Python -- combined data type (module 5: use of Jieba Library) (example: basic statistical value calculation & text word frequency statistics)


This chapter mainly introduces the types of combined data. Taking the calculation of basic statistical values as an example, it introduces the use of functions and the definitions of various types. Taking text word frequency statistics as an example, this paper introduces the use of Jieba database.
(from the beginning of this article, some functions in the library and some simple codes will be presented in the form of pictures)

After reading this article, you will understand:
1. Methodology:
Usage methods of three mainstream combined data types in Python (establish set, sequence and dictionary mode)
2. Practical ability:
Learn to write programs that process a set of data

This chapter will systematically introduce:
1. Collection type and operation
2. Sequence type and operation (including tuple type and list type)
3. Example: Calculation of basic statistical value
4. Dictionary type and operation
5. Module 5: use of Jieba Library
6. Example: text word frequency statistics

1, Collection type and operation

1. Set type definition

(1) Collection: an unordered combination of multiple elements
① The set type is consistent with the set concept in mathematics
② The collection elements are in disorder, each element is unique, and there is no same element
③ Collection elements cannot be changed and cannot be variable data types (that is, elements are placed in the collection and cannot be modified)
④ Immutable data types: integer, floating point number, complex number, string type, tuple type, etc
(2) Create collection type
① Collections are represented by braces and elements are separated by commas
② Create a set type with or set()
③ To create an empty collection type, you must use set()
(3) Examples
① A = {"python", 123, ("python", 123)} # use {} to build a collection
The data type represented by parentheses is called tuple
② Use the set() function to generate a set
B = set("pypy123")
Make each character separate and become an element in the collection. So there are five elements in the collection
Since there are two p and two Y, after generating the set, the same p and y will be removed. After generation, the elements are not retained in order
③ Define a collection constructed in such a way that the same elements exist
C = {"python","123","python","123"}
The resulting collection has only two elements {'python', 123}

2. Set operator

(1) Inter collection operation
Four operations: Union, difference, intersection and complement
(2) 6 operators

(3) 4 enhanced operators

If not, a new set will be generated, and the use of enhanced operators will modify the original set
(4) Examples

3. Collection processing method

① In the above ten methods, basic operations such as adding and deleting elements, emptying elements and taking out elements are carried out respectively
② At the same time, the number of elements in the element set, the copy and judgment of elements are done
③ With this, we can handle almost all the functions of the collection
④ for in can traverse a combined data type, and while can. Their traversal has its own characteristics in different application scenarios

4. Set type application scenarios

(1) Comparison of inclusion relationships

① The reserved word in determines whether other elements are in this collection
② Judge the relationship between data
(2) Data De duplication
Take advantage of the fact that all elements in the set cannot be repeated

2, Sequence type and operation

1. Sequence type definition

(1) Sequence
A sequence is a group of elements with sequential relationships
① A sequence is a one-dimensional element vector. Elements can be the same and element types can be different
② Sequence of similar mathematical elements: s0, s1,..., Sn-1
③ Elements are guided by sequence numbers, and specific elements of the sequence are accessed through subscripts (integer sequence numbers)
(2) Use of sequence types
Sequence is a base class type, simply a basic data type
① In general, it is not used directly, but several data types derived from sequence types
② Such as string type and tuple type. List types are several derivatives of sequence types
③ The operation of sequence type is also applicable in the above derived types
④ The above three types of derivatives have their own unique operating capabilities
(3) Definition of serial number
① The element has the index relationship of forward increasing sequence number and reverse decreasing sequence number
② As previously mentioned, the string sequence number

The difference is that each element is only a character in the string type, while each element in the complex in the sequence type can be any data type

2. Sequence processing function and method

(1) Sequence type universal operator
6 operators

① s[i]: there are two systems of serial number: positive increment and reverse decrement
(2) General functions and methods for sequence types
5 functions and methods

① In min and max, if the elements in the sequence are of different types and cannot be compared, the two functions will report an error
② Examples

3. Tuple type and operation

(1) Tuple type definition
Arrange the elements in order and organize them with ()
① Tuple is an extension of sequence type
② Tuple is a sequence type that cannot be modified once created
③ Created using parentheses () or tuple(), elements are separated by commas
④ Parentheses may or may not be used
⑤ Function to obtain the return variable of tuple, or parse the return element in the form of tuple
(2) Examples
① As shown in the figure

When using the create variable, you will see that multiple elements separated by commas are internally expressed as tuple types
② As shown in the figure

Tuple types contain tuple types
(3) Tuple type operation
① Tuples inherit all the common operations of sequence types
② Tuples cannot be modified after creation, so there is no special operation
③ With or without parentheses
Note: when slicing, the original variable value is not changed, but a new tuple value is generated

4. List type and operation

(1) List type definition
List is an extension of sequence type, which is very common
① A list is a sequence type that can be modified at will after creation
② Create with square brackets or list(), and separate the elements with commas
③ Each element type in the list can be different without length limit
④ Examples

Note: square brackets [] really create a list, and assignment only passes references. If you assign a list variable to another list variable only by =
At this time, a list is not really generated in the system. Instead, the same list is given different names ls and lt, both of which point to the same list
(2) List type operation functions and methods

give an example

(3) Modify list content

give an example

5. Sequence type application scenario

(1) Data representation
① Tuples are used for application scenarios where elements do not change, and more for fixed collocation scenarios
For example, the return value of a function is return
② Lists are more flexible and are the most commonly used sequence types
Use one data type to process a group of data, and the processing functions are diverse. Use lists as much as possible
③ Main function: represent a set of ordered data, and then operate them
(2) Element traversal (more importantly, the expression of data)
(3) Data protection
If you do not want the data to be changed by the program, convert it to tuple type (taking advantage of the immutability of tuple elements)

3, Example: basic statistical value calculation

1. Problem analysis

(1) Basic statistics
Requirements: give a set of numbers and have a general understanding of them (total number, sum, mean, variance, median...)
(2) Related methods
① Total number: len()
② Sum: for... in
③ Average value: Sum / total number
④ Variance: the average of the sum of the squares of the difference between each data and the average
⑤ Median: sort. Find the middle one for odd numbers and the middle two for even numbers. Take the average

2. Example explanation

(1) Get user input of indefinite length

def getNum():
    nums = []
    iNumStr = input("Please enter a number (enter to exit):")
    while iNumStr != "":
        iNumStr = input("Please enter a number (enter to exit):")
    return nums

① Use the input function to get user input. If the input is not empty, it will be converted into a number and put in the list
② At the same time, the user is again asked to provide an input
③ The loop exits until you enter an empty string or enter
④ Finally, each input is put into the list and returned to the part of the calling function as the input data
(2) Calculate average

def mean(numbers):
    s = 0.0
    for num in numbers:
        s = s + num
    return s/len(numbers)
    (3)Calculate variance
def dev(numbers, mean):
    sdev = 0.0
    for num in numbers:
        sdev += (num-mean)**2
    return  pow(sdev.(len(numbers)-1),0.5)

(4) Calculate median

def median(numbers):
    size = len(numbers)
    if size%2 == 0:
        med = (numbers[size//2-1]+numbers[size//2])/2
        med = numbers[size//2]
    return med

① Sort with the sorted function
(5) Call

    n = getNum()
    m = mean(n)
    print("Average:{},Variance:{: .2},median:{}.".format(m, dev(n, m), median(n)))

3. Draw inferences from one instance

Technical capability expansion
① Get multiple data: a method to get multiple uncertain data from the console
② Separating multiple functions: modular design method
③ Make full use of functions: make full use of the content functions provided by Python

4, Dictionary type and operation

1. Dictionary type definition

(1) Mapping
① A mapping is a correspondence between a key (index) and a value (data)
② The sequence type uses 0... N integers as the default index of the data, and the mapping type uses the user to define the index for the data
(2) Dictionary type (the embodiment of mapping type)
① It is a new form of data organization and expression
② Key value pair: a key is an extension of a data index
③ A dictionary is a collection of key value pairs, which are out of order
④ It is created with curly braces f and dict(), and key value pairs are represented by colons:
For example: {key 1 >: < value 1 >, < key 2 >: < value 2 >,..., < key n >: < value n >}

    (3)[]Used to index or add elements to dictionary variables
        ①<Dictionary variable> ={<Key 1>:<Value 1>,... ,<key n>:<value n>}
        ②<value>=<Dictionary variable>-[<key>]
        ③<Dictionary variable>[<key>]=<value>
    (4)give an example
        Generate a dictionary d,Each element is a key value pair
        d = {"China":"Beijing","U.S.A":"Washington","France":"Paris""}
        use d["China"]return'Beijing'
    (5)Generate an empty dictionary
        de = {}; type(de)
        Note: empty{}Cannot be used to generate an empty collection

2. Dictionary processing function and method

(1) Several functions

① In k in d, k is not a data value, but an index of the data value
② d.keys and d.values do not return the list type, but return the key type or values type of a dictionary (you can traverse in the way of for in)
③ Examples

(2) Method of processing operation

3. Dictionary type application scenario

(1) Expression of mapping
① Mapping is everywhere, and key value pairs are everywhere
② For example: the number of times the statistics appear. The data is the key and the number is the value
③ Main function: Express key value pairs of data, and then operate them

5, Module 5: use of Jieba Library

1. Introduction to Jieba Library

(1) jieba is an excellent third-party database of Chinese word segmentation
① Chinese text needs to get a single word through word segmentation
② jieba requires additional installation
③ jieba library provides three word segmentation modes. The simplest is to master only one function
(2) jieba library installation
(cmd command line) pip install jieba
(3) jieba word segmentation principle
① Using a Chinese Thesaurus, the association probability between Chinese characters is determined
② The phrases with high probability between Chinese characters form word segmentation
③ As a result, in addition to word segmentation, users can also add custom phrases (to apply to other fields)

2. Description of Jieba Library

(1) Three patterns of jieba word segmentation
① Precise mode (most commonly used): cut the text accurately, and there are no redundant words
② Full mode: scan out all possible words in the text with redundancy
③ Search engine mode (more intelligent): on the basis of precise mode, long words are segmented again
(2) Common functions
As shown in figures 22 and 23
(3) Key points of jieba participle
Remember the jieba.lcut(s) function to complete the word segmentation function

6, Example: text word frequency statistics

1. Problem analysis

(1) Requirements: what words appear in an article? Which words appear most?
(2) Examples of text word frequency statistics for English and Chinese texts are given
(3) Examples
① English text: Hamlet analysis word frequency
② Chinese text: analysis of characters in the romance of the Three Kingdoms
The text can be downloaded from the link

2. Example explanation

(1) Hamlet English word frequency statistics
Pay attention to case, spaces and different punctuation marks
① The text is subjected to noise processing and normalization, and each word is extracted

def getText():  # Get specific information about the text
    txt = open("hamlet.txt", "r").read()  # Open file
    txt = txt.lower()
    for ch in '|"#$%&*()+-./;:<>=?@[\\]^_'{}!':  # Bring each special symbol in the text
        txt = txt.replace(ch, "")  # Replace special symbols with spaces
    return txt  # After replacement, it is still saved in txt. After processing, normalized results are formed. All words are lowercase, separated by spaces, and there are no special symbols

hamletTxt = getText()  # Read the file and normalize the text
words = hamletTxt.split()  # By default, split() in the string uses a space to separate the information in the character and returns it to the variable in the form of a list
counts = {}  # Dictionary types express the correspondence between words and frequency
for word in words:  # Take the element from the words list
    counts[word] = counts.get(word, 0) + 1  # In the dictionary, get() is used to obtain the value corresponding to a key from the dictionary. If it does not exist, the default value 0 is given
items = list(counts.items())  # Convert dictionary type to list type
items.sort(key=lambda x: x[1], reverse=True)  # lambda is used to specify which multivariate option column to use as the sorting sequence. The default sorting is from small to large, and reverse=True sorting is from large to small
# That is to sort the second element of the two elements of a list according to the key value pair. The sorting method is inverted from large to small
for i in range(10):  # Print out the top 10 words with the most occurrences and the corresponding times
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

Result statistics

(2) Explanation of statistical examples of characters appearing in the romance of the Three Kingdoms
① Use jieba for word segmentation
② There is no case problem in Chinese. Chinese punctuation will be disposed in the process of word segmentation
③ Code attached

import jieba
txt = open("thresskingdoms.txt", "r", encoding="utf-8").read()
words = jieba.lcut(txt)  # Word segmentation is performed to form a list word with all words of list type
counts = {}  # Construct dictionary counts
for word in words:  # Traverse each Chinese word in words one by one
    if len(word) == 1:
        counts[word] = counts.get(word, 0) + 1  # Count by dictionary
items = list(counts.items())  # Convert to list type
items.sort(key=lambda x: x[1], reverse=True)  # sort
for i in range(15):  # Print out the first 15 words
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

④ Operation results

Due to word segmentation, the same person and word combination will make the result unreasonable, and the code needs to be modified
(3) Optimization of the statistics code for the appearance of characters in the romance of the Three Kingdoms
① A large number of words unrelated to the person's name and different names of the same person appear in the result of the original code. Repeat counting
② Consider how to adapt the program to the problem
③ Upgrade code

import jieba
txt = open("thresskingdoms.txt", "r", encoding="utf-8").read()
excludes = {"general", "But say", "Jingzhou", "Two people", "must not", "No", "such"}  # Construct a collection and list some words that are not determined to be people's names but are ranked at the top
# Keep running the program, and find out how many words there are in it. If it is determined that it is not a person's name, it will be added to the set
words = jieba.lcut(txt)  # Word segmentation is performed to form a list word with all words of list type
counts = {}  # Construct dictionary counts
for word in words:  # Word integration and person name Association
    if len(word) == 1:
    elif word == "Zhuge Liang" or word == "Kong Mingyue":
        rword = "Zhuge Liang"
    elif word == "Guan Yu" or word == "Cloud length":
        rword = "Guan Yu"
    elif word == "Xuande" or word == "Xuande said":
        rword = "Liu Bei"
    elif word == "Meng de" or word == "the prime minister":
        rword = "Cao Cao"
        rword = word
    counts[rword] = counts.get(rword, 0) + 1  # Count by dictionary
items = list(counts.items())  # Convert to list type
items.sort(key=lambda x: x[1], reverse=True)  # sort
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

④ Operation results

Further update the excluded thesaurus and optimize the output results
Final result:

3. Draw inferences from one instance

(1) Extension of application problems
① Figure statistics: a dream of Red Mansions, journey to the west, outlaws of the marsh
② Word frequency statistics, government work report, scientific research paper, news report
③ Drawing word clouds for words or words in the text has a more intuitive display effect
(2) This is a very good example of making full use of collections, sequences, and dictionaries in composite data types


After learning this article, you can roughly master the use of sets, sequences, dictionary types and jieba word segmentation library to realize the word frequency statistics of some texts. You can try to use different types of codes, compare the advantages and disadvantages of various types, and solve a variety of word segmentation and basic problems.

The next chapter will introduce file and data formatting.

Tags: Python

Posted on Sat, 23 Oct 2021 11:03:06 -0400 by reivax_dj