Count the number of words

The difference between sort and sorted:

  • Sort is a method applied to list. sorted can sort all objects that can be iterated.

  • The sort method of list returns to operate on the existing list without return value, while the sorted method of the built-in function returns a new list instead of the original operation.

    Here we read the text in the folder and count the words with the most frequent occurrence as the most important words, and return its count to print out
    Technological process

  1. Determine whether it is a folder
  2. Define a list - exclude common words and prepositions such as and is
  3. Here we use some processing methods of translate and string. After all symbols are removed, the spaces before and after are removed to form a list. Then we can traverse the statistics. Save and count+1 the read data in the dictionary. If you don't understand these two methods, please refer to my previous article
  4. In addition to using os.isfile to determine whether it is a file, you need to use splitext mentioned earlier to determine whether the suffix is a TXT (custom) file
  5. Finally, sort the dictionary data in descending order, and take out the first data
import os
import string

def count_words(dirpath):
    if not os.path.isdir(dirpath):
        print('please input legal dirpath!')
        return

    exclude_words = ['a', 'an', 'the', 'and', 'or', 'of', 'in', 'at', 'to', 'is','...' ]
    table = str.maketrans("", "", string.punctuation)
    for root, dirs, files in os.walk(dirpath):
        for name in files:
            filename = os.path.join(root, name)
            if not os.path.isfile(filename) or not os.path.splitext(filename)[1] == '.txt':
                print('diary < %s > format is not .txt' % filename)
                return
            f = open(filename, 'r', encoding='utf-8')
            data = f.read()
            words = data.translate(table).split()
            word_dict = dict()

            #Here is word splicing. The word with '-' at the end of the word will be combined with the next word to form a new word
            n = 0
            for word in words:
                word = word.lower()
                if word[-1] == '-':
                    m = word[:-1]
                    n = 1
                    break
                if n == 1:
                    word = m + word
                    n = 0
                if word in exclude_words:
                    continue
                if word in word_dict:
                    word_dict[word] += 1
                else:
                    word_dict[word] = 1
            f.close()
            word_dict = sorted(word_dict.items(), key=lambda x: x[1], reverse=True)
            print("word_dict", type(word_dict))
            print('The most word in diary < %s > is: %s' % (name, word_dict[0]))

if __name__ == '__main__':
    count_words('diary')


Tags: encoding Lambda github

Posted on Mon, 02 Dec 2019 21:39:57 -0500 by Ibnatu