The difference between sort and sorted:
Sort is a method applied to list. sorted can sort all objects that can be iterated.
The sort method of list returns to operate on the existing list without return value, while the sorted method of the built-in function returns a new list instead of the original operation.
Here we read the text in the folder and count the words with the most frequent occurrence as the most important words, and return its count to print out
- Determine whether it is a folder
- Define a list - exclude common words and prepositions such as and is
- Here we use some processing methods of translate and string. After all symbols are removed, the spaces before and after are removed to form a list. Then we can traverse the statistics. Save and count+1 the read data in the dictionary. If you don't understand these two methods, please refer to my previous article
- In addition to using os.isfile to determine whether it is a file, you need to use splitext mentioned earlier to determine whether the suffix is a TXT (custom) file
- Finally, sort the dictionary data in descending order, and take out the first data
import os import string def count_words(dirpath): if not os.path.isdir(dirpath): print('please input legal dirpath!') return exclude_words = ['a', 'an', 'the', 'and', 'or', 'of', 'in', 'at', 'to', 'is','...' ] table = str.maketrans("", "", string.punctuation) for root, dirs, files in os.walk(dirpath): for name in files: filename = os.path.join(root, name) if not os.path.isfile(filename) or not os.path.splitext(filename) == '.txt': print('diary < %s > format is not .txt' % filename) return f = open(filename, 'r', encoding='utf-8') data = f.read() words = data.translate(table).split() word_dict = dict() #Here is word splicing. The word with '-' at the end of the word will be combined with the next word to form a new word n = 0 for word in words: word = word.lower() if word[-1] == '-': m = word[:-1] n = 1 break if n == 1: word = m + word n = 0 if word in exclude_words: continue if word in word_dict: word_dict[word] += 1 else: word_dict[word] = 1 f.close() word_dict = sorted(word_dict.items(), key=lambda x: x, reverse=True) print("word_dict", type(word_dict)) print('The most word in diary < %s > is: %s' % (name, word_dict)) if __name__ == '__main__': count_words('diary')
- More code details refer to My Github