object-oriented
Implementing Python's search engineA search engine consists of four parts: searcher, indexer, Retriever and user interface.
Generally speaking, a searcher is a crawler. It can crawl a lot of content from various websites on the Internet and send it to an indexer.
After the indexer obtains the web page and content, it processes the content, forms an index, and stores it in the internal database for retrieval.
Interface refers to web page and App front-end interface. For example, in Baidu and Google's search pages, the user sends a query to the search engine through the interface, and the query is parsed and delivered to the searcher; the searcher funny searches and returns the results to the user.
example
- Retrieval of five documents
# 1.txt I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today. # 2.txt I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today. # 3.txt I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together. # 4.txt This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . . # 5.txt And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"
- Define SearchEngineBase base class
class SearchEngineBase(object): def __init__(self): pass def add_corpus(self, file_path): with open(file_path, 'r') as fin: text = fin.read() self.process_corpus(file_path, text) def process_corpus(self, id, text): raise Exception('process_corpus not implemented.') def search(self, query): raise Exception('search not implemented.') def main(search_engine): for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']: search_engine.add_corpus(file_path) while True: query = input() results = search_engine.search(query) print('found {} result(s):'.format(len(results))) for result in results: print(result)
SearchEngineBase can be inherited, and the inherited classes represent different algorithm engines. Every engine should implement process_ Two functions, corpus () and search(), correspond to indexer and retriever.
add_ The corpus () function is responsible for reading the contents of the file, taking the file path as the ID, and sending it to the process together with the contents_ In corpus.
process_corpus needs to process the content, and then the file path is ID to save the processed content. The processed content is called index
search gives a query, processes the query, searches through the index, and returns.
- A basic working search engine
class SimpleEngine(SearchEngineBase): def __init__(self): super(SimpleEngine, self).__init__() self.__id_to_texts = {} def process_corpus(self, id, text): self.__id_to_texts[id] = text def search(self, query): results = [] for id, text in self.__id_to_texts.items(): if query in text: results.append(id) return results search_engine = SimpleEngine() main(search_engine) ########## output ########## simple found 0 result(s): little found 2 result(s): 1.txt 2.txt
SimpleEngine implements a subclass that inherits search enginebase, inherits and implements process_ At the same time, it inherits add_ The corpus function, because it can be called directly in the main() function.
In the new constructor, self__ Id_ To_ Texts = {} initializes its own private variable, which is the dictionary used to store file names to file contents.
process_ The corpus () function simply inserts the contents of the file into the dictionary.
Search enumerates the dictionaries directly to find the string to search.
However, this search engine has two problems. One is inefficient. It needs a lot of space after each index because the index function does nothing.
The other is that query here can only be one word, or several words connected.
Bag of Words and Inverted Index
import re class BOWInvertedIndexEngine(SearchEngineBase): def __init__(self): super(BOWInvertedIndexEngine, self).__init__() self.inverted_index = {} def process_corpus(self, id, text): words = self.parse_text_to_words(text) for word in words: if word not in self.inverted_index: self.inverted_index[word] = [] self.inverted_index[word].append(id) def search(self, query): query_words = list(self.parse_text_to_words(query)) query_words_index = list() for query_word in query_words: query_words_index.append(0) # If the reverse index of a query word is empty, we will immediately return for query_word in query_words: if query_word not in self.inverted_index: return [] result = [] while True: # First, get the index of all inverted indexes in the current state current_ids = [] for idx, query_word in enumerate(query_words): current_index = query_words_index[idx] current_inverted_list = self.inverted_index[query_word] # It has traversed to the end of an inverted index and ended the search if current_index >= len(current_inverted_list): return result current_ids.append(current_inverted_list[current_index]) # Then, if current_ All elements of IDS are the same, which means that the word appears in the corresponding document of this element if all(x == current_ids[0] for x in current_ids): result.append(current_ids[0]) query_words_index = [x + 1 for x in query_words_index] continue # If not, we'll add one to the smallest element min_val = min(current_ids) min_val_pos = current_ids.index(min_val) query_words_index[min_val_pos] += 1 @staticmethod def parse_text_to_words(text): # Using regular expressions to remove punctuation and line breaks text = re.sub(r'[^\w ]', ' ', text) # Change to lowercase text = text.lower() # Generate a list of all words word_list = text.split(' ') # Remove blank words word_list = filter(None, word_list) # Returns the set of the word return set(word_list) search_engine = BOWInvertedIndexEngine() main(search_engine) ########## output ########## little found 2 result(s): 1.txt 2.txt little vicious found 1 result(s): 2.txt
BOW Model, which is called tape model in Chinese, is one of the most common and simple models in NLP field.
You can see that the interface before the new model continues to be used is still only__ init__ (),process_ Three functions, corpus () and search(), are used for modification.
LRU and multiple inheritance
Search engines are online, with more and more visits. You will find that the server is a little "overburdened". It will be found that a large number of repetitive searches account for more than 90% of the traffic. At this time, it is necessary to add a cache to the search engine,
import pylru class LRUCache(object): def __init__(self, size=32): self.cache = pylru.lrucache(size) def has(self, key): return key in self.cache def get(self, key): return self.cache[key] def set(self, key, value): self.cache[key] = value class BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache): def __init__(self): super(BOWInvertedIndexEngineWithCache, self).__init__() LRUCache.__init__(self) def search(self, query): if self.has(query): print('cache hit!') return self.get(query) result = super(BOWInvertedIndexEngineWithCache, self).search(query) self.set(query, result) return result search_engine = BOWInvertedIndexEngineWithCache() main(search_engine) ########## output ########## little found 2 result(s): 1.txt 2.txt little cache hit! found 2 result(s): 1.txt 2.txt
LRUCache defines a cache class that can be inherited to call its methods.
Call the has() function to determine whether it is in the cache. If it is, call the get function to return the result directly. If it is not, send it to the background to calculate the result, and then insert it into the cache.
You can see that the BOWInvertedIndexEngineWithCache class inherits two classes.
Multiple inheritance has two initialization methods
- The first
super(BOWInertedIndexEngineWithCache, self).__init__()
When using this method, the top-level parent class of the inheritance chain must inherit the object.
- The second
super(BOWInvertedIndexEngineWithCache, self).search(query)
You can emphasize the function that calls the overridden parent class