Remember this $100 million AI core code?
while True: AI = input('I:') print(AI.replace("Do you", " ").replace('?','!').replace('?','!'))
The above code is our topic today, rule-based chat robot
Chat robot
Chat robot itself is a kind of machine or software, which simulates human interaction through text or sentences. In short, you can chat using software similar to talking to humans.
Why try to create a chat robot? Maybe you are interested in a new project, or the company needs one, or you want to invest. Whatever the motivation, this article will try to explain how to create a simple rule-based chat robot.
Rule based chat robot
What is a rule-based chat robot? It is a chat robot that answers the text given by human based on specific rules. Because it is based on imposed rules, the response generated by the chat robot is almost accurate; However, if we receive a query that does not match the rules, the chat robot will not answer. Another version of it is a model-based chat robot, which answers a given query through a machine learning model. (the difference between the two is that we need to specify each rule based on rules, and model-based will automatically generate rules through training models. Remember our "Introduction to machine learning" in the previous article, "machine learning provides the system with the ability to automatically learn and improve based on experience without explicit programming.")
Rule based chat robots may be based on rules given by humans, but this does not mean that we do not use data sets. The main goal of chat robots is still to automate the problems raised by humans, so we still need data to formulate specific rules.
In this paper, we will develop a rule-based chat robot based on cosine similarity distance. Cosine similarity is a measure of similarity between vectors (especially non-zero vectors in inner product space), which is often used to measure the similarity between two texts.
We will use cosine similarity to create a chat robot to answer the questions raised by the query by comparing the similarity between the query and the corpus we developed, which is why we need to develop our corpus at first.
Create corpus
For this chat robot example, I want to create a chat robot to answer all the questions about cats. In order to collect data about cats, I will grab it from the Internet.
import bs4 as bs import urllib.request#Open the cat web data page cat_data = urllib.request.urlopen('https://simple.wikipedia.org/wiki/Cat').read() #Find all the paragraph html from the web page cat_data_paragraphs = bs.BeautifulSoup(cat_data,'lxml').find_all('p') #Creating the corpus of all the web page paragraphs cat_text = '' #Creating lower text corpus of cat paragraphs for p in cat_data_paragraphs: cat_text += p.text.lower() print(cat_text)
Using the above code, you will get a collection of paragraphs from the wikipedia page. Next, you need to clean up the text to remove useless text such as bracket numbers and spaces.
import re cat_text = re.sub(r'\s+', ' ',re.sub(r'\[[0-9]*\]', ' ', cat_text))
The above code will remove the parentheses from the corpus. I deliberately didn't remove these symbols and punctuation marks, because it sounds natural when talking to a chat robot.
Finally, I will create a list of sentences based on the corpus I created earlier.
import nltk cat_sentences = nltk.sent_tokenize(cat_text)
Our rule is very simple: measure the cosine similarity between the query text of the chat robot and each text in the sentence list. Which result produces the closest similarity (the highest cosine similarity) is the answer of our chat robot.
Create a chat robot
The corpus above is still in the form of text, and the cosine similarity does not accept text data; therefore, it is necessary to convert the corpus into a digital vector. The usual practice is to convert the text into a word bag (word count) or use the TF-IDF method (frequency probability). In our example, we will use TF-IDF.
I'll create a function that receives the query text and gives an output based on the cosine similarity in the following code. Let's take a look at the code.
from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import TfidfVectorizer def chatbot_answer(user_query): #Append the query to the sentences list cat_sentences.append(user_query) #Create the sentences vector based on the list vectorizer = TfidfVectorizer() sentences_vectors = vectorizer.fit_transform(cat_sentences) #Measure the cosine similarity and take the second closest index because the first index is the user query vector_values = cosine_similarity(sentences_vectors[-1], sentences_vectors) answer = cat_sentences[vector_values.argsort()[0][-2]] #Final check to make sure there are result present. If all the result are 0, means the text input by us are not captured in the corpus input_check = vector_values.flatten() input_check.sort() if input_check[-2] == 0: return "Please Try again" else: return answer
We can use the following flowchart to represent the above functions:
Finally, create a simple answer interaction using the following code.
print("Hello, I am the Cat Chatbot. What is your meow questions?:") while(True): query = input().lower() if query not in ['bye', 'good bye', 'take care']: print("Cat Chatbot: ", end="") print(chatbot_answer(query)) cat_sentences.remove(query) else: print("See You Again") break
The above script will receive queries and process them through the chat robot we developed earlier.
As can be seen from the above picture, the results are acceptable, but there are some strange answers. However, we should think that at present, the results are only obtained from one data source without any optimization. If we use additional data sets and rules to improve it, it will certainly answer the questions better.
summary
The chat robot project is an exciting data science project because it is helpful in many fields. In this paper, we use the data obtained from web pages, cosine similarity and TF-IDF to create a simple chat robot project in Python, which really implements our 100 million project. In fact, there are many improvements:
- For the selection of vectorization, in addition to TF-IDF, word2vec can also be used, and even pre trained bert can be used to extract word vectors.
- In fact, the answer link is to search for the most matching answer from our corpus through a specific algorithm or rule. The similarity top1 method used in this paper is actually the simplest greensearch like method. For the optimization of answer results, you can also use the beamsearch like algorithm to extract the matching items of the answer.
- Wait a lot
Before the rise of end-to-end deep learning, many chat robots operated based on rules, and there were many landing cases. If you want to make a POC display quickly, this rule-based method is still very useful.
Method, for the optimization of answer results, the algorithm similar to beamsearch can also be used to extract the matching items of the answer.
- Wait a lot
Before the rise of end-to-end deep learning, many chat robots operated based on rules, and there were many landing cases. If you want to make a POC display quickly, this rule-based method is still very useful.