Almost all of us use whatsapp on a daily basis. Those conversations are basically unstructured text that we can use in order to learn and experiment. In this tutorial I will show how to create a very simple chatbot, that you can chat with, simply by training a doc2vec model using all the messages you already have on you phone.
Disclaimer: This post and implementation is based on the following great post which appeared in toward-data-science
If you’re just interested in the full python notebook, it’s right here (I changed the original names)
At a high level, the steps would include:
- Loading your whatsapp conversation into a python DataFrame
- Preparing a training set of (text, response) tuples – so the chatbot will be able to respond to your input
- Training a Doc2Vec model
- Implementing the chatbot conversation in python
Let’s start…
Loading your whatsapp conversation into a python DataFrame
Start by downloading your selected whatsapp conversation into your computer.
To do that, go into the conversation in the mobile app. Inside the settings menu you’ll see an “export chat” button, just save the file to your google drive, and copy it to your local computer.
Each row in the file looks like this:
3/5/17, 12:58 - ${full-name}: ${message}
In order to parse each line, and retrieve the information out of it, we define a function called “isMessage” which gets a single line and return an array with all the parsed info [date, time, name, message]
In the code above, we define the regex that matches each message line. Then, we extract the groups content and return an array with the data. In case the input line is not a message, we return None.
After we have a DataFrame with all the messages, we construct a new training set DataFrame with our training data. In our example, we take all the available tuples of 2 consecutive messages, and treat them as [input, output] pairs.
In the code above we construct a new DataFrame that will be our training data, we will use the following columns:
- id: this will be an incremental index of the message
- text: this is the input message (for the machine learning algorithm this will be the input X)
- response: this is the output text (for the machine learning algorithm this will be the output Y)
- name: the name of the person who wrote this message (just for visualization if we want)
Training the Model with Gensim
In the code above, we initialize a Doc2Vec model with the training data, and train it for 20 epochs. Doc2Vec basically learn a vector representation for each token in the vocabulary as well as a vector for each message in the training set.
Implementing the ChatBot
Lastly, we will write the chatbot loop that receives an input from the user, searches the most similar response, and output it back to the screen
Cheers