Representing unstructured documents as vectors can be done in many ways. One very common approach is to use the well-known word2vec algorithm, and generalize it to documents level, which is also known as doc2vec.
A great python library to train such doc2vec models, is Gensim. And this is what this tutorial will show.
From Words to Documents
Similar words appear in similar contexts
This is the underlying assumption behind word2vec which allows it to be so powerful.
When looking at the document level, we see that each document is composed of a different number of words, trying to limit the number of words to be the same in all documents, will probably cause relevant data to be lost.
We would like to achieve the following:
- We want to have as much information as we can, in the document vector
- We want all the documents’ vectors to have the same dimension
Gensim is a powerful python library which allows you to achieve that.
When training a doc2vec model with Gensim, the following happens:
- a word vector W is generated for each word
- a document vector D is generated for each document
In the inference stage, the model uses the calculated weights and outputs a new vector D for a given document.
Training a doc2vec model on a large corpus
Training a doc2vec model in the old style, require all the data to be in memory. In this way, training a model on a large corpus is nearly impossible on a home laptop.
Gensim introduced a way to stream documents one by one from the disk, instead of heaving them all stored in RAM
If you have your training documents stored inside a DataFrame, you can initialize MyCorpus with a DataFrame object (here, I assume the document text is inside a ‘text’ column and the id is inside a ‘documentId’ column)
I also added basic prepossessing steps:
- remove email addresses
- remove websites addresses
- remove quotes
- remove stopwords
- stemming the words using PorterStemmer
- converting the text to lowercase
This is done as follow:
The Training Process
Invoking Doc2VecTrainer.run will start the training process on the input training corpus
To conclude, There are 2 approaches that you can use in order to create the training data:
- The first approach uses the DocumentsIterable object which basically get a path in the file system and knows how to parse a single line on each iteration (In this case I used XML files, so DocumentsIterable parse the XML files)
- The second approach uses a DataFrame that already contain the training data in the following columns: [“text”, “documentId”]. The DataFrame is sent as a parameter to instantiate the object MyCorpus.
Cheers