Representing unstructured documents as vectors can be done in many ways. One very common approach is to use the well-known word2vec algorithm, and generalize it to documents level, which is also known as doc2vec.
A great python library to train such doc2vec models, is Gensim. And this is what this tutorial will show.
From Words to Documents
Similar words have similar content
This is the underlying assumption behind word2vec which allows it to be so powerful.
When looking at the document level, we see that each document is composed of a different number of words, trying to limit the number of words to be the same in all documents, will probably cause relevant data to be lost.
We would like to achieve the following:
- We want to have as much information as we can, in the document vector
- We want all the documents’ vectors to have the same dimension
- a word vector W is generated for each word
- a document vector D is generated for each document
In the inference stage, the model uses the calculated weights and outputs a new vector D for a given document.
Training a doc2vec model on a large corpus
Training a doc2vec model in the old style, required all the data to be in memory. This way, training a model on a large corpus (collection of documents), was nearly impossible on a home laptop.
Gensim introduced a way to stream documents one by one from the disk, instead of heaving them all stored in RAM
The Training Process
Invoking Doc2VecTrainer.run will start the training process on the input training corpus
As I wrote before, in order to train on large a corpus, I used the DocumentsIterable object which basically get a path in the file system and knows how to parse a single file on each iteration (In this case I used XML files, so DocumentsIterable parse the XML files)parsing