Training a Doc2Vec Model with Gensim on a Large Corpus

Representing unstructured documents as vectors can be done in many ways. One very common approach is to use the well-known word2vec algorithm, and generalize it to documents level, which is also known as doc2vec.

A great python library to train such doc2vec models, is Gensim. And this is what this tutorial will show.

From Words to Documents

Similar words appear in similar contexts

This is the underlying assumption behind word2vec which allows it to be so powerful.

When looking at the document level, we see that each document is composed of a different number of words, trying to limit the number of words to be the same in all documents, will probably cause relevant data to be lost.

We would like to achieve the following:

We want to have as much information as we can, in the document vector
We want all the documents’ vectors to have the same dimension

Gensim is a powerful python library which allows you to achieve that.
When training a doc2vec model with Gensim, the following happens:

a word vector W is generated for each word
a document vector D is generated for each document

In the inference stage, the model uses the calculated weights and outputs a new vector D for a given document.

Training a doc2vec model on a large corpus

Training a doc2vec model in the old style, require all the data to be in memory. In this way, training a model on a large corpus is nearly impossible on a home laptop.

Gensim introduced a way to stream documents one by one from the disk, instead of heaving them all stored in RAM

If you have your training documents stored inside a DataFrame, you can initialize MyCorpus with a DataFrame object (here, I assume the document text is inside a ‘text’ column and the id is inside a ‘documentId’ column)

I also added basic prepossessing steps:

remove email addresses
remove websites addresses
remove quotes
remove stopwords
stemming the words using PorterStemmer
converting the text to lowercase

This is done as follow:

The Training Process

Invoking Doc2VecTrainer.run will start the training process on the input training corpus

To conclude, There are 2 approaches that you can use in order to create the training data:

The first approach uses the DocumentsIterable object which basically get a path in the file system and knows how to parse a single line on each iteration (In this case I used XML files, so DocumentsIterable parse the XML files)
The second approach uses a DataFrame that already contain the training data in the following columns: [“text”, “documentId”]. The DataFrame is sent as a parameter to instantiate the object MyCorpus.

Cheers

Contact Me

Training a Doc2Vec Model with Gensim

From Words to Documents

Training a doc2vec model on a large corpus

The Training Process

Subscribe to Blog