Yaron Vazana

NLP, Algorithms, Machine Learning, Data Science, tutorials, tips and more

  • About
  • Blog
  • Projects
  • Medium

Contact Me

yaronv99 [at] gmail.com

Powered by Genesis

You are here: Home / Data Science / Training a Doc2Vec Model with Gensim

Training a Doc2Vec Model with Gensim

January 20, 2018 by Yaron 2 Comments

Representing unstructured documents as vectors can be done in many ways. One very common approach is to use the well-known word2vec algorithm, and generalize it to documents level, which is also known as doc2vec.

A great python library to train such doc2vec models, is Gensim. And this is what this tutorial will show.

Training Doc2Vec Model with Gensim

From Words to Documents

Similar words appear in similar contexts

This is the underlying assumption behind word2vec which allows it to be so powerful.

When looking at the document level, we see that each document is composed of a different number of words, trying to limit the number of words to be the same in all documents, will probably cause relevant data to be lost.

We would like to achieve the following:

  1. We want to have as much information as we can, in the document vector
  2. We want all the documents’ vectors to have the same dimension

Gensim is a powerful python library which allows you to achieve that.
When training a doc2vec model with Gensim, the following happens:

  1. a word vector W is generated for each word
  2. a document vector D is generated for each document

In the inference stage, the model uses the calculated weights and outputs a new vector D for a given document.

 

Training a doc2vec model on a large corpus

Training a doc2vec model in the old style, require all the data to be in memory. In this way, training a model on a large corpus is nearly impossible on a home  laptop.

Gensim introduced a way to stream documents one by one from the disk, instead of heaving them all stored in RAM

If you have your training documents stored inside a DataFrame, you can initialize MyCorpus with a DataFrame object (here, I assume the document text is inside a ‘text’ column and the id is inside a ‘documentId’ column)

I also added basic prepossessing steps:

  • remove email addresses
  • remove websites addresses
  • remove quotes
  • remove stopwords
  • stemming the words using PorterStemmer
  • converting the text to lowercase

This is done as follow:

 

The Training Process

Invoking Doc2VecTrainer.run will start the training process on the input training corpus

 

To conclude, There are 2 approaches that you can use in order to create the training data:

  1. The first approach uses the DocumentsIterable object which basically get a path in the file system and knows how to parse a single line on each iteration (In this case I used XML files, so DocumentsIterable parse the XML files)
  2. The second approach uses a DataFrame that already contain the training data in the following columns: [“text”, “documentId”]. The DataFrame is sent as a parameter to instantiate the object MyCorpus.

Subscribe to Blog

Subscribe to get the latest posts to your inbox

Cheers

Filed Under: Algorithms, Data Science, Python Tagged With: Data Science, doc2vec, python, word2vec

I am a data science team lead at Darrow and NLP enthusiastic. My interests range from machine learning modeling to solving challenging data related problems. I believe sharing ideas is where we all become better in what we do. If you’d like to get in touch, feel free to say hello through any of the social platforms. More About Yaron…

SUBSCRIBE TO BLOG

Subscribe to Blog

Subscribe to get the latest posts to your inbox

Recent Posts

  • Training an AutoEncoder to Generate Text Embeddings
  • Using Dockers for your Data Science Dev Environment
  • Identifying Real Estate Opportunities using Machine Learning
  • How to Create a Simple WhatsApp Chatbot in Python using Doc2vec
  • Average Word Vectors – Generate Document / Paragraph / Sentence Embeddings
  • Visualizing Vectors using TensorBoard
  • Training a Doc2Vec Model with Gensim
 

Loading Comments...