Yaron Vazana

NLP, Algorithms, Machine Learning, Data Science, tutorials, tips and more

  • About
  • Blog
  • Projects
  • Medium

Contact Me

yaronv99 [at] gmail.com

Powered by Genesis

You are here: Home / Data Science / Average Word Vectors – Generate Document / Paragraph / Sentence Embeddings

Average Word Vectors – Generate Document / Paragraph / Sentence Embeddings

September 20, 2018 by Yaron Leave a Comment

Using the strength of word vectors and applying it to larger text formats, such as documents, paragraphs or sentences, is a very common technique in many NLP use cases.

Let’s look at the basic scenario where you have multiple sentences (or paragraphs), and you want to compare them with each other. In that case, using fixed length vectors to represent the sentences, gives you the ability to measure the similarity between them, even though each sentence can be of a different length.

In this post, I will show a very common technique to generate new embeddings to sentences / paragraphs / documents, using an existing pre-trained word embeddings, by averaging the word vectors to create a single fixed size embedding vector.

average word vectors
average word vectors

Calculating the average using a pre-trained word2vec model

First we need to import an existing word2vec model using gensim.

In this example I will load FastText word embeddings.

We are publishing pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.

 

Calculate the mean vector

Next, we create the method that calculates the mean for a given document, note that we only take into account the words that exist in the word2vec model.

  • ‘word2vec_model’ parameter is the loaded word2vec model
  • ‘words’ parameter is a tokens list (strings) of the sentence / paragraph / document that you want to calculate

 

Generate embeddings for the entire corpus

Finally, we iterate over our entire corpus and generate a mean vector for each sentence / paragraph / document

The variable ‘corpus’ is a generator object that streams the documents one by one, here’s an example from my previous post

note: the parameter ‘train_data’ is a  DataFrame with the mandatory columns [‘text’, ‘documentId’] (additional columns may exist)

Cheers

Subscribe to Blog

Subscribe to get the latest posts to your inbox

Filed Under: Algorithms, Data Science, Python

I am a data science team lead at Darrow and NLP enthusiastic. My interests range from machine learning modeling to solving challenging data related problems. I believe sharing ideas is where we all become better in what we do. If you’d like to get in touch, feel free to say hello through any of the social platforms. More About Yaron…

SUBSCRIBE TO BLOG

Subscribe to Blog

Subscribe to get the latest posts to your inbox

Recent Posts

  • Training an AutoEncoder to Generate Text Embeddings
  • Using Dockers for your Data Science Dev Environment
  • Identifying Real Estate Opportunities using Machine Learning
  • How to Create a Simple WhatsApp Chatbot in Python using Doc2vec
  • Average Word Vectors – Generate Document / Paragraph / Sentence Embeddings
  • Visualizing Vectors using TensorBoard
  • Training a Doc2Vec Model with Gensim
 

Loading Comments...