Using the strength of word vectors and applying it to larger text formats, such as documents, paragraphs or sentences, is a very common technique in many NLP use cases.
Let’s look at the basic scenario where you have multiple sentences (or paragraphs), and you want to compare them with each other. In that case, using fixed length vectors to represent the sentences, gives you the ability to measure the similarity between them, even though each sentence can be of a different length.
In this post, I will show a very common technique to generate new embeddings to sentences / paragraphs / documents, using an existing pre-trained word embeddings, by averaging the word vectors to create a single fixed size embedding vector.
Calculating the average using a pre-trained word2vec model
First we need to import an existing word2vec model using gensim.
In this example I will load FastText word embeddings.
We are publishing pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.
Calculate the mean vector
Next, we create the method that calculates the mean for a given document, note that we only take into account the words that exist in the word2vec model.
- ‘word2vec_model’ parameter is the loaded word2vec model
- ‘words’ parameter is a tokens list (strings) of the sentence / paragraph / document that you want to calculate
Generate embeddings for the entire corpus
Finally, we iterate over our entire corpus and generate a mean vector for each sentence / paragraph / document
The variable ‘corpus’ is a generator object that streams the documents one by one, here’s an example from my previous post
note: the parameter ‘train_data’ is a DataFrame with the mandatory columns [‘text’, ‘documentId’] (additional columns may exist)
Cheers