Yaron Vazana

NLP, Algorithms, Machine Learning, Data Science, tutorials, tips and more

  • About
  • Blog
  • Projects
  • Medium

Contact Me

yaronv99 [at] gmail.com

Powered by Genesis

You are here: Home / Data Science / Training an AutoEncoder to Generate Text Embeddings

Training an AutoEncoder to Generate Text Embeddings

September 28, 2019 by Yaron Leave a Comment

Calculating sentences / paragraphs vectors can be done in many ways. For example a simple method  is to average all the words vectors and retrieve a single vector for the entire piece of text, of course this forces you to have pre-calculated word embeddings (I already wrote about it here).

In this post I will show a different approach that uses an AutoEncoder

The aim of an AutoEncoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”.

autoencoder model diagram

An AutoEncoder takes an input (sequence of text in our case), squeezes it through a bottleneck layer (which has less nodes than the input layer), and output it to a decoder which tries to reconstruct the exact same input.

This approach allows us to learn vectors representations to any unstructured text, of any length. We control the size of the bottleneck layer, which means we control the size of the embeddings vectors we want to generate

After Training the AutoEncoder, we can use the encoder model to generate embeddings to any input.

Before we start with the code, here is Keras documentation of AutoEncoders

Define a Few Constants

We start by defining a few constants that will serve us in the rest of the code. 

num_words = 2000
maxlen = 30
embed_dim = 150
batch_size = 16

Preparing the Input

Assuming we have a list of sentences, we tokenize and generate a fixed length padded sequence from each sentence

tokenizer = Tokenizer(num_words = num_words, split=' ')
tokenizer.fit_on_texts(sentences)
seqs = tokenizer.texts_to_sequences(sentences)
pad_seqs = pad_sequences(seqs, maxlen)

The Encoder Model

We define the encoder and decoder separately since later (on run-time) we will only use the encoder to generate embeddings for unseen text

The Encoder Consists of a Few Layers

  • Embedding layer
  • Bidirectional LSTM layer
encoder_inputs = Input(shape=(maxlen,), name='Encoder-Input')
emb_layer = Embedding(num_words, embed_dim,input_length = maxlen, name='Body-Word-Embedding', mask_zero=False)

x = emb_layer(encoder_inputs)
state_h = Bidirectional(LSTM(128, activation='relu', name='Encoder-Last-LSTM'))(x)
encoder_model = Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')
seq2seq_encoder_out = encoder_model(encoder_inputs)

The Decoder Model

The decoder model is used only on train time and it’s goal is to reconstruct the exact input that was given to the network

decoded = RepeatVector(maxlen)(seq2seq_encoder_out)
decoder_lstm = Bidirectional(LSTM(128, return_sequences=True, name='Decoder-LSTM-before'))
decoder_lstm_output = decoder_lstm(decoded)
decoder_dense = Dense(num_words, activation='softmax', name='Final-Output-Dense-before')
decoder_outputs = decoder_dense(decoder_lstm_output)

The Combined Model

The combined model actually just stacks together the encoder and the decoder, and defines the loss and optimizer that it will use

seq2seq_Model = Model(encoder_inputs, decoder_outputs)
seq2seq_Model.compile(optimizer=optimizers.Nadam(lr=0.001), loss='sparse_categorical_crossentropy')
history = seq2seq_Model.fit(pad_seqs, np.expand_dims(pad_seqs, -1),
          batch_size=batch_size,
          epochs=10)

Generating Embedding

Now, you might want to do 2 things:
1. Generate embeddings for the initial train data
2. Generate embeddings for new and unseen data

Generate Embeddings for the Training Data

vecs = encoder_model.predict(pad_seqs)

Generate Embeddings for Unseen Data

sentence = "here's a sample unseen sentence"
seq = tokenizer.texts_to_sequences([sentence])
pad_seq = pad_sequences(seq, maxlen)
sentence_vec = encoder_model.predict(pad_seq)[0]

Cheers

Subscribe to Blog

Subscribe to get the latest posts to your inbox

Filed Under: Data Science, Python Tagged With: autoencoder, Data Science, machine learning, python

I am a data science team lead at Darrow and NLP enthusiastic. My interests range from machine learning modeling to solving challenging data related problems. I believe sharing ideas is where we all become better in what we do. If you’d like to get in touch, feel free to say hello through any of the social platforms. More About Yaron…

SUBSCRIBE TO BLOG

Subscribe to Blog

Subscribe to get the latest posts to your inbox

Recent Posts

  • Training an AutoEncoder to Generate Text Embeddings
  • Using Dockers for your Data Science Dev Environment
  • Identifying Real Estate Opportunities using Machine Learning
  • How to Create a Simple WhatsApp Chatbot in Python using Doc2vec
  • Average Word Vectors – Generate Document / Paragraph / Sentence Embeddings
  • Visualizing Vectors using TensorBoard
  • Training a Doc2Vec Model with Gensim
 

Loading Comments...