Training an AutoEncoder to Generate Text Embeddings

Calculating sentences / paragraphs vectors can be done in many ways. For example a simple method is to average all the words vectors and retrieve a single vector for the entire piece of text, of course this forces you to have pre-calculated word embeddings (I already wrote about it here).

In this post I will show a different approach that uses an AutoEncoder

The aim of an AutoEncoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”.

An AutoEncoder takes an input (sequence of text in our case), squeezes it through a bottleneck layer (which has less nodes than the input layer), and output it to a decoder which tries to reconstruct the exact same input.

This approach allows us to learn vectors representations to any unstructured text, of any length. We control the size of the bottleneck layer, which means we control the size of the embeddings vectors we want to generate

After Training the AutoEncoder, we can use the encoder model to generate embeddings to any input.

Before we start with the code, here is Keras documentation of AutoEncoders

Define a Few Constants

We start by defining a few constants that will serve us in the rest of the code.

num_words = 2000
maxlen = 30
embed_dim = 150
batch_size = 16

Preparing the Input

Assuming we have a list of sentences, we tokenize and generate a fixed length padded sequence from each sentence

tokenizer = Tokenizer(num_words = num_words, split=' ')
tokenizer.fit_on_texts(sentences)
seqs = tokenizer.texts_to_sequences(sentences)
pad_seqs = pad_sequences(seqs, maxlen)

The Encoder Model

We define the encoder and decoder separately since later (on run-time) we will only use the encoder to generate embeddings for unseen text

The Encoder Consists of a Few Layers

Embedding layer
Bidirectional LSTM layer

encoder_inputs = Input(shape=(maxlen,), name='Encoder-Input')
emb_layer = Embedding(num_words, embed_dim,input_length = maxlen, name='Body-Word-Embedding', mask_zero=False)

x = emb_layer(encoder_inputs)
state_h = Bidirectional(LSTM(128, activation='relu', name='Encoder-Last-LSTM'))(x)
encoder_model = Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')
seq2seq_encoder_out = encoder_model(encoder_inputs)

The Decoder Model

The decoder model is used only on train time and it’s goal is to reconstruct the exact input that was given to the network

decoded = RepeatVector(maxlen)(seq2seq_encoder_out)
decoder_lstm = Bidirectional(LSTM(128, return_sequences=True, name='Decoder-LSTM-before'))
decoder_lstm_output = decoder_lstm(decoded)
decoder_dense = Dense(num_words, activation='softmax', name='Final-Output-Dense-before')
decoder_outputs = decoder_dense(decoder_lstm_output)

The Combined Model

The combined model actually just stacks together the encoder and the decoder, and defines the loss and optimizer that it will use

seq2seq_Model = Model(encoder_inputs, decoder_outputs)
seq2seq_Model.compile(optimizer=optimizers.Nadam(lr=0.001), loss='sparse_categorical_crossentropy')
history = seq2seq_Model.fit(pad_seqs, np.expand_dims(pad_seqs, -1),
          batch_size=batch_size,
          epochs=10)

Generating Embedding

Now, you might want to do 2 things:
1. Generate embeddings for the initial train data
2. Generate embeddings for new and unseen data

Generate Embeddings for the Training Data

vecs = encoder_model.predict(pad_seqs)

Generate Embeddings for Unseen Data

sentence = "here's a sample unseen sentence"
seq = tokenizer.texts_to_sequences([sentence])
pad_seq = pad_sequences(seq, maxlen)
sentence_vec = encoder_model.predict(pad_seq)[0]

Cheers

Contact Me