Calculating sentences / paragraphs vectors can be done in many ways. For example a simple method is to average all the words vectors and retrieve a single vector for the entire piece of text, of course this forces you to have pre-calculated word embeddings (I already wrote about it here).
In this post I will show a different approach that uses an AutoEncoder
The aim of an AutoEncoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”.
An AutoEncoder takes an input (sequence of text in our case), squeezes it through a bottleneck layer (which has less nodes than the input layer), and output it to a decoder which tries to reconstruct the exact same input.
This approach allows us to learn vectors representations to any unstructured text, of any length. We control the size of the bottleneck layer, which means we control the size of the embeddings vectors we want to generate
After Training the AutoEncoder, we can use the encoder model to generate embeddings to any input.
Before we start with the code, here is Keras documentation of AutoEncoders
Define a Few Constants
We start by defining a few constants that will serve us in the rest of the code.
num_words = 2000 maxlen = 30 embed_dim = 150 batch_size = 16
Preparing the Input
Assuming we have a list of sentences, we tokenize and generate a fixed length padded sequence from each sentence
tokenizer = Tokenizer(num_words = num_words, split=' ') tokenizer.fit_on_texts(sentences) seqs = tokenizer.texts_to_sequences(sentences) pad_seqs = pad_sequences(seqs, maxlen)
The Encoder Model
We define the encoder and decoder separately since later (on run-time) we will only use the encoder to generate embeddings for unseen text
The Encoder Consists of a Few Layers
- Embedding layer
- Bidirectional LSTM layer
encoder_inputs = Input(shape=(maxlen,), name='Encoder-Input') emb_layer = Embedding(num_words, embed_dim,input_length = maxlen, name='Body-Word-Embedding', mask_zero=False) x = emb_layer(encoder_inputs) state_h = Bidirectional(LSTM(128, activation='relu', name='Encoder-Last-LSTM'))(x) encoder_model = Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model') seq2seq_encoder_out = encoder_model(encoder_inputs)
The Decoder Model
The decoder model is used only on train time and it’s goal is to reconstruct the exact input that was given to the network
decoded = RepeatVector(maxlen)(seq2seq_encoder_out) decoder_lstm = Bidirectional(LSTM(128, return_sequences=True, name='Decoder-LSTM-before')) decoder_lstm_output = decoder_lstm(decoded) decoder_dense = Dense(num_words, activation='softmax', name='Final-Output-Dense-before') decoder_outputs = decoder_dense(decoder_lstm_output)
The Combined Model
The combined model actually just stacks together the encoder and the decoder, and defines the loss and optimizer that it will use
seq2seq_Model = Model(encoder_inputs, decoder_outputs) seq2seq_Model.compile(optimizer=optimizers.Nadam(lr=0.001), loss='sparse_categorical_crossentropy') history = seq2seq_Model.fit(pad_seqs, np.expand_dims(pad_seqs, -1), batch_size=batch_size, epochs=10)
Generating Embedding
Now, you might want to do 2 things:
1. Generate embeddings for the initial train data
2. Generate embeddings for new and unseen data
Generate Embeddings for the Training Data
vecs = encoder_model.predict(pad_seqs)
Generate Embeddings for Unseen Data
sentence = "here's a sample unseen sentence" seq = tokenizer.texts_to_sequences([sentence]) pad_seq = pad_sequences(seq, maxlen) sentence_vec = encoder_model.predict(pad_seq)[0]
Cheers