Technical overview

Aida's NLP pipeline is composed of 3 models. The Embeddings model takes text sentences and encodes them to high dimensional vector representations. The text classification model takes the encoded sentences and predicts the intent of the sentence. And finally the Named entity recognition model takes the sentence embeddings by bigrams and the text classification output (the intent tags), and returns the sentence slots. Here is a more detailed description of the models.

Embeddings model

Plot:

Alt text

The embeddings model uses a pre-trained fastText dictionary of bigrams to form word representations. For text classification, the embeddings model first takes a sentence, then breaks the sentence into words and then splits the words by bigrams. Here is an example of this process (sentence => words => word bigrams):

"hi friend" => ["hi", "friend"] => [["hi"],["fr", "ri", "ie", "en", "nd"]]

Then each word bigram is replaced by its 300 dimensional representation provided by fastText. Then we obtain a single 300 dimensional vector for each word of the sentence by the sum and average of the bigrams.

The dimension each sentence tensor is: 21 (length in words of the dataset longest sentence) by 300 (embedding dimensions). Here is a visualization of the embeddings tensor given this sentence: please remind to me watch real madrid match tomorrow at 9pm

Alt text

Text classification model

Plot:

Alt text

The text classification model is composed of 3 CNN layers concated and with a skip-layer connection and a dense layer output. The input is the output of the embeddings model for a given sentence using bigram embeddings and the ouput is the list of probabilities this sentence belongs to each of the classification classes.

Here is an animation of the 3 convolutional layers activations during training, we see the convolutional layer filters learn features from the 300 dimensional vector representations and also the output layer visualization. We pass the same sentence (please remind to me watch real madrid match tomorrow at 9pm) to the model at the end of each training batch (~50 batches) and plot the activations to see how they evolve as the model learns to classify. (note: frame 0 is empty for 1 second, and the last frame also pauses for 1 second).

First convolutional layer activations animation:

text classification Conv 1

Second convolutional layer activations animation:

text classification Conv 2

Third convolutional layer activations animation:

text classification Conv 3

Output layer visualization:

text classification output

NOTE: The model for text classification performs good and fast enough so there was no need to add a simple self-attention mechanism but it can be explored.

Named entity recognition model

Plot:

Alt text

The named entity recognition (NER) model uses 2 inputs. The sentence embeddings at the word bigrams level and the one hot encoded classification tag for the sentence (the text classification model output).

The training data is one-hot encoded with inside-outside (IO) tagging format. The model architecture uses a CNN at the sentence-word-bigram level concatenated with the classification tags repeated to match the length of words, then a bidirectional LSTM with no merge mode, with optional time-series attention mechanism applied only to the forward lstm, and a final dense output layer.

Given a sentence, the model will return its tags and the class of the tag. e.g.:

{
  "sentence": "please remind to me watch real madrid match tomorrow at 9pm",
  "slots": {
    "dateTime": [{ "value": "tomorrow at 9am" }],
    "calendarEvent": [{ "value": "real madrid match" } ]
  }
}

Here is an animation of the 2 deep convolutional layers activations during training, we see the convolutional layer filters learn features from the 300 dimensional vector representations and also the output layer visualization. We pass the same sentence (please remind to me watch real madrid match tomorrow at 9pm) to the model at the end of each training batch (~50 batches) and plot the activations to see how they evolve as the model learns to perform NER. (note: frame 0 is empty for 1 second, and the last frame also pauses for 1 second).

First convolutional layer activations animation:

NER Conv 1

Second convolutional layer activations animation:

NER Conv 2

Output layer visualization:

NER output

Visualization code snippets for python

There is code at classification.py and ner.py marked inside # === Visualization code block === comments that can be uncommented to generate images and then gif's of the activation progress for a given phrase. Also at the jupyter notebook, there is code for plotting the model diagrams.

Resources