Deep Learning with Python - Chapter 11

The Chapter "Deep Learning for Text" goes over bag of words models and sequence models for text classification and sequence to sequence tasks. The chapter reviews the Transformer architecture, neural attention, and word embeddings.

Deep Learning for Text

Natural Language Processing: Bird's eye View

With human language: usage comes first, rules apply later - natural language was shaped by an evolutionary process (that's what makes it "natural"). Natural language is messy - ambiguous, chaotic, sprawling, and constantly in flux. What modern NLP is about: give computers the ability not to understand language, but to ingest a piece of language as input and return something useful in return.NLP is pattern recognition applied to words, sentences, and paragraphs. NLP started as expert systems, graduated to decision trees and logistic regression and then to recurrent neural networks and then to transformers.

Preparing Text Data

Vectorizing test is the process of transforiming text into numeric tensors.

  • First, you standardize the text to make it easier to process, such as by converting it to lowercase or removing punctuation.
  • You split the text into units (called tokens), such as characters, words, or groups of words. This is called tokenization
  • You convert each such token into a numerical vector. This will usually involve first indexing all tokens present in the data.

From Raw Text to Vectors

Text Standardization

Text standardization is a basic form of feature engineering that aims to erase encoding differences that you don't want your model to have to deal with.

One of the simplest and most widespreas standardization schemes is to "convert to lowercase and remove punctutation characters". Another common transformation is to convert special characters to a standard form (replacing accents). A much more advanced standardization pattern that is more rarely used in a machine learning context is stemming: converting variations of a term into a single shared representation ("cats" to "cat", "caught" / "been catching" to "catch"). With these standadization techniques, your model will require less training data and will generalize better - but it will remove some amount of information.

Toeknization

Three different ways:

  • Word-level tokenization: tokens are space-separated (or punctuation-separated) substrings.
  • N-gram tokenization: Tokens are groups of NNN consecutive words.
  • Character-level tokenization: Each character is its own token. In practice, this scheme is rarely used, and you only see it in special contexts like text generation or speech recognition.

Vocabulary Indexing

Once your text is split into tokens, you need to encode each token into a numerical representation. The way you'd go about this is to build an index of all terms found in the training data (the "vocabulary") and assign a unique integer to each entry in the vocabulary.You can then convert that integer into a vector encoding that can be processed by a neural network by one hot encoding it. It's common to restric the vocabulary to only the top 20,000 or 30,000 most common words found in the trianing data. Its common to have an "out of vocabulry" index ( OOV indexOOV\space indexOOV index ) - a catch-all for any token that wasn't in the index (usually index 1). The mask token is used for index 0., The OOV token means "here was a word we did not recognize", while the mask token means "ignore me, I'm not a word".

Using the TextVectorization Layer

In practice, you'll work with the Keras TextVectorization layer which is fast and efficient and can be dropped directly into a tf.data pipeline or a Keras model:

from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(
  output_mode="int" # Configures the layer to return sequences of words encoded as integer indices.
  # There are several output modes available.
)

By default, this layer uses the setting "convert to lowercase and remove punctuation" for text standardization, and "split on whitespace" for tokenization. You can provide custom functions for standardization and tokenization, though. Note that the custom functions operate on tf.string tensors.

To index the vocabulary of a text corpus, use the adapt() method of the layer with a Dataset object that yields strings (or just with a list of Python strings). Importantly, because TextVectorization is mostly a dictionary lookup operation, it can't be executed on a GPU (or a TPU) inly on a CPU. If you're training a model on a GPU, your TextVectorization layer will run on the CPU before sending its output to the GPU.

Two Approaches for Representing Groups of Words: Sets and Sequences

How to represent word order is the pivotal question from which differentkinds of NLP architectures spring. The simplest thing you could do is just discard order and treat text as an unordered set of words - this goves you bag-of-words models. You could also decide that words should be processed strictly in the order in which they appear, one at a time, like steps in a timeseries - you could then leverage the recurrent models. A hybrid model is also possible: the Transformer architecture is technically order-agnostic, yet it injects word-position information into the representations it processes, which enabled it to simultaneously look at different parts of a sentence (unlike RNNs) while still being order-aware. Because they take into account word order, both RNNs and Transformers are called sequence models.

# Download the dataset
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
out[2]

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 80.2M 100 80.2M 0 0 5440k 0 0:00:15 0:00:15 --:--:-- 14.2M

# Remove the directory we don't need
!rm -r aclImdb/train/unsup
out[3]
# Take a look at what the data looks like
!cat aclImdb/train/pos/4077_10.txt
out[4]

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

# Prepare the validation set by setting apart 20% f the training text files into
# a new directory
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
  os.makedirs(val_dir / category)
  files = os.listdir(train_dir / category)
  # Shuffle the list of training files using a seed, to ensure
  # we always get the same validation set
  random.Random(1337).shuffle(files)
  # Take 20% of the training files to use for validation
  num_val_samples = int(0.2 * len(files))
  val_files = files[-num_val_samples:]
  for fname in val_files:
    # Move the files
    shutil.move(train_dir / category / fname, val_dir / category / fname)
out[5]
from tensorflow import keras
batch_size = 32
"""
Creating Dataset objects for training, validation, and testing
"""
train_ds = keras.utils.text_dataset_from_directory(
 "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
 "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
 "aclImdb/test", batch_size=batch_size
)
out[6]

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.

"""
Displaying the shapes and dtypes of the first batch
"""
for inputs, targets in train_ds:
  print("inputs.shape:", inputs.shape)
  print("inputs.dtype:", inputs.dtype)
  print("targets.shape:", targets.shape)
  print("targets.dtype:", targets.dtype)
  print("inputs[0]:", inputs[0])
  print("targets[0]:", targets[0])
  break
out[7]

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'This film illustrates the worst part of surviving war, the memories. For many soldiers, men and women alike, returning home can be the beginning of real problems. I am reminded of my father and his brothers returning from WWII. For one of my uncles the war was never over. He survived the D-Day invasion, something akin to the first 20 minutes of Saving Private Ryan. For him the memories not only lingered but tortured him. He became an alcoholic as did several of my cousins, his sons. Jump ahead 60 years and place the soldiers in a different war, in a different country, the result is the same. When I saw this at the KC FilmFest, I was reminded that there are somethings about war that never change. The idealistic young men and women are not spared the emotional torment of what happened in Iraq, and especially if you are against the war you will come away with more compassion for the soldiers there trying to do what they believe or have been told is right.<br /><br />The tag line from the Vietnam war film Platoon says it all. "The First Casualty of War is Innocence."', shape=(), dtype=string)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)

Processing words as a set: The bag-of-words approach

The simplest way to encode a piece of text for processing by a machine learning model is to discard order and treat it as a set (a "bag") of tokens. You could either look at individual words (unigrams), or try to recover some local order infromation by looking at groups of consecutive token (N-grams).

Single Words (Unigram) with Binary Encoding

The main advantage of this encoding is that you can represent an entire text as a single vector, where each entru is a presence indictor for a given word.

from tensorflow.keras.layers import TextVectorization
"""
Preprocessing our dataset with a TextVectorization layer

Limit the vocabulary to the 20,000 most frequent words. Otherwise, we'd be
indexing every word in the training data. In general, 20,000 is the right
vocabulary for text classification.
"""
text_vectorization = TextVectorization(
 max_tokens=20000,
 # Encode the output tokens as mult-hot binary vectors
 output_mode="multi_hot",
)
# Prepare a dataset that only yeidls raw text inputs
# (no labels)
text_only_train_ds = train_ds.map(lambda x, y: x)
# Use the dataset to index the dataset vocabulary via the
# adapt() method
text_vectorization.adapt(text_only_train_ds)
"""
Prepare processed versions of our training, validation,
and text dataset. Make sure to specify num_parallel_calls to leverage
multiple CPU cores.
"""
binary_1gram_train_ds = train_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
out[9]
# Inspecting the output of our binary unigram dataset
for inputs, targets in binary_1gram_train_ds:
  print("inputs.shape:", inputs.shape)
  print("inputs.dtype:", inputs.dtype)
  print("targets.shape:", targets.shape)
  print("targets.dtype:", targets.dtype)
  print("inputs[0]:", inputs[0])
  print("targets[0]:", targets[0])
  break
out[10]

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'int64'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1 1 1 ... 0 0 0], shape=(20000,), dtype=int64)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)

"""
Our model-building utility
"""
from tensorflow import keras
from tensorflow.keras import layers
def get_model(max_tokens=20000, hidden_dim=16):
  inputs = keras.Input(shape=(max_tokens,))
  x = layers.Dense(hidden_dim, activation="relu")(inputs)
  x = layers.Dropout(0.5)(x)
  outputs = layers.Dense(1, activation="sigmoid")(x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer="rmsprop",
  loss="binary_crossentropy",
  metrics=["accuracy"])
  return model
out[11]
"""
Training and testing the binary unigram model
"""
model = get_model()
model.summary()
callbacks = [
 keras.callbacks.ModelCheckpoint("binary_1gram.keras",save_best_only=True)
]
"""
We call cache() on the datasets to cache them in memory:
this way we will only do the preprocessing once, during the first epoch and
we'll reuse the prepocessed texts for the following epochs. Thhis can be
done if the data is small enough to fit in memory.
"""
model.fit(binary_1gram_train_ds.cache(),
validation_data=binary_1gram_val_ds.cache(),
 epochs=10,
 callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")
out[12]

Model: "functional"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer (InputLayer) │ (None, 20000) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense (Dense) │ (None, 16) │ 320,016 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dropout (Dropout) │ (None, 16) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_1 (Dense) │ (None, 1) │ 17 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 320,033 (1.22 MB)

 Trainable params: 320,033 (1.22 MB)

 Non-trainable params: 0 (0.00 B)

Epoch 1/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 60s 92ms/step - accuracy: 0.7688 - loss: 0.4911 - val_accuracy: 0.8914 - val_loss: 0.2743
Epoch 2/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.8919 - loss: 0.2816 - val_accuracy: 0.8946 - val_loss: 0.2746
Epoch 3/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9151 - loss: 0.2385 - val_accuracy: 0.8974 - val_loss: 0.2752
Epoch 4/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9235 - loss: 0.2249 - val_accuracy: 0.8956 - val_loss: 0.2901
Epoch 5/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9262 - loss: 0.2193 - val_accuracy: 0.8940 - val_loss: 0.3037
Epoch 6/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9323 - loss: 0.2041 - val_accuracy: 0.8946 - val_loss: 0.3091
Epoch 7/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9330 - loss: 0.2093 - val_accuracy: 0.8954 - val_loss: 0.3240
Epoch 8/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9379 - loss: 0.2011 - val_accuracy: 0.8950 - val_loss: 0.3328
Epoch 9/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9394 - loss: 0.2002 - val_accuracy: 0.8874 - val_loss: 0.3416
Epoch 10/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9374 - loss: 0.2045 - val_accuracy: 0.8856 - val_loss: 0.3483
782/782 ━━━━━━━━━━━━━━━━━━━━ 55s 70ms/step - accuracy: 0.8866 - loss: 0.2963
Test acc: 0.886

"""
Configuring the TextVectorization layer to return bigrams
"""
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)
out[13]
"""
Training and testing the binary bigram model
"""
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
model = get_model()
model.summary()
callbacks = [
 keras.callbacks.ModelCheckpoint("binary_2gram.keras",
 save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
 validation_data=binary_2gram_val_ds.cache(),
 epochs=10,
 callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")
out[14]

Model: "functional_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer_1 (InputLayer) │ (None, 20000) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_2 (Dense) │ (None, 16) │ 320,016 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dropout_1 (Dropout) │ (None, 16) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_3 (Dense) │ (None, 1) │ 17 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 320,033 (1.22 MB)

 Trainable params: 320,033 (1.22 MB)

 Non-trainable params: 0 (0.00 B)

Epoch 1/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 112s 177ms/step - accuracy: 0.7908 - loss: 0.4515 - val_accuracy: 0.8988 - val_loss: 0.2515
Epoch 2/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9097 - loss: 0.2415 - val_accuracy: 0.9054 - val_loss: 0.2498
Epoch 3/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9313 - loss: 0.1998 - val_accuracy: 0.9048 - val_loss: 0.2671
Epoch 4/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9411 - loss: 0.1838 - val_accuracy: 0.9050 - val_loss: 0.2859
Epoch 5/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9532 - loss: 0.1661 - val_accuracy: 0.9028 - val_loss: 0.3023
Epoch 6/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9521 - loss: 0.1742 - val_accuracy: 0.9036 - val_loss: 0.3185
Epoch 7/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9561 - loss: 0.1552 - val_accuracy: 0.9016 - val_loss: 0.3312
Epoch 8/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9588 - loss: 0.1544 - val_accuracy: 0.9028 - val_loss: 0.3383
Epoch 9/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9573 - loss: 0.1582 - val_accuracy: 0.9018 - val_loss: 0.3432
Epoch 10/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9599 - loss: 0.1488 - val_accuracy: 0.9016 - val_loss: 0.3512
782/782 ━━━━━━━━━━━━━━━━━━━━ 106s 135ms/step - accuracy: 0.9018 - loss: 0.2699
Test acc: 0.901

The improvement of score using 2-gram (bigram) enoding vs unigram encoding communicates the idea that order is somewhat important.

Bigrams with TF-IDF Encoding

If you're doing text classification, knowing how many times a word occurs in a sample is critical: any sufficently long movie review may contain the word "terrible" regardless of sentiment, but a review that contains many instances of the word "terrible" is likely a negative one.

"""
Configurting the TextVectorization layer to return token counts
"""
text_vectorization = TextVectorization(
 ngrams=2,
 max_tokens=20000,
 output_mode="count"
)
out[16]

The best practice to normalize token counts is to go with something called TF-IDF normalization - TF-IDF stands for "term ferquency, inverse document frequency". For many text-classification datasets, it would be typicaly to see a one-percentage-point increase when using TF-IDF compared to plain binary encoding.

"""
Congiguring TextVectorization to return TF-IDF weighted outputs
"""
text_vectorization = TextVectorization(
 ngrams=2,
 max_tokens=20000,
 output_mode="tf_idf",
)
out[18]
"""
Training and testing the TF-IDF bigram model
"""
# The adapt() call will learn the TF-IDF weights in addition to
# the vocabulary.
text_vectorization.adapt(text_only_train_ds)
tfidf_2gram_train_ds = train_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
model = get_model()
model.summary()
callbacks = [
 keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
 save_best_only=True)
]
model.fit(tfidf_2gram_train_ds.cache(),
 validation_data=tfidf_2gram_val_ds.cache(),
 epochs=10,
 callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")
out[19]

Model: "functional_2"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer_2 (InputLayer) │ (None, 20000) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_4 (Dense) │ (None, 16) │ 320,016 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dropout_2 (Dropout) │ (None, 16) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_5 (Dense) │ (None, 1) │ 17 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 320,033 (1.22 MB)

 Trainable params: 320,033 (1.22 MB)

 Non-trainable params: 0 (0.00 B)

Epoch 1/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 107s 168ms/step - accuracy: 0.6868 - loss: 0.6616 - val_accuracy: 0.8958 - val_loss: 0.2840
Epoch 2/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.8464 - loss: 0.3428 - val_accuracy: 0.9026 - val_loss: 0.2796
Epoch 3/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.8757 - loss: 0.2914 - val_accuracy: 0.8880 - val_loss: 0.2942
Epoch 4/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.8795 - loss: 0.2747 - val_accuracy: 0.8956 - val_loss: 0.2953
Epoch 5/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.8884 - loss: 0.2633 - val_accuracy: 0.8896 - val_loss: 0.3075
Epoch 6/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9001 - loss: 0.2413 - val_accuracy: 0.8908 - val_loss: 0.3251
Epoch 7/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9036 - loss: 0.2271 - val_accuracy: 0.8912 - val_loss: 0.3402
Epoch 8/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9021 - loss: 0.2293 - val_accuracy: 0.8888 - val_loss: 0.3373
Epoch 9/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9054 - loss: 0.2237 - val_accuracy: 0.8798 - val_loss: 0.3424
Epoch 10/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9099 - loss: 0.2122 - val_accuracy: 0.8794 - val_loss: 0.3536
782/782 ━━━━━━━━━━━━━━━━━━━━ 103s 131ms/step - accuracy: 0.8951 - loss: 0.2903
Test acc: 0.894

Processing words as a sequence: The sequence model approach

What id, instead of manually crafting order-based features, we exposed the model to raw-word sequences and let it figure out such features on its own? This is what sequence models are about.

To implement a sequence model, you'd start by representing your input samples as sequences of integer indices (one integer standing for one word). Then, you'd map each integer to a vector to obtain vector sequences. Finally, you'd feed these sequences of vectors into a stack of layers that could cross-correlate features from adjacent vectors, such as a 1D convnet, a RNN, or a Transformer.

"""
Preparing Integer Datasets
"""
from tensorflow.keras import layers
max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
 max_tokens=max_tokens,
 output_mode="int",
 # In order to keep a manageable input size, we'll truncate the inputs
 # after the first 600 words
 # This is a reasonable choice, since teh average review length is 233
 # words, and only 5% of reviews are longer than 600 words.
 output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)
int_train_ds = train_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
int_val_ds = val_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
int_test_ds = test_ds.map(
 lambda x, y: (text_vectorization(x), y),
 num_parallel_calls=4)
out[21]
"""
The simplest way to convert the integer sequences to vector sequences is to one-
hot encode the integers. On top of these one-hot vectros, we add a simple
bidirectional LSTM

A sequence model built on one-hot encoded vector sequences
"""
import tensorflow as tf
# One input is a sequence of integers
inputs = keras.Input(shape=(None,), dtype="int64")
# Encode the integers into binary 20,000-dimensional vectors
embedded = tf.one_hot(inputs, depth=max_tokens)
# Add a bidrectional LSTM
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
# Add a classification layer
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
 loss="binary_crossentropy",
 metrics=["accuracy"])
model.summary()
out[22]
"""
Training a first basic sequence model
"""
callbacks = [
 keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
 save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
 callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")
"""
Observations:
The model trains very slowly, espcially when compared to the light-weight model
of the previous section. This is because our inputs are quite large. Second,
the model only gets 87% test accuracy - clealy, using ohe to turn words into
vectors wasn;t a great idea. A better way: *word encodings*.
"""
out[23]
Understanding Word Embeddings

When you encode something with ohe, your assumption is that the different tokens you're encoding are all independent from eath other - ohe vectors are all orthoginal from one another. The geometric relationship between two word vectors should reflect the semantic relationship between the two words. Word embeddings are vector representations of words that map human language into a structured geomtric space. Word embedding s are low-dimensional floating-point vectors (dense vectors). It is common to see word embeddings that are 256-dimensional, 512-dimensional, or 1024-diemnsional when dealing with very large vocabularies.

One Hot Encoding vs Word Embedding

Word embeddings are also structured representations - their structure is learned from data. Similar words get embedded in close locations, and further, specific directions in the embedding space are meaningful. Two ways to obtain word embeddings:

  • Learn word embeddings jointly with the main task you care about (such as doument classification or sentiment prediction). In this setup, you start with random word vectors and then learn word vectors in the same way you learn the weights of a neural network.
  • Load into your model word embeddings that were precomputed using a different machine learning task than the one you're trying to solve. These are called pretrained word embeddings.
Learning Word Embeddings with the Embedding Layer

What makes the prefect word-embedding space depedns heavily on your task beause the importance of certain semantic relationships vares from task to task.

# The embedding layer takes at least two arguments: the number of possible
# tokens and the dimensionality of the embedidng (here, 256)
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)

The Embedding layer is best understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, looks up these integers in an internal dictionary, and returns the associated vectors. The layer takes rank-2 tensors of integers, of shape (batch_size, sequence_length), where each entry is a sequence of integers. The layer then returns a 3D floating-point tensor of shape (batch_size, sequence_length, embedding_dimensionality).

When you instantiate an Embedding layer, its weights (an internal dictionary of token vectors) are initially random. During training, these word vectors are gradually adjusted via backpropagation, structuring the space into something the downstream model can exploit.

"""
Model that uses an Embedding layer trained from scratch
"""
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
 loss="binary_crossentropy",
 metrics=["accuracy"])
model.summary()
callbacks = [
 keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
 save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
 callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")
out[25]

Model: "functional_3"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer_4 (InputLayer) │ (None, None) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ embedding (Embedding) │ (None, None, 256) │ 5,120,000 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ bidirectional (Bidirectional) │ (None, 64) │ 73,984 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dropout_3 (Dropout) │ (None, 64) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_6 (Dense) │ (None, 1) │ 65 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 5,194,049 (19.81 MB)

 Trainable params: 5,194,049 (19.81 MB)

 Non-trainable params: 0 (0.00 B)

Epoch 1/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 34s 47ms/step - accuracy: 0.6210 - loss: 0.6247 - val_accuracy: 0.8236 - val_loss: 0.4160
Epoch 2/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 29s 46ms/step - accuracy: 0.8388 - loss: 0.4031 - val_accuracy: 0.8462 - val_loss: 0.3864
Epoch 3/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 45ms/step - accuracy: 0.8694 - loss: 0.3368 - val_accuracy: 0.7486 - val_loss: 0.6713
Epoch 4/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 29s 46ms/step - accuracy: 0.8959 - loss: 0.2860 - val_accuracy: 0.8820 - val_loss: 0.3065
Epoch 5/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 29s 46ms/step - accuracy: 0.9154 - loss: 0.2452 - val_accuracy: 0.8740 - val_loss: 0.3426
Epoch 6/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 45ms/step - accuracy: 0.9295 - loss: 0.1980 - val_accuracy: 0.8776 - val_loss: 0.3280
Epoch 7/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 45ms/step - accuracy: 0.9448 - loss: 0.1659 - val_accuracy: 0.8842 - val_loss: 0.3521
Epoch 8/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 46ms/step - accuracy: 0.9544 - loss: 0.1427 - val_accuracy: 0.8698 - val_loss: 0.3598
Epoch 9/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 45ms/step - accuracy: 0.9656 - loss: 0.1109 - val_accuracy: 0.8810 - val_loss: 0.3768
Epoch 10/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 45ms/step - accuracy: 0.9705 - loss: 0.0992 - val_accuracy: 0.8728 - val_loss: 0.4521
782/782 ━━━━━━━━━━━━━━━━━━━━ 15s 19ms/step - accuracy: 0.8689 - loss: 0.3382
Test acc: 0.868

Understanding Padding and Masking

One things that's slightly huring model performance (the word-embeding model is worse than bigram model) is that our input sequences are full of zeros. max_length=600 means that sentences longer than 600 tokens are truncated to alength of 600 tokens, and sentences shorter than 600 tokens are padded with zeros at the end sop that they can be concatenated to form contiguous batches.

We need a way to tell the RNN that it should skip iterations full of zeros: there's an API for that, masking. This mask is a tensor of ones and zeros or shape (bacth-size, sequence_length) where each entry indicated whether a sample should be skipped or not. By default, this is not active.

"""
Using an Embedidng Model with masking enabled
"""
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(
 input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
 loss="binary_crossentropy",
 metrics=["accuracy"])
model.summary()
callbacks = [
 keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
 save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
 callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru_with_masking.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")
out[27]

Model: "functional_4"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃ Connected to  ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩

│ input_layer_5 │ (None, None) │ 0 │ - │

│ (InputLayer) │ │ │ │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ embedding_1 (Embedding) │ (None, None, 256) │ 5,120,000 │ input_layer_5[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ not_equal (NotEqual) │ (None, None) │ 0 │ input_layer_5[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ bidirectional_1 │ (None, 64) │ 73,984 │ embedding_1[0][0], │

│ (Bidirectional) │ │ │ not_equal[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ dropout_4 (Dropout) │ (None, 64) │ 0 │ bidirectional_1[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ dense_7 (Dense) │ (None, 1) │ 65 │ dropout_4[0][0] │

└───────────────────────────┴────────────────────────┴────────────────┴────────────────────────┘

 Total params: 5,194,049 (19.81 MB)

 Trainable params: 5,194,049 (19.81 MB)

 Non-trainable params: 0 (0.00 B)

Epoch 1/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 30s 46ms/step - accuracy: 0.6752 - loss: 0.5741 - val_accuracy: 0.8326 - val_loss: 0.3731
Epoch 2/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 45ms/step - accuracy: 0.8585 - loss: 0.3314 - val_accuracy: 0.8804 - val_loss: 0.2909
Epoch 3/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 44ms/step - accuracy: 0.8947 - loss: 0.2624 - val_accuracy: 0.8724 - val_loss: 0.3155
Epoch 4/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 44ms/step - accuracy: 0.9219 - loss: 0.2055 - val_accuracy: 0.8860 - val_loss: 0.3450
Epoch 5/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 44ms/step - accuracy: 0.9385 - loss: 0.1625 - val_accuracy: 0.8876 - val_loss: 0.3102
Epoch 6/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 44ms/step - accuracy: 0.9579 - loss: 0.1163 - val_accuracy: 0.8916 - val_loss: 0.3550
Epoch 7/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 45ms/step - accuracy: 0.9708 - loss: 0.0850 - val_accuracy: 0.8804 - val_loss: 0.3801
Epoch 8/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 44ms/step - accuracy: 0.9786 - loss: 0.0638 - val_accuracy: 0.8744 - val_loss: 0.4436
Epoch 9/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 44ms/step - accuracy: 0.9860 - loss: 0.0440 - val_accuracy: 0.8736 - val_loss: 0.5018
Epoch 10/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 44ms/step - accuracy: 0.9893 - loss: 0.0352 - val_accuracy: 0.8794 - val_loss: 0.5683
782/782 ━━━━━━━━━━━━━━━━━━━━ 15s 18ms/step - accuracy: 0.8762 - loss: 0.2993
Test acc: 0.876

Using Pretrained Word Embeddings

Sometimes you have so little available training data that you can't use data alone to learn an appropriate task-specific emedding of your vocabulary. In such ases, you can load embedding vectors from a precomputed embedding space that you know is structured and exhibits useful properties - one that captures generic aspects of language structure. Word2Vec and GloVe (Global Bectors for Word Representation) are two examples of pretrained word embedidngs.

# Download the GloVe word embeddings precomputed on the 2014 English Wikipedia
# Dataset
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
out[29]

--2024-09-08 23:40:20-- http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-09-08 23:40:20-- https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-09-08 23:40:21-- https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’

glove.6B.zip 100%[===================>] 822.24M 5.10MB/s in 2m 45s

2024-09-08 23:43:07 (4.97 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

"""
Parsing the GloVe word-embeddings file

Parse the unzipped .txt file to build an index that maps words
as strings to their vector representation
"""
import numpy as np
path_to_glove_file = "glove.6B.100d.txt"
embeddings_index = {}
with open(path_to_glove_file) as f:
  for line in f:
    word, coefs = line.split(maxsplit=1)
    coefs = np.fromstring(coefs, "f", sep=" ")
    embeddings_index[word] = coefs
print(f"Found {len(embeddings_index)} word vectors.")
out[30]

Found 400000 word vectors.

"""
Preparing the GloVe word-embeddings matrix

- Build an embedding matrix that you can load into an
Embedding layer
- It must be a matirx of shape (max_words, embedding_dim), where
each entry i contains the embedding_dim-dimensional vector for the
word of index i in the reference word index
"""
embedding_dim = 100
# Retrieve the vocabulary indexed by our previous TextVectorization layer
vocabulary = text_vectorization.get_vocabulary()
# Use it to create a mapping from words to their index in the vocabulary
word_index = dict(zip(vocabulary, range(len(vocabulary))))
# Prepare a matrix that we'll fill with the GloVe vectors
embedding_matrix = np.zeros((max_tokens, embedding_dim))
for word, i in word_index.items():
  if i < max_tokens:
    embedding_vector = embeddings_index.get(word)
  # Fill entry i in the matrix with the word vector for index i
  # Words not found in the embedding index will be all zeros
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector
out[31]
"""
Use a Constant initializer to load the pretrained embeddings in an
Embedding layer. Freeze the pretrained representation during training - set
trainable=False
"""
embedding_layer = layers.Embedding(
  max_tokens,
  embedding_dim,
  embeddings_initializer=keras.initializers.Constant(embedding_matrix),
  trainable=False,
  mask_zero=True,
)
out[32]
"""
Model that uses a pretrained Embedding layer
"""
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
 loss="binary_crossentropy",
 metrics=["accuracy"])
model.summary()
callbacks = [
 keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
 save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
 callbacks=callbacks)
model = keras.models.load_model("glove_embeddings_sequence_model.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")
out[33]

Model: "functional_5"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃ Connected to  ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩

│ input_layer_6 │ (None, None) │ 0 │ - │

│ (InputLayer) │ │ │ │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ embedding_2 (Embedding) │ (None, None, 100) │ 2,000,000 │ input_layer_6[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ not_equal_2 (NotEqual) │ (None, None) │ 0 │ input_layer_6[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ bidirectional_2 │ (None, 64) │ 34,048 │ embedding_2[0][0], │

│ (Bidirectional) │ │ │ not_equal_2[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ dropout_5 (Dropout) │ (None, 64) │ 0 │ bidirectional_2[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ dense_8 (Dense) │ (None, 1) │ 65 │ dropout_5[0][0] │

└───────────────────────────┴────────────────────────┴────────────────┴────────────────────────┘

 Total params: 2,034,113 (7.76 MB)

 Trainable params: 34,113 (133.25 KB)

 Non-trainable params: 2,000,000 (7.63 MB)

Epoch 1/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 34s 52ms/step - accuracy: 0.6177 - loss: 0.6401 - val_accuracy: 0.7958 - val_loss: 0.4509
Epoch 2/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 32s 50ms/step - accuracy: 0.7747 - loss: 0.4801 - val_accuracy: 0.8014 - val_loss: 0.4413
Epoch 3/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 33s 52ms/step - accuracy: 0.8144 - loss: 0.4173 - val_accuracy: 0.8424 - val_loss: 0.3748
Epoch 4/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 44ms/step - accuracy: 0.8268 - loss: 0.3847 - val_accuracy: 0.8284 - val_loss: 0.3887
Epoch 5/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 32s 51ms/step - accuracy: 0.8489 - loss: 0.3517 - val_accuracy: 0.8606 - val_loss: 0.3297
Epoch 6/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 32s 51ms/step - accuracy: 0.8548 - loss: 0.3341 - val_accuracy: 0.8678 - val_loss: 0.3220
Epoch 7/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 32s 52ms/step - accuracy: 0.8666 - loss: 0.3054 - val_accuracy: 0.8706 - val_loss: 0.3188
Epoch 8/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 32s 52ms/step - accuracy: 0.8767 - loss: 0.2961 - val_accuracy: 0.8742 - val_loss: 0.3111
Epoch 9/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 32s 51ms/step - accuracy: 0.8823 - loss: 0.2825 - val_accuracy: 0.8696 - val_loss: 0.3084
Epoch 10/10
625/625 ━━━━━━━━━━━━━━━━━━━━ 28s 45ms/step - accuracy: 0.8931 - loss: 0.2637 - val_accuracy: 0.8766 - val_loss: 0.3131
782/782 ━━━━━━━━━━━━━━━━━━━━ 16s 20ms/step - accuracy: 0.8668 - loss: 0.3121
Test acc: 0.868

The Transformer Architecture

Transformers were introduced in the seminal paper: "Attention is all you need" by Vaswani et al. The gist of the paper: a simple mechanism called "neural attrention" could be used to build powerful sequence models that didn't feature any recurrent layers or convolutional layers.

Understanding Self-Attention

There are many different forms of attention you could imagine, but they all start by computing importance scores for a set of features, with higher scores for more relevant features and lower scores for less relevant ones. A smart embedding space would provide a different vector representation for a word depnding on the other words surrounding it. That's where self-attention comed in. The purpose of self attention is to modulate the representation of a token by using the representations of related tokens in the sequence. This produces context-aware tokens.

Self Attention

Step 1 is to compute relevancy scores betwene the vector for "station" and every other word in the sentence. These are our "attention socres". We're simply going to se the dot product between two word vectors as a measure of the stength of their relationship.

Step 2 is to compute the sum of all word vectors in a sentence, weighted by our relevancy scores. The resulting vector is our new representation for "station": a representation which incorporates the surrounding context.

# NumPy-like psudocode
def self_attention(input_sequence):
  output = np.zeros(shape=input_sequence.shape)
  # Iterate over each token in the input sequence
  for i, pivot_vector in enumerate(input_sequence):
    scores = np.zeros(shape=(len(input_sequence),))
    for j, vector in enumerate(input_sequence):
      # Compute the dot product (attention socre) between the token
      # and every other token
      scores[j] = np.dot(pivot_vector, vector.T)
    # Scale by a normalization factor and apply a softmax
    scores /= np.sqrt(input_sequence.shape[1])
    scores = softmax(scores)
    new_pivot_representation = np.zeros(shape=pivot_vector.shape)
    for j, vector in enumerate(input_sequence):
      # Take the sum of all tokens wighted by the attention scores
      new_pivot_representation += vector * scores[j]
    # That sum is our output
    output[i] = new_pivot_representation
  return output

In Keras:

num_heads = 4
embed_dim = 256
mha_layer = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
outputs = mha_layer(inputs, inputs, inputs)
Generalized Self-Attention: The Query Key-Value Model

A Transformer is a sequence-to-sequence model: it was designed to convert one sequence into another. For each element in the query, compute how much the element is related to every key, and use these scores to weight a sum of values

outputs = sum(values * pairwise_scores(query, keys))

Query, Keys, and Values

Multi-Head Attention

The multi-head moniker refers to the fact that the output space of the self-attention layer gets factored into a set of independent sub-spaces, learned separatelu: the initial query, key, and value are sent through three independent sets of dense projections, resulting in three separate vectors. Each vector is processed via neural attention, and the three outpits are concatenated back together into a single output sequence. Each such subspace is called a head.

The presence of the learnable dense projections enables the layer to actually learn something, as opposed to being a purely stateless transformation that would require additional layers before or after it to be useful.

MultiHeadAttention layer

The Transformer Ecnoder

Factoring outputs into multiple independent spaces, adding residual connections, adding normalization layers - all of these are standard architecture patterns that one would be wise to leverage in any complex model. Together, these bells and whistles form the Transformer encoder - one of two critical parts that make up the Transformer architecture.

The Transformer Encoder

"""
Transformer encoder implemented as a subclassed Layer
"""

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
class TransformerEncoder(layers.Layer):
  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
    super().__init__(**kwargs)
    # Size of the input token vectors
    self.embed_dim = embed_dim
    # Size of the inner dense layer
    self.dense_dim = dense_dim
    # Number of attention heads
    self.num_heads = num_heads
    self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
    self.dense_proj = keras.Sequential([
        layers.Dense(dense_dim, activation="relu"),
        layers.Dense(embed_dim),
      ]
    )
    self.layernorm_1 = layers.LayerNormalization()
    self.layernorm_2 = layers.LayerNormalization()

  def call(self, inputs, mask=None):
    """
    Computation goes in call()
    """
    if mask is not None:
      # The mask that will be generated by the Embedding layer will
      # be 2D, but the attention layer expects to be 3D or 4D,
      # so we expand its rank
      mask = mask[:, tf.newaxis, :]
    attention_output = self.attention(inputs, inputs, attention_mask=mask)
    proj_input = self.layernorm_1(inputs + attention_output)
    proj_output = self.dense_proj(proj_input)
    return self.layernorm_2(proj_input + proj_output)
  def get_config(self):
    """
    Implement serialization so we can save the model

    When you write custom layers, make sure to implement the get_config method:
    this enables the layer to be reinstated from its config dict, which is
    useful during model saving and loading. This method should return a Python
    dict that contains the values of the constructor arguments used to create
    the layer.
    """
    config = super().get_config()
    config.update({
        "embed_dim": self.embed_dim,
        "num_heads": self.num_heads,
        "dense_dim": self.dense_dim,
    })
    return config
out[35]

Note that we use LayerNormalization layer here instead of BatchNormalization - BatchNormalization doesn't work well with sequence data, and LayerNormalization normalizes each sequence independently from other sequences in the batch.

"""
Using the Transformer encoder for text classification
"""
vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32
inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
"""
Since TransformerEncoder returns full sequences,
we need to reduce each sequence to a single vector for clasification
via a global pooling layer.
"""
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
 loss="binary_crossentropy",
 metrics=["accuracy"]
)
model.summary()
"""
Training and evaluating the Transformer encoded based model
"""
callbacks = [
 keras.callbacks.ModelCheckpoint("transformer_encoder.keras",
 save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20,
callbacks=callbacks)
model = keras.models.load_model(
 "transformer_encoder.keras",
 custom_objects={"TransformerEncoder": TransformerEncoder}) # provide the custom
 # transformerEncoder class to the model-loading process
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")
out[37]

Model: "functional_7"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer_7 (InputLayer) │ (None, None) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ embedding_3 (Embedding) │ (None, None, 256) │ 5,120,000 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ transformer_encoder │ (None, None, 256) │ 543,776 │

│ (TransformerEncoder) │ │ │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ global_max_pooling1d │ (None, 256) │ 0 │

│ (GlobalMaxPooling1D) │ │ │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dropout_7 (Dropout) │ (None, 256) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_11 (Dense) │ (None, 1) │ 257 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 5,664,033 (21.61 MB)

 Trainable params: 5,664,033 (21.61 MB)

 Non-trainable params: 0 (0.00 B)

Epoch 1/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 86s 104ms/step - accuracy: 0.6113 - loss: 0.7622 - val_accuracy: 0.8374 - val_loss: 0.3668
Epoch 2/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 56s 90ms/step - accuracy: 0.8227 - loss: 0.3927 - val_accuracy: 0.8602 - val_loss: 0.3277
Epoch 3/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 56s 90ms/step - accuracy: 0.8554 - loss: 0.3358 - val_accuracy: 0.8454 - val_loss: 0.3461
Epoch 4/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 57s 92ms/step - accuracy: 0.8690 - loss: 0.3087 - val_accuracy: 0.8702 - val_loss: 0.3093
Epoch 5/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 57s 92ms/step - accuracy: 0.8814 - loss: 0.2876 - val_accuracy: 0.8758 - val_loss: 0.2913
Epoch 6/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 58s 92ms/step - accuracy: 0.8934 - loss: 0.2633 - val_accuracy: 0.8768 - val_loss: 0.2967
Epoch 7/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 58s 92ms/step - accuracy: 0.9005 - loss: 0.2445 - val_accuracy: 0.8802 - val_loss: 0.2899
Epoch 8/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 58s 93ms/step - accuracy: 0.9125 - loss: 0.2223 - val_accuracy: 0.8808 - val_loss: 0.2882
Epoch 9/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 57s 91ms/step - accuracy: 0.9190 - loss: 0.2059 - val_accuracy: 0.8754 - val_loss: 0.3156
Epoch 10/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 57s 91ms/step - accuracy: 0.9266 - loss: 0.1897 - val_accuracy: 0.8790 - val_loss: 0.3138
Epoch 11/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 56s 90ms/step - accuracy: 0.9329 - loss: 0.1771 - val_accuracy: 0.8796 - val_loss: 0.3182
Epoch 12/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 56s 89ms/step - accuracy: 0.9403 - loss: 0.1574 - val_accuracy: 0.8724 - val_loss: 0.3522
Epoch 13/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 54s 87ms/step - accuracy: 0.9480 - loss: 0.1391 - val_accuracy: 0.8714 - val_loss: 0.3570
Epoch 14/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 53s 85ms/step - accuracy: 0.9533 - loss: 0.1246 - val_accuracy: 0.8712 - val_loss: 0.3689
Epoch 15/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 51s 81ms/step - accuracy: 0.9620 - loss: 0.1061 - val_accuracy: 0.8710 - val_loss: 0.4077
Epoch 16/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 49s 78ms/step - accuracy: 0.9682 - loss: 0.0916 - val_accuracy: 0.8680 - val_loss: 0.4360
Epoch 17/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 46s 74ms/step - accuracy: 0.9708 - loss: 0.0825 - val_accuracy: 0.8672 - val_loss: 0.4423
Epoch 18/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 44s 70ms/step - accuracy: 0.9753 - loss: 0.0711 - val_accuracy: 0.8662 - val_loss: 0.4465
Epoch 19/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 41s 66ms/step - accuracy: 0.9781 - loss: 0.0634 - val_accuracy: 0.8608 - val_loss: 0.5086
Epoch 20/20
625/625 ━━━━━━━━━━━━━━━━━━━━ 40s 65ms/step - accuracy: 0.9798 - loss: 0.0599 - val_accuracy: 0.8544 - val_loss: 0.5366

/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py:372: UserWarning: `build()` was called on layer 'transformer_encoder', however the layer does not have a `build()` method implemented and it looks like it has unbuilt state. This will cause the layer to be marked as built, despite not being actually built, which may cause failures down the line. Make sure to implement a proper `build()` method.
warnings.warn(

782/782 ━━━━━━━━━━━━━━━━━━━━ 9s 9ms/step - accuracy: 0.8660 - loss: 0.3186
Test acc: 0.867

Self-attention is a set-processing mechanism, focues on the relationship between pairs of sequence elements = it's blind to whether these elements occur at the beginning, at the end, or in the middle of a sequence.

Features of Different types of NLP models

Using Positional-Encoding to Re-inject Order Information

The ides behind positional-encoding is very simple: to give the model access to word-order information, we're going to add the word's position in the sentence to each word embedding. The nput word embeddings will have two components: the usual word vector, which represents the word independently of any specific context, and a positional vector, which represents the position of the word in the current sentence.

"""
Implementing positional embedding as a subclassed layer
"""

class PositionalEmbedding(layers.Layer):
  """
  A downside of position embeddings is that the sequence lengths
  needs to be known in advance
  """
  def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
    super().__init__(**kwargs)
    # Prepare an Embedding layer for the token indices
    self.token_embeddings = layers.Embedding(input_dim=input_dim, output_dim=output_dim)
    # Add another one for the token positions
    self.position_embeddings = layers.Embedding(
    input_dim=sequence_length, output_dim=output_dim)
    self.sequence_length = sequence_length
    self.input_dim = input_dim
    self.output_dim = output_dim
  def call(self,inputs):
    length = tf.shape(inputs)[-1]
    positions = tf.range(start=0, limit=length, delta=1)
    embedded_tokens = self.token_embeddings(inputs)
    embedded_positions = self.position_embeddings(positions)
    # Add both embedding vectors together
    return embedded_tokens + embedded_positions
  def compute_mask(self, inputs, mask=None):
    """
    Like the Embedding Layer, this layer should be able to generate a
    mask so we can ignore padding 0s in the inputs. The compute_mask
    method will be called automatically by the framework, and the mask
    will get propagated to the next layer
    """
    return tf.math.not_equal(inputs, 0)
  def get_config(self):
    """
    Implement serialization so that we can save the model
    """
    config = super().get_config()
    config.update({
      "output_dim": self.output_dim,
      "sequence_length": self.sequence_length,
      "input_dim": self.input_dim,
    })
    return config
out[39]
"""
Combining the Transformer encoder with positional embedding

This model gives an 88.3% test accuracy, which demonstrates the value
of word order information for text classification, but it is still one
notch below the bag-of-words approach.
"""
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32
inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
 loss="binary_crossentropy",
 metrics=["accuracy"])
model.summary()
callbacks = [
 keras.callbacks.ModelCheckpoint("full_transformer_encoder.keras",
 save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20,
callbacks=callbacks)
model = keras.models.load_model(
 "full_transformer_encoder.keras",
 custom_objects={"TransformerEncoder": TransformerEncoder,
 "PositionalEmbedding": PositionalEmbedding})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")
out[40]

When to use sequence models over bag-of-words models

Bag-of-words is still a valid and relevant approach in many cases. When to use bag-of-words vs Transformers: it turns out that wehen approaching a new text-classification task, you should pay close attention to the ratio between the number of samples in your training data and the mean number of words per sample. If that ratio is small - less than 1,500 - then the bag-of-bigrams model will perform better, If that ratio is higher than 1,500, then you should go with a sequence model. (Remember this is just for text classification).

Beyond text classification: Sequence to Sequence Learning

A sequence-to-sequence model takes a sequnce as input (often a sequence or paragraph) and translated it into a different sequence. This is the task at the heart of many of the most successful applications of NLP:

  • Machine translation: Convert a paragraph in a source language to its equivalent in a target language
  • Text summarization: Convert a long document to a shorter version taht retains the most important information
  • Question answering: Convert an input question into its answer
  • Chatbots: Convert a dialogure prompt into a reply to this prompt, or convert the history of a conversation into the next reply in the conversation
  • Text generation: Convert a text prompt into a paragraph that completes the prompt

During training,

  • An encoder model turns the source sequence into an intermediate representation
  • A decoder is trained to predict the next token i in the traget sequence by looking at both previous tokens and the encoded source sequence.

During inference,

  1. We obtain the encoded source sequence from the encoder.
  2. The decoder starts by looking at the encoded source sequence as well as the initial "seed" token (such as the string "[start]"), and uses them to predict the forst real token in the sequence.
  3. The predicted sequence so far is fed back into the decoder, which generated the next token, and so on, until it generates a stop token (such as the string "[end]").

Sequence-to-Sequence Learning

A Machine Translation Example

## Download the English-to-Spanish Dataset
!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
!unzip -q spa-eng.zip
out[42]

--2024-09-09 02:08:25-- http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.130.207, 74.125.68.207, 64.233.170.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.130.207|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2638744 (2.5M) [application/zip]
Saving to: ‘spa-eng.zip’

spa-eng.zip 100%[===================>] 2.52M 1.79MB/s in 1.4s

2024-09-09 02:08:27 (1.79 MB/s) - ‘spa-eng.zip’ saved [2638744/2638744]

# Parsing the text files
text_file = "spa-eng/spa.txt"
with open(text_file) as f:
  lines = f.read().split("\n")[:-1]
text_pairs = []

for line in lines: # Iterate over the lines in the file
  # Each line contains an English phrase and its Spanish translation, tab-separated
  english, spanish = line.split("\t")
  # We prepend "[start]" and append "[end]" to the Spanish template
  spanish = "[start] " + spanish + " [end]"
  text_pairs.append((english, spanish))
out[43]
# What our text_pairs look like
import random
print(random.choice(text_pairs))
out[44]

('I wonder what has become of the friend I used to go fishing with.', '[start] Me pregunto qué se habrá hecho del amigo con el que solía ir de pesca. [end]')

"""
Shuffle the pairs and split them into training, validation, and test sets
"""
import random
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]
out[45]
"""
Vectorizing the English and Spanish text pairs
"""
import tensorflow as tf
import string
import re

"""
Prepare a custom string standardization fucntion for the Spanish
TextVectorization layer: it preserves "[" and "]" but strips the upside down
question mark (as well as all other characters from strings.punctuation)
"""
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")
def custom_standardization(input_string):
 lowercase = tf.strings.lower(input_string)
 return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "")

"""
To keep thimngs simple, we will only look at the top 15,000 words in each
language, and we'll restrict sentences to 20 words
"""
vocab_size = 15000
sequence_length = 20
# The English layer
source_vectorization = layers.TextVectorization(
 max_tokens=vocab_size,
 output_mode="int",
 output_sequence_length=sequence_length,
)
# The Spanish Layer
target_vectorization = layers.TextVectorization(
 max_tokens=vocab_size,
 output_mode="int",
 # Generate Spanish sentences that have one extra token
 # Since we'll need to offset the sentence by one step during training
 output_sequence_length=sequence_length + 1,
 standardize=custom_standardization,
)
train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
# Learn the vocabulary of each language
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)
out[46]
"""
Preparing datasets for the translation task


"""
batch_size = 64
def format_dataset(eng, spa):
  eng = source_vectorization(eng)
  spa = target_vectorization(spa)
  return ({
  "english": eng,
  # The input Spanish doesn't include the last token to keep inputs and
  # targets at the same length
  "spanish": spa[:, :-1],
  # The target Spanish sentence is one step ahead.
  # Both are still the same length (20 words)
  }, spa[:, 1:])
def make_dataset(pairs):
  eng_texts, spa_texts = zip(*pairs)
  eng_texts = list(eng_texts)
  spa_texts = list(spa_texts)
  dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
  dataset = dataset.batch(batch_size)
  dataset = dataset.map(format_dataset, num_parallel_calls=4)
  # Use in memory cacheing to speed up preprocessing
  return dataset.shuffle(2048).prefetch(16).cache()
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

"""
What out dataset outputs look like
"""
for inputs, targets in train_ds.take(1):
  print(f"inputs['english'].shape: {inputs['english'].shape}")
  print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
  print(f"targets.shape: {targets.shape}")
out[47]

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)

Sequence-to-Sequence Learning with RNNs

Recurrent neural networks dominated sequence-to-sequence learning from 2015-2017 before bing overtaken by Transformer. It's still worth learning about this approach today, as it provides an easy entry point to understanding sequence-to-sequence models. The simplest, naive way to use RNNs to turn a sequence into another sequence is to keep the output of the RNN at each time step. In Keras:

inputs = keras.Input(shape=(sequence_length,), dtype="int64")
x = layers.Embedding(input_dim=vocab_size, output_dim=128)(inputs)
x = layers.LSTM(32, return_sequences=True)(x)
outputs = layers.Dense(vocab_size, activation="softmax")(x)
model = keras.Model(inputs, outputs)

In a proper sequence-to-sequence setup, you would first use an RNN (the encoder) to turn the entire source sequence into a single vector (or set of vectors). Then you would use this vector (or vectors) as the initial state of another RNN (the decoder), which would look at elements 0N0 \ldots N0N in the target sequence, and try to predict step N+1N+1N+1 in the target sequence.

Sequence-to-Sequence RNN

"""
GRU Based Encoder
"""
from tensorflow import keras
from tensorflow.keras import layers
embed_dim = 256
latent_dim = 1024
# The English source sentence goes here
source = keras.Input(shape=(None,), dtype="int64", name="english")
# Masking is critical in this setup
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)

encoded_source = layers.Bidirectional(
 layers.GRU(latent_dim), merge_mode="sum")(x) # Our encoded source sentence is the last output of bidiectional GRU
out[49]
"""
GRU-based decoder and the end-to-end model
"""
# The spanish target sentence goes here
past_target = keras.Input(shape=(None,), dtype="int64", name="spanish")
# Don't forget masking
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.GRU(latent_dim, return_sequences=True)
# The encoded source sentence serves as the initial state of the decoder GRU
x = decoder_gru(x, initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
# Predicts the next token
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)
# End-to-end model: maps the source sentence and the target sentence
# to the target senetence one step in the future
seq2seq_rnn = keras.Model([source, past_target], target_next_step)
out[50]
"""
Training our recurrent sequence-to-sequence model
"""
seq2seq_rnn.compile(
 optimizer="rmsprop",
 loss="sparse_categorical_crossentropy",
 metrics=["accuracy"])
seq2seq_rnn.fit(train_ds, epochs=15, validation_data=val_ds)
out[51]

Epoch 1/15
1302/1302 ━━━━━━━━━━━━━━━━━━━━ 126s 95ms/step - accuracy: 0.1496 - loss: 5.2624 - val_accuracy: 0.1586 - val_loss: 3.8709
Epoch 2/15
1302/1302 ━━━━━━━━━━━━━━━━━━━━ 124s 95ms/step - accuracy: 0.1613 - loss: 3.8686 - val_accuracy: 0.1900 - val_loss: 3.2432
Epoch 3/15
1302/1302 ━━━━━━━━━━━━━━━━━━━━ 124s 95ms/step - accuracy: 0.1868 - loss: 3.3079 - val_accuracy: 0.2083 - val_loss: 2.8817
Epoch 4/15
1302/1302 ━━━━━━━━━━━━━━━━━━━━ 124s 95ms/step - accuracy: 0.2044 - loss: 2.9295 - val_accuracy: 0.2217 - val_loss: 2.6399
Epoch 5/15
1302/1302 ━━━━━━━━━━━━━━━━━━━━ 124s 95ms/step - accuracy: 0.2181 - loss: 2.6410 - val_accuracy: 0.2333 - val_loss: 2.4437
Epoch 6/15
1302/1302 ━━━━━━━━━━━━━━━━━━━━ 124s 95ms/step - accuracy: 0.2302 - loss: 2.4063 - val_accuracy: 0.2412 - val_loss: 2.3141
Epoch 7/15
1302/1302 ━━━━━━━━━━━━━━━━━━━━ 124s 95ms/step - accuracy: 0.2403 - loss: 2.2126 - val_accuracy: 0.2475 - val_loss: 2.2146
Epoch 8/15
 567/1302 ━━━━━━━━━━━━━━━━━━━━ 1:05 89ms/step - accuracy: 0.2482 - loss: 2.0680

In realy world machine-translation systems, you will likely use "BLEU scores" to evaluate your models - a metric that looks at entire generated sequences and that seems to correlate well with human perception of translation quality.

The RNN approach to sequence-to-sequence learning has a few fundamental limitations:

  • The source sequence representation has to be held entirely in the encoder state vector(s), which puts significant limitations on the size and complexity of the sentenes you can translate.
  • RNNs have trouble dealing with very long sequences, since they tend to progressiuvely forget about the past - by the time you've reached the 100th token in either sequence, little information remains about the start of the sequence.

Sequence-to-Sequence Learning with transformer

Sequence-to-Sequence learning is the task where the Transformer really shines. Neural attention enables Transformer models to successfully process sequences that are considerably longer and more complex than those RNNs can handle.

The Transformer Decoder

The image below shows the full sequence-to-sequence Transformer. The Transformer decoder is similar to the Transformer encoder, except that an extra attention block is inserted between teh self-attention block applied to teh target sequence and the dense layers of the exit block.

Transformer Architecture

"""
The Transformer Decoder
"""
class TransformerDecoder(layers.Layer):
  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
    super().__init__(**kwargs)
    self.embed_dim = embed_dim
    self.dense_dim = dense_dim
    self.num_heads = num_heads
    self.attention_1 = layers.MultiHeadAttention(
    num_heads=num_heads, key_dim=embed_dim)
    self.attention_2 = layers.MultiHeadAttention(
    num_heads=num_heads, key_dim=embed_dim)
    self.dense_proj = keras.Sequential(
      [layers.Dense(dense_dim, activation="relu"),
      layers.Dense(embed_dim),]
    )
    self.layernorm_1 = layers.LayerNormalization()
    self.layernorm_2 = layers.LayerNormalization()
    self.layernorm_3 = layers.LayerNormalization()
    # This attribute ensures that the layer will propagate its input mask
    # to its outputs; masking in Keras is explicitly opt-in.
    self.supports_masking = True
  def get_config(self):
    config = super().get_config()
    config.update({
      "embed_dim": self.embed_dim,
      "num_heads": self.num_heads,
      "dense_dim": self.dense_dim,
    })
    return config
out[53]

Casual padding is absolutely critical to successfully training a sequence-to-sequence Transformer. The TransformerDecoder is order-agnostic: it looks at the entire target sequence at once.

Summary

  • Two kinds of NLP models: bag of words models that process sets of words or N-grams without taking into account their order and sequnce models that process word order.
  • Neural attention is a way to creat econtext-aware word representations. Its the basis for the Transformer architecture, which yields excellent results on sequence-to-sequence tasks.