Hands On Machine Learning Chapter 13 - Loading and Preprocessing Data with TensorFlow
I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.
Loading and Preprocessing Data with TensorFlow
Deep learning systems are often trained on very large datasets that will not fit in RAM. Ingesting a large dataset and Preprocessing it efficiently can be tricky to implement with other Deep Learning libraries, but TensorFlow makes it easy thank to the Data API: you just create a dataset object, tell it where to get the data, then transform it in any way you want, and TensorFlow takes care of all the implementation details, such as multithreading, queueing, batching, prefecthing, and so on.
The Data API can read from text files, binary files with fixed-size records, and binary files that use TensorFlow's TFRecord Format. It also has support for reading from SQL databases. TensorFlow provides the Features API so that data can be preprocessed - it lets you easily convert these features to numerical features that can be consumed by your neural network. For example, categorical features with a large number of categories can be encoded using embeddings (an embedding is a trainable dense vector that represents a category). Bothe the Data API and the Features API work seamlessly with Keras.
This chapter looks at the Data API, the Features API in detail. It also looks at:
- TF Transform (ts.Transform) makes it possible to write a single preprocessing function that can be run both in batch mode on your full training set, before training, and then exported to a TF function and incorporated into your trained model, so that once it is deployed in production, it can take care of preprocessing new instances on the fly
- TF Datasets (TFDS) provides a convenient function to download many common datasets of all kinds, including ones like ImageNet, and it provides convenient dataset objects to manipulate them using the Data API
The Data API
The Data API revolves around the concept of a dataset: this represents a sequence of data items. Usually you will use datasets that gradually read data from disk.
# Create and return a dataset that will efficiently load California housing data from mulktiple CSB files, then suffle it, then preprocess it, and batch it
def csv_reader_dataset(filepaths, repeat=None, n_readers=5,
n_read_threads=None, shuffle_buffer_size=10000,
n_parse_threads=5, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
dataset = dataset.interleave(
lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
cycle_length=n_readers, num_parallel_calls=n_read_threads)
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset.prefetch(1)
With the pretch(1) line in the code above, we are creating a dataset that will do its best to always be one batch ahead. In other words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting the next batch ready. This can improve performance dramatically, as seen in the image below. If we also ensure that loading and preprocessing are multithreaded, we can exploit multiple cores on the CPU and hopefully make preparing one batch of data shorter than running a training set on the GPU: this way the GPU will almost be 100% utilized, and training will be much faster,
If you plan to purchase a GPU card, its processing power and its memory size are of course very important (in particular, a large RAM is crucial for computer vision), but its memory bandwidth is just as important as the processing power to get good performance: this is the number of gigabytes of data it can get in or out of its RAM per second.
import tensorflow as tf
X = tf.range(10) # any data tensor
dataset = tf.data.Dataset.from_tensor_slices(X) # takes a tensor and creates a Dataset whose elemnts are slices of X along first dimension
dataset
for item in dataset:
print(item)
# Chaining Transformations
dataset = dataset.repeat(3).batch(7)
for item in dataset:
print(item)
dataset = dataset.map(lambda x: x * 2)
The TFRecord Format
The TFRecord format is TensorFlow's preferred format for storing large amounts of data and reading it efficiently. It is a very simple binary format that just contains a sequence of binary record of varying sizes (each record has a length, a CRC checksum to check that the length was not corrupted, then the actual data, and finally a CRC checksum for the data). Creating a TFRecord file:
# TFRecord File
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
f.write(b"This is the first record")
f.write(b"And this is the second record")
# TFRecordDataset
filepaths = ["my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
print(item)
A Brief Introduction to Protocol Buffers
Even though each record can use any binary format you want, TFRecord files usually contain serialized Protocol Buffers (also called protobufs). This is a portable, extensible, and efficient binary format developed at Google in 2001 and Open Sourced in 2008, and they are noew widely used, particularly in gRPC, Google's remote procedure call system. Protocol buffers are defined using a simple language that looks like:
syntax = "proto3";
message Person {
string name = 1;
int32 id = 2;
repeated string email = 3;
}
The Features API
Whichever solution for preprocessing data you prefer, the Features API can help you: it is a set of functions available in the tf.feature_column package, which let you define how each feature (or group of features) in you data should be preprocessed (therefore you can think of this API as the analog of Scikit-learn's ColumnTransformer class). Numeric columns let you specify a normalization function using the normalizer_fn argument.
For categorical features such as ocean_proximity, there are several options. If it is already represented as a category ID (i.e., an integer from 0 to the max ID), then you can use the categorical_column_with_identity() function (specifying the max ID). If not, and you know the list of all possible categories, then you can use categorical_column_with_vocabulary_list()
If you suspect two or more categorical features are more meaningful when used jointly, then you can crated a crossed column. The crossed column will compute a hash of every combination of the categorical columns it comes across, modulo the hash_bucket_size, and it will give you the cross category id. A common use case for crossed columns is to cross latitude and longitude into a single categorical feature.
# Numeric Columns
age_mean, age_std = X_mean[1], X_std[1] # The median age is column in 1
housing_median_age = tf.feature_column.numeric_column("housing_median_age", normalizer_fn=lambda x: (x - age_mean) / age_std)
# Categorical Column
ocean_prox_vocab = ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']
ocean_proximity = tf.feature_column.categorical_column_with_vocabulary_list("ocean_proximity", ocean_prox_vocab)
Encoding Categorical Features Using One-Hot Features or
No matter which option you choose to build a categorical feature, it must be encoded before you can feed it to a neural network. There are two options to encode a categorical feature: one-hot vectors or embeddings. For the first option, simply use the indicator_column() function:
ocean_proximity_one_hot = tf.feature_column.indicator_column(ocean_proximity)
A one-hot vector encoding has the size of the vocabulary length, which is fine if there are just a few possible categories, but if the vocabulary is large, you will end up with too many inputs fed to your neural network: it will have too many weights to learn and it will probably not perform very well.
In this case, you should probably encode them using embeddings. (As a rule of thumb, if the number of categories is lower than 10, then one-hot encoding is generally the way to go. If the number of categories is greater than 50, then embeddings are preferable. If in between, experiment with both options. Embeddings also typically require less training data.) An embedding is trainable dense vector that represents a category. By default, embeddings are initialized randomly. Since these embeddings are trainable, they will gradually improve during training, and as they represent fairly similar categories, Gradient Descent will certainly end up pushing them closer together. The better the representation, the easier it will be for the neural network to make accurate predictions, so training to make embeddings useful representations of the categories. This is called representative learning.
Word Embeddings
Not only will embeddings generally be useful representations for the task at hand, but quite often these same embeddings can be reused successfully for other tasks as well. The most common example of this is word embeddings (embeddings of individual words): when you are working on a natural language processing task, you are often better off reusing pre-trained word embeddings tan training your own. In 2013, Google researches published a paper describing how to learn word embeddings using deep neural networks, much faster than previous attempts. This allowed them to learn embeddings on a very large corpus of text: they trained a deep neural network to predict the words near any given word. This allowed them to obtain astounding word embeddings. The example below shows when you do the vector math of the embeddings for Kind - Man + Woman - the resulting vector ends up close to Queen., which means that word embeddings encode the concept of gender.
Embedding Matrix - one row per category, one column per embedding dimension. Tensor Flow Extended is an end-to-end platform for productionizing TensorFlow models. The TensorFlow Datasets project makes it trivial to download common datasets, from small ones like MNIST or Fashion MNIST, to huge datasets like ImageNet.