Recommendation Engines

I wanted to read about recommendation engines for recommending content - the task of selecting which content to show to the user when there is a lot of content to choose from.

DOWNLOAD NOTEBOOK

2 502

Recommendation Engines

References

A recommender system (RecSys), or a recommendation system (sometimes replacing system with terms such as platform, endine, or algorithm), is a subclass of information filtering system that provides suggestions for items that are most pertinent to a particular user. Recommender systems are particularly useful when an individual needs to choose an item from a potential overwhelming number of items that a service may offer.
Wikipedia Recommender system

ML-based recommendation models determine how similar videos and apps are to other things you like and then serves up a recommendation on Youube and the Google Play store. The two kinds of recommendations that are commonly used are:

home page recommendations
Home page recommendations are personalized to a user based on their known interests. Every usre sees different recommendations.
related item recommendations
Related items are recommendations similar to a particular item.

A recommendation system helps users find compelling content in a large corpus. A recommendation engine can display items that users might not have thought to search for on their own.

Terminology

Items (also known as documents)
- The entities a system recommends. For YouTube, the items are videos
Query (also known as content)
- The information a system uses to make recommendations. Queries can be a combination of the following:
  - user information
    - the id of the user
    - items that users previously interacted with
  - additional context
    - time of day
    - the user's device
Embedding
- A mapping from a discrete set (in this case, the set of queries, or the set of items to recommend) to a vector space called the embedidng space. Many recommendation systems rely on learning an appropriate embedding representation of the queries and items.

Reccomendation Systems Overview

One common architecture for recommendation systems consists of the following components:

candidate generation
scoring
re-ranking

Candidate Generation

In the first stage, the system starts from a potentially huge corpus and generates a much smaller subset of candidates. For example, the candidate generator in YouTube reduces billions of videos down to hundreds of thousands.The model needs to evaluate queries quickly given the enormous size of the corpus. A given model may provide multiple candidate generators, each nominating a different subset of candidates.

Scoring

Next, another model scores and ranks the candidates in order to select the set of items (on the order of 10) to display to the user. Since the model evaluates a relatively small subset of items, the system can use a more precise model relying on additional queries.

Re-Ranking

Finally, the system must take into account additional constraints for the final ranking. For example, The system removes items that the user explicitly disliked or boosts the score of fresher content. Re-ranking can also help ensure diversity, freshness, and fairness.

Candidate Generation

Candidate Generation Overview

Candidate generation is the first stage of recommendation. Given a query, the system generates a set of relevant candidates. The following table shows two common candidate generation approaches:

Type	Definition	Example
content-based-filtering	Uses similarity between items to recommend items similar to what the user likes	If user A watches two cat videos, then the system can recommend cute animal videos to that user.
collaborative filtering	Uses similarities between queries and items simultaneously to provide recommendations.	If user A is similar to user B, and user B likes video 1, then the system can recommend video 1 to user A (evn if user A hadn't seen videos similar to video 1).

Embedding Space

Both content-based and collaborative filtering map each item and each query (or context) to an embedding vector in a common embedding space $E = \R ^d$ . Typically, the embedding space is low-dimensional (that is, $d$ is much smaller than the size of the corpus), and captures some latent strructure of the item or query set.

Similarity Measures

A similarity measure is a function $s\ :\ E \times E \rightarrow \R$ that takes a pair of embeddings and returns a scalar measuring their similarity. The embeddins can be used for candidate generation as follows: given a query embedding $q \in E$ , the system looks for item embeddings $x \in E$ that are close to $q$ , that is, embeddings with high similarity $s (q, x)$ . To determine the degree of similarity, most recommendation systems rely on one or more of the following:

cosine
- The cosine of the angle between two vectors: $s(q,x) = \cos (q, x)$
dot product
- The dot product of two vectors. If the embeddings are normalized, then dot-product and cosine coincide.
Euclidean Distance
- This is the usual distance in Euclidean space, $s(q,x) = \lVert q - x \rVert = \left[ \sum_{i=1}^d (q_i-x_i)^2 \right]^{\frac{1}{2}}$ . When the embeddings are normalized, the squared Euclidean distance coincides with dot-product (and cosine) up to a constant

Comparing Similarity Measures

In the image above, dpending on the similarity measure used, the ranking of the items can be different.

Which Similarity Measure?

Compared to the cosine, the dot product similarity is sensitive to the norm of the embedding. That is, the larger the norm of an embedding, the higher the similarity (for items with an acute angle) and the more likely the item is to be recommended. This can affect the reccomendations:

Items that appear very frequently in the training set tend to have embeddings with large norms. If capturing popularity information is desireable, then you should prefer dot product. However, if you are not careful, the popularity items may end up dominating the recommendations.
Items that appear very rarely may not be updated frequently during training. Consequently, if they are initialized with a large norm, the system may recommend rare items over more relevant items.

Content Based Filtering

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or feedback.

Advantages:

The model doesn't need any data about other users, since the recommendations are specific to this user. This makes it easier to scale to a large number of users.
The model can capture the specific interests of a user and can recommend niche items that very few other users are interested in.

Disadvantages:

Since the feature representation of the items are hard-engineered to some extent, this technique requires a lot of domain knowledge.
The model can only make recommendations based on existing interests of the user. In other words, the model has limited ability to expand on the users' exising interests.

Collaborative Filtering and Matrix Factorization

To address some of the limitations of content-based filtering, collaborative filtering uses simlarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B.

In practice, embeddings can be learned automatizally, which is the power of collaborative filtering models.

Matrix Factorization

Matrix factorization is a simple embedding model. Given a feedback matrix $\textbf{A} \in \R^{m\times n}$ , where $m$ is the number of users (or queries) and $n$ is the number of items, the model learns:

A user embedding matrix $U \in \R^{m\times d}$ , where row i is the embedding for user i.
An item embedding matrix $V \in \R^{n \times d}$ , where row j is the embedding for item j.

The embeddings are learned such that the product $U V^{T}$ is a good approximation of the feedback matrix A. Observe that the $(i, j)$ entry of $U$ . $V^{T}$ is simply the dot product $\left< U_i , V_j \right>$ of the embeddings of user $i$ and item $j$ . Which you want to be close to $A _{i,j}$ .

Choosing the Objective Function

One intuitive objective function is the squared distance. To do this, minimize the sum of squared errors over all pairs of observed entries:

\underset{U\in \R^{m\times d},\ V \in \R ^{n\times d}}{\text{min}}\sum_{(i,j)\in \text{obj}}\left( A_{ij} - \left< U_i , V_j \right> \right)^{2}

In this objective function, you only sum over observed pairs (i,j), that is, over non-zero values in the feedback matrix. However, only summing over values is not a good idea - a matrix of all ones will have a minimal loss and produce a model that can't make effective recommendations and that generalized poorly.

You could treat the unobserved values as zero, and sum over all entries in the matrix. This corresponds to minimizing the squared Frobenius distance between $A$ and its approximation $UV^{T}$

\underset{U\in \R^{m\times d},\ V \in \R ^{n\times d}}{\text{min}} \lVert A-UV^T \rVert ^2 _F

You can solve this quadratic problem through Singular Value Decomposition (SVD) of the matrix. However, SVD is not a great solution either, because in real applications, the matrix $A$ may be very sparse. The solution $U V^{T}$ will likely be close to zero, leading to poor generalization performance.

In contrast, Weighted Matrix Factorization decompresses the objective into two sums:

A sum over observed entries
A sum over unobserved entries (treated as zeros)

\underset{U\in \R^{m\times d},\ V \in \R ^{n\times d}}{\text{min}}\sum_{(i,j)\in \text{obj}}\left( A_{ij} - \left< U_i , V_j \right> \right)^{2} + w_0 \sum_{(i,j)\notin \text{obj}}\left( A_{ij} - \left< U_i , V_j \right> \right)^{2}

Here, $w_{0}$ is a hyperparameter that weights the two terms so that the objective is not dominated by one or the other. Tuning this hyperparameter is very important.

Minimizing the Objective FUnction

Common algorithms to minimize the object function include:

Stochastic Gradient Descent (SGD) is a generic method to minimize loss function
Weighted Alternating Least Squares (WALS) is specialized to this particular objective

The objective ius quadratic in each of the two matrices U and V. WALS works by initializing the embeddings randomly, then alternating between:

Fixing $U$ and solving for $V$
Fixing $V$ and solving for $U$

Each stage can be solved exactly and can be distributed.

Collaborative Filtering Advantages and Disadvantages

Advantages

No Domain Knowledge Necessary: We don't need domain knowledge because the embeddings are automatically learned
Serendipity: The model can help users discover new interests.
Great Starting Point: To some extent, the system needs only the feedback matrix factorization model. In particular, the system doesn't need contextual features.

Disadvantages

Cannot handle fresh items: If the item is not seen during training, the system can't create an embedding for it and can't query the model with this ite,. This issue is often called the cold-start problem. There are some techniques that can address this problem to some extent.
Hard to include side fatures for query/item: Side Features are any features beyond the query or item ID.

Recommendation Using Deep Neural Networks

Some limitations of matrix factorization include:

The difficulty of using side features. As a result, the model can only be queries with a user or item present in the training set.
Relevance of recommendations. Popular items tend to be recommended for everyone, especially when using dot product as similaroty measure. It is best to capture specific user interests.

Deep Neural Network (DNN) models can address these limitations of matrix factorization. DNNs can easily incorporate query features and item features (due to the flexibility of the input layer or the network), which can help capture the specific interests of a user and improve the relevance of recomendations.

One possible DNN model is softmax, which treats the problem as a multiclass prediction problem in which:

The input is user query
The output probability vector with size equal to the number of items in the corpus, representing the probability to interact with each item

The input to a DNN can inlcude:

dense features (e.g., watch time and time since last watch)
sparse features (e.g., watch history and country)

The model architecture determines the complexity and expressivity of the model. By adding hidden layers and non-linear activation functions, the model can capture more complex relationships in the data. The output of the last hidden layer can be denoted $\psi (x)$ .

The model maps the output of the last hidden layer $\psi (x)$ , through a softmax layer to a probability distribution $\hat{p} = h(\psi (x) V^T )$ where:

$h : \R^n \rightarrow \R^n$ is a softmax function, given by $h(y)_i = \cfrac{e^{y_i}}{\sum _j e^{y_j}}$
$ V \in \R ^{n\times d} is the matrix of weights of the softmax layer

The softmax layer maps a vector of scores $y \in \R^n$ (sometimes called the logits) to a probability distribution.

Finally, define a loss function that compares the following:

$\hat{p}$ , the output of the softmax layer (a probability distribution)
$p$ , the ground truth, representing the items the user has interacted with. This can be represented as a normalized multi-hot distribution.

The probability of item $j$ is given by $\hat{p}_j = \cfrac{\text{exp}(\left< \psi (x), V_j \right>)}{Z}$ , where $Z$ is a normalization constant that does not depend on $j$ .

In other words, $\text{log}(\hat{p}_j) = \left< \psi (x), V_j \right> - log(Z)$ , so the log probability of an item $j$ is (up to an additive constant) the dot product of two $d$ -dimensional vectors, which can be interpreted as query and item embeddings.

$\psi (x) \in \R ^d$ is the output of the last hidden layer. We call it the embedding of the query $x$ .
$V_j \in \R ^d$ is the vector of weights connecting the last hidden layer to output $j$ . We call it the embedding of item $j$ .

In both the softmax model and the matrix factorization model, the system learns one embedding vector $V_{j}$ per item $j$ . What we called the item embedding matrix, $V \in \R^{n \times d}$ in matrix factorization is now the matrix of weights of the softmax layer. The query embeddings, however, are different. Instread of learning one embedding $U_{i}$ , per query $i$ , the system learns a mapping from the query feature $x$ to an embedding $\psi (x) \in \R ^d$ . Therefore, you can think of this DNN model as a generalization of matrix factorization, in which you can replace the query side by a nonlinear function $\psi (\cdot )$ .

The softmax training data consists of the query features $x$ and a vector of utems the user interacted with (represented as a probability distribution $p$ ). These are marked i blue in the figure below. The variables of the model are the weights of the different layers. The model is typically trained using any variant of stochastic gradient descent.

Retrieval, Scoring, and Re-Ranking

Retrieval

How to decide which items to recommend? At serve time, given a query, you start by doing one of the following:

For a matrix factorization model, the query (or user) embedding is known statically, and the system can simply look it up from the user embedding matrix
For a DNN model, the system computes the query embedding $\psi (x)$ at serve time by running the network on the feature vector $x$ .

Once you have the query embedidng $q$ , search for item embeddings $V_{j}$ that are close to $q$ in the embedding space. This is a nearest neighbor problem.

To compute the nearest nighbors in the embedding space, the system can exhaustively score evey potential candidate. Exhaustive scoring can be expensive for very large corpora, but you can use either of the following strategies to make it more efficient:

If the query embedding is known statically, the system can perform exhaustive scoring offline, precomputing and storing a list of the top candidates for each query. This is a common practice for related-item recommendation.
Use approximate nearest neighbors. Google provides an open-source tool on GitHub called ScaNN. This tool performs efficient vector similarity search at scale.

Scoring

After candidate generation, another model scores and ranks the generated candidates to select the set of items to display. The recommendation system may have multiple candidate generators that use different sources, such as:

Related items from a matrix factorization model.
User features that account for personalization.
"Local" vs "distant" items; that is, taking geographic information into account
Popular vs trending items
A social graph; that is, items liked or recommended by friends.

The system combines these different sources into a common pool of candidates that are then scored by a single model and ranked according to that score. You should avoid letting the candidate generator compute a score. The choice of scoring function can dramatically affect the ranking of items and ultimately the quality of recommendations.

Re-Ranking

In the final stage of a recommendation system, the system can re-rank teh candidates to consider additional criteria or constraints. One re-ranking approach is to use filters that remove some candidates. Another re-ranking approach is to manually transform the score returned by the ranker. You should take into account the freshness, diversity, and fairness of your reccomendation system.

TensorFlow Reccomenders

TensorFlow Recommenders (TFRS) is a library for building recommender system models. It helps with the full workflow of building a recommender system: data preparatuon, model formulation, training, evaluation, and deployment. It's built on Keras.

TFRS makes it possible to:

Build and devleop flexible recommendation retrieval models
Freely incorporate item, user, and context information into recommendation models
Train multi-task models that jointly optimize multiple recommendation objectives

TFRS is open source and aailable on Github.

TensorFlow Recommenders: QuickStart

This tutorial builds a simple matrix factorization model using MovieLens 100K dataset with TFRS.

!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-dataset

out[2]

[?25l [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/96.2 kB[0m [31m?[0m eta [36m-:--:--[0m [2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m92.2/96.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: Could not find a version that satisfies the requirement tensorflow-dataset (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow-dataset[0m[31m
[0m

from typing import Dict, Text
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
# Ratings data.
ratings = tfds.load('movielens/100k-ratings', split="train")
# Features of all the available movies.
movies = tfds.load('movielens/100k-movies', split="train")
# Select the basic features.
ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"]
})
movies = movies.map(lambda x: x["movie_title"])
# Build vocabularies to convert user ids and mobie titles into integer indices for embedding layers
user_ids_vocabulary = tf.keras.layers.StringLookup(mask_token=None)
user_ids_vocabulary.adapt(ratings.map(lambda x: x["user_id"]))

movie_titles_vocabulary = tf.keras.layers.StringLookup(mask_token=None)
movie_titles_vocabulary.adapt(movies)

# Define a TFRS model by inheriting form tfrs.Model and implementing the compute_loss method

class MovieLensModel(tfrs.Model):
  # We derive from a custom base class to help reduce boilerplate. Under the hood,
  # these are still plain Keras Models.

  def __init__(
      self,
      user_model: tf.keras.Model,
      movie_model: tf.keras.Model,
      task: tfrs.tasks.Retrieval):
    super().__init__()

    # Set up user and movie representations.
    self.user_model = user_model
    self.movie_model = movie_model

    # Set up a retrieval task.
    self.task = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # Define how the loss is computed.

    user_embeddings = self.user_model(features["user_id"])
    movie_embeddings = self.movie_model(features["movie_title"])

    return self.task(user_embeddings, movie_embeddings)

# Define the two models and the retieval task

# Define user and movie models.
user_model = tf.keras.Sequential([
    user_ids_vocabulary,
    tf.keras.layers.Embedding(user_ids_vocabulary.vocab_size(), 64)
])
movie_model = tf.keras.Sequential([
    movie_titles_vocabulary,
    tf.keras.layers.Embedding(movie_titles_vocabulary.vocab_size(), 64)
])

# Define your objectives.
task = tfrs.tasks.Retrieval(metrics=tfrs.metrics.FactorizedTopK(
    movies.batch(128).map(movie_model)
  )
)

# Fit and Evaluate it
# Create a retrieval model.
model = MovieLensModel(user_model, movie_model, task)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.5))

# Train for 3 epochs.
model.fit(ratings.batch(4096), epochs=3)

# Use brute-force search to set up retrieval using the trained representations.
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index_from_dataset(
    movies.batch(100).map(lambda title: (title, model.movie_model(title))))

# Get some recommendations.
_, titles = index(np.array(["42"]))
print(f"Top 3 recommendations for user 42: {titles[0, :3]}")

out[3]

TensorFlow Recommendation Systems Home

Recommendation systems increase user engagement within your app and elevate user experience by providing the most desirable content. Modern recommenders are complex systems that are often broken down into multiple stages to achieve low latency in production. Through the retrieval, ranking, and potentially post-ranking stages, irrelevant items are gradually filtered out from a large pool of candidates and a list of options that users are most likely to interact with are finally presented.

When you've finished training your models, deploy them into production to serve recommendations to end users. TensorFlow Serving productionizes your models for high performance inference. It aims to mazimize throughput of machine learning models and can support large recommendation models that require distributed serving.

ScaNN is a library for vector similarity search at scale. It leverages state-of-the-art Approximate Nearest neighbor techniques, such as asymmetric hashing and aniosotropic qunatization, to accelerate retieval of top candidates.

TensorFlow Ranking is a library for developing scalable, neural LTR models. It provides additional functionalities to rank candidate items to maximize the ranking utilities.

The TPUEmbedding layer API facilitates training and serving large embedding tables on Tensor Processing Units (TPUs).

TesnorFlow Recommenders Addons is a community-contributed project that leverages dynamic embedding technology that is particularly useful for online learning.

TensorFlow Lite provides an on-device recommendation solution that achieves low-latency and high-quality recommendations, while keeping all user data on mobile devices.

TensorFlow Federated is a framework for federated learning and other computations on decentralized data. Federates Reconstruction brings matrix factorization to the federated learning setting and better protects user privacy for recommendations.

TensorFlow Agents Bandits is a comprehensive library of bandit algorithms that can explore and exploit effectively in the recommendation engine setting.

TensorFlow GNN is a library that can efficiently facilitate item recommendations based on network structures and can be used in conjunction with retrieval and ranking models.

Other Resources

Building a Full Stack Recommendation System
TensorFlow Serving
Google ScaNN
TensorFlow Ranking
TensorFlow TPUEmbedding
- Doc
- Blog Post
TensorFlow Recommenders Addons
- GutHub
TensorFlow Lite On-Device Recommendation
- Doc
- Blog Post
Federation Reconstruction with TensorFlow Federated
- Tutorial
- Blog Post
TensorFlow Agents Bandits
- GitHub
- Blog Post
TensorFlow GNN
- GitHUb
- Blog Post

User Comments

There are currently no comments for this article.

Recommendation Engines

Recommendation Engines

References

Recommendation Systems

Background

Recommendations: What and Why?

Terminology

Reccomendation Systems Overview

Candidate Generation

Scoring

Re-Ranking

Candidate Generation

Candidate Generation Overview

Embedding Space

Similarity Measures

Comparing Similarity Measures

Which Similarity Measure?

Content Based Filtering

Collaborative Filtering and Matrix Factorization

Matrix Factorization

Choosing the Objective Function

Minimizing the Objective FUnction

Collaborative Filtering Advantages and Disadvantages

Advantages

Disadvantages

Recommendation Using Deep Neural Networks

Retrieval, Scoring, and Re-Ranking

Retrieval

Scoring

Re-Ranking

TensorFlow Reccomenders

TensorFlow Recommenders: QuickStart

TensorFlow Recommendation Systems Home

Other Resources

Comments

User Comments