Relational Recurrent Neural Networks

I am reading this paper because it was recommended as part of Ilya Sutskever's approx. 30 papers that he recommended to John Carmack to learn what really matters for machine learning / AI today. This paper confirms suspicions that standard memory architectures may stuggle at tasks that involve understanding the ways in which entities are connected and then improves upon the problem by using a new memory module - a Relational Memory Core.

Reference Link to PDF of Paper

Date Created:
Last Edited:
1 14

Memory-based neural networks model temporal data by leveraging an ability to remember information for long periods. It is unclear, however, whether they also have an ability to perform complex relational reasoning with the information they remember. This paper first confirms the intuition that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected - i.e., tasks involving relational reasoning. This paper then improves upon these deficits by using a new memory module - a Relational Memory Core (RMC) - which employs multi-head dot product attention to allow memories to interact. This paper then shows improvements in RL tasks, program evaluation, and language modeling.

RNNs like LSTMs, bolstered by augmented memory capabilities, bounded computational costs over time, and an ability to deal with vanishing gradients, learn to correlate events across time to be proficient at storing and retrieving information. This paper proposes that it is fruitful to consider memory interactions along with storage and retrieval. Although current models can learn to compartmentalize and relate distributed, vectorized memories, they are not biased towards doing so explicitly. This paper hypothesizes that such a bias may allow a model to better understand how memories are related, and hence may give it a better capacity for relational reasoning over time. A Relational Memory Core (RMC) uses multi-head dot product attention to allow memories to interact with each other.

Relational Reasoning is the process of understanding the ways in which entities are connected and using this understanding to accomplish some higher order goal. Consider sorting the distances of various trees to a park bench: the relations (distances) between the entities (trees and bench) are compared and contrasted to produce the solution, which could not be reached if one reasoned about the properties (positions) of each individual entity in isolation.

PIC

Multi head dot product attention (MHDPA), also known as self-attention, allows memories to interact. Using MHDPA, each memory will attend over all of the other memories, and will update its content based on the attended information.

  1. A simple linear projection is used to construct queries (Q = MWq), keys (K = MWk) and values (V = MWv) for each memory (row mi) in matrix M.
  2. Use the queries, Q to perform a scaled dot-product attention over the keys K.
  3. The returned scalars are put through a softmax-function to produce a set of weights, which can then be used to return a weighted average of values V as A(Q,K,V ) = softmax (QKT dk ) V , where dk is the dimensionality of the key vectors used as a scaling factor. Equivalently:

Aθ(M) = softmax (MWq(MWk)T dk ) MQv, where θ = (Wq,Wk,Wv)

The output of Aθ(M), which we will denote as M~, is a matrix with the same dimensionality as M, and it can be interpreted as a proposed update to M, which each m~i comprising information from memories mj. Thus, in once step of attention memory is updated with information originating from other memories, and it is up to the model to learn (via parameters Wq, Wk and Wv) how to shuttle information from memory to memory. As implied by the name, MHDPA uses multiple heads. We implement this producing h sets of queries, keys, and values, using unique parameters to compute a linear projection from the original memory for each head h. We then independently apply attention operators for each head. For example, if M is an N × F dimensional matrix and we employ two attention heads, then we compute M1~ = Aθ(M) and M2~ = Aθ(M), where M1~ and M2~ are N × F2 matrices, and θ and ϕ denote unique parameters for the linear projects to produce the queries, keys, and values, and M~ = [M1~ : M2], where : denotes column-wise concatenation. Intuitively, heads could be useful for letting a memory share different information, to different targets, using each head.

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC