What I've Learned Building Interactive Embedding Visualizations

I want to read a blog post about building interactive embedding visualizations because it is something that I want to do.

Date Created:
1 73

References



Notes


Note: When I is used in blockquotes / quotes in this note, it is referring to the author of the article linked above.

After completing my most recent attempt, I believe I've come up with a solid process for building high-quality interactive embedding visualizations for a variety of different kinds of entity relationship data.

Embeddings are a way of representing entities as points in N-dimensional space. These entities can be things such as words, products, people, tweets - anything that can be related to something else. The idea is to pick coordinates for each entity such that similar/related entities are near each other and vice versa.

This blog post suggests somewhere between 10k-50k entities for creating embeddings. Entities are an individual data value - one value per entity per instance. The foundation of the embedding-building process involves creating a co-occurrence matrix out of the raw source data. This is a square matrix where the size is equal to the number of entities you're embedding. The idea is that every time you find entity_n and entity_m in the same collection, you increment cooc_matrix[n][m]. For some types of entities, you may have some additional data available that can be used to determine to what degree two entities are related. Since co-occurrence matrices are square, they grow exponentially with the number of entities being embedded (Might be a good idea to use a sparse matrix in Python). The process of generating the co-occurrence matrix, since it is , can be computationally intensive. Once you've built your co-occurrence matrix, you have all the data that you need to create the embedding.

PyMDE is a Python library for implementing an algorithm called Minimum Distortion Embedding. It's the main workhorse of the embedding-generation process and very powerful + versatile. It can embed high-dimensional vectors or graphs natively. embedding_dim is probably the most important parameter in pre-processing and perhaps the most important parameter in the whole embedding process. This blog post suggests embedding into higher dimensions as an intermediary step and the using a different algorithm to project that embedding down to 2 dimensions.

This blog post recommends UMAP and t-SNE for projecting embeddings down to 2-D and 3-D.

WebGL or WebGL/WebGPU-powered libraries like Pixi.JS are great choices for building these kinds of visualizations [visualizations for embeddings].


You can read more about how comments are sorted in this blog post.

User Comments