What I've Learned Building Interactive Embedding Visualizations
I want to read a blog post about building interactive embedding visualizations because it is something that I want to do.
References
Notes
Note: When I
is used in blockquotes / quotes in this note, it is referring to the author of the article linked above.
After completing my most recent attempt, I believe I've come up with a solid process for building high-quality interactive embedding visualizations for a variety of different kinds of entity relationship data.
Embeddings are a way of representing entities as points in N-dimensional space. These entities can be things such as words, products, people, tweets - anything that can be related to something else. The idea is to pick coordinates for each entity such that similar/related entities are near each other and vice versa.
This blog post suggests somewhere between 10k-50k entities for creating embeddings. Entities
are an individual data value - one value per entity per instance. The foundation of the embedding-building process involves creating a co-occurrence matrix out of the raw source data. This is a square matrix where the size is equal to the number of entities you're embedding. The idea is that every time you find entity_n
and entity_m
in the same collection, you increment cooc_matrix[n][m]
. For some types of entities, you may have some additional data available that can be used to determine to what degree two entities are related. Since co-occurrence matrices are square, they grow exponentially with the number of entities being embedded (Might be a good idea to use a sparse matrix in Python). The process of generating the co-occurrence matrix, since it is , can be computationally intensive. Once you've built your co-occurrence matrix, you have all the data that you need to create the embedding.
PyMDE is a Python library for implementing an algorithm called Minimum Distortion Embedding. It's the main workhorse of the embedding-generation process and very powerful + versatile. It can embed high-dimensional vectors or graphs natively. embedding_dim
is probably the most important parameter in pre-processing and perhaps the most important parameter in the whole embedding process. This blog post suggests embedding into higher dimensions as an intermediary step and the using a different algorithm to project that embedding down to 2 dimensions.
This blog post recommends UMAP and t-SNE for projecting embeddings down to 2-D and 3-D.
WebGL or WebGL/WebGPU-powered libraries like Pixi.JS are great choices for building these kinds of visualizations [visualizations for embeddings].
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.