Don't use cosine similarity carelessly
Blog post that I saw on HackerNews that looks informative and interesting. (https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity)
References
Notes
King - man + woman is queen; but why?
word2vec is an algorithm that transforms words into vectors, so that words with similar meanings end up laying close to each other. Moreover, it allows us to use vector arithmetic to work with analogues, for examples the famous: king - man + woman = queen
Merely looking at word coincidences, while ignoring all grammar and context, can provide us insight into the meaning of a word.
Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based in their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.
In other words, a word is characterized by the company it keeps
- John Rupert Firth
If we want to teach a computer word meaning using distributional semantics, the simplest, approximated approach is making it look only at word pairs. Let be the conditional probability that given a word there is also a word within a short distance (e.g., within no more than 2 words). Then we can claim that two words and are similar if
for every word . In other words, if we have this equality, no matter if there is a word or , all other words will occur with the same frequency. Even simple word counts compared by source can give interesting results:
Looking at co-occurrences can provide much more information.
Keeping track of even for a relatively small word count can take up a lot of space, so instead of working with conditional probabilities, we use the pointwise mutual information (PMI):
Its direct interpretation is how much more likely we get a pair than if it were at random. The logarithm makes it easier to work with words appearing at frequencies of different magnitudes. The PMI scalar product can be approximated as:
where are vectors, typically of 50-300 dimensions (Words can be compressed to a much smaller dimensionality). The fact that compression is lossy may give it an advantage, as it can discover patterns rather than only memorize each pair.
The vectors of synonyms, antonyms, and other easily interchangeable words (yellow and blue) are close to each other in vector space. In particular, most opposing ideas will have similar contexts.
Words form a linear space. A zero vector represents a totally uncharacteristic word, occurring with every other word at the random chance level (and its scalar product with every word is zero, so is its PMI). It is one of the reasons why for vector similarity people often use cosine distance:
That is, its puts emphasis on the direction in which a given word co-occurs with other words, rather than on the strength of this effect.
- Google Ngram Viewer allows you to compare pairs to words
The projection:
is exactly a relative occurrence of a word within different contexts. When we want to look at common aspects of a word it is more natural to average two vectors rather than take their sum.
Don't use cosine similarity carelessly
Blindly applying cosine similarity to vectors can lead us astray. While embeddings to capture similarities, they often reflect the wrong kind - matching questions to questions rather than questions to answers, or getting distracted by superficial patterns like writing styles and typos rather than meaning.
Vectors can be used to chart entities and relationships between them - both to provide as structured input to a machine learning algorithm and on its own, to find similar items. Recent research shows that embeddings from large language models are almost as revealing as the original text - Text Embeddings Reveal (Almost) As Much As Text, (2023).
Cosine similarity interesting properties:
- Identical vectors score a perfect 1
- random vectors hover around 0
- The result is between -1 and 1
Just because the values usually fall between 0 and 1 doesn't mean they represent probabilities or any other meaningful metric. The value 0.6 tells little if we have something really similar, or not so much. And while negative values are possible, they rarely indicate semantic opposites — more often, the opposite of something is gibberish.
Cosine similarity is the duct tape of vector comparisons. It sticks everything together - images, text, audio, code - but like real duct tape, it's a quick fix that often masks deeper problems rather than solving them. You shouldn't trust cosine similarity for all your vector comparison needs.
Pearson correlation can be seen as a sequence of three operations:
- Subtracting means to center the data
- Normalizing vectors to unit length
- Computing dot products between them
When we work with vectors that are both centered () and normalized (). Pearson correlation, cosine similarity and dot product are the same. In practical cases, we don't want to center or normalize vectors during each pairwise comparison - we do it just once, and just use dot product. In any case, when you are fine using cosine similarity, you should be as fine with using Pearson correlation (and vice versa).
The trouble with cosine similarity begins when we venture beyond its comfort zone, specifically when:
- the cost function used in model training isn't cosine similarity
- The training objective differs from what we care about
The normalization gives us some nice mathematical properties (keeping results between -1 and +1, regardless of dimensions), but it's ultimately a hack. Sometimes it helps, sometimes it doesn't.
We are safe only if the model itself uses cosine similarity or a direct function of it.
The best approach to finding similarity between documents is to use LLMs directly to compare two entries. Start with a powerful model of your choice, then write something in the line of: Is {sentence_a} a plausible answer to {sentence_b}?
. We typically want our answers in structured output - what the field calls tools
or function calls
. Since many models love Markdown, the author's template looks like:
{question}
## A
{sentence_a}
## B
{sentence_b}
In most cases this approach is impractical though, we don't want to run such a costly operation for each query. It would be cost prohibitive.
Going back to embeddings, instead of trusting a black box, we can directly optimize for what we actually care about by creating task-specific embeddings. The two main approaches:
- Fine tuning - teaching an old model new tricks by adjusting its weights
- Transfer learning - using the model's knowledge to create new, focused embeddings
Asking, Is A similar to B?
, we can write that as:
where and is a matrix that reduces the embedding space to dimensions we actually care about. Think of it as only keeping the features relevant to our specific similarity definition.
The question Is document B a correct answer to question A?
and the relevant probability:
where and . The matrices and transform our embeddings into specialized spaces for queries and . It's like having two different languages and learning to translate between them, rather than assuming they're the same thing. This approach works beautifully for RAG too, as we do not care about similar documents but relevant ones.
Instead of having to train a model, one of the quickest fixes is to add a prompt to the text, so to set the apparent context: Similar to {term}
(the prompt depends on the task).
This approach is useful, but it is not quite a silver bullet. Another approach is to preprocess the text before embedding it. Here is a suggested trick:
Rewrite the following text in standard English using Markdown. Focus on content, ignore style. Limit to 200 words
This approach works wonders. It helps avoid false matches based on superficial similarities like formatting quirks or unnecessary typos or unnecessary verbosity.
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.