MTEB: Massive Text Embedding Benchmark

I need to do a lot with text embeddings, so I am going to read this paper. This paper establishes a benchmark to test text embeddings.

Reference arxiv paper

DOWNLOAD TEX

Date Created: 40 24, 2025

Last Edited: 13 25, 2025

1 177

Introduction

Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or re-ranking.This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, the authors introduce Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. This paper establishes the most comprehensive benchmark of text embeddings to date. The paper finds that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up suffciently to provide state-of-the-art results on all embedding tasks.

Natural Language embeddings power a variety of use cases: clustering and topic representation, search systems and text mining, and feature representations for downstream models. The massive Text Embedding Benchmark (MTEB) aims to provide clarity on how models perform on a variety of embedding tasks and thus serves as the gateway to finding universal text embeddings applicable to a variety of tasks. The paper finds that there is no single solution for embeddings - different embeddings dominate different tasks.

Notes

MTEB is built on a set of desiderata:

Diversity: MTEB aims to provide an understanding of the usability of embedding models in various use cases. The benchmark comprises 8 different tasks, with up to 15 datasets each.
Simplicity MTEB provides a simple API for plugging in any model that given a list of texts can produce a vector for each list item with a consistent shape.
Extensibility New datasets for existing tasks can be benchmarked in MTEB via a single file that specifies the task and a Hugging Face datasset name where the data has been uploaded
Reproducivility: Make it easy to reproduce results

Tasks and Evaluation

Bitext mining Inputs are two sets of sentences in two different languages. For each sentence in the first set, the best match in the seond set needs to be found.
Classification A train and test set are embedded with the provided model. The train set embeddings are used to train a logistic regression classifier with 100 maximum iterations, which is scored on the test set.
Clustering Given a set of sentences or paragraphs, the goal is to group them into meaningful clusters.
Pair classification: A pair of text inputs is provided and a label needs to be assigned.
Reranking Inputs are a query and a list of relevant and irrelevant reference texts. The aim is to rank the results according to their relevance to the query.
Retrieval Each dataset consists of a corpus, queries and a mapping for each query to relevant documents from the corpus. The aim is to find these relevant documents.
Semantic Textual Similarity (STS) Given a sentence pair the aim is to determine their similarity.
Summarization The aim is to score the machine summaries.

MTEB: Massive Text Embedding Benchmark

Introduction

Notes

Tasks and Evaluation

Comments

User Comments