Embedding Models for Semantic Similarity

Core Principle

An embedding model (bi-encoder) compresses text into a fixed-size numerical vector such that semantically similar texts land near each other in vector space. Unlike BM25 - Best Matching 25 Ranking Function, which matches on shared words, embeddings capture meaning — “staying focused” and “maintaining attention” produce similar vectors even though they share no words.

How It Works (First Principles)

The key insight is that a transformer model’s internal representations already encode semantic meaning. An embedding model takes a piece of text, passes it through transformer layers, then collapses the final hidden states into a single vector (typically 768 or 1536 dimensions) via pooling (mean, CLS token, etc.).

Training uses contrastive learning: the model is shown pairs of texts that should be similar (query + relevant document) and pairs that should be dissimilar (query + irrelevant document), and learns to push similar pairs closer in vector space while pushing dissimilar pairs apart.

At search time, the query is encoded into a vector independently from documents. Similarity is computed via cosine similarity or dot product. This independence is both the key advantage (you can pre-compute all document vectors offline) and the fundamental limitation (the query vector has no awareness of any specific document).

The Independence Problem

Because queries and documents are encoded separately, the model must compress all possible meanings of a document into a single vector before it ever sees the query. This means:

  • Nuance gets lost — a document discussing both pros and cons of X gets a vector somewhere in between
  • Negation is hard — “papers that do NOT use reinforcement learning” produces a vector close to RL papers because the model sees the same terms
  • The model can’t attend to which parts of a document are relevant to this specific query

This is why Reranker - Cross-Encoder Rescoring exists — it processes query and document together to overcome this bottleneck.

Current Landscape (as of early 2026)

Leading models include Cohere Embed v3, OpenAI text-embedding-3-large, and open-source options like BGE, E5-Mistral, and Nomic Embed. The trend is toward instruction-tuned embeddings where you can prefix the query with a task description to steer the embedding.

Where It Fits

Embeddings serve as a dense retrieval first stage, often combined with BM25 in Hybrid Search - Combining Sparse and Dense Retrieval. They excel where vocabulary mismatch is the problem (user and document use different words for the same concept) but underperform BM25 for exact terminology matching.

References

  • Reimers, N., & Gurevych, I. (2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP.
  • Karpukhin, V., et al. (2020). “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP.