Bi-encoders compress text into a fixed-size numerical vector such that semantically similar texts land near each other in vector space. Unlike BM25, which matches on shared words, embeddings capture meaning: “staying focused” and “maintaining attention” produce similar vectors even though they share no words.
The mechanism is that a transformer’s internal representations already encode semantic meaning. The model takes a text, passes it through transformer layers, then collapses the final hidden states into a single vector (typically 768 or 1536 dimensions) via pooling (mean, CLS token, etc.). Training uses contrastive learning: the model is shown pairs of texts that should be similar (query + relevant document) and pairs that should be dissimilar (query + irrelevant document), and learns to push similar pairs closer in vector space while pushing dissimilar pairs apart. At search time, the query is encoded into a vector independently from documents, and similarity is computed via cosine similarity or dot product.
That independence is the architectural trade-off. The advantage is offline pre-computation: every document vector can be cached, so query-time work scales with the candidate set, not the corpus. The cost is that the model must compress all possible meanings of a document into one vector before it ever sees the query. Nuance gets lost (a document discussing both pros and cons of X gets a vector somewhere in between), negation is hard (“papers that do NOT use reinforcement learning” lands close to RL papers because the same terms appear), and the model can’t attend to which parts of a document are relevant to this specific query. This is why cross-encoders exist: they process query and document together to overcome this bottleneck.
Embedding models serve as the dense-retrieval first stage, often combined with BM25 in Hybrid Search. They excel where vocabulary mismatch is the problem (user and document use different words for the same concept) but underperform BM25 for exact terminology matching. Leading models in early 2026: Cohere Embed v3, OpenAI text-embedding-3-large, and open-source options like BGE, E5-Mistral, and Nomic Embed. The trend is toward instruction-tuned embeddings where you prefix the query with a task description to steer the embedding. See Reimers and Gurevych (2019), “Sentence-BERT”, for the foundational paper.