BM25
- tags
- Full Text Search
Acronym: Best Match 25 #
BM25 also known as the Okapi BM25, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.
BM25, or Best Match 25, is a ranking algorithm for information retrieval and search engines. It enhances the traditional TF-IDF (Term Frequency-Inverse Document Frequency) model. The goal of BM25 is to determine the relevance of a document to a given query and rank documents based on their relevance scores.

Key Components of BM25 #
Term Frequency (TF) #
TF refers to the number of times a particular term appears in a document. However, BM25 uses a modified term frequency that takes into account saturation effects to prevent overemphasizing heavily repeated terms.
Inverse Document Frequency (IDF) #
IDF measures the importance of a term in the entire corpus. It assigns higher weights to terms that are rare in the corpus and lower weights to terms that are common. IDF is calculated using the formula: IDF = log((N — n + 0.5) / (n + 0.5)), where N is the total number of documents and n is the number of documents containing the term.
Document Length Normalization #
BM25 incorporates document length normalization to address the impact of document length on relevance scoring. Longer documents tend to have more occurrences of a term, leading to potential bias. Document length normalization counteracts this bias by dividing the term frequency by the document’s length and applying a normalization factor.
Query Term Saturation #
BM25 also includes a term saturation function to mitigate the impact of excessively high term frequencies. This function reduces the effect of extremely high term frequencies on relevance scoring, as very high frequencies often correspond to less informative terms.
Elasticsearch #
BM25, or Best Match 25, is the default similarity ranking algorithm in Elasticsearch, .

TF-IDF vs BM25 #

OpenSearch #
OpenSearch uses a probabilistic (probability) ranking framework called Okapi BM25 to calculate relevance scores. Okapi BM25 is based on the original TF/IDF(TFIDF) framework used by Apache Lucene..
#
from langchain.retrievers import BM25Retriever
from langchain_core.documents import Document
retriever = BM25Retriever.from_documents(
[
Document(page_content="foo"),
Document(page_content="bar"),
Document(page_content="world"),
Document(page_content="hello"),
Document(page_content="foo bar"),
]
)
result = retriever.get_relevant_documents("foo")
# result
# [Document(page_content='foo', metadata=),
# Document(page_content='foo bar', metadata=),
# Document(page_content='hello', metadata=),
# Document(page_content='world', metadata=)]
OCR of Images #
2024-02-27_12-11-14_screenshot.png #

n f(qi,D) * - C - + - 1) score(D,Q) - IDF(qi) DI f(qi,D) + k1 * - - avgdl
2024-02-27_11-59-50_screenshot.png #

f(9iD) 1 (k1 +1) IDF(q1) fieldLen f(qi ,D) + k1 * (1 n avgFieldLen
2024-02-27_12-00-55_screenshot.png #

tf() of TF/IDF 8 8 + tf() of BM25 10 20 30 40 50 60 70 80 90 Frequency