Semantic Search

October 25, 2024 | seedling, permanent

tags :

Semantic Search #

Semantic search denotes search with meaning, as distinguished from lexical search where the Search Engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query wikipedia

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms. ref

How it works? #

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.

Symmetric vs. Asymmetric Semantic Search #

A critical distinction for your setup is symmetric vs. asymmetric semantic search:

symmetric semantic search #

For symmetric semantic search your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be “How to learn Python online?” and you want to find an entry like “How to learn Python on the web?”. For symmetric tasks, you could potentially flip the query and the entries in your corpus. symmetric models colab for symmetric semantic search

asymmetric semantic search #

For asymmetric semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like “What is Python” and you wand to find the paragraph “Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …”. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.

assymetric models colab for asymmetric semantic search

Models #

https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models

vector Search #

OpenSearch vs Elasticsearch ref

Vector search vs. lexical search #

ref

Best Embeddings for the semantic search #

ref

Word2vec

OpenSearch #

OCR of Images #

2023-12-21_10-42-13_screenshot.png #

Relevant Document Query - - - - - - - - - -

2024-01-05_14-21-10_screenshot.png #

Embedding Vector Space Query Retrieve results based on the highest similarity to the query in that space. Embedding Model graft

2024-01-05_14-23-07_screenshot.png #

- Text as Vectors Embedding Model - = 0.8 0.3 0.1 graft

2024-05-01_16-32-43_screenshot.png #

Semantic search (neural search plugin) aws AWS Cloud Create a connection to a 3P model hosting service Run neural search pipeline to ingest documents into OpenSearch Service Pre-trained BERT model 3 Client submits a search request to API Gateway Backend Frontend Client Amazon API Gateway calls AWS Lambda backend service in Lambda Amazon OpenSearch Service A 09 Business documents Backend service calls neural search API1 to get similar documents and retum to client aws o 2023 Heb

2024-05-01_16-32-53_screenshot.png #

Results: Transformer + BM25, NDCG@10 BM25 Harmonic Fine-tuned, arithmetic Fine-tuned, geometric NFCorpus Trec-Covid ArguAna FIQA Scifact DBPedia Quora Scidocs 0.343 0.688 0.472 0.254 0.691 0.313 0.789 0.165 0.346 0.731 0.482 0.281 0.673 0.395 0.847 0.173 0.333 0.088 0.369 0.752 0.527 0.364 0.728 0.373 0.874 0.184 0.3673 0.091 14.14% 0.367 0.79 0.526 0.350 0.727 0.392 0.872 0.184 0.377 0.091 14.93% CQADupStack 0.325 Amazon ESCI 0.081 N/A 6.42% Source: htps//opensearchors/blog/semantic: science-benchmarks/ Wb A relnvent