sentence-transformers

April 2, 2024 | seedling, permanent

tags :

Python Apps for Semantic Search #

Multilingual Sentence & Image Embedding with BERT

This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity.

Cross-Encoder #

In contrast, for a Cross-Encoder, we pass both sentences simultaneously to the Transformer network. It produces than an output value between 0 and 1 indicating the similarity of the input sentence pair

use-case use to evaluate the relevance of the query and its result.

Symmetric vs Assymetric Search model recommendation #

ref

Symmetric #

both query and response are of equal length model: multi-qa-MiniLM-L6-cos-v1

Asymmetric #

query and documents are not of equal length, docs are of larger length. model: all-MiniLM-L6-v2

Multilingual Semantic Search #

distiluse-base-multilingual-cased-v1 #

Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.

distiluse-base-multilingual-cased-v2 #

Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. This version supports 50+ languages, but performs a bit weaker than the v1 model.

paraphrase-multilingual-MiniLM-L12-v2 #

Multilingual version of paraphrase-MiniLM-L12-v2, trained on parallel data for 50+ languages.

paraphrase-multilingual-mpnet-base-v2 #

Multilingual version of paraphrase-mpnet-base-v2, trained on parallel data for 50+ languages.