Elasticsearch

May 1, 2024 | seedling

Summary #

tags: , , , ,

The heart of company

It is a database with search engine APIs
Cloud solution requires
elastic-licenses
Elasticsearch is built on top of .
Elasticsearch converts Lucene into a distributed system/search engine for scaling horizontally.

Elastic cloud features #

https://www.elastic.co/subscriptions/cloud

Where does Elasticsearch stores data? #

ref

/var/lib/elasticsearch/data

Why Elastic changed their licenses from version 7.11? #

elastic’s justification

Because Amazon was providing elastic products as service they changed their licenses to make them as sole cloud service providers of elastic products.
- Our on-prem or Elastic Cloud customers will not be impacted.
- The vast majority of our users will not be impacted.
- The folks who take our products and sell them directly as a service will be impacted, such as the Elasticsearch Service.

Current licenses #

How Elasticsearch works #

udemy course

Inverted index

inverted index per filed in document.

Mapping

identifying the type of a filed

Using it as Vector Store with Dense vector #

using dense_vector mapping type.
dense_vector with good diagrams

Dense vector is available from 7.6 version.

#

distance ref

#

7.10 #

ref

OCR of Images #

2023-08-25_22-29-41_screenshot.png #

Elasticsearch uses analyzers to analyze text data. An analyzer has the following components: Character filters: Zero or more Tokenizer: Exactly one Token filters: Zero or more The following diagram depicts the components of an analyzer: (zero or more) Character Filter(s) (exactly one) Tokenizer (zero or more) Token Filter(s) Analyzer Figure 3.1: Anatomy of an analyzer Let's understand the role of each component one by one.

2023-12-10_16-33-37_screenshot.png #

- - E Document Analyzer Storage V Character filters Token filters Tokenizer

2023-12-10_16-35-40_screenshot.png #

Character filters Tokenizer Token filters (none) standard ["lowercase"] Input: "I REALLY like beer!" Input: "I REALLY like beer!" Input: "I", "REALLY", "like", "beer" Output: "I REALLY like beer!' I1 Output: ["I", "REALLY", "like", "beer"] Output: "i", "really", "like", "beer" I1 STANDARD ANALYZER

2023-12-10_16-52-47_screenshot.png #

"properties": ( 'description": ( type": "text", 'analyzer": temming_analyzer" TERM DOCUMENT #1 X X X X X X X X X X GET /st temming_t test/_search love drink bottl of wine on last year vacat "query": ( "ma tch' " : f D description": "drinking" "drinking" "drink" stemming.analyzer

2023-12-10_16-40-04_screenshot.png #

Sentence Tokens "2 guys walk into bar, but the third.. DUCKS! :-)" ["2", "guys", "walk", "into", a a" "bar", "but", the" third", "ducks"] "2 guys went into a bar" ["2", "guys" "went", "into", "a" "bar"] "2 ducks walk around the lake" ["2", "ducks", "walk", "around", "the", "lake"] TERM 2 a around bar but ducks guys into lake the third walk went DOCUMENT #1 X X DOCUMENT #2 X X DOCUMENT #3 X X X X X X X X X X X X X X X X X X

2023-12-10_16-40-21_screenshot.png #

Sentence Tokens "2 guys walk into bar, but the third... DUCKS! :-)" ["2", "guys", "walk" "into", "bar" "but" the", third", "ducks"] "2 guys went into a bar" ["2", "guys", "went", "into", a" "bar"] "2 ducks walk around the lake" ["2", "ducks", "walk", "around" the", "lake"] TERM 2 a around bar but ducks guys into lake the third walk went DOCUMENT #1 X X DOCUMENT #2 X X DOCUMENT #3 X X X X X X X X X X X X X X X X X

2023-12-10_16-41-27_screenshot.png #

Inverted indices Mapping between terms and which documents contain them Outside the context of analyzers, we use the terminology "terms" Terms are sorted alphabetically Inverted indices contain more than just terms and document IDs O E.g. information for relevance scoring

2023-12-10_16-41-51_screenshot.png #

Document #1 Document #2 "name": "Coffee Maker", " name " : "Toaster", " description": "Makes coffee super fast!", description": "Makes delicious toasts. II "price": 64, "in_stock": 10, "price": 49, "in_stock": 4, "created_at": "2009-11-08714:21:512" "created_at": "2007-01-29T09:44:152" name field description field TERM coffee maker toaster DOCUMENT #1 X X DOCUMENT #2 TERM coffee delicious fast makes super toasts DOCUMENT #1 X DOCUMENT #2 X X X X X X X

2023-12-10_16-42-34_screenshot.png #

Lecture summary (1/2) Values for a text field are analyzed and the results are stored within an inverted index Each field has a dedicated inverted index An inverted index is a mapping between terms and which documents contain them Terms are sorted alphabetically for performance reasons Created and maintained by Apache Lucene, not Elasticsearch

2023-12-10_16-42-47_screenshot.png #

Lecture summary (2/2) Inverted indices enable fast searches Inverted indices contain other data as well O E.g. things used for relevance scoring (covered later) Elasticsearch (technically, Apache Lucene) uses other data structures as well C - E.g. BKD trees for numeric values, dates, and geospatial data

2023-12-10_16-43-43_screenshot.png #

PUT /employees ( "mappings": ( CREATE TABLE employees ( "properties": ( id INT AUTO INCREMENT PRIMARY KEY, first_name VARCHAR (255) NOT NULL, last_name VARCHAR (255) NOT NULL, "id": ( "type": "integer" ), "first_name' " : ( "type": "text" 1, "last_name": f "type": "text" 1, - dob": ( "type": "date" 1, dob DATE, description TEXT, description": ( 1 type": : "text" 1, "created_at": ( "type": "date" ) created_at TIMESTAMP DEFAULT CURRENT TIMESTAMP MySQL Elasticsearch

2023-12-10_16-43-53_screenshot.png #

What is mapping? Defines the structure of documents (e.g. fields and their data types) ) Also used to configure how values are indexed Similar to a table's schema in a relational database Explicit mapping .  We define field mappings ourselves Dynamic mapping O Elasticsearch generates field mappings for us

2023-12-10_16-44-29_screenshot.png #

long - - - integer object boolean text short double date float

2023-12-10_16-45-28_screenshot.png #

PUT /products - "mappings": ( "properties": ( " name": "Coffee Maker", "price": 64.2, "in_stock": 10, "is_active": true, "manufacturer": ( "name' " : "Nespresso", "country": "Switzerland", "owned_by": ( "manufacturer": ( "properties": ( "name - : ( "type": "text" 1, "country": ( "type": "text" 1, "owned_ by": : ( "properties": ( name " : ( "type": "text" 1, "country": ( "type": "text" ) name": : "Nestlé Group' " "country": "Switzerland"

2023-12-10_16-46-00_screenshot.png #

"name": "Coffee Maker", "price": 64.2, in_stock": 10, "is_active": true, "manufacturer": ( "name": "Nespresso", "country": "Switzerland" 1 "name": "Coffee Maker", "price": 64.2, "in_stock": 10, is_active": true, "manufacturer.name": "Nespresso", "manufacturer. country": "Switzerland" name": "Coffee Maker", "price": 64.2, "in_stock": 10, "is_active": true, manufacturer": ( l name": "Nespresso", "country": "Switzerland", "owned_by": ( "name": "Coffee Maker", "price": 64.2, "in_stock": 10, "is_active": true, "manufacturer.name": "Nespresso", manufacturer. country": "Switzerland", manufacturer. owned by name": "Nestlé Group", 1 manufacturer. owned by : country": "Switzerland" "name": "Nestlé Group", "country": "Switzerland"

2023-12-10_16-46-16_screenshot.png #

name": "Coffee Maker", reviews": [ 'rating": 5.0, "name": "Coffee Maker", author": ' Average Joe", "reviews.rating": (5.0, 3.51, description": "Haven't slept for days.. Amazing!" reviews.author" ["Average Joe", "John Doe"] ), "reviews. description": [ ' Haven 't slept for days.. Amazing!", "rating": 3.5, "Could be better :)" author": ' John Doe' : description": "Could be better :)"

2023-12-10_16-48-21_screenshot.png #

keyword data type Used for exact matching of values Typically used for filtering, aggregations, and sorting E.g. searching for articles with a status of PUBLISHED For full-text searches, use the text data type instead C E.g. searching the body text of an article

2023-08-25_22-29-57_screenshot.png #

Tokenizer An analyzer has exactly one tokenizer. he responsibility of a tokenizer is to receive a stream of characters and generate a stream of tokens. These tokens are used to build an inverted index. A token is roughly equivalent to a word. In addition to breaking down characters into words or tokens, it also produces, in its output, the start and end offset of each token in the input stream.

Elasticsearch

Summary #

Elastic cloud features #

Where does Elasticsearch stores data? #

Why Elastic changed their licenses from version 7.11? #

Current licenses #

How Elasticsearch works #

Using it as Vector Store with Dense vector #

#

#

7.10 #

OCR of Images #

2023-08-25_22-29-41_screenshot.png #

2023-12-10_16-33-37_screenshot.png #

2023-12-10_16-35-40_screenshot.png #

2023-12-10_16-52-47_screenshot.png #

2023-12-10_16-40-04_screenshot.png #

2023-12-10_16-40-21_screenshot.png #

2023-12-10_16-41-27_screenshot.png #

2023-12-10_16-41-51_screenshot.png #

2023-12-10_16-42-34_screenshot.png #

2023-12-10_16-42-47_screenshot.png #

2023-12-10_16-43-43_screenshot.png #

2023-12-10_16-43-53_screenshot.png #

2023-12-10_16-44-29_screenshot.png #

2023-12-10_16-45-28_screenshot.png #

2023-12-10_16-46-00_screenshot.png #

2023-12-10_16-46-16_screenshot.png #

2023-12-10_16-48-21_screenshot.png #

2023-08-25_22-29-57_screenshot.png #

Links to this note