Technology and Infrastructure → Elasticsearch and Full-Text Search

Elasticsearch and Full-Text Search

1) Elasticsearch role

Elasticsearch (ES) is a distributed search and analysis system based on inverted indexes and column structures for aggregations. It gives:

Full text: relevance (BM25), morphology, fuzzy/typo tolerant.
Facets and aggregations: quick slices by attributes.
Hybrid search: BM25 + vector kNN (semantics).
Development speed: Query DSL, ingest pipelines, rich ecosystem.

For iGaming/fintech: search for games/providers, promos and rules, fast-reacting facets (provider, volatility, RTP, language), search for KYC/AML magazines, parsing logs and alerts.

2) Data model and mappings

2. 1 Field index and types

'text'for full text.
'keyword '- exact values/aggregations/sort.
`date`, `long/double`, `boolean`, `ip`, `geo_point`.
'nested '- arrays of objects with correct field correlation.
'dense _ vector '- vector representations (embeddings).

2. 2 Multi-field strategy

Store the field in several views: 'name. text ',' name. raw` (keyword), `name. ngram '(for auto-completion).

json
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ru_morph",
"fields": {
"raw": { "type": "keyword", "ignore_above": 256 },
"ngram": { "type": "text", "analyzer": "edge_ngram_2_20" }
}
},
"provider": { "type": "keyword" },
"tags":   { "type": "keyword" },
"rtp":   { "type": "float" },
"released_at": { "type": "date" },
"lang":   { "type": "keyword" },
"embedding": { "type": "dense_vector", "dims": 384, "index": true, "similarity": "cosine" }
}
},
"settings": {
"analysis": {
"filter": {
"ru_stop": { "type": "stop", "stopwords": "_russian_" },
"ru_stemmer": { "type": "stemmer", "language": "russian" },
"syn_ru": { "type": "synonym", "lenient": true, "synonyms": [
"slot, slot machine => slot,"
"jackpot, super prize => jackpot"
] }
},
"analyzer": {
"ru_morph": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "ru_stop", "ru_stemmer", "syn_ru"]
},
"edge_ngram_2_20": {
"type": "custom",
"tokenizer": "edge_ngram",
"filter": ["lowercase"],
"char_filter": [],
"tokenizer": "edge_ngram"
}
},
"tokenizer": {
"edge_ngram": { "type": "edge_ngram", "min_gram": 2, "max_gram": 20 }
}
}
}
}

2. 3 Nested for facets

Attributes of the form 'features: [{name, value}]' design 'nested', otherwise facets will give false matches.

3) Relevance: BM25, boost and hybrid

3. 1 Classics (BM25)

Combine fields with weights (title ^ 4, tags ^ 2, description).
Use 'minimum _ should _ match' to control noisy matches.

3. 2 Vectors (kNN) + BM25 (rerank)

Embeddings (e.g. 384-768) in 'dense _ vector'.
First kNN by vector (top 200-500), then rescore BM25 + business boosts (novelty, RTP, region license).

Hybrid query example:

json
{
"knn": {
"field": "embedding",
"query_vector": [/... /],
"k": 400, "num_candidates": 2000
},
"query": {
"bool": {
"should": [
{ "multi_match": {
"query": "Egyptian jackpot slots,"
"fields": ["title^4","tags^2","description"],
"type": "best_fields",
"minimum_should_match": "60%"
}}
],
"filter": [
{ "term": { "region": "TR" }},
{ "range": { "rtp": { "gte": 94. 0 }}}
]
}
},
"rescore": {
"window_size": 400,
"query": {
"rescore_query": {
"function_score": {
"query": { "match_all": {} },
"boost_mode": "sum",
"functions": [
{ "gauss": { "released_at": { "scale": "180d", "offset": "30d", "decay": 0. 5 } } },
{ "field_value_factor": { "field": "popularity", "factor": 0. 2, "modifier": "log1p" } }
]
}
}
}
},
"highlight": { "fields": { "title": {}, "description": {} } }
}

4) Auto-completion and prompts

Approaches:

Edge N-gram on subfield 'title. ngram '(fast, simple).
Completion suggesters ('completion' field) - quick hints, but a separate indexing path.
Search-as-you-type - combines tokenization to start words and phrases.

Sample prompts:

json
{ "suggest": { "game-suggest": { "prefix": "book o", "completion": { "field": "title_suggest", "fuzzy": { "fuzziness": 1 }}}}}

5) Synonyms, typos and multilingualism

Synonyms: load the file/list through the 'synonym' filter; separate domains (casino/sports).
Typos: 'fuzziness: AUTO' in 'multi _ match', limit to length and fields. For prompts - 'fuzzy' completion mode.

Multilingualism:

Index-per-locale (ru/en/tr/pt-BR) or multi-analyzer circuit: 'title _ ru', 'title _ en'.
Разные analyzers: `russian`, `english`, `turkish`, `portuguese`.
Move the language into the routing key to keep the hot locales closer to the user.

6) Filters, facets and aggregations

For facets, use 'keyword' and 'nested' aggregation.
Avoid cardinal fields (unique IDs) in aggregations - bring them to'runtime fields' or pre-windows.

Example facets:

json
{
"size": 20,
"aggs": {
"by_provider": { "terms": { "field": "provider", "size": 20 } },
"by_volatility": { "terms": { "field": "volatility" } },
"rtp_hist": { "histogram": { "field": "rtp", "interval": 1 } }
}
}

7) Data entry and text clearing

Ingest pipelines: normalization, field extraction, geo-encoding, HTML deletion.
Attachment/ingest-ocr (as needed): PDF/image indexing (careful with PII).
Lemmatization: through analyzers or external pipelines (precompute tokens).

8) Shards, replicas and ILM

8. 1 Dimensions and Sharding

Fewer shards are better. Target: 10-50GB per shard for mixed loads.
Start with 'number _ of _ shards: 1-3', scale in fact. Replicas - at least 1 in sales.

8. 2 ILM (Lifecycle)

hot → warm → cold → delete for logs/history promo.
Force merge for cold segments.
For catalogs and product search - "perpetual" hot with periodic optimization.

8. 3 Downtime-free migration algorithm

The new index 'games _ v2' → alias' games' switches after 'reindex' and backfill. Depressed fields - remove gradually.

9) Snapshots, DR and updates

Snapshots to object storage (S3/GCS), schedule and restore check.
Rolling updates of nodes, checking shard allocation awareness (by zones).
DR plans: cross-region replication (CCR) for critical indexes (directories, directories).

10) Safety and PII

TLS/mTLS between client and cluster.
RBAC: roles per index/operation; Dev/Stage/Prod - separately.
PII/PCI: do not index fields with personal data unnecessarily; Use ingest masking.
Right to be forgotten: keep links to documents for deletion by user_id; soft-delete + reindex/announcement.

11) Observability and search SLO

Metrics:

P50/P95/P99 latency to query, 4xx/5xx errors.
Cache hit (query cache / shard request cache).
Heap usage, GC паузы, segment merges, threadpools (search/write).
Hot shards/hot nodes, rejections.
KNN: `graph_hits`, `search_k`, latency, recall@k.

SLO examples:

Searching for games: P95 ≤ 200 ms, errors <0. 5% in the 30-min window.
Tips: P95 ≤ 80 ms.
KNN hybrid: P95 ≤ 350ms for top-20 results.

12) FinOps: cost and performance

Index size: save tokenization, disable unnecessary 'fielddata', use 'doc _ values' only where necessary.

Segments: plan merge policy, do not allow "split."

KNN is more expensive in RAM/CPU: limit dims, 'num _ candidates', pre-filter on BM25.
Hot fields in RAM: monitor field data/heap; take "heavy" aggregations into separate indices.

13) Sample requests

13. 1 Multi-field full-text with boost

json
{
"query": {
"multi_match": {
"query": "book of",
"fields": ["title^4","title. ngram^2","tags^2","description"]
}
},
"sort": ["_score", { "released_at": "desc" }]
}

13. 2 Filters + facets

json
{
"query": {
"bool": {
"must": [{"match": {"title": "egypt"}}],
"filter": [
{ "terms": { "provider": ["Novomatic","PragmaticPlay"]}},
{ "range": { "rtp": { "gte": 95 }}}
]
}
},
"aggs": {
"by_provider": { "terms": { "field": "provider" } },
"by_year": { "date_histogram": { "field": "released_at", "calendar_interval": "year" } }
}
}

13. 3 Nested attribute filtering

json
{
"query": {
"nested": {
"path": "features",
"query": { "bool": {
"must": [
{ "term": { "features. name": "volatility" }},
{ "term": { "features. value": "high" }}
]
}}
}
}
}

13. 4 Log Search (ECS) with Highlight

json
{
"query": {
"bool": {
"must": [{ "match_phrase": { "message": "payment declined" }}],
"filter": [
{ "term": { "service. name": "payments" }},
{ "range": { "@timestamp": { "gte": "now-1h" }}}
]
}
},
"highlight": { "fields": { "message": { "number_of_fragments": 0 } } }
}

14) Multi-tenant and isolation

Index to tenant (better) or field 'tenant _ id' + ACL filter (more expensive on aggregations).
Routing by 'tenant _ id' to localize shards.
Limit tenant requests to limits/timeouts, 'query. phase` guard-rails.

15) Implementation checklist

1. Schema: 'text/keyword/nested' + multi-fields, 'dense _ vector' if necessary.
2. Analyzers per-language, synonyms, edge-ngram for auto-completion.
3. Relevance: BM25 boosts + hybrid kNN→rescore.
4. Facets: keyword/nested, aggregation only for "healthy" fields.
5. Indexing: ingest pipelines (normalization), batch loading.
6. Sharding: start small, alias for moving, ILM for "long" logs.
7. DR: snapshots schedule, recovery check, CCR for critical indexes.
8. Security: TLS, RBAC, PII masking, deletion policy.
9. Observability: latency, heap/GC, cache hit, hot shards, rejections.
10. FinOps: index size, kNN parameterization, disabling extra 'doc _ values/fielddata'.

16) Anti-patterns

One index "for all": different domains (directory, logs, transactions) require different settings.
Thoughtless' fuzziness: AUTO '→ slowly and noisily in all fields.
Synonyms "eat up meaning": do not separate dictionary domains.
Without nested where field bundles → false facets are needed.
Too many shards (one per document) - cluster state overhead.
Non-use of alias during migrations - downtime and broken links.
PII indexation "as is" - regulatory risks and expensive reindexes.

17) iGaming context/fintech: Quick recipes

Search for games: 'multi _ match' with the boost 'title ^ 4', 'tags ^ 2', facets by provider/volatility, filters by region/currency, hybrid with vectors for "topics" (for example, "Egypt," "fruit classic").
Promo/bonuses: synonyms ("freespins," "free spins"), data filters' active _ from/active _ to ', tips through completion.
KYC/AML logs: ECS schema, full text by 'message', aggregations by 'rule _ name', 'country', anomalies by '@ timestamp' histogram.
Provider directory: keyword fields for facets and sorts; text descriptions - 'text' with morphology.
Regulatory pages: multilingual fields, 'search _ as _ you _ type' for soft hints.

Total

Effective search on Elasticsearch is not only "turn on BM25": these are the right analyzers and mappings, multifields and nested, a hybrid of BM25 + vectors, neat facets and aggregations, the discipline of sharding and ILM, clear SLOs and observability, as well as safety and FinOps. With these principles, your search will be fast, relevant and predictable - and will withstand peaks in product platform traffic.

Technology and Infrastructure → Elasticsearch and Full-Text Search

Elasticsearch and Full-Text Search

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects