Technology and Infrastructure → Elasticsearch and Full-Text Search
Elasticsearch and Full-Text Search
1) Elasticsearch role
Elasticsearch (ES) is a distributed search and analysis system based on inverted indexes and column structures for aggregations. It gives:- Full text: relevance (BM25), morphology, fuzzy/typo tolerant.
- Facets and aggregations: quick slices by attributes.
- Hybrid search: BM25 + vector kNN (semantics).
- Development speed: Query DSL, ingest pipelines, rich ecosystem.
For iGaming/fintech: search for games/providers, promos and rules, fast-reacting facets (provider, volatility, RTP, language), search for KYC/AML magazines, parsing logs and alerts.
2) Data model and mappings
2. 1 Field index and types
'text'for full text.
'keyword '- exact values/aggregations/sort.
`date`, `long/double`, `boolean`, `ip`, `geo_point`.
'nested '- arrays of objects with correct field correlation.
'dense _ vector '- vector representations (embeddings).
2. 2 Multi-field strategy
Store the field in several views: 'name. text ',' name. raw` (keyword), `name. ngram '(for auto-completion).
json
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ru_morph",
"fields": {
"raw": { "type": "keyword", "ignore_above": 256 },
"ngram": { "type": "text", "analyzer": "edge_ngram_2_20" }
}
},
"provider": { "type": "keyword" },
"tags": { "type": "keyword" },
"rtp": { "type": "float" },
"released_at": { "type": "date" },
"lang": { "type": "keyword" },
"embedding": { "type": "dense_vector", "dims": 384, "index": true, "similarity": "cosine" }
}
},
"settings": {
"analysis": {
"filter": {
"ru_stop": { "type": "stop", "stopwords": "_russian_" },
"ru_stemmer": { "type": "stemmer", "language": "russian" },
"syn_ru": { "type": "synonym", "lenient": true, "synonyms": [
"слот,игровой автомат => слот",
"джекпот,суперприз => джекпот"
] }
},
"analyzer": {
"ru_morph": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "ru_stop", "ru_stemmer", "syn_ru"]
},
"edge_ngram_2_20": {
"type": "custom",
"tokenizer": "edge_ngram",
"filter": ["lowercase"],
"char_filter": [],
"tokenizer": "edge_ngram"
}
},
"tokenizer": {
"edge_ngram": { "type": "edge_ngram", "min_gram": 2, "max_gram": 20 }
}
}
}
}
2. 3 Nested for facets
Attributes of the form 'features: [{name, value}]' design 'nested', otherwise facets will give false matches.
3) Relevance: BM25, boost and hybrid
3. 1 Classics (BM25)
Combine fields with weights (title ^ 4, tags ^ 2, description).
Use 'minimum _ should _ match' to control noisy matches.
3. 2 Vectors (kNN) + BM25 (rerank)
Embeddings (e.g. 384-768) in 'dense _ vector'.
First kNN by vector (top 200-500), then rescore BM25 + business boosts (novelty, RTP, region license).
json
{
"knn": {
"field": "embedding",
"query_vector": [/... /],
"k": 400, "num_candidates": 2000
},
"query": {
"bool": {
"should": [
{ "multi_match": {
"query": "египетские слоты джекпот",
"fields": ["title^4","tags^2","description"],
"type": "best_fields",
"minimum_should_match": "60%"
}}
],
"filter": [
{ "term": { "region": "TR" }},
{ "range": { "rtp": { "gte": 94.0 }}}
]
}
},
"rescore": {
"window_size": 400,
"query": {
"rescore_query": {
"function_score": {
"query": { "match_all": {} },
"boost_mode": "sum",
"functions": [
{ "gauss": { "released_at": { "scale": "180d", "offset": "30d", "decay": 0.5 } } },
{ "field_value_factor": { "field": "popularity", "factor": 0.2, "modifier": "log1p" } }
]
}
}
}
},
"highlight": { "fields": { "title": {}, "description": {} } }
}
4) Auto-completion and prompts
Approaches:- Edge N-gram on subfield 'title. ngram '(fast, simple).
- Completion suggesters ('completion' field) - quick hints, but a separate indexing path.
- Search-as-you-type - combines tokenization to start words and phrases.
json
{ "suggest": { "game-suggest": { "prefix": "book o", "completion": { "field": "title_suggest", "fuzzy": { "fuzziness": 1 }}}}}
5) Synonyms, typos and multilingualism
Synonyms: load the file/list through the 'synonym' filter; separate domains (casino/sports).
Typos: 'fuzziness: AUTO' in 'multi _ match', limit to length and fields. For prompts - 'fuzzy' completion mode.
- Index-per-locale (ru/en/tr/pt-BR) or multi-analyzer circuit: 'title _ ru', 'title _ en'.
- Разные analyzers: `russian`, `english`, `turkish`, `portuguese`.
- Move the language into the routing key to keep the hot locales closer to the user.
6) Filters, facets and aggregations
For facets, use 'keyword' and 'nested' aggregation.
Avoid cardinal fields (unique IDs) in aggregations - bring them to'runtime fields' or pre-windows.
json
{
"size": 20,
"aggs": {
"by_provider": { "terms": { "field": "provider", "size": 20 } },
"by_volatility": { "terms": { "field": "volatility" } },
"rtp_hist": { "histogram": { "field": "rtp", "interval": 1 } }
}
}
7) Data entry and text clearing
Ingest pipelines: normalization, field extraction, geo-encoding, HTML deletion.
Attachment/ingest-ocr (as needed): PDF/image indexing (careful with PII).
Lemmatization: through analyzers or external pipelines (precompute tokens).
8) Shards, replicas and ILM
8. 1 Dimensions and Sharding
Fewer shards are better. Target: 10-50GB per shard for mixed loads.
Start with 'number _ of _ shards: 1-3', scale in fact. Replicas - at least 1 in sales.
8. 2 ILM (Lifecycle)
hot → warm → cold → delete for logs/history promo.
Force merge for cold segments.
For catalogs and product search - "perpetual" hot with periodic optimization.
8. 3 Downtime-free migration algorithm
The new index 'games _ v2' → alias' games' switches after 'reindex' and backfill. Depressed fields - remove gradually.
9) Snapshots, DR and updates
Snapshots to object storage (S3/GCS), schedule and restore check.
Rolling updates of nodes, checking shard allocation awareness (by zones).
DR plans: cross-region replication (CCR) for critical indexes (directories, directories).
10) Safety and PII
TLS/mTLS between client and cluster.
RBAC: roles per index/operation; Dev/Stage/Prod - separately.
PII/PCI: do not index fields with personal data unnecessarily; Use ingest masking.
Right to be forgotten: keep links to documents for deletion by user_id; soft-delete + reindex/announcement.
11) Observability and search SLO
Metrics:- P50/P95/P99 latency to query, 4xx/5xx errors.
- Cache hit (query cache / shard request cache).
- Heap usage, GC паузы, segment merges, threadpools (search/write).
- Hot shards/hot nodes, rejections.
- KNN: `graph_hits`, `search_k`, latency, recall@k.
- Searching for games: P95 ≤ 200 ms, errors <0. 5% in the 30-min window.
- Tips: P95 ≤ 80 ms.
- KNN hybrid: P95 ≤ 350ms for top-20 results.
12) FinOps: cost and performance
Index size: save tokenization, disable unnecessary 'fielddata', use 'doc _ values' only where necessary.
Segments: plan merge policy, do not allow "split."
KNN is more expensive in RAM/CPU: limit dims, 'num _ candidates', pre-filter on BM25.
Hot fields in RAM: monitor field data/heap; take "heavy" aggregations into separate indices.
13) Sample requests
13. 1 Multi-field full-text with boost
json
{
"query": {
"multi_match": {
"query": "book of",
"fields": ["title^4","title.ngram^2","tags^2","description"]
}
},
"sort": ["_score", { "released_at": "desc" }]
}
13. 2 Filters + facets
json
{
"query": {
"bool": {
"must": [{ "match": { "title": "египет" }}],
"filter": [
{ "terms": { "provider": ["Novomatic","PragmaticPlay"]}},
{ "range": { "rtp": { "gte": 95 }}}
]
}
},
"aggs": {
"by_provider": { "terms": { "field": "provider" } },
"by_year": { "date_histogram": { "field": "released_at", "calendar_interval": "year" } }
}
}
13. 3 Nested attribute filtering
json
{
"query": {
"nested": {
"path": "features",
"query": { "bool": {
"must": [
{ "term": { "features.name": "volatility" }},
{ "term": { "features.value": "high" }}
]
}}
}
}
}
13. 4 Log Search (ECS) with Highlight
json
{
"query": {
"bool": {
"must": [{ "match_phrase": { "message": "payment declined" }}],
"filter": [
{ "term": { "service.name": "payments" }},
{ "range": { "@timestamp": { "gte": "now-1h" }}}
]
}
},
"highlight": { "fields": { "message": { "number_of_fragments": 0 } } }
}
14) Multi-tenant and isolation
Index to tenant (better) or field 'tenant _ id' + ACL filter (more expensive on aggregations).
Routing by 'tenant _ id' to localize shards.
Limit tenant requests to limits/timeouts, 'query. phase` guard-rails.
15) Implementation checklist
1. Schema: 'text/keyword/nested' + multi-fields, 'dense _ vector' if necessary.
2. Analyzers per-language, synonyms, edge-ngram for auto-completion.
3. Relevance: BM25 boosts + hybrid kNN→rescore.
4. Facets: keyword/nested, aggregation only for "healthy" fields.
5. Indexing: ingest pipelines (normalization), batch loading.
6. Sharding: start small, alias for moving, ILM for "long" logs.
7. DR: snapshots schedule, recovery check, CCR for critical indexes.
8. Security: TLS, RBAC, PII masking, deletion policy.
9. Observability: latency, heap/GC, cache hit, hot shards, rejections.
10. FinOps: index size, kNN parameterization, disabling extra 'doc _ values/fielddata'.
16) Anti-patterns
One index "for all": different domains (directory, logs, transactions) require different settings.
Thoughtless' fuzziness: AUTO '→ slowly and noisily in all fields.
Synonyms "eat up meaning": do not separate dictionary domains.
Without nested where field bundles → false facets are needed.
Too many shards (one per document) - cluster state overhead.
Non-use of alias during migrations - downtime and broken links.
PII indexation "as is" - regulatory risks and expensive reindexes.
17) iGaming context/fintech: Quick recipes
Search for games: 'multi _ match' with the boost 'title ^ 4', 'tags ^ 2', facets by provider/volatility, filters by region/currency, hybrid with vectors for "topics" (for example, "Egypt," "fruit classic").
Promo/bonuses: synonyms ("freespins," "free spins"), data filters' active _ from/active _ to ', tips through completion.
KYC/AML logs: ECS schema, full text by 'message', aggregations by 'rule _ name', 'country', anomalies by '@ timestamp' histogram.
Provider directory: keyword fields for facets and sorts; text descriptions - 'text' with morphology.
Regulatory pages: multilingual fields, 'search _ as _ you _ type' for soft hints.
Result
Effective search on Elasticsearch is not only "turn on BM25": these are the right analyzers and mappings, multifields and nested, a hybrid of BM25 + vectors, neat facets and aggregations, the discipline of sharding and ILM, clear SLOs and observability, as well as safety and FinOps. With these principles, your search will be fast, relevant and predictable - and will withstand peaks in product platform traffic.