Filtering and Full-Text Search
1) Why you need a search layer
Filtering and full-text search (FTS) provide quick access to data "by meaning," not just by primary keys. A properly designed search layer combines:- Strict filters (categories, dates, prices, access rights)
- Full text (lexical match and ranking)
- Facets (aggregates for navigation)
- Hybrid Ranking (BM25/TF-IDF + Vector Embeddings)
- Reliable protocols (cursor pagination, token TTL, cross-sharding)
2) Architectural picture
Components:1. Ingest/ETL → normalization, deduplication, enrichment, building fields for the index.
2. Indexer → reverse index (tokens → documents), column structures, vector index (HNSW/IVF-PQ).
3. Query Layer → request parser, application of filters/access rights, shard scheduler, k-way merge.
4. Ranker → BM25 + LTR/Neural re-rank.
5. Serving → cache, cursors, facets, highlights, autocomplete.
6. Observability → latency, quality metrics, A/B experiments.
3) Data and index model
3. 1 Fields and analyzers
Types: keyword (even match), text (analyzed), numeric/date/geo, vector.
Analyzers: tokenization, normalization (lowercase, Unicode NFKC), filters (stopwords, stemming/lemmatization).
Multilingualism: per-field analyzers (ru, uk, en); ICU analysis; transliteration; consideration of diacritics.
3. 2 Reverse index (sparse)
Structure: term → posting list (docID, term freq, positions).
Ranking: BM25 (or classic TF-IDF) with field boosts.
3. 3 Vector index (dense)
Text embeddings (for example, 384-1024-dimensional).
ANN structures: HNSW, IVF-PQ, Flat (for small sets).
Cosine proximity/inner product; BM25 calibration (hybrid).
3. 4 Facets and aggregates
Precompute/column storage of values for fast counts.
Hierarchical facets (category/subcategory).
Ranges (price bins, dates).
4) Queries: filters + full-text + sort
4. 1 API Contracts (REST)
Request:
GET /v1/search? q = classic slots & limit = 20 & cursor =... & sort = score: desc, created _ at: desc
&filters=brand:("NetEnt","EGT"); price:[10 TO 50];published_at:[2024-01-01 TO ]
&facets=brand,year,price:range(0,10,20,50,100)
Response (fragment):
json
{
"items": [ { "id":"...", "title":"...", "score": 12. 3, "highlight": { "content": ["..."] } } ],
"facets": { "brand": [{"value":"NetEnt","count":123},...] },
"page": { "limit":20, "has_more":true, "next_cursor":"opaque-token" }
}
4. 2 GraphQL (simplified)
graphql type Query {
search(query: String!, filter: SearchFilter, first: Int, after: String, sort: [Sort!]): SearchConnection!
}
4. 3 gRPC
proto message SearchRequest {
string query = 1;
map<string,string> filters = 2;
int32 page_size = 3;
string page_token = 4; // курсор repeated string facets = 5;
}
5) Natural Language Processing (NLP)
Tokenization/normalization: Unicode-safe, hyphen/apostrophe accounting.
Stopwords: customization lists by language.
Stemming vs lemmatization: for ru/uk lemmatization is better (quality> speed).
Synonyms: bidirectional/directional dictionaries; dictionary versions with TTL.
Typos (fuzzy): Damerau-Levenshtein with distance restriction and exact match boosts.
N-grams/edge-ngrams: for autocomplete and hints.
Transliteration: "shch ↔" "u," "kyiv/kyiv" - correspondence rules.
6) Relevance and ranking
6. 1 Basic lexical scoring
BM25 with the 'k1', 'b' setting by collection.
Boosts by fields (title ^ 3, tags ^ 1. 5, body^1).
Freshness: 'score + = freshness_boost (decay (created_at))'.
6. 2 Behavioral cues
Click-through rate, dwell time, save to favorites (with anti-positional bayas).
Deduplication - Stitch together documents with ~ identical content (MinHash/SimHash).
6. 3 Learning-to-Rank (LTR)
Features: field BM25, length, freshness, popularity, match by phrase, positional speed.
Models: LambdaMART/XGBoost; offline metrics NDCG @ k, MAP, Precision @ k; online A/B.
6. 4 Neuro-rearrangement
Two-step: recall (BM25/ANN) → top-N (for example, 200) → cross-encoder rerank.
Cost accounting: time budget, fallback without neuro-stage under load.
6. 5 Hybrid search (sparse + dense)
Either fusion (normalization of speeds and sum), or multi-stage (dense as rerank).
Calibration is important: min-max/z-score/quantitative mapping.
7) Filtering, facets and access
7. 1 Filters
Operators: '=', 'IN', ranges, prefixes, geo-bounding box/geo-distance.
Combinations: 'AND' by filters, 'OR' within a set of values (brand IN...).
Type security: numeric fields are not parsed as text.
7. 2 Facets
Cheap counts for pre-calculated structures.
"Applied" facets show the remaining post-filter facets.
7. 3 Access/multi-tenancy
Security filters are integrated before ranking (pre-filter).
ABAC/RBAC fields in the document ('tenant _ id', 'visibility', 'acl').
The request token is signed; with multi-tenant - automatic'tenant _ id'filter.
8) Pagination, cursors and consistency
Pagination by seek-cursor by '(score, tie-breaker)' or by '(created_at, id)' when sorted by time.
Opaque 'page _ token' with HMAC and TTL.
Consistency: near-real-time (NRT) index: delay 0. 5-2 s between recording and visibility. Document it in the SLA.
Cross-shard: local search → k-way merge by global order, per-shard cursors in token.
9) AutoComplete and prompts
Suggesters: prefix-trie / edge-ngrams по полю `title`.
Popular queries: log of clicks → tips on popularity + personalization (segments).
Spell-as-you-type: fast fuzzy search with distance limit '<= 1'.
GET /v1/suggest? q=kaz&limit=8&locale=ru
→ ["casino," "casual games,..."]
10) Highlights and snippets
Positional index → retrieving phrases with matches.
HTML escape, length limit, union of neighboring fragments.
Ranking snippets by density of relevant terms.
11) Performance, cache and SLO
Indexes: hot segments in memory; compression postings; doc values for facets.
Cache: L1 (process), L2 (Redis), facets/aggregates cache; disabled by index version.
SLO: P95 <150-200 ms at 'k <= 20', P99 <500 ms; availability 99. 9%.
Backpressure: decrease 'k', disable the neuro-stage when overloaded.
Rate limiting to the API/user/tenant key.
12) Observability and quality metrics
Technical metrics:- `search_latency_ms` (P50/P95/P99), `qps`, `timeouts`, `error_rate`
- `cache_hit_ratio`, `facet_cache_hit`, `rerank_share`
- `shard_fanout`, `merge_time_ms`, `ann_recall@k`
- NDCG @ k, MAP, MRR, Recall @ k, Precision @ k on marked samples.
- CTR@k, sCTR (satisfied clicks), dwell time, отказ (pogostick rate).
A/B: fix "guardrail" metrics (latency, errors) + target (NDCG proxy).
13) Testing
Relevance unit tests: checking expected matches for key requests.
Property-based: resistance to typos/synonyms/languages.
Pagination: no duplicates at the page boundary (seek contracts).
Security: access filters are always applied (even on faset-count).
Dictionary regressions: versioning synonyms and fuzzy rules.
14) Security and privacy
Fields with PII are not indexed as text; store separately/encrypt.
Minimize stored sources (store = false, snippet fields only).
Query privacy: do not log raw requests with PII; anonymization/hashing.
Multi-tenant: strict index isolation or mandatory'tenant _ id'filter.
15) Migrations and interoperability
Versioning index scheme (v1→v2) with double write and gradual switch.
Analyzer compatibility: do not re-index old chains yet.
Rotation of synonym/stopword dictionaries: 'version', 'activated _ at', rollback.
16) Practical recipes
16. 1 Classic Lexical Search (BM25)
Fields: 'title ^ 3', 'tags ^ 2', 'body ^ 1'.
Analyzers: language-specific + lemmatization.
Fuzzy for short queries ('<= 3' tokens), 'fuzziness = 1'.
16. 2 Hybrid sparse + dense
1. ANN search by query embedding (k = 200)
2. Merge with top-200 BM25
3. Calibration Rank Fusion
4. Take top-N (N = 20), optionally - rank cross-encoder with a sufficient budget.
16. 3 Faceted catalog navigation
Hard pre-filter by rights/tenant
Post-filter facets (counts including active filters)
Sort by relevance or business field (price/novelty)
17) Sample requests (pseudo-DSL)
Filters and sorting:json
{
"query": "live casino,"
"filters": {
"country": ["EE","LV","LT"],
"license": ["MGA","UKGC"],
"launched_at": {"gte": "2023-01-01"}
},
"sort": ["_score:desc","launched_at:desc"],
"facets": ["country","license"],
"page": {"limit": 20, "cursor": "opaque"}
}
Geopoisk:
json
{
"query": "casino",
"geo": {"lat": 59. 437, "lon": 24. 753, "radius_km": 50}
}
Autocomplete:
json
{ "prefix": "evo", "field": "brand_suggest", "limit": 8 }
18) UX patterns
Active filter chips + "reset all."
Blank results: show "try..." (synonyms, remove filter).
Zero Hints: popular queries/categories.
Cursor pagination (More button) and infinite scrolling; fixed indicator of applied filters.
Separate switches "take into account typos," "exact match of the phrase."
19) Frequent errors and anti-patterns
No tie-breaker when sorting → doubles/jumps.
Facets without taking into account active filters → "false" counts.
Apply post-ranking access filters.
Mixing different languages with one analyzer.
Deep pagination OFFSET/LIMIT instead of seek cursor.
Unlimited fuzzy → explosion by latency.
20) Implementation checklist
1. Define the fields and their types, assign per-locale analyzers.
2. Design the inverse index + (opts.) vector ANN.
3. Implement a query parser and secure pre-filters.
4. Set up BM25 and field boosts; attach facets.
5. Enter cursors (opaque, HMAC, TTL) and k-way merge by shards.
6. Add autocomplete, highlights, safe shielding.
7. Metrics: latency, NDCG @ k, CTR; L1/L2 cache.
8. A/B framework for tuning relevance.
9. Document SLA: NRT delay, 'limit' limits, consistency guarantee.
10. Migration plan: versions of index, dictionaries and analyzers.
A well-designed filtering and full-text search layer is not only a fast index, but also a clear protocol contract with cursors, security, predictable UX, and measurable relevance. This approach scales from thousands to billions of documents and supports both classical lexical search and modern hybrid scenarios with neural network ranking.