GH GambleHub

PII Data Tokenization

Tokenization of PII data

1) Why tokenization and what exactly we tokenize

The goal: to exclude access to "raw" personal data in the operational circuit and analytics, reduce the risk of leaks and simplify compliance with requirements.
PII examples: full name, phone number, email, address, passport/ID, TIN, IP addresses, cookie-ID, payment identifiers, date of birth, etc.

The idea: instead of the original value, we use a token - a safe substitute that:
  • does not disclose the original value;
  • can be reversible (via a secure detokenation service) or irreversible;
  • can be deterministic (for join/search) or non-deterministic (for maximum privacy).

2) Threat model and control objectives

Risks: database/log/backup leaks, insider readings, correlation by repeating values, unauthorized detokenization, dictionary/format attacks (email/phone), reuse of secrets.

Objectives:

1. Separate trust zones: the application works with tokens, the sources - only in the token service.

2. Guarantee cryptographic strength of tokens and managed detokenation.

3. Reduce blast radius with KMS/HSM, rotation and crypto sterilization.

4. Ensure suitability for search/joyns/analytics at controlled risk.

3) Typology of tokens

PropertyOptionsWhy
Reversibilityreversible/irreversiblecustomer care vs analytics
Determinismdeterministic/nondeterministicjoin/deduplication vs anti-correlation
FormatFPE (format-preserving )/arbitrarymask adherence (phone/BIN) vs random strings
Areaper-tenant/per-dataset/globalisolation and collision management
Lifetimepermanent/short-liveddurable links vs disposable tokens
Recommended profiles:
  • PII for search/joynes: reversible deterministic, region-bound (tenant/scope), latched on KMS.
  • PII for operational masking (UI): reversible nondeterministic with lifetime to reduce reuse risks.
  • For gray zone analytics: irreversible (key NMAC/salt hash) or DP aggregations.

4) Tokenization architecture

4. 1 Components

Tokenization Service (TS): "tokenize/detokenize/search" API, high trust zone.
Token Vault (TV): protected map 'token → original (+ metadata)'.
KMS/HSM: root key storage (KEK), wrapping/signing operations.
Policy Engine: who, where and why can detokenize; scope/TTL/rate-limits; mTLS/mTLS+mTLS.
Audit & Immunity: unchangeable logs of all tokenization/detokenization operations.

4. 2 Key hierarchy

Root/KEK in KMS/HSM (per organization/region/tenant).
DEK-PII per data domain (email/phone/address) and/or dataset.
Rotation: rewrap DEK without re-encrypting the entire volt; "key compromise" plan.

4. 3 Flows

1. Tokenize: TS → client (mTLS + A&A) → normalization → token calculation → writing to TV → token response.
2. Detokenize - TS → Client → Policy/Reason Check → Source Check (or Reject).
3. Search/Match: deterministic tokenization allows you to search by token; for email/phone - normalize the format before tokenization.

5) Token designs (crypto design)

5. 1 Reversible (recommended for operational circuit)

AES-SIV/AEAD envelope: `cipher = AEAD_Encrypt(DEK, PII, AAD=scope|tenant|field)`; token = 'prefix' nonce 'cipher' tag '.
FPE (FF1/FF3-1) for formats (e.g. 10-digit telephone without country code). Apply with caution and correct domain (alphabet/length).

5. 2 Irreversible (analytics/face anonymization)

Keyed HMAC/хэш: `token = HMAC(PII_normalized, key=K_scope)`; salt/pepper - separate; per tenant or dataset.
Minimize the risk of collisions by choosing a function (SHA-256/512) and a domain.

5. 3 Determinism and Scope

For join, use a deterministic schema with AAD = '{tenant' purpose 'field}' → different tokens of the same value correspond to different targets.
For anti-correlation in different services - different keys/areas.

5. 4 Minimize dictionary attacks

Normalization (canonization of email/phone), pepper in KMS, domain size limitation (do not give "no record found" errors as a side channel), rate-limit and SARTSNA/proxy for public points.

6) API design and schematics

6. 1 REST/gRPC (option)

`POST /v1/tokenize { field, value, scope, tenant_id, purpose } -> { token, meta }`

`POST /v1/detokenize { token, purpose } -> { value }` (mTLS + OIDC + ABAC; "minimizing" issuance)

'POST/v1/match {field, value} -> {token} '(deterministic search path)

6. 2 Storage diagram (TV)

Таблица `tokens(field, scope, tenant_id, token, created_at, version, wrapped_key_id, hash_index)`

Indexes: by 'token', by '(tenant_id, field, hash_index)' for de-duplication/search.
Hash index (HMAC from normalized PII) allows you to search without detokenization.

6. 3 Normalization pipelines

email: lowercasing, trim, canonical local-part (without aggressive "eating" of points for all domains).
phone: E.164 (with country code), removing formatting characters.
address/name: transliteration by rules, trim, collapse spaces.

7) Multi-tenancy and isolation

Keys and namespaces per tenant: KEK/DEK per tenant.
Detokenization policies: role + goal + cause + event audit.
Crypto deletion of tenant data - KEK revocation and DEK destruction → volts becomes useless (for its records).

8) Integrations

8. 1 Databases and caches

Store only tokens in operating tables.
Rare cases require on-the-fly detokenization through a proxy/agent.
Token caches - only in memory with a short TTL, without writing to disk.

8. 2 Analytics/BI/ML

In DWH/lake, tokens or hashes. Join is performed on the deterministic tokens of the corresponding scope.
For ML, pseudonymization and aggregates are preferred; avoid restoring persons.

8. 3 Support services and anti-fraud

UI with mask ('+ 380') and episodic detokenization for a reasonable reason (reason code) + second factor.

9) Rotation, versions and lifecycle

Separate the token ID and encryption version (v1/v2).
Rewrap: change KEK without touching the data.
Incident plan: key compromise → instant recall, prohibition of detokenization, rollback to "read-only," launching rewrap.
TTL tokens: by policy - permanent (identifiers) or short (one-time links/temporary integrations).

10) Performance and reliability

Hardware accelerations (AES-NI/ARMv8), pools of connections to KMS, cache of wrapped DEKs.
Horizontal scaling TS; split read/write paths.
Idempotency-key for tokenize repeats for network flags.
DR/HA: multi-area, asynchronous volt replica, regular recovery tests.
SLO: p99 latency 'tokenize' ≤ 50-100 ms; 'detokenize' ≤ 50 ms; availability ≥ 99. 9%.

11) Observability, audit, compliance

Metrics: QPS by methods, A&A errors, share of detokenations (by roles/goals), cache hit-rate, KMS operation time.
Audit (immutable): each detokenation with 'who/what/why/where', query hash, result.
Retention and WORM policies for the log (see Auditing and Immutable Logs).
Compliance: GDPR (minimization, right to delete via crypto erasure), PCI DSS (for PAN - FPE/pseudonymization), ISO/SOC reporting.

12) Testing and safety

Crypto unit tests: stability of deterministic tokens, verification of AAD and failure if it does not match.
Negative tests: dictionary attacks, format reverse, rate-limit, CSRF (for web panels), SSRF for backends.
Chaos: KMS/Volt unavailable, legacy key, partial replication.
Periodic Red-team attempts to detokenize without reason and through side channels.

13) Mini recipes

Deterministic reversible token (AEAD SIV, pseudocode):

pii_norm = normalize(value)
aad   = scope          tenant          field dek   = kms. unwrap(kek_id, wrapped_dek_for_field)
token = aead_siv_encrypt (dek, pii_norm, aad) # deterministically store_vault (token, pii_norm, meta)
return token
Irreversible Analytics Token (HMAC):

pii_norm = normalize(value)
pepper  = kms. get_secret("pepper/"+tenant+"/"+field)
token = HMAC_SHA256 (pepper, pii_norm) # deterministically within scope return base64url (token)
Detokenization policy (idea):

allow if role in {SupportL2, Risk, DPO} and purpose in {KYC, Chargeback, DSAR}
and mTLS and OIDC_claims match tenant and reason_code provided and ticket_id linked rate_limit per actor <= N/min
Tenant crypto removal:

kms. disable_key(kek_tenant)
access to unwrap is blocked → detoxification is not possible schedule_destroy (kek_tenant, hold_days=7)

14) Frequent mistakes and how to avoid them

Tokens in logs. Mask the tokens themselves (especially reversible ones) - these are sensitive data.
Single key "for everything." Divide by tenant/field/goal; use AAD.
Normalization "at random." Uncoordinated canonization breaks down the search/joynes.
Detokenization without cause/limitation. Always reason code, audit and rate-limit.
FPE as a panacea. Use only when the format is really needed and with the correct domain/keys.
Long-lived caches on disk. Cache only in memory with TTL.
No rewrap process. Rotation of KEK without downtime is mandatory.

15) Checklists

Before selling

  • Selected token profiles per field/target (reversibility/determinism/scope).
  • Key hierarchy (KEK/DEK), KMS policies, key operations auditing are configured.
  • Input normalization, format validation pipeline implemented.
  • Rate-limit, reason-codes, immutable audit enabled.
  • Tests for dictionary attacks/format/role-based access passed.
  • DR/replica volt and key compromise plan.

Operation

  • Monthly detokenation report (who/why/how much).
  • Periodic rotation of KEK/pepper, rewrap DEK.
  • Red-team for unauthorized detokenation/side channels.
  • Revise normalization as new formats/regions emerge.

16) FAQ

Q: Tokenization = anonymization?
Oh no. Tokenization - pseudonymization. The source will be restored (or comparable) if there are keys/volt. To exit the sphere of GDPR requires reliable anonymization.

Q: How to search by email/phone without detokenization?
A: Deteminated tokenization with canonization. For addresses/full names - hash indexes/search keys and auxiliary tables.

Q: When is FPE needed?
A: When an external contract/schema requires format (length/alphabet). In other cases, regular AEAD tokens are simpler and safer.

Q: Is it possible to have one token for all purposes?
A: Better different scopes (scope/purpose): the same PII gives different tokens for different tasks → reduces the risk of correlation.

Q: How do you exercise the "right to remove"?
A: Crypto deletion: revoke KEK/DEK for the corresponding set and/or delete the entry in the volt + destroy the field/party keys; in analytics - TTL/aggregation/depersonalization.

Related Materials:
  • "Secret Management"
  • "At Rest Encryption"
  • "In Transit Encryption"
  • «Privacy by Design (GDPR)»
  • "Audit and immutable logs"
  • "Key Management and Rotation"
Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.