PII Data Tokenization
Tokenization of PII data
1) Why tokenization and what exactly we tokenize
The goal: to exclude access to "raw" personal data in the operational circuit and analytics, reduce the risk of leaks and simplify compliance with requirements.
PII examples: full name, phone number, email, address, passport/ID, TIN, IP addresses, cookie-ID, payment identifiers, date of birth, etc.
- does not disclose the original value;
- can be reversible (via a secure detokenation service) or irreversible;
- can be deterministic (for join/search) or non-deterministic (for maximum privacy).
2) Threat model and control objectives
Risks: database/log/backup leaks, insider readings, correlation by repeating values, unauthorized detokenization, dictionary/format attacks (email/phone), reuse of secrets.
Objectives:1. Separate trust zones: the application works with tokens, the sources - only in the token service.
2. Guarantee cryptographic strength of tokens and managed detokenation.
3. Reduce blast radius with KMS/HSM, rotation and crypto sterilization.
4. Ensure suitability for search/joyns/analytics at controlled risk.
3) Typology of tokens
Recommended profiles:- PII for search/joynes: reversible deterministic, region-bound (tenant/scope), latched on KMS.
- PII for operational masking (UI): reversible nondeterministic with lifetime to reduce reuse risks.
- For gray zone analytics: irreversible (key NMAC/salt hash) or DP aggregations.
4) Tokenization architecture
4. 1 Components
Tokenization Service (TS): "tokenize/detokenize/search" API, high trust zone.
Token Vault (TV): protected map 'token → original (+ metadata)'.
KMS/HSM: root key storage (KEK), wrapping/signing operations.
Policy Engine: who, where and why can detokenize; scope/TTL/rate-limits; mTLS/mTLS+mTLS.
Audit & Immunity: unchangeable logs of all tokenization/detokenization operations.
4. 2 Key hierarchy
Root/KEK in KMS/HSM (per organization/region/tenant).
DEK-PII per data domain (email/phone/address) and/or dataset.
Rotation: rewrap DEK without re-encrypting the entire volt; "key compromise" plan.
4. 3 Flows
1. Tokenize: TS → client (mTLS + A&A) → normalization → token calculation → writing to TV → token response.
2. Detokenize - TS → Client → Policy/Reason Check → Source Check (or Reject).
3. Search/Match: deterministic tokenization allows you to search by token; for email/phone - normalize the format before tokenization.
5) Token designs (crypto design)
5. 1 Reversible (recommended for operational circuit)
AES-SIV/AEAD envelope: `cipher = AEAD_Encrypt(DEK, PII, AAD=scope|tenant|field)`; token = 'prefix' nonce 'cipher' tag '.
FPE (FF1/FF3-1) for formats (e.g. 10-digit telephone without country code). Apply with caution and correct domain (alphabet/length).
5. 2 Irreversible (analytics/face anonymization)
Keyed HMAC/хэш: `token = HMAC(PII_normalized, key=K_scope)`; salt/pepper - separate; per tenant or dataset.
Minimize the risk of collisions by choosing a function (SHA-256/512) and a domain.
5. 3 Determinism and Scope
For join, use a deterministic schema with AAD = '{tenant' purpose 'field}' → different tokens of the same value correspond to different targets.
For anti-correlation in different services - different keys/areas.
5. 4 Minimize dictionary attacks
Normalization (canonization of email/phone), pepper in KMS, domain size limitation (do not give "no record found" errors as a side channel), rate-limit and SARTSNA/proxy for public points.
6) API design and schematics
6. 1 REST/gRPC (option)
`POST /v1/tokenize { field, value, scope, tenant_id, purpose } -> { token, meta }`
`POST /v1/detokenize { token, purpose } -> { value }` (mTLS + OIDC + ABAC; "minimizing" issuance)
'POST/v1/match {field, value} -> {token} '(deterministic search path)
6. 2 Storage diagram (TV)
Таблица `tokens(field, scope, tenant_id, token, created_at, version, wrapped_key_id, hash_index)`
Indexes: by 'token', by '(tenant_id, field, hash_index)' for de-duplication/search.
Hash index (HMAC from normalized PII) allows you to search without detokenization.
6. 3 Normalization pipelines
email: lowercasing, trim, canonical local-part (without aggressive "eating" of points for all domains).
phone: E.164 (with country code), removing formatting characters.
address/name: transliteration by rules, trim, collapse spaces.
7) Multi-tenancy and isolation
Keys and namespaces per tenant: KEK/DEK per tenant.
Detokenization policies: role + goal + cause + event audit.
Crypto deletion of tenant data - KEK revocation and DEK destruction → volts becomes useless (for its records).
8) Integrations
8. 1 Databases and caches
Store only tokens in operating tables.
Rare cases require on-the-fly detokenization through a proxy/agent.
Token caches - only in memory with a short TTL, without writing to disk.
8. 2 Analytics/BI/ML
In DWH/lake, tokens or hashes. Join is performed on the deterministic tokens of the corresponding scope.
For ML, pseudonymization and aggregates are preferred; avoid restoring persons.
8. 3 Support services and anti-fraud
UI with mask ('+ 380') and episodic detokenization for a reasonable reason (reason code) + second factor.
9) Rotation, versions and lifecycle
Separate the token ID and encryption version (v1/v2).
Rewrap: change KEK without touching the data.
Incident plan: key compromise → instant recall, prohibition of detokenization, rollback to "read-only," launching rewrap.
TTL tokens: by policy - permanent (identifiers) or short (one-time links/temporary integrations).
10) Performance and reliability
Hardware accelerations (AES-NI/ARMv8), pools of connections to KMS, cache of wrapped DEKs.
Horizontal scaling TS; split read/write paths.
Idempotency-key for tokenize repeats for network flags.
DR/HA: multi-area, asynchronous volt replica, regular recovery tests.
SLO: p99 latency 'tokenize' ≤ 50-100 ms; 'detokenize' ≤ 50 ms; availability ≥ 99. 9%.
11) Observability, audit, compliance
Metrics: QPS by methods, A&A errors, share of detokenations (by roles/goals), cache hit-rate, KMS operation time.
Audit (immutable): each detokenation with 'who/what/why/where', query hash, result.
Retention and WORM policies for the log (see Auditing and Immutable Logs).
Compliance: GDPR (minimization, right to delete via crypto erasure), PCI DSS (for PAN - FPE/pseudonymization), ISO/SOC reporting.
12) Testing and safety
Crypto unit tests: stability of deterministic tokens, verification of AAD and failure if it does not match.
Negative tests: dictionary attacks, format reverse, rate-limit, CSRF (for web panels), SSRF for backends.
Chaos: KMS/Volt unavailable, legacy key, partial replication.
Periodic Red-team attempts to detokenize without reason and through side channels.
13) Mini recipes
Deterministic reversible token (AEAD SIV, pseudocode):
pii_norm = normalize(value)
aad = scope tenant field dek = kms. unwrap(kek_id, wrapped_dek_for_field)
token = aead_siv_encrypt (dek, pii_norm, aad) # deterministically store_vault (token, pii_norm, meta)
return token
Irreversible Analytics Token (HMAC):
pii_norm = normalize(value)
pepper = kms. get_secret("pepper/"+tenant+"/"+field)
token = HMAC_SHA256 (pepper, pii_norm) # deterministically within scope return base64url (token)
Detokenization policy (idea):
allow if role in {SupportL2, Risk, DPO} and purpose in {KYC, Chargeback, DSAR}
and mTLS and OIDC_claims match tenant and reason_code provided and ticket_id linked rate_limit per actor <= N/min
Tenant crypto removal:
kms. disable_key(kek_tenant)
access to unwrap is blocked → detoxification is not possible schedule_destroy (kek_tenant, hold_days=7)
14) Frequent mistakes and how to avoid them
Tokens in logs. Mask the tokens themselves (especially reversible ones) - these are sensitive data.
Single key "for everything." Divide by tenant/field/goal; use AAD.
Normalization "at random." Uncoordinated canonization breaks down the search/joynes.
Detokenization without cause/limitation. Always reason code, audit and rate-limit.
FPE as a panacea. Use only when the format is really needed and with the correct domain/keys.
Long-lived caches on disk. Cache only in memory with TTL.
No rewrap process. Rotation of KEK without downtime is mandatory.
15) Checklists
Before selling
- Selected token profiles per field/target (reversibility/determinism/scope).
- Key hierarchy (KEK/DEK), KMS policies, key operations auditing are configured.
- Input normalization, format validation pipeline implemented.
- Rate-limit, reason-codes, immutable audit enabled.
- Tests for dictionary attacks/format/role-based access passed.
- DR/replica volt and key compromise plan.
Operation
- Monthly detokenation report (who/why/how much).
- Periodic rotation of KEK/pepper, rewrap DEK.
- Red-team for unauthorized detokenation/side channels.
- Revise normalization as new formats/regions emerge.
16) FAQ
Q: Tokenization = anonymization?
Oh no. Tokenization - pseudonymization. The source will be restored (or comparable) if there are keys/volt. To exit the sphere of GDPR requires reliable anonymization.
Q: How to search by email/phone without detokenization?
A: Deteminated tokenization with canonization. For addresses/full names - hash indexes/search keys and auxiliary tables.
Q: When is FPE needed?
A: When an external contract/schema requires format (length/alphabet). In other cases, regular AEAD tokens are simpler and safer.
Q: Is it possible to have one token for all purposes?
A: Better different scopes (scope/purpose): the same PII gives different tokens for different tasks → reduces the risk of correlation.
Q: How do you exercise the "right to remove"?
A: Crypto deletion: revoke KEK/DEK for the corresponding set and/or delete the entry in the volt + destroy the field/party keys; in analytics - TTL/aggregation/depersonalization.
- "Secret Management"
- "At Rest Encryption"
- "In Transit Encryption"
- «Privacy by Design (GDPR)»
- "Audit and immutable logs"
- "Key Management and Rotation"