PII Data Tokenization

Tokenization of PII data

1) Why tokenization and what exactly we tokenize

The goal: to exclude access to "raw" personal data in the operational circuit and analytics, reduce the risk of leaks and simplify compliance with requirements.
PII examples: full name, phone number, email, address, passport/ID, TIN, IP addresses, cookie-ID, payment identifiers, date of birth, etc.

The idea: instead of the original value, we use a token - a safe substitute that:

does not disclose the original value;
can be reversible (via a secure detokenation service) or irreversible;
can be deterministic (for join/search) or non-deterministic (for maximum privacy).

2) Threat model and control objectives

Risks: database/log/backup leaks, insider readings, correlation by repeating values, unauthorized detokenization, dictionary/format attacks (email/phone), reuse of secrets.

Objectives:

1. Separate trust zones: the application works with tokens, the sources - only in the token service.

2. Guarantee cryptographic strength of tokens and managed detokenation.

3. Reduce blast radius with KMS/HSM, rotation and crypto sterilization.

4. Ensure suitability for search/joyns/analytics at controlled risk.

3) Typology of tokens

Property	Options	Why
Reversibility	reversible/irreversible	customer care vs analytics
Determinism	deterministic/nondeterministic	join/deduplication vs anti-correlation
Format	FPE (format-preserving )/arbitrary	mask adherence (phone/BIN) vs random strings
Area	per-tenant/per-dataset/global	isolation and collision management
Lifetime	permanent/short-lived	durable links vs disposable tokens

Recommended profiles:

PII for search/joynes: reversible deterministic, region-bound (tenant/scope), latched on KMS.
PII for operational masking (UI): reversible nondeterministic with lifetime to reduce reuse risks.
For gray zone analytics: irreversible (key NMAC/salt hash) or DP aggregations.

4) Tokenization architecture

4. 1 Components

Tokenization Service (TS): "tokenize/detokenize/search" API, high trust zone.
Token Vault (TV): protected map 'token → original (+ metadata)'.
KMS/HSM: root key storage (KEK), wrapping/signing operations.
Policy Engine: who, where and why can detokenize; scope/TTL/rate-limits; mTLS/mTLS+mTLS.
Audit & Immunity: unchangeable logs of all tokenization/detokenization operations.

4. 2 Key hierarchy

Root/KEK in KMS/HSM (per organization/region/tenant).
DEK-PII per data domain (email/phone/address) and/or dataset.
Rotation: rewrap DEK without re-encrypting the entire volt; "key compromise" plan.

4. 3 Flows

1. Tokenize: TS → client (mTLS + A&A) → normalization → token calculation → writing to TV → token response.
2. Detokenize - TS → Client → Policy/Reason Check → Source Check (or Reject).
3. Search/Match: deterministic tokenization allows you to search by token; for email/phone - normalize the format before tokenization.

5) Token designs (crypto design)

5. 1 Reversible (recommended for operational circuit)

AES-SIV/AEAD envelope: `cipher = AEAD_Encrypt(DEK, PII, AAD=scope|tenant|field)`; token = 'prefix' nonce 'cipher' tag '.
FPE (FF1/FF3-1) for formats (e.g. 10-digit telephone without country code). Apply with caution and correct domain (alphabet/length).

5. 2 Irreversible (analytics/face anonymization)

Keyed HMAC/хэш: `token = HMAC(PII_normalized, key=K_scope)`; salt/pepper - separate; per tenant or dataset.
Minimize the risk of collisions by choosing a function (SHA-256/512) and a domain.

5. 3 Determinism and Scope

For join, use a deterministic schema with AAD = '{tenant' purpose 'field}' → different tokens of the same value correspond to different targets.
For anti-correlation in different services - different keys/areas.

5. 4 Minimize dictionary attacks

Normalization (canonization of email/phone), pepper in KMS, domain size limitation (do not give "no record found" errors as a side channel), rate-limit and SARTSNA/proxy for public points.

6) API design and schematics

6. 1 REST/gRPC (option)

`POST /v1/tokenize { field, value, scope, tenant_id, purpose } -> { token, meta }`

`POST /v1/detokenize { token, purpose } -> { value }` (mTLS + OIDC + ABAC; "minimizing" issuance)

'POST/v1/match {field, value} -> {token} '(deterministic search path)

6. 2 Storage diagram (TV)

Таблица `tokens(field, scope, tenant_id, token, created_at, version, wrapped_key_id, hash_index)`

Indexes: by 'token', by '(tenant_id, field, hash_index)' for de-duplication/search.
Hash index (HMAC from normalized PII) allows you to search without detokenization.

6. 3 Normalization pipelines

email: lowercasing, trim, canonical local-part (without aggressive "eating" of points for all domains).
phone: E.164 (with country code), removing formatting characters.
address/name: transliteration by rules, trim, collapse spaces.

7) Multi-tenancy and isolation

Keys and namespaces per tenant: KEK/DEK per tenant.
Detokenization policies: role + goal + cause + event audit.
Crypto deletion of tenant data - KEK revocation and DEK destruction → volts becomes useless (for its records).

8) Integrations

8. 1 Databases and caches

Store only tokens in operating tables.
Rare cases require on-the-fly detokenization through a proxy/agent.
Token caches - only in memory with a short TTL, without writing to disk.

8. 2 Analytics/BI/ML

In DWH/lake, tokens or hashes. Join is performed on the deterministic tokens of the corresponding scope.
For ML, pseudonymization and aggregates are preferred; avoid restoring persons.

8. 3 Support services and anti-fraud

UI with mask ('+ 380') and episodic detokenization for a reasonable reason (reason code) + second factor.

9) Rotation, versions and lifecycle

Separate the token ID and encryption version (v1/v2).
Rewrap: change KEK without touching the data.
Incident plan: key compromise → instant recall, prohibition of detokenization, rollback to "read-only," launching rewrap.
TTL tokens: by policy - permanent (identifiers) or short (one-time links/temporary integrations).

10) Performance and reliability

Hardware accelerations (AES-NI/ARMv8), pools of connections to KMS, cache of wrapped DEKs.
Horizontal scaling TS; split read/write paths.
Idempotency-key for tokenize repeats for network flags.
DR/HA: multi-area, asynchronous volt replica, regular recovery tests.
SLO: p99 latency 'tokenize' ≤ 50-100 ms; 'detokenize' ≤ 50 ms; availability ≥ 99. 9%.

11) Observability, audit, compliance

Metrics: QPS by methods, A&A errors, share of detokenations (by roles/goals), cache hit-rate, KMS operation time.
Audit (immutable): each detokenation with 'who/what/why/where', query hash, result.
Retention and WORM policies for the log (see Auditing and Immutable Logs).
Compliance: GDPR (minimization, right to delete via crypto erasure), PCI DSS (for PAN - FPE/pseudonymization), ISO/SOC reporting.

12) Testing and safety

Crypto unit tests: stability of deterministic tokens, verification of AAD and failure if it does not match.
Negative tests: dictionary attacks, format reverse, rate-limit, CSRF (for web panels), SSRF for backends.
Chaos: KMS/Volt unavailable, legacy key, partial replication.
Periodic Red-team attempts to detokenize without reason and through side channels.

13) Mini recipes

Deterministic reversible token (AEAD SIV, pseudocode):


pii_norm = normalize(value)
aad   = scope          tenant          field dek   = kms. unwrap(kek_id, wrapped_dek_for_field)
token = aead_siv_encrypt (dek, pii_norm, aad) # deterministically store_vault (token, pii_norm, meta)
return token

Irreversible Analytics Token (HMAC):


pii_norm = normalize(value)
pepper  = kms. get_secret("pepper/"+tenant+"/"+field)
token = HMAC_SHA256 (pepper, pii_norm) # deterministically within scope return base64url (token)

Detokenization policy (idea):


allow if role in {SupportL2, Risk, DPO} and purpose in {KYC, Chargeback, DSAR}
and mTLS and OIDC_claims match tenant and reason_code provided and ticket_id linked rate_limit per actor <= N/min

Tenant crypto removal:


kms. disable_key(kek_tenant)
access to unwrap is blocked → detoxification is not possible schedule_destroy (kek_tenant, hold_days=7)

14) Frequent mistakes and how to avoid them

Tokens in logs. Mask the tokens themselves (especially reversible ones) - these are sensitive data.
Single key "for everything." Divide by tenant/field/goal; use AAD.
Normalization "at random." Uncoordinated canonization breaks down the search/joynes.
Detokenization without cause/limitation. Always reason code, audit and rate-limit.
FPE as a panacea. Use only when the format is really needed and with the correct domain/keys.
Long-lived caches on disk. Cache only in memory with TTL.
No rewrap process. Rotation of KEK without downtime is mandatory.

15) Checklists

Before selling

Selected token profiles per field/target (reversibility/determinism/scope).
Key hierarchy (KEK/DEK), KMS policies, key operations auditing are configured.
Input normalization, format validation pipeline implemented.
Rate-limit, reason-codes, immutable audit enabled.
Tests for dictionary attacks/format/role-based access passed.
DR/replica volt and key compromise plan.

Operation

Monthly detokenation report (who/why/how much).
Periodic rotation of KEK/pepper, rewrap DEK.
Red-team for unauthorized detokenation/side channels.
Revise normalization as new formats/regions emerge.

16) FAQ

Q: Tokenization = anonymization?
Oh no. Tokenization - pseudonymization. The source will be restored (or comparable) if there are keys/volt. To exit the sphere of GDPR requires reliable anonymization.

Q: How to search by email/phone without detokenization?
A: Deteminated tokenization with canonization. For addresses/full names - hash indexes/search keys and auxiliary tables.

Q: When is FPE needed?
A: When an external contract/schema requires format (length/alphabet). In other cases, regular AEAD tokens are simpler and safer.

Q: Is it possible to have one token for all purposes?
A: Better different scopes (scope/purpose): the same PII gives different tokens for different tasks → reduces the risk of correlation.

Q: How do you exercise the "right to remove"?
A: Crypto deletion: revoke KEK/DEK for the corresponding set and/or delete the entry in the volt + destroy the field/party keys; in analytics - TTL/aggregation/depersonalization.

Related Materials:

"Secret Management"
"At Rest Encryption"
"In Transit Encryption"
«Privacy by Design (GDPR)»
"Audit and immutable logs"
"Key Management and Rotation"

PII Data Tokenization

Tokenization of PII data

Operation

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects