Anonymization and Aliasing

1) Terms and key differences

Anonymization: irreversible reduction of a set to a form where the subject cannot be identified either directly or indirectly with reasonable effort. After correct anonymization, the data ceases to be personal data.
Aliasing: replacing direct identifiers (name, phone, email, account number) with aliases (tokens). Communication is stored separately and protected by cryptography and access procedures. Legally, this is still personal data.
Quasi-identifiers: combinations of harmless features (date of birth, index, gender, city, device), which in conjunction can uniquely indicate a person.
Re-identification: restoration of communication with the subject by gluing to external sources or analyzing rare combinations of features.

2) Architectural objectives and requirements

1. Privacy by default: minimizing collection, storing only the necessary fields, strict TTL.
2. Separation of contours: production identifiers are separated from analytical and ML contours; access to link tables - according to the need-to-know principle.
3. Audit and traceability: who, when and why gained access to re-identification.
4. Reuse policies: Data given to partners/external researchers must have formal privacy guarantees and application licenses.
5. Risk assessment: quantitative metrics (k-anonymity, matchup probability, ε for differential privacy) as engineering SLOs.

3) De-identification techniques

3. 1 Aliasing (reversible)

Tokenization: Storing matches in the "token registry."

Forms: deterministic (one input → one token), randomized (input → different tokens with salt and context).
Where appropriate: payment identifiers, accounts, long-lived links between events.
FPE (Format-Preserving Encryption) - format-preserving encryption (for example, 16-digit PAN → 16-digit ciphertext). Convenient for legal schemes and validations.
HMAC/Deterministic Encryption: gives a stable alias for joynes, but requires management of keys and application domains (context binding).
Hashing: acceptable only with strong salt and in the absence of the need for reversibility. For rare domains (phone, email), pure hashing is vulnerable to brute force.

3. 2 Anonymization (irreversible)

k-anonymity: each recorded "quasi-portrait" occurs ≥ k times. Achieved by generalization (age→age_band) and suppression of rare combinations.
l-diversity: in each k-group, the sensitive attribute has ≥ l different values to avoid disclosure across homogeneous clusters.
t-closeness-Distributes the sensitive attribute in the k-group "close" to the global (info leak constraint).
Differential privacy (DP): adding mathematically controlled noise to aggregates or training models with privacy (ε -DP). Gives formal guarantees against arbitrary external knowledge of the attacker.
Masking/permutation/mixing: appropriate for demo/support environments.
Synthetic data: generation of "similar" development/research kits without connection to real subjects (GAN/VAEs/tabular synthesizers) with leak testing.

4) Architectural patterns

4. 1 Privacy Gateway at the entrance

Thread: Client → API Gateway → Privacy Gateway → Event/Storage Bus.

Functions:

normalization of circuits;
Highlight sensitive fields (PII/PHI/Finance)
applying rules: tokenization/FPE/masking;
policy logging (policy_id, key version, processing reason).

4. 2 Token Vault

Separate service/database with HSM/KMS.
RBAC/ABAC over API; all operations are auditable.
Separation of tokenization "domains" (email/payment/user_id) so that one token cannot be confused with contexts.
Key rotation and token version ('token _ v1', 'token _ v2') with transparent migration.

4. 3 Dual-loop analytics

Loop A (operational): PII is stored minimally, for business - tokens.
Contour B (analytic): only anonymized datasets/aggregates; secure notebooks access; export to the outside - via the DP gate.

4. 4 ML conveyor with privacy

Phases: collection → cleaning → pseudonymization → anonymization/DP aggregation → training.
For personalized models, store features on tokens and limit the "brightness" of the feature (caps for cardinality, tail trimming, DP regularization).

5) Protocols and flows (example)

Email Aliasing Protocol:

1. API receives'email '.

2. Privacy Gateway вызывает Token Vault: `tokenize("email", value, context="signup:v1")`.

3. The application stores' email _ token'instead of email.

4. For notifications - a separate service that has the right to "detokenize" by case-by-case, with an audit.

Report anonymization protocol:

1. The analyst forms a request to the showcase (only tokens/insensitive fields).

2. Engine applies k-anonymization on quasi-identifiers ('country, age_band, device_class').

3. For indicators with a risk of disclosure, DP noise is added.

4. The export is marked 'anonymization _ profile _ id' and ε with a budget.

6) Risk metrics and validation

k-anonymity: the minimum size of the equivalent class (target: k≥5/10/20 depending on the domain).
l-diversity/t-closure: control the leakage of sensitive values within k-classes.
Uniqueness score: the share of unique portraits among assets is to reduce by generalization.
Linkability/Inference risk: probability that the record will be compared with an external set (estimated by attack simulations).
DP ε -budget: start a "privacy budget" on the subject/dataset and track its consumption.
Attack simulations: regular "red commands" for re-identification on test cuts.

7) Keys, crypto and operational circuit

KMS/HSM: key generation and storage for FPE/Deterministic Encryption/HMAC.
Versioning: 'key _ id', 'created _ at', 'status = active' retirement 'retired'. Store 'kid' in the data for reversibility.
Rotation: planned (quarterly) and forced (incident). Support "dual encryption" for the duration of the migration.
Access policies: prohibition of mass detokenization; RPS/volume limits mandatory'purpose '.
Audit: unmodified log (WORM/append-only) with signatures.

8) Integration into microservices and protocols

Protobuf/JSON-Schema-Tag fields with 'pii: direct' quasi 'sensitive', 'policy _ id'.
Events: two sets of topics - "raw" (inner contour) and "impersonal" (for analytics/partners).
Partner gate: egress service with anonymization profiles (rule set + risk metrics + version).
Logs/traces: exclude PII; use tokens/hashes, and use FPE/HMAC in correlation.

9) Anti-patterns

Store source PIIs near tokens/keys.
Trust one "super access" without multifactor uprooting and logging.
Give out "impersonal" datasets without risk metrics and without formal guarantees.
Rely only on hashing email/phone without salt/context.
Anonymize "once and forever" without revision when changing external sources (leaks increase the risk of linking).
Consider that k-anonymity is enough for texts/time series/geo-tracks - there you need DP/cropping and synthetics.

10) Application cases (including fintech/gaming industry)

Antifraud & behavioral features: deterministic tokens for gluing sessions and devices, and sensitive fields go into a separate circuit.
Reporting by region: k-anonymization of quasi-identifiers (age groups, region-cluster, type of payment method), DP-noise to revenue metrics.
A/B tests and marketing: user tokens, soft audiences via DP clipping and minimal audit logs.
Data sharing with providers: only through an egress gate with anonymization profiles and legal restrictions on incremental reconstructions.

11) Mini recipes (pseudocode)

Deterministic token (email) with domain salt


function email_token(email, domain_key, context):
norm = normalize (email )//lower, trim, punycode salt = HMAC (domain_key, context )//context bound to use-case return BASE32 (HMAC (salt, norm) )//stable, non-brute force token

FPE for PAN (approx)


cipher = FPE_AES_FF1(kid="pay_v2")
enc_pan = cipher. encrypt(pan, tweak=merchant_id)
store(enc_pan, kid="pay_v2")

k-anonymization with suppression of rare baskets


groups = groupBy(dataset, [age_band, region3, device_class])
filtered = filter(groups, count >= k)
suppressed = replaceRare(groups, with="")

DP aggregation metrics


function dp_sum(values, epsilon, sensitivity=1):
noise = Laplace(0, sensitivity/epsilon)
return sum(values) + noise

12) Testing and observability

Unit tests of policies: reproducibility of tokens, correct rotation of 'kid', inability to detokenize without rights.
Privacy CI: for each PR - static analysis of schemes and code for PII leaks (tag/log/export checks).
Metrics: proportion of columns with PII tags, number of detoxification by targets, k-min by sets, ε - consumption.
Alerts: a surge in detokenization attempts, the appearance of "thin" baskets (k falls below the threshold), exports without an anonymization profile.

13) Legal-process circuit (high-level)

DPIA/TRA: privacy impact assessment for new streams.
Data Retention: TTL and the policy of removing surrogates and registries.
Subject requests: the ability to issue a copy of data without exposing internal tokenization keys/logic.
Contracts with partners: prohibition of re-identification, restrictions on joynes with external sets, mandatory privacy metrics.

14) Architect checklist

1. PII/quasi-identifiers defined and marked up in diagrams?
2. Input Privacy Gateway applies policies deterministically and logs versions?
3. Token registry isolated (KMS/HSM, RBAC, audit, limits)?
4. The contours are divided: operational, analytical, ML, egress?
5. Are risk metrics (k, l, t, ε) and threshold SLOs configured?
6. Have a key rotation plan and reversible token migration?
7. Export to outside goes through anonymization profile and DP noise?
8. Do the logs/traces not contain PII?
9. Regular "red-team" re-identification simulations?
10. Documented runbook on key leak/compromise incident?

15) Related Architecture and Protocols Section Patterns

Tokenization and Key Management

At Rest/In Transit Encryption

Geo-routing and localization

Observability: logs, metrics, traces (without PII)

SLO/SLA for privacy and compliance

Conclusion

Anonymization and pseudonymization is not a single operation on a column, but a systemic architectural ability: policies, services, keys, audits, risk metrics and development cultures. By combining strong pseudonymization for business processes and formal privacy guarantees (DP, k-/l-/t-criteria) for analytics and exchange, you turn privacy from a "brake on innovation" into a competitive advantage and a mandatory layer of quality for your platform.

Anonymization and Aliasing

FPE for PAN (approx)

k-anonymization with suppression of rare baskets

DP aggregation metrics

Conclusion

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects