Key management and rotation
Keys are the platform's "trust roots." A reliable key management system (KMS/HSM + processes + telemetry) turns cryptography from a one-time integration into an everyday operation: keys are regularly updated, their use is transparent, compromises are localized, and clients experience a key change without downtime.
1) Goals and principles
Crypto agility: the ability to change the algorithm/key length without large migrations.
Least exposure: private keys do not leave KMS/HSM; signature/decryption operations - deleted.
Short-lived artifacts: Tokens/session keys live minutes-hours, not weeks.
Dual-key/Dual-cert windows: fail-safe rotations.
Regional & tenant isolation: keys are divided by region and tenant.
Full auditability: immutable transaction log, HSM qualification, access control.
2) Key classification
Root CA/Master Key: extremely rare use, kept in HSM, used to release intermediate keys or data-key wrappers.
Operating: JWT/event signature, TLS, webhook signature, config encryption/PII.
Session/time: DPoP, mTLS-binding, ECDH-output for channel/dialogue.
Integration: Partner keys (public) and HMAC secrets.
Data Keys (DEK): use envelope encryption under KEK, are not stored explicitly.
3) Key identification and usage policy
Each key has a 'kid' (the key is identified in tokens/headers):yaml key:
kid: "eu-core-es256-2025-10"
alg: "ES256" # или EdDSA, RSA-PSS, AES-GCM, XChaCha20-Poly1305 purpose: ["jwt-sign","webhook-sign"]
scope: ["tenant:brand_eu","region:EE"]
status: "active" # active next retiring revoked created_at: "2025-10-15T08:00:00Z"
valid_to: "2026-01-15T08:00:00Z"
Rules: "one goal - one key" (minimum sharing), explicit areas of application and timing.
4) Key lifecycle (KMS/HSM)
1. Generate: in HSM/KMS, with export policy = denied.
2. Publish: for asymmetry - JWKS/certificate with 'kid'.
3. Use: remote operations (sign/decrypt) with controlled IAM.
4. Rotate: run'next 'key and enable dual-accept.
5. Retire: translate the old into 'retiring', then 'revoked'.
6. Destroy: destroy material (with purge protocol) after dispute window.
5) Rotation: Strategies
Scheduled: calendar (for example, every 1-3 months for JWT signature, 6-12 months for TLS-serts).
Rolling: gradually switching consumers (JWKS already contains a new key; the emitter begins to sign new after warming up the caches).
Forced (security): immediate rotation upon compromise; short dual-accept window, aggressive expiration of artifacts.
Staggered per region/tenant: so as not to "clap" the whole world at the same time.
The golden rule: first the publication, then the signature is new, and only after the expiration - the recall of the old one.
6) Dual-key window
We publish JWKS with the old and new 'kid'.
Verifiers accept both.
Emitter in N minutes/hours starts signing new.
We monitor the share of checks on the old/new 'kid'.
Upon reaching the target share, the retyrim is old.
yaml jwks:
keys:
- kid: "eu-core-es256-2025-10" # new alg: "ES256"
use: "sig"
crv: "P-256"
x: "<...>"; y: "<...>"
- kid: "eu-core-es256-2025-07" # old alg: "ES256"
use: "sig"
...
7) Signature and validation policies
Default algorithms: signature ES256/EdDSA; RSA-PSS where required.
Prohibition of'none '/weak algorithms; whitelisting on the verification side.
Clock skew: we allow ± 300 c, log deviations.
Key pinning (internal services) and a short TTL JWKS cache (30-60 s).
8) Envelope encryption and KDF
Store data like this:
ciphertext = AEAD_Encrypt(DEK, plaintext, AAD=tenant region table row_id)
DEK = KMS. Decrypt (KEK, EncryptedDEK )//on access
EncryptedDEK = KMS. Encrypt (KEK, DEK )//on write
KEK (Key Encryption Key) is stored in KMS/HSM, rotated regularly.
DEK is created per object/batch; when rotating KEK, we perform re-wrap (quickly, without data re-encryption).
For streams - ECDH + HKDF to output short-lived channel keys.
9) Regionality and multi-tenant
Keys and JWKS are regionalized: 'eu-core', 'latam-core' are different sets of keys.
Separation of IAM/audit by tenant/region; keys do not "flow" between residences.
'kid'code with trust domain prefix:' eu-core-es256-2025-10 '.
10) Integration secrets (HMAC, API keys)
Store in the KMS-backed Secret Store, issue via short-lived client secrets (rotation policy ≤ 90 days).
Support for two active secrets (dual-secret) during rotation.
For webhooks - timestamp + HMAC body signature; time window ≤ 5 min.
11) Access control and processes
IAM matrix: who can 'generate', 'sign', 'decrypt', 'rotate', 'destroy' (minimum roles).
4-eye principle: sensitive operations require two confirmations.
Change windows: windows for enabling a new key and test canary regions.
Runbooks: procedure templates for scheduled and forced rotations.
12) Observability and audit
Metrics:- `sign_p95_ms`, `decrypt_p95_ms`, `jwks_skew_ms`,
- consumption by 'kid', 'old _ kid _ usage _ ratio',
- `invalid_signature_rate`, `decrypt_failure_rate`.
- Each signature/decryption operation is' who/what/when/where/kid/purpose '.
- History of key status and rotation/revocation requests.
- HSM qualification, key materials access logs.
13) Playbooks (incidents)
1. Signature key compromise
Immediate revoke of the old 'kid' (or translation into 'retiring' with a minimal window), publication of a new JWKS, shortened TTL tokens, force logout/RT disability, communications to integration owners, retro audit.
2. Mass'INVALID _ SIGNATURE'after rotation
Check JWKS/clock skew cache, return dual-accept, extend window, distribute to clients.
3. Increase in KMS/HSM latency
Enabling the local signature cache is not allowed; instead - batch/queue at the emitter, autoscaling HSM proxy, prioritization of critical streams.
4. Failure of one region
Activate regional isolation procedures; do not "pull" keys from other regions; degrade functions tied to signatures in a fallen region.
14) Testing
Contract: JWKS correctness, correct 'kid '/alg/use, client compatibility.
Negative: fake signature, obsolete 'kid', incorrect alg, clock skew.
Chaos: instant rotation, KMS unavailability, time drift.
Load: peak signatures (JWT/webhooks), peak decryptions (PII/payouts).
E2E: dual-key window: release - verification - traffic transfer - rejection of the old one.
15) Configuration Example (YAML)
yaml crypto:
regions:
- id: "eu-core"
jwks_url: "https://sts. eu/.well-known/jwks. json"
rotation:
jwt_sign: { interval_days: 30, window_dual: "48h" }
webhook: { interval_days: 60, window_dual: "72h" }
kek: { interval_days: 90, action: "rewrap" }
alg_policy:
sign: ["ES256","EdDSA"]
tls: ["TLS1. 2+","ECDSA_P256"]
publish:
jwks_cache_ttl: "60s"
audit:
hsm_attestation_required: true two_person_rule: true
16) Example of JWKS and markers in artifacts
JWT header fragment:json
{ "alg":"ES256", "kid":"eu-core-es256-2025-10", "typ":"JWT" }
JWKS (public part):
json
{ "keys":[
{"kty":"EC","use":"sig","crv":"P-256","kid":"eu-core-es256-2025-10","x":"...","y":"..."},
{"kty":"EC","use":"sig","crv":"P-256","kid":"eu-core-es256-2025-07","x":"...","y":"..."}
]}
17) Anti-patterns
Long-lived keys "for years" and common to all regions.
Rotation "at one moment" without dual-accept.
Export private keys from KMS/HSM "for speed."
Mixing tasks: sign JWT and encrypt data with one key.
Absence of HSM logs/qualification and IAM restrictions.
There is no re-wrap mechanism for DEK in KEK rotation.
Manual "secrets" in env instead of Secret Store.
18) Pre-sale checklist
- All private keys in KMS/HSM; The IAM matrix and the 4-eye principle are tuned.
- Algorithm policies, key lengths, and lifetimes are approved.
- Enabled dual-key process with 'kid' share monitoring.
- JWKS is published with short TTL and cache warming; clients accept key ≥2.
- Envelope encryption: KEK rotates, DEK re-wrap without downtime.
- Regional isolation and separate key sets by tenants.
- Compromise/rolling/force rotation playbooks; training runs.
- Metrics ('old _ kid _ usage _ ratio', 'invalid _ signature _ rate') and alerts are enabled.
- contract/negative/chaos/load/E2E test suite passed.
- Documentation for integrations: how to handle the'kid' shift, which windows and error codes.
Conclusion
Key management is an operational discipline: KMS/HSM as a source of truth, regular and secure rotations with dual-key, regional and tenant isolation, envelope encryption and observability. By following these rules, you get a crypto contour that scales, is incident-resistant and easy to explain to the auditor - and developers and integrators experience any change without pain.