S2S-authentication
S2S authentication proves which service/workflow makes the request and gives it the minimum necessary rights for a limited time. Unlike user streams, there is no person here - therefore, the short lifespan of credentials, cryptographic binding to the workout/channel and clear observability are critical.
1) Goals and principles
Zero Trust by default: do not trust the network, only certification of the workout and cryptography.
Short-lived credits: minutes, not days/months.
Context binding: tenant/region/license/audience/scopes.
Centralized issuance, decentralized verification: STS/IdP + local verification.
Minimal privileges and explicit delegation: only necessary scopes and audits.
"Pain-free" rotation: dual-key/dual-cert windows and automation.
2) Threat model (minimum)
Theft of long-lived secrets (API-keys, long-lived RT).
Service spoofing within the VPC/cluster.
Interregional attacks in broken segmentation.
Replay/proxy traffic substitution.
Supply-chain/container image substitution.
Configuration errors (wide firewall/mesh rules, common JWKS for all).
3) Basic patterns S2S
3. 1 mTLS (mutual certificates)
Who are you: proves by the channel.
Certificates short-lived (hour-day) from internal PKI; release/rotation is managed by mesh/sidecar or SPIRE agent.
Good for "neighbors" in the same trust domain and for binding tokens.
3. 2 Service JWTs (STS)
Who are you: proves with a message.
Short Access JWT (2-5 min) with 'aud', 'scp', 'tenant', 'region'.
Signs KMS/HSM, public keys - via JWKS with 'kid' and rotation.
Check locally (no IdP network call).
3. 3 SPIFFE/SPIRE (SVID)
Universal identity of workers: 'spiffe ://trust-domain/ns/< ns >/sa/< sa>'.
Automatic issuance/rotation X.509/JWT-SVID, integration with Istio/Linkerd.
3. 4 OAuth 2. 1 Client Credentials / Token Exchange (RFC 8693)
Machine clients receive a token from STS; for user "on behalf" actions - OBO (token exchange).
Combine: mTLS for the channel, JWT for the message, SPIFFE for stable identities.
4) Reference architecture
[KMS/HSM] [Policy Store / PDP]
[STS/IdP (issuer)] ── JWKS ──[Gateway/PEP] ─────[Services/PEP]
│
SVID/JWT │ │ │ │
(SPIRE/Istio)│ mTLS/DPoP │ mTLS/DPoP
│ │ │ │
[Workload/Sidecar]─────────┴───────┴────────────┘
Issuer (STS/IdP): releases short service JWT/CVID, publishes JWKS.
Gateway (PEP): network term, validates mTLS/JWT, enriches context, requests PDP.
Services (PEP) - defense in depth, PDP solutions cache.
SPIRE/mesh: auto-certificates and SVID for mTLS.
5) JWT service format (example)
json
{
"iss": "https://sts. core",
"sub": "svc. catalog, "//service identity
"aud": ["svc. search"] ,//target service/domain
"exp": 1730390100, "iat": 1730389800,
"tenant": "brand_eu",
"region": "EE",
"scp": ["catalog:read:public","catalog:read:tenant"],
"mtls": { "bound": true, "spiffe": "spiffe://core/ns/prod/sa/catalog" }
}
Signed ES256/EdDSA, 'kid' indicates active key.
Optional binding to channel: flag, hash cert, SVID.
6) Issuance Policies (STS) and Verification
Issue:- Subject is taken from the SVID/client certificate/client register.
- Lifespan 2-5 min, refresh none - ask for STS again instead.
- Scopes/audiences are taken from the Policy Store (GitOps), not from a customer request.
1. Check mTLS (optional) and chain validity.
2. Check JWT signature by JWKS (by 'kid').
3. Check'exp/nbf/iss/aud ', tenant/region/license.
4. Enrich the context and ask PDP (RBAC/ABAC/ReBAC).
5. Cache PDP solution (TTL 30-120 s), event disability.
7) Multi-tenant and regions (trust domains)
Separate trust-domains' s: 'spiffe ://eu. core`, `spiffe://latam. core`.
Separate JWKS/PKI by region; interregion - only through trusted gateways.
Include 'tenant/region/license' in the stamps and check for resource compliance.
Segment logs/audits by tenants and regions.
8) Mesh/sidecar and no-mesh mode
Istio/Linkerd: mTLS out of the box, policy-enforcement at the L4/L7 level, integration with SPIRE.
Without mesh: client library + mutual TLS in the application; more difficult to manage rotation - automate via agent.
9) Keys, JWKS and Rotation
Private keys only in KMS/HSM; signature - by remote call/set.
Rotation every N days; dual-key: old + new are accepted, issuer signs new after warming up caches.
Monitoring: share of consumption by 'kid', hung clients on the old key.
yaml issuer:
jwks:
alg: ES256 rotation_days: 30 publish_cache_ttl: 60s sts:
access_ttl: 5m audience_policies:
- subject: "svc. catalog"
allow: ["svc. search","svc. wallet"]
scopes: ["catalog:read:"]
tenancy:
claims: ["tenant","region","licence"]
jwks_per_region: true
10) Link binding (DPoP/mTLS-bound)
mTLS-bound tokens: add client certificate hash to JWT; check at the reception.
DPoP: for HTTP clients without mTLS - sign each request with a DPoP key, place a DPoP thumbprint in the AT.
11) Errors and return policy
Standardize codes:- `401 INVALID_TOKEN`/`EXPIRED_TOKEN`/`AUD_MISMATCH`.
- `401 MTLS_REQUIRED`/`MTLS_CERT_INVALID`.
- `403 INSUFFICIENT_SCOPE`/`POLICY_DENY`.
- `429 RATE_LIMITED`.
The response contains machine-readable 'error _ code' and 'as _ of' (key/policy version).
12) Observability and audit
Metrics:- `s2s_auth_p95_ms`, `verify_jwt_p95_ms`, `jwks_skew_ms`,
- `invalid_token_rate`, `aud_mismatch_rate`, `insufficient_scope_rate`,
- consumption by 'kid', the proportion of mTLS-bound requests.
- `subject`, `aud`, `tenant`, `region`, `scp`, `kid`, `sid/svid`, `decision`, `policy_version`, `trace_id`.
- Token issuance, key rotation, policy changes, rejected requests.
13) Performance
JWT verification - locally, cache JWKS (TTL 30-60 s) with background update.
X.509 chains - CA pinning and OCSP/CRL cache.
Bring the expensive validation I/O to the gateway/sidecar.
Use prefetch tokens/certificates (10-20 seconds before expiration).
14) Testing
Contract/interop: different NP/libraries, clock skew ± 300 s.
Negative: expired/fake token, incorrect 'aud', wrong region/tenant, broken cert-chain.
Chaos: sudden rotation 'kid', unavailability of JWKS, expiration en masse, mTLS breakage.
Load: peak issue on STS, verify spike on gateway.
E2E: mTLS-only, JWT-only, combined mode, Token Exchange (OBO).
15) Playbooks (runbooks)
1. Signature key compromise
Immediate revoke 'kid', release of new, shortened TTL tokens, audit, search for "hung" clients, forced deny for old 'kid'.
2. Mass'INVALID _ TOKEN'
Check JWKS cache, clock misalignment, token origins (TTL too short), temporarily expand skew tolerance, warm up JWKS.
3. mTLS-refusals
Check CA chain, SVID dates, host time; emergency-reissue via SPIRE/Istio, enable fallback routes only within the region.
4. 'AUD _ MISMATCH'growth
Audience policy drift: compare STS-policy with actual calls, temporarily add the desired'aud ', schedule call architecture adjustments.
5. STS Unavailable/Slow
Increase the TTL of already issued tokens (grace), enable prefetch/refresh-earlier, scale-out STS.
16) Typical errors
Long-lived API keys/secrets in env/code.
General JWKS/PKI "for all regions and for all times."
Lack of binding (mTLS/DPoP) → the token is easy to take away.
Broad'aud = 'and "admin" scopes by default.
Rotation without dual-key period → mass 401.
Checking tokens only on gateway (no defense in depth).
"Dumb" failure (no 'error _ code' and 'reason') - it is difficult to debug and train teams.
17) Mini Configuration Templates
PEP (gateway) - rules:yaml auth:
require_mtls: true jwks:
url: https://sts. core/.well-known/jwks. json cache_ttl: 60s claims:
required: ["iss","sub","aud","exp","tenant","region"]
tenant_in_header: "x-tenant"
pdp:
endpoint: "opa:8181/v1/data/policy/allow"
decision_cache_ttl: 60s
STS Policy (fragment):
yaml subjects:
- id: "svc. catalog"
spiffe: "spiffe://core/ns/prod/sa/catalog"
audiences: ["svc. search","svc. wallet"]
scopes: ["catalog:read:"]
ttl: "5m"
18) Pre-sale checklist
- Short service JWT (≤5 min), local verification, JWKS cache.
- mTLS (or DPoP) enabled; priority - mTLS-bound tokens.
- SPIFFE/SPIRE or equivalent for auto-issuance/rotation of certificates.
- STS with audience/scope policies; issuance by trusted identity only.
- Separation of trust-domains and JWKS by region; tenant/region/license stamps are checked.
- PDP/PEP integrated, solution cache + disability by event.
- Dual-key windows, monitoring consumption 'kid', alerts to invalid/aud mismatch.
- Full logs/audit S2S, performance/error metrics enabled.
- Key compromise playbooks, STS drop, mTLS failure.
- contract/negative/chaos/load/E2E test suite passed.
Conclusion
S2S authentication is a combination of channel-trust (mTLS), message-trust (short JWT), and persistent worker identity (SPIFFE), managed by a centralized STS and verified locally. Add trust domain separation, rigorous audience/scopes, automatic rotation and observability - and you have an outline that is reliable, explainable and scalable along with the platform and its geography.