SLO-burn alerta on Payments/Bets
操作路线图
1)为什么需要它
操作路线图(Ops Roadmap)将不同的SRE/平台/支持任务和域命令转变为一个透明的计划:我们在每个季度都会对SLO/成本/事件产生什么影响,以及付出什么代价(人员、时间、预算)。这减少了溷乱,简化了技术债务,加快了向企业交付价值。
目标是:- 结合围绕可测量结果(SLO、MTTR、Cost/RPS、Risk)的举措。
- 协调平台、域和外部提供商之间的优先级。
- 预算资源并记录"我们不做什么"(明确交易)。
- 保持关于绩效和风险的统一真相。
2)路线图原则
1.Outcome-first:每个计划都与结果度量挂钩(不是"实施X",而是"将MTTR降低20%")。
2.SLO-aware:影响关键途径SLO(存款/投注/游戏/KUS)的举措优先级更高。
3.数据驱动:依靠事件、验尸、异形、Capacity/FinOps面板。
4.时间框和可逆的:小的插值,假设验证,快速回滚。
5.单一真相来源:一个人工制品,定期的咆哮和公共地位。
6.没有隐藏的工作:在地图之外-根据规定,只有"火灾"。
3)车架Roadmap: 水平和人工制品
愿景(12-18个月):3-5个操作主题(可信度、规模、成本、安全、自动化)。
支柱(6-12个月):主题计划块(例如"100%关键路径的SLO覆盖","2个地区的主动活动")。
季度计划(Q):具有指标,所有者,从属关系,预算的具体计划。
迭代(2-3周):任务/史诗和实际进展。
ID: OPS-23
4) Prioritization: How to compare the incomparable
4. 1 RICE (Reach, Impact, Confidence, Effort)
Reach: affected users/transactions/geo.
Impact: expected contribution to SLO/MTTR/Cost.
Confidence: Confidence in estimates (data/pilots).
Effort: man-weeks/calendar window/dependencies.
4. 2 WSJF (Scaled)
Cost of Delay = (SLO Risk + Revenue Impact + Compliance + Incident Rate)
/ Job Size = duration/force.
Suitable for mixed initiatives (technical debt, security, platform features).
The rule: initiatives with high SLO risk and high cost of delay come first, even if the effect is "invisible" on UI.
5) Relationship with OKR, SLO and incidents
Platform-level OKR:
KR1: "Reduce Change Failure Rate from 18% to 12% by the end of Q2."
KR2: "Increase Pre-Incident Detect Rate from 35% to 60%."
SLO-matrix: for each domain - target p95/p99/Success Rate/Availability.
Incident analytics: the top 3 reasons for the last quarter should have counteraction initiatives in the current one.
6) Resource and budget planning
FTE-matrix: by squads and competencies (SRE, Observability, Data, Integrations).
Provider calendar: maintenance/quota windows (impact on dates).
CapEx/OpEx: licenses/cluster extensions vs command hours.
Buffer: ~ 15-20% for unplanned "fires" and regulatory tasks.
What-don't-do policy: A list of rescheduled/postponed initiatives with reasons.
7) Managing dependencies and risks
Dependency map: who blocks whom (service/provider/data/command).
Risk register: risk, probability/impact, owner, mitigation plan/plan B.
Change freeze: periods of prohibition of major changes (prime time events/tournaments).
Ficheflags/canaries: Mandatory for initiatives affecting traffic.
8) Quarterly cycle (rhythms)
Q-0 (preparation, 2 weeks): data collection (SLO, incidents, costs), revision of topics, preliminary prioritization.
Q planning: protection of initiatives by owners, reconciliation of resources/risks, fixing the Q plan and "not doing" the list.
Weekly sync: status, blockers, adjustments; maximum 30 minutes.
Monthly review: checking effects on metrics, possible re-scope.
Q retro: compare plan/fact, update principles/patterns.
9) Roadmap view formats
Outcome View: grouped by purpose (SLO, Cost, Risk).
Domain View: Payments/Bets/Games/KYC/Platform.
Timeline View: quarterly, with dependency and frieze markers.
Budget View: FTE/CapEx/OpEx by Initiative and Topic.
Example of a quarterly slice (summary):
Initiative Outcome Metrics Term Owner Risk
-------------------- ----------------------- -------------------- ----- ------------- -------
Active-Active Games RTO≤5 min Availability 99. 95% Q1–Q2 platform-sre High
SLO-burn на Payments − 30% of late incidents Pre-Incident↑, MTTR↓ Q1 observability Average
Kafka Lag Guardrails − 50% of lag storms Lag p95↓, DLQ↑ Q1 streaming Average
FinOps Right-sizing −15% cost/RPS Cost/RPS↓ Q2 finops Low
10) Roadmap Success Metrics (KPIs)
Delivery Predictability: percentage of initiatives completed on time (target ≥ 80%).
SLO Coverage:% of critical paths with active SLOs/alerts.
Incident Trend: − X% of P1/P2 QoQ incidents
Change Failure Rate: Target decline by quarter.
Cost Efficiency: Cost/RPS, Cost/transaction - downward trend.
Risk Burn-down: the number of "red" risks and their total weight.
Stakeholder NPS: satisfaction of domain teams with the quality of the Roadmap.
11) Roadmap launch checklist
[] Defined themes/pillars and 3-5 target outcomes per year.
[] Catalog of initiatives linked to metrics and owners.
[] Prioritization methodology (RICE/WSJF) and scales adopted.
[] Checked resources: FTE, provider windows, budgets.
[] Fixed Q-plan + "not doing."
[] Set up Outcome/Domain/Budget panels, alerts by shifts.
[] Review Schedule: weekly/monthly/quarterly.
12) Anti-patterns
List of tasks without outcomes: "make X" instead of "achieve Y by metric."
Hidden initiatives and private arrangements outside of a single artifact.
Eternal epics: no time-box, no verifiable milestones.
Priority "in terms of volume": resources are spent on the "loudest" request, and not on the most valuable one.
No "what not to do": expectations are unmanageable, trust is falling.
Lack of a link with incidents/SLO: "cosmetic" improvements instead of real ones.
13) Templates (fragments)
Initiative Template (YAML):
yaml id: OPS-42标题:"Guardrails for Canaries发行"
theme: "Reliability"
quarter: "2025-Q1"
owner: "platform-release"
stakeholders: ["payments", "bets", "games"]
结果: "将发布后的回归降低40%"
metrics:
name: change_failure_rate target: "<= 12%"
name: post_deploy_regression_rate target: "-40% QoQ"
slo_impact: ["api_p99 <= 300ms@99.9", "availability >= 99.95%"]
effort_weeks: 6 rice:
reach: 5000,000#事务/空间影响:3。0 confidence: 0.7 effort: 6 dependencies: ["observability-baseline", "feature-flags-core"]
risks:
-名称: "误报门"
mitigation: "基线/调谐,10%流量的飞行员"
budget:
fte: 3 capex: 0 milestones:
name: design eta: "2025-01-20"
name: pilot-10%
eta: "2025-02-10"
name: rollout-100%
eta: "2025-03-05"
Quarterly report template (Markdown):
Q1 Ops Roadmap-报告
结果总数: SLO Coverage 92%(+7个百分点),MTTR − 18%,Cost/RPS − 9%
已执行: 8/10倡议(80%)
转换: OPS-31 → Q2(依赖PSP-X提供程序)
事件: P1=2(− 1 sq/sq),主要原因:提供商时间轴上的转发
Follow-ups: 调音破解者,备份PSP-Y配额
14)与流程集成
事件管理:每个验尸人员在Roadmap中→计划/改进。
更改/版本:主要计划仅带有旗帜/金丝雀。
Capacity/FinOps:每月一次通过头部和成本趋势进行同步。
安全/合规性:季度要求和审计检查点。
15) 30/60/90(快速启动)
30天:收集事件/公制基础,形成主题,以YAML格式描述10-15个举措,选择RICE/WSJF,记录Q计划。
60天:启动Outcome/Domain/Budget面板,进行首次中季度审查,调整数据优先级。
90天:总结Q,更新原则和比额表,重新计算年度支柱。
16)沟通和透明度
Stakholders每月回顾:30分钟,重点关注结果和风险。
异步升级:带有"前/之后"度量的短条目。
单个Roadmap通道:状态,更改,优先级解决方案。
红牌规则:任何团队都可以通过附加数据(SLO/事件/成本)来启动优先级审查。
17) FAQ
Q:如果一切"燃烧"并且没有时间在Roadmap上,该怎么办?
答:包括15-20%的"收缩缓冲"和3 个覆盖事件主要原因的举措的最低Q计划。任何新的"热门"工作都是通过重新排列优先事项。
问:如何证明"无形"倡议的价值(可观察性,自动登机)?
答:计数变更失误率、MTTR、事件前详细率、回滚和"夜间佩奇"。显示之前或之后的动态。
问:如何处理技术债务?
答:债务也是一项主动行动,包括:"− X%的N类事件","− Y%费用/RPS","+Z p.p.SLO Coverage».如果没有可衡量的结果,债务就不会进入计划。