hj: December 2021

This book is more about collective blogs or excerpts from SRE (Site Reliability Engineering) experts, and the authors are from various high tech companies with SRE experience. In this reading note, I will just list out some statements I feel useful and interesting for my team's day-to-day job.

SRE in six words: measure, analyze, decide, act, reflect and repeat
Reliability stack is made up of three components: SLI (service level indicators), SLO (service level objectives), and error budgets.
Here are ways we improve resilience

Load reduction - throttling, load shedding/prioritization, queuing, load balancing
Latency reduction - caching, regional replication
Load adaptation - autoscaling, overprovisioning
Resilience - timeouts, circuit breakers, bulkheads, retries, fail-overs, fallback
Meta-techniques - better tooling, scale up or fail over faster

Observability is for figuring out where in your systems to find the code you need to debug.
Tools such as OpenTracing might show the full path of a management call as well as a user request, possibly exposing unintended interactions.
Use NIST's CyberSecurity framework of Identify, Protect, Detect, Respond, and Recover as a foundation for your own security journey.
You can't do SRE well without investing in a culture of communication. Writing is good for reliability, the more precise the better.
Trust that your manager has your back. They support you if things don't go to plan, celebrate if you succeed, suggest ideas if you're stuck, find opportunities if you're bored, and gather help if you're overloaded.
Successful SRE adoption is about so much more than automating your operations. It's cultural.
Success in SRE requires deep emotional understanding, influence, and organizational context to advocate for change and foster a blameless engineering culture.
With incident response, start small. A pretty good starting point is to think in terms of three roles: incident manager, expert/operator, and communications.
By starting out small and getting some quick wins under your belt, you'll be able to demonstrate the positive benefits of SRE through incremental change and reduce daily toil for yourself or other engineers.
You can follow a structured methodological process to pinpoint the problem, avoiding mistakes and cognitive biases: triage, operational definition, making the mental model of the system explicit, iterating on the model, reconstructing and validating, next steps.
At startups, SRE is often an afterthought behind shiny new features. This could be because product or market fit is higher priority, or SLOs aren't clearly aligned with customers' needs.
How to measure SRE return on investment? These can take many forms, from SLO improvements, toil work reduction, and achieving OKRs (objectives and key results), through to client satisfaction surveys.
It's okay not to know, and it's okay to be wrong. One of LinkedIn's core values is "Take intelligent risks". "I don't know, but I will find out and get back to you."
Stop the bleeding: keep the focus unrelentingly on prioritizing mitigation.The first impulse should be to keep the ongoing conversation focused solely on recovering the current situation.
Runbook creation, maintenance, and review should be a whole-team activity.
Ideally, a playbook should only contain:

Why do I care? severity and qualification of the user-visible impact.
What can I look at? Console, logs, and inspection tools.
What can I do? Mitigation tooling
Whom can I escalate to? Developers, back-end teams, or a dedicated incident response team

The key to reliability is the ability to make the system better (recover) quickly.
Taking a holistic approach will help reduce the MTTR (mean time to repair). A few minutes of downtime might be overlooked by your users but a few hours can lead to loss in user truest, bad press coverage, and potential loss of revenue.
It's not that the tools or the processes don't matter, because they do, but often the biggest obstacle to creating a DevOps culture is ourselves and how we work with our teams.
Establishing an SRE mindset and its practices are foundational to the long-term, sustainable success of any SRE team. Culture change is about people, identify culture carriers who are adept at empowering others and building trust.
Treat your security team the same way you wish that one product team treated you.
Separating SRE from dev teams leads to a few problems, including: Elitism, knowledge constraints, separation from the work, sponsorship. Work closely with security team and resolve CVE (common vulnerabilities & exposures)
We can recognize that heroes do their best work as part of a team, and true heroes don't need a hero culture to do good.
The best source for ideas to improve on call is the on-callers themselves. You can hold frequent retrospectives and reflect on ways you can make on call better.
Building happy, healthy on-call rotations is a superpower and one that you can gain by taking the time and effort to incentivize people, reduce the pain points, provide mentorship, and iterate rapidly.
Study of human factors and team culture to improve pager fatigue

Technical literacy and hands-on experience
Good communication and collaboration
To build a culture of accountability and ownership, but also to celebrate achievements
Establish an effective feedback loop to ensure that everyone's voice was heard
Demonstrate high levels of empathy, looking proactively at opportunities to support each other.

If your service is meeting its SLOs and SLAs, but your on call is unhealthy, your service won't stay healthy for long. Unhealthy on calls will lead to fatigue, burnout and attrition.
Test your disaster plan. A disaster plan can only keep you safe if it works.
Make your engineering blog a priority. An engineering blog is not only great for recruiting, it is a key component of a strong learning organization.
There's been a tension between SRE and product. The former want to optimize for reliability first and foremost, whereas the latter want to ship, which leads to change-- which leads to break. If we can build empathy between product and SRE teams, it will not only lead to a healthier relationship, it will be a stronger win-win outcome for everyone.
Conway's Law (1967): Any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure.
Companies with local, distributed, and remote teams have different challenges. Creating and managing remote teams still needs the same design and process as creating local teams, but also think more about the culture you want to promote.
Roadmaps set out what a team should do in the long term, why those objectives are important, and the relative priorities of those goals.
As Eisenhower said, "What is important is seldom urgent and what is urgent is seldom important." Don't get so bogged down in the day-to-day distractions all SREs face that you neglect planning and executing the long-term projects that will have the greatest impact on your organization.
Google's SRE book defines toil as "the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows."
Tools that support code compliance include metric analysis. Integrity analysis tools include Polyspace, CodeSonar, Frama-C, and Facebook Infer.
Set up chaos engineering to test how your services deal with failure and track those results as you would postmortem action items. Load test your systems to understand how they scale and address bottlenecks you will encounter in the next year or two. Take near misses and surprises seriously and track them along with action items.
Incidents: a window into gaps, including tooling gap, operational expertise gap, and resource gaps.

hj

Friday, December 31, 2021

读书笔记 - 97 Things Every SRE Should Know

Wednesday, December 22, 2021

Respond to "Thank You"