Friday, December 31, 2021

读书笔记 - 97 Things Every SRE Should Know

This book is more about collective blogs or excerpts from SRE (Site Reliability Engineering) experts, and the authors are from various high tech companies with SRE experience. In this reading note, I will just list out some statements I feel useful and interesting for my team's day-to-day job.

  1. SRE in six words: measure, analyze, decide, act, reflect and repeat
  2. Reliability stack is made up of three components: SLI (service level indicators), SLO (service level objectives), and error budgets.
  3. Here are ways we improve resilience
    1. Load reduction - throttling, load shedding/prioritization, queuing, load balancing
    2. Latency reduction - caching, regional replication
    3. Load adaptation - autoscaling, overprovisioning
    4. Resilience - timeouts, circuit breakers, bulkheads, retries, fail-overs, fallback
    5. Meta-techniques - better tooling, scale up or fail over faster
  4. Observability is for figuring out where in your systems to find the code you need to debug. 
  5. Tools such as OpenTracing might show the full path of a management call as well as a user request, possibly exposing unintended interactions.
  6. Use NIST's CyberSecurity framework of Identify, Protect, Detect, Respond, and Recover as a foundation for your own security journey.
  7. You can't do SRE well without investing in a culture of communication. Writing is good for reliability, the more precise the better. 
  8. Trust that your manager has your back. They support you if things don't go to plan, celebrate if you succeed, suggest ideas if you're stuck, find opportunities if you're bored, and gather help if you're overloaded.
  9. Successful SRE adoption is about so much more than automating your operations. It's cultural.
  10. Success in SRE requires deep emotional understanding, influence, and organizational context to advocate for change and foster a blameless engineering culture.
  11. With incident response, start small. A pretty good starting point is to think in terms of three roles: incident manager, expert/operator, and communications.
  12. By starting out small and getting some quick wins under your belt, you'll be able to demonstrate the positive benefits of SRE through incremental change and reduce daily toil for yourself or other engineers.
  13. You can follow a structured methodological process to pinpoint the problem, avoiding mistakes and cognitive biases: triage, operational definition, making the mental model of the system explicit, iterating on the model, reconstructing and validating, next steps.
  14. At startups, SRE is often an afterthought behind shiny new features. This could be because product or market fit is higher priority, or SLOs aren't clearly aligned with customers' needs.
  15. How to measure SRE return on investment? These can take many forms, from SLO improvements, toil work reduction, and achieving OKRs (objectives and key results), through to client satisfaction surveys.
  16. It's okay not to know, and it's okay to be wrong. One of LinkedIn's core values is "Take intelligent risks". "I don't know, but I will find out and get back to you."
  17. Stop the bleeding: keep the focus unrelentingly on prioritizing mitigation.The first impulse should be to keep the ongoing conversation focused solely on recovering the current situation.
  18. Runbook creation, maintenance, and review should be a whole-team activity. 
  19. Ideally, a playbook should only contain:
    1. Why do I care? severity and qualification of the user-visible impact.
    2. What can I look at? Console, logs, and inspection tools.
    3. What can I do? Mitigation tooling
    4. Whom can I escalate to? Developers, back-end teams, or a dedicated incident response team
  20. The key to reliability is the ability to make the system better (recover) quickly. 
  21. Taking a holistic approach will help reduce the MTTR (mean time to repair). A few minutes of downtime might be overlooked by your users but a few hours can lead to loss in user truest, bad press coverage, and potential loss of revenue.
  22. It's not that the tools or the processes don't matter, because they do, but often the biggest obstacle to creating a DevOps culture is ourselves and how we work with our teams.
  23. Establishing an SRE mindset and its practices are foundational to the long-term, sustainable success of any SRE team. Culture change is about people, identify culture carriers who are adept at empowering others and building trust.
  24. Treat your security team the same way you wish that one product team treated you.
  25. Separating SRE from dev teams leads to a few problems, including: Elitism, knowledge constraints, separation from the work, sponsorship. Work closely with security team and resolve CVE (common vulnerabilities & exposures)
  26. We can recognize that heroes do their best work as part of a team, and true heroes don't need a hero culture to do good.
  27. The best source for ideas to improve on call is the on-callers themselves. You can hold frequent retrospectives and reflect on ways you can make on call better. 
  28. Building happy, healthy on-call rotations is a superpower and one that you can gain by taking the time and effort to incentivize people, reduce the pain points, provide mentorship, and iterate rapidly.
  29. Study of human factors and team culture to improve pager fatigue
    1. Technical literacy and hands-on experience
    2. Good communication and collaboration
    3. To build a culture of accountability and ownership, but also to celebrate achievements
    4. Establish an effective feedback loop to ensure that everyone's voice was heard
    5. Demonstrate high levels of empathy, looking proactively at opportunities to support each other.
  30. If your service is meeting its SLOs and SLAs, but your on call is unhealthy, your service won't stay healthy for long. Unhealthy on calls will lead to fatigue, burnout and attrition.
  31. Test your disaster plan. A disaster plan can only keep you safe if it works.
  32. Make your engineering blog a priority. An engineering blog is not only great for recruiting, it is a key component of a strong learning organization.
  33. There's been a tension between SRE and product. The former want to optimize for reliability first and foremost, whereas the latter want to ship, which leads to change-- which leads to break. If we can build empathy between product and SRE teams, it will not only lead to a healthier relationship, it will be a stronger win-win outcome for everyone.
  34. Conway's Law (1967): Any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure.
  35. Companies with local, distributed, and remote teams have different challenges. Creating and managing remote teams still needs the same design and process as creating local teams, but also think more about the culture you want to promote.
  36. Roadmaps set out what a team should do in the long term, why those objectives are important, and the relative priorities of those goals. 
  37. As Eisenhower said, "What is important is seldom urgent and what is urgent is seldom important." Don't get so bogged down in the day-to-day distractions all SREs face that you neglect planning and executing the long-term projects that will have the greatest impact on your organization. 
  38. Google's SRE book defines toil as "the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows."
  39. Tools that support code compliance include metric analysis. Integrity analysis tools include Polyspace, CodeSonar, Frama-C, and Facebook Infer.
  40. Set up chaos engineering to test how your services deal with failure and track those results as you would postmortem action items. Load test your systems to understand how they scale and address bottlenecks you will encounter in the next year or two. Take near misses and surprises seriously and track them along with action items.
  41. Incidents: a window into gaps, including tooling gap, operational expertise gap, and resource gaps.

Wednesday, December 22, 2021

Respond to "Thank You"

  • You are welcome (Welcome) --> try to use others rather than this standard one. :-)
  • I am happy to help. (I’m always happy to help)
  • No worries.
  • Not a problem (No problem)
  • My pleasure
  • You got it
  • Don't mention it
  • Anytime
  • It was nothing. (It’s nothing)
  • Think nothing of it
  • Sure
  • That’s okay
  • Absolutely
  • Of course

Sunday, November 7, 2021

Interest crediting methods

Indexed annuity or insurance (IUL) has many variations of crediting methods, it can be difficult to figure out which is best to choose when purchasing an indexed product. There are several crediting rates and crediting methods that work together to generate the interest income yield for any given contract.

However, no matter which crediting method you choose, it protects your account during market downside. If the final result is negative, no indexed interest would be credited and your contract value would remain unchanged (minus fees and cost).

Annual Point-to-Point
This is the simplest of the crediting methods. It may be a good choice if you want to maximize the effects of mid-year market volatility.

  • On your contract anniversary, the beginning index value is compared to the ending index value.
  • The percentage of change in the index is calculated.
  • If the ending index value is higher than the beginning index value, a participation rate, a cap, or a spread is applied to determine the amount of indexed interest you will receive. 

Monthly sum (Monthly point-to-point)
Monthly sum is the most volatility-sensitive crediting method. It can provide interest in steady "up" markets, but it can be adversely affected by large monthly decreases.

  • Calculate twelve monthly percentage changes in selected stock market index.
  • Apply the product’s cap rate to each of the twelve monthly percentage changes.
  • Add the twelve monthly capped percentage changes together to determine the annual interest amount to be credited. If the final sum is positive, you'll receive that amount as indexed interest.

Monthly average
Monthly average can help reduce volatility by averaging monthly highs and lows over the course of the year. It may be a good choice in turbulent markets.

  • The index values at the end of each month are tracked for one year.
  • At the end of the year, those index values are added together and then divided by 12 to determine the monthly average.
  • The starting index value is subtracted from the monthly average, and the result is divided by the starting index value.
  • If the final result is positive, a participation rate, a cap, or a spread is applied to determine the amount of indexed interest you will receive.

Trigger is conservative crediting method, better than fixed rate.

Trigger has a fixed rate and when certain criteria met, it will trigger to use the fixed rate to calculate the interest. For example, if annual point-to-point index value is positive (no matter 0.5% increase or 15% increase), it will trigger to use the fixed rate (say 6%) to calculate annual interest amount to be credited.

Indexed products have certain components (crediting rates) that help determine how much indexed interest you can receive in a given year.

  • Cap: If the return of the index you select exceeds the cap, the cap is used to calculate your interest.
  • Participation Rate:  a participation rate determines what percentage of the index increase will be used to calculate your indexed interest.
  • Spread: The spread or margin rate are synonymous terms that refer to the first portion of gain that would not be credited to your account. The indexed interest rate credited is determined by subtracting a spread from an index's gain during a specified period.

Friday, October 29, 2021

读书笔记 - Think and Grow Rich

This book has over 15 million copies sold, written by Napoleon Hill, which is a landmark bestseller. This book deserves second read, and it outlines the thirteen steps to riches.

  1. Desire: The starting point of all achievement
  2. Faith: Visualizing and believing in the attainment of desire
  3. Autosuggestion: The medium for influencing the subconscious mind
  4. Specialized Knowledge: Personal experiences or observations
  5. Imagination: The workshop of the mind
  6. Organized Planning: The crystallization of desire into action
  7. Decision: The mastery of procrastination
  8. Persistence: The sustained effort necessary into induce faith
  9. Power of the master mind: The driving force
  10. The mystery of sex transmutation
  11. The subconscious mind: The connecting link
  12. The brain: A broadcasting and receiving station for thought
  13. The sixth sense: The door to the temple of wisdom

The major attributes of leadership

  1. Unwavering courage
  2. Self-control
  3. A keen sense of justice
  4. Definiteness of decision
  5. Definiteness of plans
  6. The habit of doing more than paid for
  7. A pleasing personality
  8. Sympathy and understatnding
  9. Mastery of detail
  10. Willingness to assume full responsbility
  11. Cooperation

The major causes of failure in leadership

  1. Inability to organize details
  2. Unwillingness to render humble service
  3. Expectation of pay for what they know instead of what they do with what they know
  4. Fear of competition from followers
  5. Lack of imagination
  6. Selfishness
  7. Intemperance
  8. Disloyalty
  9. Emphasis of the authority of leadership
  10. Emphasis of title

 QQS formula

  1. Quality of service means the performance of every detail, in connection with your position, in the most efficient manner possible, with the object of greater efficiency always in mind
  2. Quantity of service means the HABIT of giving all the service of which you are capable, at all times, with the purpose of increasing the amount of service as you develop greater skill. 
  3. Spirit of service means the HABIT of agreeable, harmonious conduct that will induce cooperation from associates and fellow employees.

Capital consists not of money alone, but more particularly of highly organized, intelligent groups of people who plan ways and means of using money efficiently for the good of the public, and profitably for themselves.

Capitalistic society guarantees every person the opportunity to provide useful service and to collect riches in proportion to the value of the service.

Procrastination, the opposite of decision, is a common enemy that practically everybody must conquer.

In the story of the Declaration of Independence it will not be difficult to detect at least six of these principles: Desire, Decision, Faith, Persistence, The Master Mind, and Organized Planning.

The major weakness of all educational systems is that they neither teach nor encourage the habit of definite decision.

Those who have cultivated the habit of persistence seem to enjoy insurance against failure. No matter how many times they are defeated, they finally arrive up near the top of the ladder.

Persistence is a state of mind, so it can be cultivated. Like all states of mind, persistence is based upon definite causes including:

  1. Definiteness of purpose
  2. Desire
  3. Self-reliance
  4. Definiteness of plans
  5. Accurate knowledge
  6. Cooperation
  7. Willpower
  8. Habit

Riches do not respond to wishes. They respond only to definite plans, backed by definite desires, through constant persistence.

The Master Mind may be defined as: coordination of knowledge and effort, in a spirit of harmony, between two or more people, for the attainment of a definite purpose.

That power, when successfully used in the pursuit of money, must be mixed with faith. It must be mixed with desire. It must be mixed with persistence. It must be applied through a plan, and that plan must be set into action.

The most powerful of all human emotions is that of sex. Highly sexed people always have a plentiful supply of magnetism. This energy may be communicated to others through the following media:

  1. The handshake
  2. The tone of voice
  3. Posture and carriage of the body 
  4. The vibrations of thought
  5. Body adornment

The emotions of love, sex and romance are sides of the eternal triangle of achievement-building genius. Nature creates geniuses through no other force.

The seven major positive emotions: Desire, Faith, Love, Sex, Enthusiasm, Romance, Hope
The seven major negative emotions: Fear, Jealousy, Hatred, Revenge, Greed, Superstition, Anger

Six ghosts of fear

  1. The fear of poverty
  2. The fear of criticism
  3. The fear of ill health
  4. The fear of loss of love
  5. The fear of old age
  6. The fear of death

Without doubt, the most common weakness of all human beings is the habit of leaving their minds open to the negative influence of other people.

You have absolute control over but on thing, and that is your thoughts.

Saturday, October 23, 2021

Small towns across the US

Redfin selected 10 small towns

Portland, Maine
Sedona, Arizona
Carmel-by-the-Sea, California
Estes Park, Colorado
Hilo, Hawaii
St. Augustine, Florida
Port Townsend, Washington
Nantucket, Massachusetts
Montpelier, Vermont
Saugatuck, Michigan

Sunday, October 17, 2021

5 symptoms to slow down and heal

On MyTown magazine, I read an article from nervous system consultant GuruNischan. It is very helpful, with below notes.

  1. You are wired but tired - you're exhausted, tired, fatigued, but cannot rest, relax, or get deep sleep on a regular basis.
  2. Anxious ambition - you maintain an active schedule and don't feel "right" without accomplishing something or giving service.
  3. You over-stimulate yourself with substances, workout routines, or over-discipline or lack of discipline.
  4. Something feels different inside you. A strange shift that you can't explain.
  5. Life disaster - everything is failing apart.

It's a four step L.O.V.E. process:

  1. Locate yourself
  2. Observe your inner space
  3. Variance. Vibration. Velocity.
  4. Excavate Excess Energy

Weekend Getaways (Trip)

When I read the magazine "MyTown silicon valley", I suddenly came up with an idea to put getaways options in a blog, so that I can reference to this in the future. California is a such a beautiful state with lots of places for weekend retreats, beaches, mountains, gardens, wild, trails and so many.

Vision Quest Ranch
400 River Road, Salinas, CA 93908

Monterey Zoo

Listen to the lions and tigers roaring only yards from your canvas walled hotel suite. Each African tent style bungalow is creatively decorated and equipped with comfortable furniture and complete bathroom facilities. 

~1.5 hours drive from San Jose

Great Wolf Lodge Water Park | Northern California
2500 Daniels St, Manteca, CA 95337

It was opened in June 2021. When you’re not splashing in our indoor water park, there is plenty of fun to be had on dry land. Embark on a magical quest at our interactive adventure game MagiQuest, win some prizes at the arcade, and discover new skills with our variety of attractions.

~1.5 hours drive from San Jose

Fort Worth, Texas

Fort Worth Stock Yards
Hotel Drover
97 West (in-house restaurant)
John Wayne Museum
World's largest Honky-tonk
Provender Hall
Sidesaddle Saloon

~3 hours flight from SJC

Mar Vista Farm + Cottages,
35101 S Highway 1, Gualala, CA 95445

Mar Vista Farm + Cottages instills that sense in nearly everyone who visits our magical site on the “secret coast” in Mendocino County, where life slows down so you can forge stronger connections with nature and the ones you love.

Stocking up before getting to Gualala ,with the well-stocked kitchen, you can definitely cook up some good meals.

~3.5 hours drive from San Jose

Safari West
3115 Porter Creek Rd, Santa Rosa, CA 95404

Discover wildest Africa in the heart of wine country! At Safari West, every day means adventure as we journey out in search of herds of wildebeest, romping rhinos and towering giraffes. From ring-tailed lemurs to the dazzling zebra, nearly 900 animals from over 90 unique species roam through our 400-acre preserve.

~2 hours drive from San Jose

Filoli Historic House & Garden
86 Cañada Rd, Woodside, CA 94062

Connecting our rich history with a vibrant future through beauty, nature and shared stories. Introduce yourself to Filoli, where you can explore 16 acres of formal gardens, step back in time in the historic house museum, and hike through the lush and varied natural communities of the estate. 

~30 minutes drive from San Jose

Mount Hermon Adventures
17 Conference Dr, Felton, CA 95018

For over a decade, we’ve been helping people adventure outdoors, creating lasting memories with loved ones.
We respect and care for our amazing forest, have expert local guides, and draw together like-minded adventurers who share your passions.

~40 minutes drive from San Jose


Bodega Bay
103 Coast Highway 1
Bodega Bay, CA 94923

Sonoma county coast, The lodge at Bodega Bay, Drake's upscale restaurant, Bodega Head Trail, Doran Beach, Sebastopol Cookie Company, The Barlow, Eat Oysters.  Hotel, food and seaside view.

~2.5 hours drive from San Jose



MacArthur Place Hotel & Spa
The Lodge at Bodega Bay
Sonoma Plaza
Cornerstone gardens
Free wine tastings (Adastra wine, Korbel Champagne Cellars, Sonoma Portworks)
Layla (located in MacArthur Place Hotel & Spa)
Drake's (located at The Lodge at Bodega Bay)
Taste of the Himalayas (Samosas, momos, lamb tandoori, naan - India food)


Zachari Dunes on Mandalay Beach, Curio Collection by Hilton
2101 Mandalay Beach Rd, Oxnard, CA 93035
Beach town, Oxnard
Ox & Ocean onsite restaurant


More to come



Saturday, September 18, 2021

FedRAMP containerization security

There are a couple of practices to make containers are secure to pass FedRAMP audit.

  • Image Hardening
  • CI/CD Pipeline
  • Asset Management and Inventory Reporting
  • Vulnerability Scanning
  • Encryption data-in-transit and data-at-rest
  • Network separation
  • Authentication and authorization
  • Audit logging
  • System backups

Saturday, September 4, 2021

NAT 101

NAT (Network Address Translation) is a way to map multiple local private addresses to a public one before transferring the information. Organizations that want multiple devices to employ a single IP address use NAT, as do most home routers.

First, the protocol should be based on UDP. You can do NAT traversal with TCP, but it adds another layer of complexity to an already quite complex problem. Second, you need direct control over the network socket that’s sending and receiving network packets. Direct socket access may be tough depending on your situation. One workaround is to run a local proxy. Your protocol speaks to this proxy, and the proxy does both NAT traversal and relaying of your packets to the peer.

There are two obstacles to having NAT Just Work: stateful firewalls and NAT devices.

Stateful firewalls have limited memory, meaning that we need periodic communication to keep connections alive. If no packets are seen for a while (a common value for UDP is 30 seconds), the firewall forgets about the session, and we have to start over. To avoid this, we use a timer and must either send packets regularly to reset the timers, or have some out-of-band way of restarting the connection on demand.

For UDP, the rule is very simple: the firewall allows an inbound UDP packet if it previously saw a matching outbound packet. In other words, packets must flow out before packets can flow back in.

A NAT device is anything that does any kind of Network Address Translation, i.e. altering the source or destination IP address or port. NATs let us have many devices sharing a single IP address, so despite the global shortage of IPv4 addresses, we can scale the internet further with the addresses at hand. Multiple NATs on a single layer allow for higher availability or capacity, but function the same as a single NAT.

There are 4 types of NATs: "Full Cone", "Restricted Cone", "Port-Restricted Cone" and "Symmetric" NATs based on the matrix of Endpoint-dependent/independent firewall and Endpoint-dependent/independent NAT mapping.

For details, check out

When talk about NAT or WebRTC, we always need to talk about

STUN (Session Traversal Utilities for NAT)
That’s fundamentally all that the STUN protocol is: your machine sends a "what’s my endpoint from your point of view?" request to a STUN server, and the server replies with "here’s the ip:port that I saw your UDP packet coming from."

TURN (Traversal Using Relays around NAT)
The idea is that you authenticate yourself to a TURN server on the internet, and it tells you "okay, I’ve allocated ip:port, and will relay packets for you." You tell your peer the TURN ip:port, and we’re back to a completely trivial client/server communication scenario. 

ICE (Interactive Connectivity Establishmen)
The protocol specifies a stunningly elegant algorithm for figuring out the best way to get a connection. For instance, two peers are on the same WiFi network, with no firewalls and no effort required.

In short, ICE is to find best connectivity path, A STUN server is used to get an external network address, and TURN servers are used to relay traffic if direct (peer to peer) connection fails. Every TURN server supports STUN: a TURN server is a STUN server with added relaying functionality built in. Authentication parameters are supported by TURN while STUN servers do not. 


Saturday, August 28, 2021

Architecture design patterns

I came across this great article and could not agree more with these patterns discussed by the author. No matter which names we give to these patterns, but they are really essential to a distributed and complicated system for being resilient to avoid cascading failures.

Bulkhead resilience pattern enables developers to design a system with multiple, independent subsystems and services running in their own private machines or containers. Build highly loose coupling microservices.

Backpressure is a resilience approach that configures individual application systems and services to autonomously push back incoming workloads that exceed its current throughput capacity. This pattern can often help manage throughput naturally, without the need to pile an unfair or unregulated number of requests on any single component.

The circuit breaker pattern an stop temporary outages from becoming cascading failures that run rampantly across large swaths of the software stack. In the event of overloads, the circuit opens and will reject any new requests and put a halt to the pending message queue. When workload stress levels and throughput drop back down to an acceptable level, the circuit closes and starts accepting requests again.

Batch processing of records usually builds up abrupt, performance-dampening spikes of stress on services, databases and all other related components. Batch-to-stream pattern forces it to submit to load balancers and trigger the appropriate remediating mechanisms when throughput exceeds acceptable rates. Such workload throttling technique can safeguard a consistent rate of push, ensure that the load balancer distributes jobs appropriately.

When a component or service fails completely, graceful degradation pattern helps to maintain continuity using a fallback mechanism that allows alternative components to automatically pick up.

Sunday, August 1, 2021

读书笔记 - Alone on the wall

Alex Honnold and David Roberts wrote this book "Alone on the wall" to recount the most astonishing achievements of Honnold's extraordinary life and career, brimming with lessons on living fearlessly, taking risks, and maintaining focus even in the face of extreme danger. 

On June 3, 2017, Alex Honnold became the first person to free solo Yosemite's El Capitan. The followings are reading notes of this book.

  • Free soloing a big wall is all about preparation.
  • He's pushed the most extreme and dangerous form of climbing far beyond the limits of what anyone thought was possible.
  • Free soloing means climbing without a rope, a partner, or any hardness (pitons, nuts, cams, bolts) to attach oneself to the wall.
  • If you fall, you die.
  • George Leigh Mallory response in 1923 to climb Everest "Because it is there".
  • Free climbing means that the leader was his protection only to safeguard a fall.
  • Rock routes climbed free are rated on a scale of difficulty, called the Yosemite Decimal System (YDS) that ranges from 5.1 to 5.15.
  • The hardest climbs in the world are rated 5.15c.
  • Visualizing everything that could happen.
  • What's the point in having an amazing vehicle if you are afraid to drive it?
  • The universe shrinks down to me and the rock you don't take a single hold for granted.
  • Climbing down is almost always harder than climbing up.
  • - the world's best rock climbing guide info
  • Doubt is the biggest danger in soloing. As soon as you hesitate, you're screwed.
  • Yosemite triple crown: three of the park's biggest walls: Mt Watkins, El Capitan, Half Dome.
  • We will all continue climbing in the ways that we find most inspiring, whether or not we're sponsored, the mountains are calling, and we must go.
  • I could suddenly travel full-time without feeling like I had to come back for something.
  • I considered a potential free solo of El Cap to be the holy grail of climbing.
  • Being uncomfortable was a lot better than being incapable.
  • The season was over, but I was more motivated than ever.
  • I knew it was possible. I knew that I could do it.
  • That moment made me hope that I will have even a fraction of his enthusiasm for climbing when I'm in his age, or just his passion for life.
  • It was a good reminder to stay humble and keep asking for help. Each of my friends had something to teach me or remind me on the route.
  • Real soloing rquires me to climb exactly as I normally would with a rope, it's climbing that has only one solution.
  • I was fully prepared and knew exactly what to do. It was the time for execution.
  • I was overwhelmed with happiness and gratefulness.

Saturday, July 31, 2021

Cascading Failures

This article listed 6 anti-patterns about cascading failures, after reading it, I think we do have some of these anti-patterns in our system.

  1. Accepting unbounded numbers of incoming requests
  2. Dangerous client retry behavior
  3. Crashing on bad input — the ‘Query of Death’
  4. Proximity-based failover and the domino effect
  5. Work prompted by failure
  6. Long startup times 

Cascading failures are failures that involve a vicious cycle or a feedback loop. A system in cascading failure won’t self-heal; it’ll only be restored through human intervention. Now let's look at the incidents or outages we had in the past which was cascading failures.

Incident 1: Client sent unnecessary API calls using a polling mode to list meetings which caused overload to the backend system, then impacted start/join meeting.  

Setting a limit on the load on each instance of your service is so important. As we can control the client, we should also avoid polling to be kind to the system.

Incident 2: healthcheck API detected error with 3 retry then restarted the pods. The error was due to network blip, but consecutive retries without exponential back-off mechanism.  

In this case, turning off the health checks really can help. Health checks, liveness, readiness checks, whichever you're calling them. This applies both in your load balancer systems and your orchestration systems. The way to reduce this feedback loop is to delay replication, because sometimes your failure is transient, or to use a token bucket algorithm, which limits the amount of inflight replication to put a brake on that reinforcing loop and prevent the feedback cycle from running away.

Incident 3: Listening to voicemail but hang up in the middle might cause recording service crash. This was a code bug, so it eventually crashed all servers. Unfortunately, these servers were not properly monitored.  

Avoid crashing on bad input or user behaviors. if you allow a request to your system to cause a crash, that means that a request to your system can reduce the capacity of your service. There's a practical fuzz testing, which can be really helpful for detecting unintentional or intentional crashes on inputs.

Incident 4: ElasticSearch was upgraded from 7.1 to 7.9 but had performance issue, the search latency suddenly increased 10x times. Slow ES caused latency in web service, which quickly used up all tomcat threads because there was no "circuit breaker" in web service (which is a client in this scenario). After all Http threads were used up, web application failed, so customers cannot access the web portal.

Circuit breaker is very good because it's very protective of backend services in overload while still allowing fast retries in the common case of just one or two backends overloaded. Client retries can easily go exponential. That's a really good reason to limit the number of retries. Use exponential backoff, so 100 milliseconds, 200 milliseconds, 400 milliseconds, and so on. Jitter your retries. Jitter is just going to smear the excess load of retries over time.

Incident 5: Server added a SQL query which didn't use a proper index caused full table scan. When the query was executed, it was very slow, and took one DB connection. This reduced the DB connection capacity without quick connection release, then quickly all DB connections are used up. Healthcheck API will detected DB timeout with retries, then restarted pods. Restarting will not solve this problem as all DB connections are in use, which need human intervention. 

Took the backend DNS out of service, added DB index, and reintroduced traffic to bring the service back. The other way to do it is typically to use some mechanism to block the client's ability to connect.

Incident 6: We have not met this, but the DR (disaster recovery) design using Proximity-based failover is not indeed an anti-pattern.  We have a geographically distributed service, and we have a setup that lets the system fail over from one failed region to another region.

If you have this pattern, you want to think about imposing maximum capacities, or really overprovisioning to deal with the possible implications of this failure. That is why we reserve 50% capacity for possibile DR in terms of registration and concurrent sessions. Current challenges are ongoing calls and RPS when fail over to DR region.

Friday, July 30, 2021


STIR/SHAKEN is a technology framework designed to reduce fraudulent robocalls and illegal phone number spoofing. STIR stands for Secure Telephony Identity Revisited. SHAKEN stands for Secure Handling of Asserted information using toKENs. Its goal is to prevent fraudsters from scamming consumers and businesses through robocalls and illegal phone number spoofing, while making sure that legitimate calls reach the recipients.

The FCC has adopted rules requiring service providers to deploy a STIR/SHAKEN solution by June 30, 2021.

STIR/SHAKEN is a great tool to help providers restore confidence in the calls they're connecting. It is not a one-stop solution for preventing telecom fraud. In fact, its main use is to verify that a call being made is in fact from the owner of the telephone number.

To verify the ownership of the phone number, there are 3 types of attestation:

  1. A (Full) - The service provider knows the customer and their ownership to use the phone number.
  2. B (Partial) - The service provider knows the customer, but not the source of the phone number.
  3. C (Gateway)  - The service provider has originated the call onto the network but can’t authenticate the call source and phone number. e.g international gateway.

On the terminating end, attestation of A and B are usually considered good calls but not guaranteed from being blocked, because analytic engine and fraud system have other factors considered than attestation result.

Sunday, May 23, 2021

Scores for Customer Happiness

The culture is to delivering happiness to customers, but how to evaluate this? From support and marketing perspective, there are three major types of customer surveys (CSAT, CES and NPS) to help executives to make decisions.

Customer Satisfaction Score (CSAT) is the most straightforward of the customer satisfaction survey methodologies, and it measures customer satisfaction with a business, purchase, or interaction. It's calculated by asking a question, such as "How satisfied were you with your experience?"

A CSAT score of 80% is a good indicator of success, although it will vary by industry. Customer Satisfaction surveys are not designed to give you a comprehensive view of customer perception, but they're very helpful for pinpointing issues, especially if you use CSAT scores to grade different parts of your business.

Customer Effort Score (CES) is a single-item metric that measures how much effort a customer has to exert to get an issue resolved, a request fulfilled, a product purchased/returned or a question answered.

There's no definitive industry standard for customer effort score. However, customer effort score is recorded on a numeric scale, so a higher score would represent a better user experience. For a standard seven-point scale, responses of five or higher would be considered good scores.

Net Promoter Score (NPS) is a widely used market research metric that typically takes the form of a single survey question asking respondents to rate the likelihood that they would recommend a company, product, or a service to a friend or colleague.

NPS measures the loyalty of customers to a company. NPS scores are measured with a single question survey and reported with a number from -100 to +100, a higher score is desirable. Based on the global NPS standards however, any score above 0 would be considered "good", with 50 and above classified as excellent, and 70 or higher as world class. In other words, any score above 0 will be considered a good score. The NPS survery is to find if customers are Promoters (9-10), Passives (7-8), or Detractors (0-6). No company has yet to score an NPS of 100.

Saturday, May 15, 2021

Security awareness training notes

The following is the terminologies related to security risk from social engineering.

  • Social engineering: the art of manipulating, influencing, or deceiving you into taking some action or divulging confidential information.
  • Phishing: Acquire sensitive information such as usernames and passwords
  • Spear phishing: specific phishing target using soical media, personalized message
  • Vishing: voice phishing, using scam recorded message
  • Smishing: phish you using text messages
  • Pretexting: the practice of presenting oneself as someone else in order to obtain private information
  • Tailgating: trying to gain unauthorized access to physical locations
  • Ransomware: malicious software that will allow a hacker to deny access to all of files or network until a ransom is paid.
  • Spyware: installed software to spy and collect data
    Bot: act as malicious software, running in background, usually causing system slow or crash
  • Malicious app: link/attachment to install bad app on mobile phones

Three things to remember

  • Stop, look, and think before take proper actions
  • Don't open links or attachment in suspicious emails
  • Don't use public wifi

Email sender authentication

Speaking of authentication in an email delivery, usually we will talk about SPF, DKIM and DMARC. Put aside these fancy acronyms, authentication is basically to prove the sender is the legit sender and the email is not tampered in transit.

SPF (Sender Policy Framework) verifies the email is coming from an authorized server. An SPF record is a DNS TXT record specifying which IPs or servers are allowed to send email from that domain.

DKIM (DomainKeys Identified Mail) proves the email has not been changed in transit and the sender owns the DKIM domain. DKIM is also a TXT record signature that builds trust between the sender and the receiver.

DMARC (Domain-Based Message Authentication Reporting and Conformance) is an added authentication method that uses both SPF and DKIM to verify whether or not an email was actually sent by the owner "from" domain. In order for DMARC to pass, both SPF and DKIM must pass, and at least one of them must be aligned. Gmail and Microsoft adopt DMARC into their filtering methods for None, Quarantine, or Reject policies.

Friday, April 23, 2021

Pre-tax and after-tax


  • With a W-2 income, you don't have many options, but you can put pre-tax money into 401k, IRA, FSA and HSA (with HDHP). 
  • Middle-class also might have a business (no matter home business, LLC or other small business),which gives the flexibility to deduct expense from business income. 
  • If you have rental property, the rental related expense and cost can also be deductible.

After-tax, there are many ways to invest like stock, fund, bond, future etc, but here are some popular long-term ones:

  • Property - own one or many rental properties to have income, and leverage estate tax exemption
  • Life Insurance - Index life insurance is getting popular nowadays, usually needs to wait for 10 years to see the positive account cash value
  • Annuity - low yield, but no fee or cost, retirement supplement (former employer 401k can be converted to annuity)
  • 529 plan - for education cost, but no recommended
  • Roth IRA - but have income limit

Own a house is a good option esp. in California with Prop. 19 Assessor introduced in year 2021.

Sunday, April 11, 2021

Stock plan recap

Even though IRS postponed tax deadline to May 17th this year due to COVID-19 pandemic, it is time to prepare 2020 tax return. Employer stock plan is part of the tax return preparation, so I would like to recap different equity types to refresh memory. (Tax is once a year, my memory doesn't like this repeat cycle)

When working on stock plan related tax preparation, you need the following forms:
Employer: W-2, 3922, 3921
Broker: 1099-B, Stock Plan Transactions Supplement
IRS: 1040, 1040 Schedule D, 8949

ESPP - tax on sale date
To qualify for favorable tax treatment under Section 423, you must hold the stock for more than two years from the grant date and more than one year from the date of purchase. However, for the gain or loss on your sale to be taxed as long term, you only need to hold the stock for more than one year after purchase.

Non-qualified Stock Option (NQ) - tax on exercise date and sale date
Typically, for non-qualified stock options ordinary income is recognized at the time of exercise and a capital gain or loss is recognized at the time of sale. If you sell your exercised shares within one year of exercise, your capital gain or loss is considered short term. However, if you hold the exercised shares for more than a year before selling them, the gain or loss is considered long term and long-term capital gains may be taxed at a lower rate.

Incentive Stock Option (ISO) - tax on sale date, AMT on exercise date
For ISOs, ordinary income is not recognized until you sell the resulting shares. In general, selling stock in a disqualifying disposition will trigger ordinary (compensation) income. If the shares are held for more than one year from the exercise date and more than two years from the grant date, any gain will typically be taxed as long-term capital gains.

There are no taxes due when you receive an ISO grant, and while you generally do not have to pay taxes when you exercise your ISO to purchase company stock, you may have to include the difference between the stock’s fair market value at exercise and the exercise price in your alternative minimum tax (AMT) computation. The inclusion of this amount in your AMT computation may result in additional ordinary income tax in the year of exercise. If the AMT inclusion does not increase your tax liability, taxes are generally due when you sell your shares from an ISO exercise.

Restricted Stock Units (RSU) - tax on vest date and sale date
Your restricted or performance stock awards are considered ordinary (compensation) income to you and taxable in the year your grant vests and/or shares are delivered to you, unless there is a deferral feature. In the case of a Section 83(b) election, the holding period begins on the date when the stock is awarded, rather than the date when it vests or is released to you.

Below are education material for various equity types:
NQ -

Friday, February 26, 2021

Kubernetes 101

Immutable infrastructure is a practice where servers, once deployed, are never modified.

Containers offer a way to package code, runtime, system tools, system libraries, and configs altogether. This shipment is a lightweight, standalone executable.

Kubernetes provides the ability to run dynamically scaling, containerized applications, and utilizing an API for management.

Kubernetes has become the standard for running containerized applications in the cloud, with the main Cloud Providers (AWS, Azure, GCE, IBM and Oracle) now offering managed Kubernetes services.

K8s objects

  • Pod. A group of one or more containers.
  • Service. An abstraction that defines a logical set of pods as well as the policy for accessing them.
  • Volume. An abstraction that lets us persist data. (containers are ephemeral, data is deleted when container is deleted)
  • Namespace. A segment of the cluster dedicated to a certain purpose, for example a certain project
  • Node. A Virtual host on which containers/pods are running

K8s controllers

  • ReplicaSet (RS). Ensures the desired amount of pod is what’s running.
  • Deployment. Offers declarative updates for pods and RS.
  • StatefulSet. A workload API object that manages stateful applications, such as databases.
  • DaemonSet. Ensures that all or some worker nodes run a copy of a pod.
  • Job. Creates one or more pods, runs a certain task(s) to completion, then deletes the pod(s).

A docker container image – an executable image containing everything you need to run your application; application code, libraries, a runtime, environment variables and configuration files. At runtime, a container image becomes a container which runs everything that is packaged into that image.

Key k8s features make containerized application scale efficiently:

  • Horizontal scaling.Scale your application as needed from command line or UI.
  • Automated rollouts and rollbacks.
  • Service discovery and load balancing.
  • Storage orchestration.
  • Secret and configuration management.
  • Self-healing.
  • Batch execution.
  • Automatic binpacking.

Kafka 101

When I do interviews with candidates, they usually talk about Kafka, so I ask them Kafka architecture, more often than not, the candidates cannot answer this properly or completely, so I summarize some key concepts of Kafka.

Kafka cluster typically consists of multiple brokers. Kafka broker uses Zookeeper to maintain the cluster state. Zookeeper also performs Kafka broker leader election.

Producers in Kafka push message to brokers. Consumers in Kafka consume message, by using partition offset the Kafka Consumer maintains that how many messages have been consumed because Kafka brokers are stateless. 

Kafka has four core APIs, producer API, Consumer API, Streams API, and Connector API.

Kafka topic is a logical channel to which producers publish message and from which the consumers receive messages. In a Kafka cluster, a topic is identified by its name and must be unique. There can be any number of topics, there is no limitation. 

Topics are split into Partitions and also replicated across brokers. There can be any number of Partitions, there is no limitation. In one partition, messages are stored in the sequenced fashion, and each message is assigned an incremental id, also called offset.

Topic replication takes place in the partition level only. For a given partition, only one broker can be a leader, other brokers will have in-sync replica.

If we can add a key to a message, we will get ensured that all these messages will end up in the same partition. With this, Kafka offers message sequencing guarantee. Without a key, message is written to partitions randomly.

Consumer Group can have multiple consumer process/instance running.

读书笔记 - What It Takes

 In this book, Blackstone CEO Stephen Schwarzman talked about lessons in the pursuit of excellence.

Thursday, January 28, 2021

SIP main functions

  1. User location - locate the end user geographically
  2. User availability - available, busy, DND presence infomation
  3. User capability - determine the media being used and parameters associated with the media
  4. Session setup - establish session parameters for both caller and callee
  5. Session management - transfer a call, end a call, or modify session paramaters