cover of episode From QAos to Chaos Engineering

From QAos to Chaos Engineering

2022/10/19
logo of podcast The Cloudcast

The Cloudcast

Shownotes Transcript

Benjamin Wilms (@MrBWilms), co-founder/CEO of @Steadybit)) talks about the importance of resilience for SREs, DevOps, and developers through chaos engineering platforms

**SHOW: **661

**CLOUD NEWS OF THE WEEK **- http://bit.ly/cloudcast-cnotw)

CHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS")

SHOW SPONSORS:

  • Datadog Synthetic Monitoring): Frontend and Backend Modern Monitoring
  • Ensure frontend issues don’t impair user experience by detecting user-facing issues with API and browser tests with a free 14 day Datadog trial). Listeners of The Cloudcast will also receive a free Datadog T-shirt. 
  • Granulate), an Intel company - Autonomous, continuous, workload optimization
  • gProfiler) from Granulate - Production profiling, made easy
  • CDN77) - Content Delivery Network Optimized for Video
  • 85% of users stop watching a video because of stalling and rebuffering. Rely on CDN77 to deliver a seamless online experience to your audience. Ask for a free trial) with no duration or traffic limits.

SHOW NOTES:

  • Steadybit (homepage))
  • Steadybit wants developers involved in Chaos engineering before production (TechCrunch))

**Topic 1 - **Benjamin, give everyone a quick introduction.

Topic 2 - Let’s start with the concept of chaos engineering. In its simplest form, chaos engineering intentionally takes down parts of a test or production environment (typically after software has shipped) randomly so teams, typically SRE’s/ops/dev, are forced to make the applications more resilient over time. It’s not a matter of if systems will go down, it’s a matter of when. This makes the systems better over time. Benjamin, you have a consulting background in this area that ultimately led to founding Steadybit. What were the limitations to this approach?

**Topic 3 - **What you’re talking about is a more proactive approach to downtime. I’ll call this resilience engineering and it requires a shift in mindset in an organization. How do you get developers onboard to embrace the need? Are we asking developers to share responsibility for outages with the SRE organization?

**Topic 4 - **On the surface, the obvious benefit is reduced downtime. That can be hard to quantify in business value. Outages can be measured, a lack of outages is harder to quantify. Does this become an issue in convincing an organization to embrace this methodology?

**Topic 5 - **When you say we are going to move chaos engineering into the CI/CD pipeline, what does that mean? Is this code that is added? Testing simulations that have to be passed? Real time failures of databases or nodes or simulated? What are the common use cases?

FEEDBACK?

  • Email: show at the cloudcast dot net
  • Twitter: @thecloudcastnet)