Azure Chaos Studio | How to build fault tolerant apps

9 min readFeb 24, 2022

Test the resilience of your apps by introducing faults to simulate real-world outages with Azure Chaos Studio. John Engel-Kemnetz, Senior Program Manager for Azure Chaos Studio, joins Jeremy Chapman to show how you can quickly identify failures in your applications — like additional load, high latency, permission issues, and full on outages — to avoid unnecessary downtime.

With Azure Chaos Studio, we’ve delivered a fully-managed experimentation platform to quickly discover hard-to-find issues. Whether in late-stage development or in production, you can apply controlled chaos to your apps and gather the information needed to resolve issues.

Introduce faults to simulate the effects of a real-world outage
Prevent outages to VM hosts, automatically redirect traffic to healthy VMs and zones
Run resiliency experiments as part of your automated release process

QUICK LINKS:

01:27 — See how Azure Chaos Studio works

04:32 — Permission model

06:18 — How to prevent real-life outages

08:08 — Choose the right fault

09:18 — Automated release process

10:47 — Wrap up

Link References:

Run resilience architecture reviews to identify resilience gaps using https://aka.ms/thereview

Access our Quickstart tutorials at https://aka.ms/TryChaos

Unfamiliar with Microsoft Mechanics?

We are Microsoft’s official video series for IT. You can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.

Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries?sub_confirmation=1
Join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog
Watch or listen via podcast here: https://microsoftmechanics.libsyn.com/website

Keep getting this insider knowledge, join us on social:

Video Transcript:

- Up next, we take a closer look at Azure Chaos Studio, which allows you to test the resilience of your apps by deliberately introducing faults to simulate real-world outages, ultimately giving you the insights you need to make your apps more resilient. I’m joined today by John Engel-Kemnetz from the team responsible for Azure Chaos Studio. Welcome to the show.

- It’s great to be here.

- So I love it, chaos as a service. You know, but it does seem a little bit counterintuitive in terms of introducing kind of controlled chaos as part of your apps and services. But with the cloud and the shift to DevOps-style accountability, testing applications and development and also in production is probably more important than ever. So how does Azure Chaos Studio then help here?

- You’re right, the old days of manual testing and having lots of resources are gone, and people need a way to quickly identify failures in their applications, like load, additional latency, permission issues, and full on outages. So with Azure Chaos Studio we’ve delivered a fully-managed experimentation platform to quickly discover hard-to-find issues, whether your apps are in late-stage development or in production. It gives you a way to apply controlled chaos to your apps and the information needed to learn from any resulting issues, so that you can resolve them.

- So this seems like a really great approach in terms of introducing and planning for resiliency and avoiding any unnecessary downtime. So can you show us though how all of this works then with Chaos Studio, showing us maybe a simple example?

- Sure, so I’m going to start with a simple cloud app that runs across a few different VMs and services in Azure, and it uses a load balancer to distribute traffic across the VMs. If I hit the site in a browser, you’ll see that we’ve identified the host name and now I’m on demo0. And if I refresh, it’s assigning me a different host, demo1. And in application insights, I can see my app health, and right now we’re at 100%. So let’s jump in to Chaos Studio to make sure that our resources are onboarded. I’ve already onboarded the resources of this application as Targets in Chaos Studio, as you can see here. Let me show you how I add a resource. I’ll just add the final VM in my app. Now I want to add an experiment to see what happens when one or more hosts fail. A chaos experiment is an Azure resource that describes the faults that should be run, and the resources those faults should be run against. To create one, I add the usual subscription and resource group details, give it a name and define the location. And on the next step, we’ll create the experiment itself, which is organized into steps that run in sequence one right after the other. Each step has one or more branches that run in parallel, and subsequent steps only run once the previous step is complete. Even though we are simulating the impacts of a host or service outage, this is real. And when I add a VM shutdown fault, it’s actually taking down the VM. I’m going to take this first step in the branch and give the first step a more descriptive name, and I’ll do the same for the branch. And here’s where I’ll add a fault. A fault is a failure that Chaos Studio can inject into your application. When I expand the dropdown, you can see a list of available fault types. I’ll choose VM shutdown in this case. Next, the parameters allow you to customize the impact of the fault. In this case, the parameter is the duration of the shutdown, which I’ll set to 10 minutes. And, optionally, I can make an abrupt shutdown, like what would happen if your VM host suddenly lost power. Next, remember the targets I showed earlier? Well, this is where you’d select them for your experiment. I’ll choose all three Windows VMs in this case, because I want to see what would happen if an entire site or zone went down. All of these VMs happen to be running in the same zone. It’s like pulling the power plug from a shared VM host if you’re running Hyper-V or VMware. I’ll hit Add. And when I move to the next screen, it shows me a summary of the experiment, along with the steps and branches. Now in this simple experiment, I only have one step and branch, but I could have defined more. So I’ll create the experiment.

- So now your experiment’s ready, but it looked like it was pretty easy in terms of taking some components down in real life. So you probably don’t want just anyone maybe accidentally or maliciously being able to do this.

- Yeah, you don’t want anyone wielding this power. So Chaos Studio has a robust permission model that helps you to avoid this kind of thing. In fact, when you create an experiment, it creates its own identity that must be given permission to each target resource. So now I need to go to each virtual machine and give the experiment the permission it needs to run. So in the VM properties, under Access Control, I’ll add a role assignment. And here I need to apply the Virtual Machine Contributor role to my experiment. So once I choose that role, under Select Members, I can select my chaos experiment, and I’ll go through a few confirmations to create it. And I’ll do that for each VM. Then once that’s complete, I can run my experiment. So back in Chaos Studio in my experiments, I’ll choose the one I want to run and start it. Now I just need to confirm once more, and it will run. And I can check the status by going into Details. I’ll expand the branch and go into the fault and expand Running Targets. And if I go back to my app and hit F5 to refresh, you’ll see the connection to the server is no longer available, because all three of the VMs are shut down. So now, I’ll go to app insights, and we can see right here that our availability just went from 100% to 0%.

- All right, so now this outage actually means that customers aren’t going to be able to hit your site. So now you’ve kind of got the cause and effect of running this experiment. So how do you prevent this then from happening in real life?

- Right, that outage wouldn’t be acceptable for any sort of mission-critical service. So what we need to do next is isolate the failure and come up with a plan so that it won’t impact us when we’re in production. So we’ll start with a little investigation. In application insights I can see the availability dropped to zero as the VM shut down. So in network traffic, we can see the load balancer did its job to try to direct traffic to healthy nodes. But as you can see, data path availability for all VMs dropped to zero during the experiment. So there was nowhere left for the load balancer to direct traffic. This is where you might want to have your app and VMs in the region use availability zones, so that if a zone loses power, or if networking or cooling goes down, traffic is automatically redirected to healthy VMs and zones. This way, your customers won’t feel it if a zone goes down. So I’ve modified the app and now its VMs are spread across three availability zones in Azure, within a single region. And to save time, I’ve already reconfigured the experiment with the same step and branch against my now zone-aware app to take down just one zone, and I’ll run it again. And if I go back to my dashboard, you’ll see the data path availability for the VM in zone one went down to zero, but the VMs in the other zones are still running. And in app insights, the availability of the app is staying at 100%, even though one of the zones is down.

- This is great because now you’ve basically taken all the learnings from that experiment, applied them to make sure that your VMs are available. So we saw the fault library earlier. So how would I know though, which fault to choose as I start to create experiments?

- That’s right, every application’s failure scenarios are unique and you use the fault library to replicate those failure scenarios. Failure scenarios might be network connectivity issues, noisy neighbors, or even live site tooling gaps. So if your apps have quantifiable, common failure patterns, that’s a good place to start. We recommend looking to a few sources to determine where you might have resilience gaps that Chaos could expose. For example, looking at trends in past live site incidents to find common causes for failures, doing resilience architecture reviews to identify resilience gaps using aka.ms/thereview, or testing against documented reliability recommendations in the Azure Well-Architected Framework. And our faults in our fault library are pretty broad, as you can see, so they can address a lot of the common issues that might be revealed.

- And the information that you learn in this case from this app can be applied to other apps. So as kind of a long time deployment automation person, can I use the verification validation steps that you’ve done in Chaos Studio as part of a normal release management process for maybe a new app or an app update?

- Absolutely, everything we’ve done today can be done via API calls. And these should look pretty familiar if you’re using the Resource Manager APIs now. From any experiment in Chaos Studio, we can export the resource manager template, as you can see here, and you’ll use this JSON file as part of your deployment automation in GitHub, Azure DevOps, or whatever DevOps tooling you use. Here is an example with the GitHub workflow and deployment pipeline YML file. In the initial steps, we are standing up all the required services. Then it is deploying the chaos experiment. After that, it triggers the experiment. And if the app remains healthy during and after the experiment, the workflow succeeds and goes into production. And importantly, the process of testing doesn’t stop with app or update deployment. In fact, you can use Chaos Studio to perform ongoing production validation, like this example Logic App that runs every eight hours on a schedule to continually test services for additional peace of mind, and to ensure that you continue to hit your required SLAs.

- That’s cool, it’s really going to help, I think, validate any changes to your apps from user demands maybe, and also traffic against them. So now what’s the team who’s building Azure Chaos Studio working on next?

- Right, so as we showed, there’s a couple of faults available in our fault library today, but that list is going to continue to expand to meet those different failure scenarios. We’re also looking to get as prescriptive as possible with what to test for as well as break/fix recommendations, based on the observations from your experiments.

- For anyone who’s watching right now and wants to try out Chaos Studio for themselves, what do you recommend?

- So the service is currently in public preview, and you can start trying it by onboarding resources and running experiments against them. And you can access our QuickStart tutorials at aka.ms/TryChaos.

- Thanks, John, for joining us today and also giving us a healthy dose of chaos. Of course, keep watching Microsoft Mechanics for the latest in tech updates. Subscribe to our channel if you haven’t already yet. And thank you for watching.