AutoGuardian: How Okta solved flaky tests

As companies scale, their products become increasingly complex, making it essential to enforce rigorous testing and ensure stability. Okta is no stranger to that challenge. We run hundreds of thousands of tests on each change in our continuous integration (CI) system to catch issues earlier in the development lifecycle. 

Our operational scale is enormous, and our services receive over 500,000 commits annually. However, even we’ve faced the issue of test flakiness in our monolithic codebase. Our mainline commits had a pass rate of less than 40% on a commit’s first run from flaky tests. We needed a reliable method of figuring out when an issue arose, unblocking other engineers, and immediately contacting the corresponding teams to investigate.

The manual process

Historically, a partial solution to this problem was to have an on-call engineer watch whether commits on our main branch passed or failed tests on our CI. If a test failed on main, we’d have to analyze logs and stack traces to determine a failure’s validity and then consult with the appropriate team. 

Every Monday, the on-call engineer reviewed the past week’s failures, manually compiled a list of Jira tickets, gathered data for the failures, and sent an email to the Engineering teams. However, this approach had obvious flaws.
 

  1. A failure could be missed or overlooked due to a human error.
  2. It was time-consuming, as the engineer had to figure out the root cause, and then find out who had the knowledge and context to help fix the issue.
  3. The on-call engineer had to spend many hours training.  Fulfilling this role costs us four months of engineering time each year.
  4. It was not reliable or scalable.
  5. A lot of collective knowledge is required. For example, there are numerous legacy hand-stitched SQL queries to discover failures

What is an urgent failure?

To make this process reliable, we needed a strict way to determine what constituted an urgent or non-urgent break. We’ll be referring to the first scenario as a P0. You might wonder why we don’t address every single test that failed on main immediately. That would happen in an ideal environment (which we are now closer to), but it’s no surprise that out of the hundreds of thousands of tests we run, quite a few tests fail inconsistently (these are said to be “flaky”). It would be unreasonable and impractical to tell engineers from other teams to drop everything and fix hundreds of flaky tests.

Our immediate solution was to find P0s based on a percentage criterion of how often a test failed. However, this is a lagging indicator and a significant delay in reporting would happen if a test broke completely. We needed a way to detect actual breakages immediately and incrementally resolve flaky tests. After creating a program to run countless simulations, we eventually reached a solution.

Our initial solution: File a ticket for a test method as a P0 if it fails in two commits in the last five runs or if it fails more than 25% of the time in at least 100 runs.

Now that we have clearly defined the problem scope, we can finally automate it.

Introducing AutoGuardian

AutoGuardian is our service that periodically monitors tests to handle failures. Below, a chart exhibits a brief summary of its responsibilities.

                                      AD 4nXcw0TI7nDqlgMNq P5iJGzPRqsBGXjRDqfHKuhAPq8gmE7ylOJWv9 7M7g qIgBOuy7tviD eYNgd6 hkfJr9BcsLxsMQpyGm6LSthLotKC3YARBUEfVGp7QvZeOUTOzl6NXtDoMWrARVh CAyOx5QKZGM?key=qOrpxUFo3doEOAaGMagKBQ
AutoGuardian’s benefits have been transformative, significantly enhancing our team's daily workflow. It identifies and reports issues,  streamlines communication, and connects to a separate service that prevents our CI from running failing tests for other developers. 

Test exclusion is crucial ensuring that developers aren’t blocked from merging due to broken tests and helping us cut costs by avoiding unnecessary runs on tests we already know are faulty. In short, AutoGuardian empowers our team to focus on progress rather than getting mired in troubleshooting, making our development process more efficient and effective.

AutoGuardian is now a critical service that we rely on, delivering estimated annual cost savings of over $1,000,000. This was only the initial design. We’ve since further improved it by adding issue aggregation that groups issues by similar stack traces, automatic criterion strictening, and enhanced data reporting. As a result, we’re now catching over 1900% more flaky tests compared to before AutoGuardian, boosting our mainline pass rate to over 80%. This transformation streamlines the development process and helps reduce commit runtimes by 50%, allowing our engineers to focus on innovation rather than troubleshooting, fostering a more productive developer experience.

Revolutionize your processes with automation

Most tech companies have some form of an on-call process, and we’ve provided just one example of why you should strive to automate as much as possible. When something has to be done manually, always ask yourself and others, “Why?” Embracing automation saves valuable developer time and enhances the resilience, reliability, and security of your product.

Have questions about this blog post? Reach out to us at [email protected]Explore more insightful Engineering Blogs from Okta to expand your knowledge. 

Ready to join our passionate team of exceptional engineers? Visit our career page