Migrating off legacy Tokio at scale

The engine that drives millions of Okta Workflows executions each day was an early adopter of the async Rust ecosystem. While Tokio is the natural choice for an async runtime today, we launched Workflows before the release of Tokio 1.0, and well before the best practices and patterns for async Rust had really solidified. 

This article details how we migrated the Workflows engine, a codebase of over 100,000 source lines of code, from continuations-based Tokio 0.1 to async / await Tokio 1.0, without service interruptions or pausing feature development.

Continuations-based futures

In the early days, asynchronous execution of Rust futures was done by passing functions that represent eventual values, known as continuations. These continuations would be chained together and each executed asynchronously by a runtime such as Tokio, and would eventually resolve to the desired value. Before the migration, we had hundreds of thousands of lines of code that looked something like this:
 

Code before migration

 

This was the state of async Rust until async / await syntax was stabilized. For Workflows, being an early adopter meant that we had to ship way before the stabilization of async / await, and while this type of code is hard to read and maintain, we pushed it as far as it could go.

It served us well, allowing us to grow Okta Workflows to where it is today. Fast-forward to about a year ago, when the cracks were starting to become unbearable, and it became necessary to finally begin the migration to modern async / await syntax and Tokio 1.0.

Async / await based futures

With Tokio 1.0, asynchronous control flow could be written with the (now long-stable) async / await syntax, which greatly simplifies the code and reduces boilerplate. The example above could be written like this.

 

Same code with Tokio 1.0

 

Even if you don’t have much familiarity with Rust, the readability and simplicity benefits of using async / await are clear. For those who are familiar with Rust, the ability of async / await to borrow across await points allowed us to no longer need to pass ownership of the Flow object, so we no longer needed to store it in the FlowError. This was crucial to relaying errors to our customers, and a source of many bugs when a code path inevitably forgets to store the Flow in an error. This would be just one class of bugs that migrating to async / await would help us eliminate.

The big lift 

While the benefits of migrating to Tokio 1.0 were clear, it would be a daunting task to do so. The migration is non-trivial and would require a complete rewrite of the Workflows Engine. A stop-the-world, ground-up rewrite is impractical and risky, from both a business and customer reliability standpoint. We couldn’t pause product development to undergo a full rewrite; indeed, this would have been a fatal mistake for any software product

Further complicating matters, around the time of our investigation into doing the rewrite, we had new features coming down the pipeline. For example, we would ship the ability to cancel a Workflow only a few months later, concurrent to this effort to migrate to Tokio 1.0. A rewrite of this magnitude would need to be done bit-by-bit, as if replacing the wheels on a moving train. 

Compatibility layers and feature flags

Okta Workflows uses the hyper HTTP library to talk to the web and make calls to external services such as Google Drive or Amazon S3 — all part of what makes Workflows so powerful. This foundational library was one of the first to be replaced: we swapped the Tokio 0.1-powered hyper 0.12, with hyper 0.14, which was based around Tokio 1.0 and async / await. This was one of the first parts of the Workflows Engine to be ported from Tokio 0.1 to Tokio 1.0, and the experience helped inform many of our approaches and architectural decisions during the migration process.

The futures library that underpins most of the Rust async ecosystem has a set of compatibility shims to help migrate from continuations-based futures to async / await based futures. However, these shims are unable to take into account runtime differences. Our continuations-based legacy code could only be executed on the Tokio 0.1 runtime, but this new HTTP-handling code would need to be executed on the Tokio 1.0 runtime, and never the twain shall meet: Tokio 1.0 will hang forever when passed a compatibility future meant to run on a Tokio 0.1 executor.

Therefore, an extra step of indirection was required. In the early days of “asyncification,” when using an HTTP connector card in Workflows, the flow processing would initially happen within a Tokio 0.1 context. When making the HTTP request, the Tokio 0.1 would call into a shim, which would spawn the request on a Tokio 1.0 runtime running in a separate thread, and wait for the request to be returned back via a channel. This way we could avoid mixing futures meant for one runtime on another. We would later take this approach to its limits, with every Flow function invocation, as well as every call to Redis going through similar shims, as we gradually replaced Tokio 0.1 implementations and libraries with ones compatible with Tokio 1.0.

 

Process of migration from Tokio 0.1 to Tokio 1.0

 

Compatibility shims were unidirectional; calling async / await code from legacy futures code required a different shim than calling legacy futures code from async / await code. 

To help ensure that customers wouldn’t notice a drop in reliability or a difference in behavior, we took great advantage of feature flags to swap between legacy and Tokio 1.0 implementations of code paths at runtime. If an issue was ever encountered, we would be able to swap to the prior implementation quickly while we addressed the issue with the new code. Compatibility shims provided a useful fire gap between the two worlds, until we were ready to rewrite it in async / await syntax. By running both legacy and Tokio 1.0 paths in parallel, then deprecating the legacy code path, we were able to gradually replace the foundation upon which the Workflows Engine was built until there was nothing left of Tokio 0.1. 

Challenges 

During the process of moving to Tokio 1.0, we would eventually build up a small toolkit to be able to quickly write compatibility shims between the Tokio 0.1 world and Tokio 1.0. This would come to be an extremely crucial tool as we continued to ship new features despite the rewrite effort. 

While core parts of the Workflows Engine were undergoing this rewrite, features that shipped in the interim would have otherwise had to be rewritten twice – once in Tokio 0.1, and again in Tokio 1.0. Having a toolkit to create compatibility shims allowed us to write all new code in Tokio 1.0, and bridge it to our existing Tokio 0.1 codebase. Sometimes, the shims were not enough; for example in the case of the aforementioned cancel flow, parts of the implementation had to be written twice to support code paths in both runtimes.

There were some core differences between the legacy Tokio 0.1 and Tokio 1.0 runtimes that we needed to handle. With Tokio 0.1, every future was heap allocated, but with async / await, futures would now be stack-allocated by default. If we were not careful, we would quickly overflow the stack in a worker thread. By increasing the default worker thread stack size and carefully choosing futures to heap allocate, we were able to mitigate this issue.

With the legacy Tokio 0.1 code path, multiple executors worked in parallel to achieve highly concurrent output. In the initial implementation with Tokio 1.0, we only used a single executor to handle all flow executions. This turned out to be roughly 15% slower than having multiple executors, which is impressive in itself! While we eventually switched back to having multiple Tokio 1.0 runtimes to enable the same level of throughput our customers expect, we now have a couple of new knobs to tweak in the future to push beyond what was possible with Tokio 0.1 and the legacy code path.

A faster, cleaner codebase

Today at Workflows, we are largely no longer reliant on legacy Tokio 0.1. Our codebase is faster, easier to read, and more maintainable than ever before, allowing us more opportunities to optimize Flow execution performance and deliver new features to our customers quicker than ever. 

 

Graph showing net lines of code over time

 

Moving to  async / await allowed us to remove over 20,000 lines of code from the Workflows Engine.

We have finally left the long-neglected world of Tokio 0.1, and can now be continually  up-to-date with our foundational dependencies, crucial in our effort to have Okta become the most secure company in the world. And of course, it is a lot more enjoyable for our entire team to work on a clean, readable codebase every day, than one with thousands of lines of boilerplate.

Learn more about how to automate critical IT and security tasks at scale with Okta Workflows.

Have questions about this blog post? Reach out to us at [email protected].

Explore more insightful Engineering Blogs from Okta to expand your knowledge.

Ready to join our passionate team of exceptional engineers? Visit our career page.

Unlock the potential of modern and sophisticated Identity management for your organization.

Contact Sales for more information.