✂️ Cut your QA cycles down to minutes with QA Wolf (Sponsored)If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf. QA Wolf’s AI-native service supports web and mobiles, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes. QA Wolf takes testing off your plate. They can get you:
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production. With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles. Disclaimer: The details in this post have been derived from the official documentation shared online by the Atlassian Engineering Team. All credit for the technical details goes to the Atlassian Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them. Most companies struggle to migrate a handful of databases without causing downtime. Atlassian routinely migrates about 1,000 databases every day as part of its regular operations, and its users never notice. However, in 2024, they tackled something far more ambitious: migrating 4 million databases with minimal user impact. The numbers alone are staggering. Atlassian's Jira platform uses a one-database-per-tenant approach, spreading 4 million PostgreSQL databases across 3,000 server instances in 13 AWS regions worldwide. Each database contains everything for a Jira tenant: issues, projects, workflows, and custom fields. Their goal was to move everything from AWS RDS PostgreSQL to AWS Aurora PostgreSQL. The benefits were compelling:
The constraints were equally challenging. They needed to keep downtime under 3 minutes per tenant, control infrastructure costs during the migration, and complete the entire project within months rather than years. In this article, we will look at how Atlassian carried out this migration and the challenges they faced. Migration StrategyAtlassian's team chose what seemed like a textbook approach for migrating from RDS to Aurora. This process is also known as “conversion”. See the diagram below: Here's how the process was designed to work:
See the diagram below for a detailed look at the conversion process. Lastly, they also used feature flags that let them instantly redirect database connections. Instead of waiting for applications to slowly discover the new database endpoints (which could take time), feature flags allowed them to override the tenant database endpoints immediately across all application servers. The most impressive part was that they managed to keep the actual cutover time to less than 3 minutes, even for their largest instances. This was possible because the heavy lifting (data synchronization) happened ahead of time through the read replica, and the feature flags enabled instant endpoint switching. The Advantages of AuroraThe key advantages of using Aurora were as follows:
The File Count Limit ProblemJust when everything seemed to be going smoothly, Atlassian hit a wall that nobody saw coming. During their testing phase, AWS support contacted them with concerning news: one of their large test RDS instances had successfully synchronized all its data to Aurora, but the new Aurora cluster had failed to start up properly. From Atlassian's perspective, everything looked fine in the AWS console. The replica appeared healthy and still replicating. However, AWS's internal monitoring had detected that the Aurora instance's startup process had timed out and failed, even though the surface indicators suggested everything was working normally. To understand what went wrong, we need to know how PostgreSQL stores data on disk. In PostgreSQL, every high-level database object gets stored as at least one file on the server's hard drive:
However, Jira applications have particularly complex database schemas because they need to handle:
All this complexity means that a single Jira database needs approximately 5,000 files on disk to store all its tables, indexes, and sequences. For 4000 tenants per cluster, this means 20 million files per Aurora cluster. When an Aurora instance starts up, it performs various status checks to ensure everything is healthy. One of these checks involves enumerating (counting and cataloging) all the files in the cluster volume. This process is normally quick and invisible to users. However, when there are 20+ million files, this enumeration process takes a very long time. Aurora has internal timeout thresholds for how long startup processes are allowed to run. With Atlassian's massive file counts, the enumeration was taking longer than Aurora's startup timeout threshold, causing the instance to fail during boot-up. To handle this problem, the team had only two options to reduce file counts:
The solution they developed would become known as "draining". The Draining SolutionFaced with the file count limitation, Atlassian developed an elegant solution that turned the problem into an opportunity. Instead of fighting the constraints, they embraced a new workflow that would improve their infrastructure efficiency. See the diagram below: Step 1: Drain"Draining" means moving tenant databases off an RDS instance until it reaches a safe file count threshold that Aurora can handle during startup. Instead of 4,000 tenants per instance (20 million files), they needed to reduce this to a much smaller number. The draining process used their existing database migration tools, but at a completely different scale. They prioritized moving the smallest tenants with the least usage first because:
Step 2: ConvertOnce an RDS instance was drained to safe file count levels, they could proceed with the standard RDS to Aurora conversion process:
With fewer tenants, this conversion process became much more reliable and predictable. Step 3: RefillInstead of leaving the new Aurora clusters half-empty, they immediately started using them as destinations for draining other RDS instances. AWS had confirmed that once an RDS instance was successfully converted to Aurora, they could re-populate it with tenant databases up to sensible limits. This meant they didn't need to permanently reduce their tenant density. They just needed to reduce it temporarily during the conversion process. The Daisy Chain ApproachThis three-step process created what Atlassian called a "daisy chain" of conversions:
This approach had a huge advantage: minimal infrastructure overhead. Instead of needing to provision massive amounts of temporary infrastructure to hold displaced tenants, they only needed enough extra capacity to accommodate the tenants from one or two instances being converted at any given time. Through careful optimization and tooling improvements, they achieved remarkable scaling:
ConclusionIn the end, Atlassian had accomplished something remarkable in the world of database migrations. The final statistics paint a picture of an operation that succeeded on every measure: Here’s a quick look at the migration scale:
The final Aurora infrastructure looked dramatically different from where they started, but in all the right ways:
Atlassian's Aurora migration represents more than just a successful database platform upgrade. Their collaboration with AWS throughout this process led to a better understanding of Aurora's operational limits and likely influenced future improvements to the platform. The project delivered on all its original objectives: significant cost savings through better resource utilization, improved reliability through Aurora's superior SLA, and enhanced performance through auto-scaling capabilities. But perhaps most importantly, they accomplished all of this while maintaining their commitment to minimal user impact. References: ByteByteGo Technical Interview Prep KitLaunching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website. What's included:
SPONSOR USGet your product in front of more than 1,000,000 tech professionals. Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases. Space Fills Up Fast - Reserve Today Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com. |
Don't miss a thing Confirm your subscription Hi there, Thanks for subscribing to fitgirl-repacks.site! To get you up and running, please confirm your email address by clicking below. This will set you up with a WordPress.com account you can use to manage your subscription preferences. By clicking "confirm email," you agree to the Terms of Service and have read the Privacy Policy . Confirm email ...
Comments
Post a Comment