Cut Code Review Time & Bugs in Half (Sponsored)Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request. Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss. CodeRabbit reviews 1 million PRs every week across 3 million repositories and is used by 100 thousand Open-source projects. CodeRabbit is free for all open-source repo’s. The Reddit Engineering Team completed one of the most demanding infrastructure migrations in the company’s history. It moved its entire Apache Kafka fleet, comprising over 500 brokers and more than a petabyte of live data, from Amazon EC2 virtual machines onto Kubernetes. The migration was done with zero downtime and without asking a single client application to change how it connected to Kafka. In this article, we will look at the breakdown of this migration, the challenges the engineering team faced, and how they achieved their goal of a successful migration. Disclaimer: This post is based on publicly shared details from the Reddit Engineering Team. Please comment if you notice any inaccuracies. The Role of Kafka at RedditTo put things into perspective, let us first understand what exactly Apache Kafka is. Apache Kafka is an open-source message streaming platform. Applications called producers write messages into Kafka partitions, and other applications called consumers read those messages out. Kafka sits in the middle and stores those messages reliably, even if the producer and consumer are running at completely different times. A single Kafka server is called a broker, whereas a collection of brokers working together forms a cluster. At Reddit, Apache Kafka is not a peripheral tool. It sits underneath hundreds of business-critical services, processing tens of millions of messages every second. If Kafka went down, large portions of Reddit would break. Why Reddit Wanted to Move Away from EC2Before the migration, Reddit managed its Kafka brokers on Amazon EC2 instances using a combination of Terraform, Puppet, and custom scripts. Operators handled upgrades, configuration changes, and machine replacements by running commands directly from their laptops. This worked fine until a certain point. However, as the fleet grew, it became increasingly slow, error-prone, and expensive. Reddit needed a more scalable and reliable way to operate Kafka. Kubernetes, paired with a tool called Strimzi, offered that path. Kubernetes is an open-source platform for running and managing containerized applications. Instead of manually provisioning and maintaining individual servers, Kubernetes lets developers describe what should be running and handles deployment, scaling, and recovery automatically. Strimzi, on the other hand, is a project under the Cloud Native Computing Foundation that specifically lets you run Kafka on Kubernetes. It provides a declarative way to manage Kafka clusters. This means that developers can describe what they want in a configuration file, and Strimzi handles deployment, upgrades, and maintenance. This promised fewer manual interventions and more predictable operations. The Four Constraints That Shaped the MigrationReddit did not jump straight into moving brokers. Before writing a single line of migration code, Reddit identified four hard constraints that ruled out entire categories of approaches. The constraints are as follows:
Phase 1: Taking Control of the Naming LayerThe first phase of the migration did not touch Kafka at all. Reddit introduced a DNS facade, which is a set of DNS records that act as an intermediate layer between client applications and the actual Kafka brokers. DNS is the system that translates human-readable names into the addresses of servers. By creating new, infrastructure-controlled DNS names that initially pointed to the same EC2 brokers, Reddit changed nothing from the perspective of client applications. Reddit then rolled out these new connection strings across more than 250 services using automated tooling that generated batch pull requests to update configuration files. Once all clients were talking through this DNS layer, Reddit could change where those names pointed, from EC2 to Kubernetes, without modifying any client code. Phase 2: Making Room for New BrokersEach Kafka broker is identified by a unique numeric ID. Strimzi assigns broker IDs starting at 0 by default. However, Reddit’s existing EC2 brokers already occupied those low numbers. To free up that ID space, Reddit doubled the cluster size by adding new EC2 brokers with higher IDs, then terminated the original low-numbered brokers. This shifted all data onto the higher-numbered brokers and opened up IDs 0, 1, 2, and so on for Strimzi-managed brokers to use. See the diagram below: Phase 3: Running a Mixed ClusterThis was the most technically complex phase. Reddit needed Strimzi brokers running on Kubernetes to join the same cluster as the existing EC2 brokers and communicate with them directly. Strimzi does not support this out of the box, so Reddit created a fork of the Strimzi operator. The changes Reddit made were deliberately small and targeted:
Phase 4: Gradually Shifting Data and TrafficWith both sets of brokers running inside the same cluster, Reddit used Cruise Control to incrementally move partition leadership and replicated data from EC2 brokers to the Kubernetes brokers. Partition leadership determines which broker is responsible for serving reads and writes for a given piece of data. Kafka stores copies of each partition on multiple brokers for redundancy. This is called the replication factor. Moving data meant reassigning both the leadership and the replicas to the new set of brokers, one partition at a time. Reddit monitored this process continuously as the partition leadership on EC2 declined steadily over roughly a week while leadership on Strimzi climbed in parallel. Network traffic followed the same pattern. At every point, Reddit could pause or reverse the process if something looked wrong. See the dashboard view below:
Phase 5: Migrating the Control PlaneZooKeeper had managed Kafka’s metadata throughout the entire broker migration. Reddit made a deliberate choice not to change the control plane until after the data plane was fully stable on Kubernetes. This separation of concerns reduced the risk of compounding failures. Once all EC2 brokers were terminated and all data and traffic were running on Kubernetes, Reddit executed the migration from ZooKeeper to KRaft. KRaft is Kafka’s built-in metadata management system that eliminates the need for ZooKeeper. See the diagram below: Since Strimzi and Kafka both provide documented steps for this migration, and because the rest of the system had already settled, this final phase was comparatively straightforward. Phase 6: Cleaning Up and Handing Off to Standard StrimziAfter both the data plane and the control plane were fully running on Kubernetes, Reddit removed all the configuration overrides that the forked Strimzi operator had introduced. Control of the clusters was handed off to the standard, unmodified Strimzi operator. The EC2 infrastructure was decommissioned. ConclusionReddit’s migration is a good example of how large-scale infrastructure changes do not have to be dramatic, high-risk events. By breaking the work into small, reversible, well-understood steps and by respecting the constraints the system imposed, Reddit moved a petabyte-scale platform to Kubernetes without a single moment of downtime. Some key lessons from Reddit’s migration journey were as follows:
References: |
#6164 The House of Tesla Genres/Tags: Logic, Puzzle, First-person, 3D Company: Blue Brain Games Languages: ENG/MULTI8 Original Size: 6.8 GB Repack Size: 4.1 GB Download Mirrors (Direct Links) .dlinks {margin: 0.5em 0 !important; font-size… Read on blog or Reader FitGirl Repacks Read on blog or Reader The House of Tesla By FitGirl on 26/09/2025 #6164 The House of Tesla Genres/Tags: Logic, Puzzle, First-person,...






Comments
Post a Comment