When we set out to rebuild the engine at the heart of our managed Apache Kafka service, we knew we needed to address several unique requirements that characterize successful cloud-native platforms. These systems must be multi-tenant from the ground up, scale easily to serve thousands of customers, and be managed largely by data-driven software rather than human operators. They should also provide strong isolation and security across customers with unpredictable workloads, in an environment in which engineers can continue to innovate rapidly.
We presented our Kafka engine redesign last year. Much of what we designed and implemented will apply to other teams building massively distributed cloud systems, such as a database or storage system. We wanted to share what we learned with the wider community with the hope that these learnings can benefit those working on other projects.
Key considerations for the Kafka engine redesign
Our high-level objectives were likely similar to ones that you will have for your own cloud-based systems: improve performance and elasticity, increase cost-efficiency both for ourselves and our customers, and provide a consistent experience across multiple public clouds. We also had the added requirement of staying 100% compatible with current versions of the Kafka protocol.
Our redesigned Kafka engine, called Kora, is an event streaming platform that runs tens of thousands of clusters in 70+ regions across AWS, Google Cloud, and Azure. You may not be operating at this scale immediately, but many of the techniques described below will still be applicable.
Here are five key innovations that we implemented in our new Kora design. If you’d like to go deeper on any of these, we published a white paper on the topic that won Best Industry Paper at the International Conference on Very Large Data Bases (VLDB) 2023.
Using logical ‘cells’ for scalability and isolation
To build systems that are highly available and horizontally scalable, you need an architecture that is built using scalable and composable building blocks. Concretely, the work done by a scalable system should grow linearly with the increase in system size. The original Kafka architecture does not fulfill this criteria because many aspects of load increase non-linearly with the system size.
For instance, as the cluster size increases, the number of connections increases quadratically, since all clients typically need to talk to all the brokers. Similarly, the replication overhead also increases quadratically, since each broker would typically have followers on all other brokers. The end result is that adding brokers causes a disproportionate increase in overhead relative to the additional compute/storage capacity that they bring.
A second challenge is ensuring isolation between tenants. In particular, a misbehaving tenant can negatively impact the performance and availability of every other tenant in the cluster. Even with effective limits and throttling, there will likely always be some load patterns that are problematic. And even with well-behaving clients, a node’s storage may be degraded. With random spread in the cluster, this would affect all tenants and potentially all applications.
We solved these challenges using a logical building block called a cell. We divide the cluster into a set of cells that cross-cut the availability zones. Tenants are isolated to a single cell, meaning the replicas of each partition owned by that tenant are assigned to brokers in that cell. This also implies that replication is isolated to the brokers within that cell. Adding brokers to a cell carries the same problem as before at the cell level, but now we have the option of creating new cells in the cluster without an increase in overhead. Furthermore, this gives us a way to handle noisy tenants. We can move the tenant’s partitions to a quarantine cell.
To gauge the effectiveness of this solution, we set up an experimental 24-broker cluster with six broker cells (see full configuration details in our white paper). When we ran the benchmark, the cluster load—a custom metric we devised for measuring the load on the Kafka cluster—was 53% with cells, compared to 73% without cells.
Balancing storage types to optimize for warm and cold data
A key benefit of cloud is that it offers a variety of storage types with different cost and performance characteristics. We take advantage of these different storage types to provide optimal cost-performance trade-offs in our architecture.
Block storage provides both the durability and flexibility to control various dimensions of performance, such as IOPS (input/output operations per second) and latency. However, low-latency disks get costly as the size increases, making them a bad fit for cold data. In contrast, object storage services such as Amazon S3, Microsoft Azure Blob Storage, and Google GCS incur low cost and are highly scalable but have higher latency than block storage. They also get expensive quickly if you need to do lots of small writes.
By tiering our architecture to optimize use of these different storage types, we improved performance and reliability while reducing cost. This stems from the way we separate storage from compute, which we do in two primary ways: using object storage for cold data, and using block storage instead of instance storage for more frequently accessed data.
This tiered architecture allows us to improve elasticity—reassigning partitions becomes a lot easier when only warm data needs to be reassigned. Using EBS volumes instead of instance storage also improves durability as the lifetime of the storage volume is decoupled from the lifetime of the associated virtual machine.
Most importantly, tiering allows us to significantly improve cost and performance. The cost is reduced because object storage is a more affordable and reliable option for storing cold data. And performance improves because once data is tiered, we can put warm data in highly performant storage volumes, which would be prohibitively expensive without tiering.
Using abstractions to unify the multicloud experience
For any service that plans to operate on multiple clouds, providing a unified, consistent customer experience across clouds is essential, and this is challenging to achieve for several reasons. Cloud services are complex, and even when they adhere to standards there are still variations across clouds and instances. The instance types, instance availability, and even the billing model for similar cloud services can vary in subtle but impactful ways. For example, Azure block storage doesn’t allow for independent configuration of disk throughput/IOPS and thus requires provisioning a large disk to scale up IOPS. In contrast, AWS and GCP allow you to tune these variables independently.
Many SaaS providers punt on this complexity, leaving customers to worry about the configuration details required to achieve consistent performance. This is clearly not ideal, so for Kora we developed ways to abstract away the differences.
We introduced three abstractions that allow customers to distance themselves from the implementation details and focus on higher-level application properties. These abstractions can help to dramatically simplify the service and limit the questions that customers need to answer themselves.
- The logical Kafka cluster is the unit of access control and security. This is the same entity that customers manage, whether in a multi-tenant environment or a dedicated one.
- Confluent Kafka Units (CKUs) are the units of capacity (and hence cost) for Confluent customers. A CKU is expressed in terms of customer visible metrics such as ingress and egress throughput, and some upper limits for request rate, connections, etc.
- Lastly, we abstract away the load on a cluster in a single unified metric called cluster load. This helps customers decide if they want to scale up or scale down their cluster.
With abstractions like these in place, your customers don’t need to worry about low-level implementation details, and you as the service provider can continuously optimize performance and cost under the hood as new hardware and software options become available.
Automating mitigation loops to combat degradation
Failure handling is crucial for reliability. Even in the cloud, failures are inevitable, whether that’s due to cloud-provider outages, software bugs, disk corruption, misconfigurations, or some other cause. These can be complete or partial failures, but in either case they must be addressed quickly to avoid compromising performance or access to the system.
Unfortunately, if you’re operating a cloud platform at scale, detecting and addressing these failures manually is not an option. It would take up far too much operator time and can mean that failures are not addressed quickly enough to maintain service level agreements.
To address this, we built a solution that handles all such cases of infrastructure degradation. Specifically, we built a feedback loop consisting of a degradation detector component that collects metrics from the cluster and uses them to decide if any component is malfunctioning and if any action needs to be taken. These allow us to address hundreds of degradations each week without requiring any manual operator engagement.
We implemented several feedback loops that track a broker’s performance and take some action when needed. When a problem is detected, it is marked with a distinct broker health state, each of which is treated with its respective mitigation strategy. Three of these feedback loops address local disk issues, external connectivity issues, and broker degradation:
- Monitor: A way to track each broker’s performance from an external perspective. We do frequent probes to track.
- Aggregate: In some cases, we aggregate metrics to ensure that the degradation is noticeable relative to the other brokers.
- React: Kafka-specific mechanisms to either exclude a broker from the replication protocol or to migrate leadership away from it.
Indeed, our automated mitigation detects and automatically mitigates thousands of partial degradations every month across all three major cloud providers. saving valuable operator time while ensuring minimal impact to the customers.
Balancing stateful services for performance and efficiency
Balancing load across servers in any stateful service is a difficult problem and one that directly impacts the quality of service that customers experience. An uneven distribution of load leads to customers limited by the latency and throughput offered by the most loaded server. A stateful service will typically have a set of keys, and you’ll want to balance the distribution of those keys in such a way that the overall load is distributed evenly across servers, so that the client receives the maximum performance from the system at the lowest cost.
Kafka, for example, runs brokers that are stateful and balances the assignment of partitions and their replicas to various brokers. The load on those partitions can spike up and down in hard-to-predict ways depending on customer activity. This requires a set of metrics and heuristics to determine how to place partitions on brokers to maximize efficiency and utilization. We achieve this with a balancing service that tracks a set of metrics from multiple brokers and continuously works in the background to reassign partitions.
Rebalancing of assignments needs to be done judiciously. Too-aggressive rebalancing can disrupt performance and increase cost due to the additional work these reassignments create. Too-slow rebalancing can let the system degrade noticeably before fixing the imbalance. We had to experiment with a lot of heuristics to converge on an appropriate level of reactiveness that works for a diverse range of workloads.
The impact of effective balancing can be substantial. One of our customers saw an approximately 25% reduction in their load when rebalancing was enabled for them. Similarly, another customer saw a dramatic reduction in latency due to rebalancing.
The benefits of a well-designed cloud-native service
If you’re building cloud-native infrastructure for your organization with either new code or using existing open source software like Kafka, we hope the techniques described in this article will help you to achieve your desired outcomes for performance, availability, and cost-efficiency.
To test Kora’s performance, we did a small-scale experiment on identical hardware comparing Kora and our full cloud platform to open-source Kafka. We found that Kora provides much greater elasticity with 30x faster scaling; more than 10x higher availability compared to the fault rate of our self-managed customers or other cloud services; and significantly lower latency than self-managed Kafka. While Kafka is still the best option for running an open-source data streaming system, Kora is a great choice for those looking for a cloud-native experience.
We’re incredibly proud of the work that went into Kora and the results we have achieved. Cloud-native systems can be highly complex to build and manage, but they have enabled the huge range of modern SaaS applications that power much of today’s business. We hope your own cloud infrastructure projects continue this trajectory of success.
Prince Mahajan is principal engineer at Confluent.
—
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
Copyright © 2024 IDG Communications, Inc.