Why partition your data in Kafka?
This post will show how the partitioning technique used by your producers influences what your consumers want to do with the data. If you have a lot of traffic and need to run multiple application instances, you’ll need to partition your data. How you split determines the load balancing for the downstream application. The producer clients decide which topic division the data belongs in, while the consumer apps’ actions drive the decision logic. Uncorrelated/random partitioning is the ideal partitioning approach to utilise if at all possible.
However, you may need to partition the data by an attribute if:
- Consumers of the topic must aggregate data based on some point.
- Customers require some order assurance.
- Another resource is a bottleneck; thus, data must be shared.
- You wish to concentrate data for storage and/or indexing purposes.
Random partitioning of Kafka data
This system handles the matching service’s input subject for our most CPU-intensive application. All matching service instances must be aware of all registered queries to match any event. Despite many attendees, the quantity of written questions is limited. For the time being, a single application instance can handle holding all of them in memory.
Customers with the evenest load distribution benefit from random segmentation, making scaling easier. It’s perfect for stateless or “embarrassingly parallel” services.
When you utilise the default partitioner without manually defining a partition or a message key, you’ll get this message. From version 2.4 onwards, Kafka’s default partitioner uses a “sticky” technique, which clusters all messages for a batch into the same random partition.
Partition by aggregate
However, on the topic consumed by the query aggregation service, we must partition by query identifier since we need all of the events we are collecting to end up in the same place.
We noticed that the top 1.5 percent of searches accounted for almost 90% of the events processed for aggregation after deploying the original version of the service. As you may expect, this resulted in some hot patches on the unfortunate partitions.
The aggregate service was split into two parts. We may randomly partition the first stage, partially aggregating the data, and then partition the final results per window by query ID. This method condenses the more significant streams in the first aggregation stage, making load balancing easier at the second level.
Of course, this technique comes with a cost in terms of resources. We spend more on network and service charges since we have to add write an extra hop to Kafka and split the service.
Planning for resource bottlenecks and storage efficiency
It’s critical to consider resource limitations and storage efficiency while deciding on a partition scheme.
Another service relies on several databases that have been split into shards, causing a resource bottleneck. We divide the topic into fragments based on how the databases break the bits. This method yields a result that resembles the diagram in our partition by aggregate example.Each consumer will only have access to the database shard to which it connects. As a result, difficulties with other database shards will not impact the instance’s ability to consume data from its partition. It will also be less share if the program retains a database-related state in memory. Naturally, this form of data segmentation is prone to hotspots.
Storage efficiency: The source topic shares a topic with the design that retains the event data indefinitely in our query processing system. It uses a separate consumer group to read all the same data. This page’s content organizes into sections based on the customer account to which it applies. We consolidate an account’s data into as few nodes as possible for storage and access efficiency. Many identities are compact enough to carry on a node, but others require multiple nodes. If an account grows too large, we use particular logic to distribute it between nodes, reducing the number of nodes as needed.
Consumer partition assignment
When a consumer joins or quits a group of consumers, the brokers rebalance the partitions between consumers, which means Kafka takes care of load balancing for you regarding the number of cells per application instance. This is fantastic—it’s a vital characteristic of Kafka’s work. Almost all of our offerings make use of customer groups.
When a rebalance occurs by default, all consumers lose their partitions and are reassigned new ones (known as the “eager” protocol). If your application has a state-linked with the consumed data, such as our aggregator service, you’ll need to remove it and start with data from the new partition.
You can utilise the StickyAssignor to reduce partition shuffle on stateful services. This assignor tries to maintain partition numbers allocated to the same instance as long as they stay in the group while still distributing partitions evenly across the members. Suppose partition changes are critical to the application’s logic. In that case, the consumer client code must track whether it has kept, lost, or gained partitions because cells permanently revoke at the start of a rebalance. For our aggregator service, we take this method.
I’d want to draw attention to a couple of other possibilities. The CooperativeStickyAssignor is available in Kafka Release 2.4 and beyond. Instead of removing all partitions at the beginning of a rebalancing, the consumer listener only gets the difference in divisions withdrawn as the rebalance progress. Rebalances as a whole take longer, and we need to optimise our application to reduce the time it takes to rebalance when a partition moves. That’s why, for our aggregator service, we stuck with the “eager” protocol beneath the StickyPartitioner. However, with Kafka version 2.5, we can continue consuming from partitions during a cooperative rebalance, so it’s worth revisiting.
Additionally, if customers continuously identify themselves as the same member, you may be able to take advantage of static membership, preventing a rebalance from occurring.
Instead of utilising a consumer group, you can assign partitions directly through the consumer client, which does not cause rebalances. Of course, in that situation, you must manually balance the cells and ensure that they are entirely used. We do this in circumstances where we’re utilising Kafka to snapshot state. We manually connect snapshot messages with the input topic partitions that our service reads.
Conclusion
Your data form and the processing your apps conduct will determine your partitioning strategies. As businesses grow, your techniques may need to change to accommodate the increased volume and variety of data. Consider where your architecture’s resource bottlenecks are, and distribute load among your data pipelines accordingly. The principle is the same whether CPU, database traffic, or storage space. Make the most of your limited/expensive resources.
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management
- 21 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Revolutionizing Real Estate: Enhancing Database Performance and Cost Efficiency with Enteros and Cloud FinOps
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros in Education: Leveraging AIOps for Advanced Anomaly Management and Optimized Learning Environments
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…