sunshine

I thought the shade was meant for me, to keep me unseen, unheard. Unloved. but it was only the shadows of his trees, who give him a safe harbor from the sun; I had cried a thousand tears in darkness…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Stateful Stream Processing Frameworks

Lately we have been adding some more complex event processing functionality, such as duplicate message detection. Specifically, we are receiving messages from several sources. Those sources may or may not report the same message. When two sources report the same message within 15 minutes, we want to flag the second message as duplicate so we can skip it in some further downstream processing.

Such functionality can be implemented in Akka Streams relatively easily. We build up a small history of the (hashes) of messages we have seen together with a timestamp and compare incoming messages to the history.

This works very well, however, we want our duplicate detection to also work after restarting the application. If we don’t take measures to persist and restore the history somehow, then our application will start with an empty history and will fail to detect some duplicates for up to 15 minutes.

In general, stream processing applications have the need to keep track of some state, persist and restore it. Akka Streams was not built with such high-level functionality in mind. There are several other stream processing frameworks out there that support state management first-class:

The monthly Codestar R&D days offered us the perfect chance to look around for the best way to tackle our stream processing requirements and learn about other frameworks. We have been experimenting with each of the three libraries a bit to see which one best meets our needs. Although we had limited time, here’s some of our findings:

Apache Samza

Kafka Streams

Flink

Samza and Kafka Streams are very similar, apparently sharing some ancestry at LinkedIn. The APIs are very similar, but Kafka Streams is much more powerful. One important limiting factor for us for Kafka Streams is the integration with external APIs. For example, calling remote machine learning systems or spatial database queries. Samza with its more elaborate ‘as-a-table’ support has better support for that.

Flink is more powerful but requires a different deployment model for scalability, which is too different from our current deployment model. We also found that the API is less functional than we would like it to be.

In the end we did not get our use case completely working with Samza due to time limitations.

For us the next steps are:

To be continued!

Add a comment

Related posts:

Lower Back Pain and Treatment

Have you ever had lower back pain? If so, what was your experience? Not pleasant at all, right? Well let’s talk about lower back pain. Almost every one of us would agree that the back pain was…

Do you believe in yourself?

This is going to be my first series and deal with positivity and false belief systems. The first part, which is this will deal with believing in yourself as the first step to living a positive life…

NavPay Cold Staking Spending Upgrade

The upgrade to enable spending of cold staking balances where NavPay was used as the spending key has been successfully deployed to NavPay’s production servers. The upgrade also requires a wallet…