Imagine you are building a message-driven architecture based system for image recognition, video processing, or big data analysis. How would you share large files (gigabytes of data) between services? Should you send them as messages using a messaging platform (e.g. Kafka, RabbitMq, AzureMQ, Redpanda, etc.)? There are good reasons not to do so.
Messaging Platforms Are Designed for Small Messages
Sending large messages via messaging platforms might result in high memory usage. Message brokers process messages in memory before they are made available for consumption. In other words, every message must fit into RAM. While this is fine for a single client-broker solution, imagine sending thousands of 1GB messages.
Many messaging platforms have hard limits for message sizes. For example, RabbitMQ's maximum message size used to be 2GB, which later was changed to 512MB. You could split those large messages into multiple smaller chunks on the sender's end and later aggregate them on the receiver's end as a workaround. However, this comes with an increased memory footprint since you'd have to store those smaller chunks until you can resolve the entire message. Moreover, you'd have to split and assemble each message, which takes additional processing time and memory. And if it still doesn't sound too bad, you'd have to handle edge cases where you lose or fail to process a chunk of your message. A tricky thing to get right? Yes. Easy to introduce bugs? For sure.
Compressing large messages might significantly reduce their size. Some messaging platforms (e.g., Kafka) support message compressions like Gzip, Snappy, Lz4, or Zstd. Unfortunately, the ones built on top of AMQP (e.g., RabbitMQ) don't.
Since messaging platforms are designed for small messages, sending large messages might come with a few surprises. For instance, sending large messages via RabbitMQ might block consumers from performing heartbeat acknowledgements, which could result in the RabbitMQ server closing the consumer's connection.
Sending a Link to an External Store
Combining a sound file store with a messaging platform can solve all the abovementioned issues. Instead of sending a large file via a messaging platform, you could store it in an external storage system (i.e. Amazon S3, Azure Blob, etc.) and send a message containing a reference to retrieve the stored file. This pattern is known as "Claim Check".
As good as this approach is, it requires an additional data store to store files. Adding another storage component to the picture makes the overall architecture more complex. Moreover, it comes with additional costs associated with running and operating the store, especially when running applications outside the public cloud.
Summary
Most messaging platforms are not designed to share large files, while most file storage systems are designed just for that. Using an external file store to store files and a messaging platform to orchestrate data flows brings the best of both worlds.
Thank you for reading. Is there something else that this post is missing? I'd love to hear about it in the comments.