MySQL's Binary Log is nice to have, but not easiest to use. Change data capture (CDC) is the primary reason why I have to do it. When a user performs and action that causes an INSERT/UPDATE/DELETE, I want to know about it.
Typical way is to plug it into your application, but that is prone to issues when your CDC statement doesn't run because fatal error caused your application to quit in the middle of the data manipulation statement. The entry was written to MySQL and its Binary Log though!
I've always read about Facebook doing this and never thought I'd do it myself. That changed this week when I started to explore the automation system for my product. Like above, how can I make sure that the automation will absolutely run in response to a change that a user did? That's right the binary log.
Sure Redis has a CDC notification solution, but it you'd still need to log it yourself for later to process it.
I decided to embark on the journey of capturing MySQL Binary Log data in an application written in Rust (with RocksDB) that allows consumers to grab the data they want from it. I want to store the data on the same machine and in an iterable way and RockDB is very useful for that. Polling isn't an issue for me though there is a way to notify all subscribers every once in a while when new data is available to short circuit their poll waits.
One of the few constraints is that every MySQL instance must run this CDC application (could evolve to have other capabilities later) alongside it. Logical offset must be distributed to consumers rather than a binary log offset (RocksDB is very good about being a LOG).
The Binary Log is obviously binary data, so you'd need a parser. Thankfully, mysql_async has a parser and parses them into their respective Binary Log entries. Which entries are we interested in? Rotate Event, TableMap Event, Query Event and (Write|Update|Delete)Rows Event.
- The PUB/SUB application will go down at any time and needs to continue off where it left off. It also needs to know table structure which requires the application to get table structure information to do row value parsing. Each row event has unparsed row value data. To parse it effectively requires TableMap Event and information schema. What if the table structure changes in the middle of the application run? Query event's query data has to be read to find which table was modified.
- Row value parsing is the unfortunate part if you ask me. You have to maintain two connections if in the end you goal is to go from column[], row[] to {column x: row value x, ....}. It all has to be done in the application. There is no side stepping it.
- It is not the most fun one will have, but fortunately there are CDC solutions out there that people can use so they don't have to get with the nitty gritty details, which I suggest evaluating first.