Introduction
Service Discovery, the phonebook of applications, is a foundation for services oriented architecture. It answers the question of how does one service dynamically discover the endpoints of another service.
It is not a new thing that came with microservices. Zookeeper is a famous application for implementing service discovery on regardless of requirements. I give it good praise for the things you can do with it. If you don't have a lot of time to spend I recommend it.
Use Case
Building on top of my last post for Building a low latency Async RPC library, I needed a way for my applications to at runtime whether on laptop or datacenter environment find the registered running services quickly. From the get go, you know that there has to be a way to enter and query for entries. The application would have to be able to provide the ip:host and the protocol of which I label as the service provider. Protocol is http, rpc server and rpc publisher. It would also need to provide health checking to make sure the ip:host:protocol are still valid. Finally a way to be notified of any changes.
Building it
Just have to say, it was not the most fun for anyone to have building it as you will drudge thorugh some really boring parts. The most fun was thinking of interesting ways to make it work and building that way.
First idea
First off, when you look at an application like this you think API, Storage/Leader and some Workers to do checking and notifiying. Any application that wants to be discovered needs to expose a ZeroMQ PUB socket for heartbeating. Drawing inspiration from other applications influences the architecture. Yeah, we have 3 roles in this application. MICROSERVICES! Only 2 parts are public, just have to figure out how to make them work out.
Evolving the idea as I go
I did end up going with this train of thought for most of the project until I realized that subscribing to notifications from a application would be odd since the application needs to discover all of the workers. Filtering is not a big deal as ZeroMQ PUB/SUB provides this. I figured that the subscriptions could go to the API then since that can be horizontally scaled. Well, I do not wany applications to connect to all APIs, so....
Worker strictly monitors application - The Workers do not know anything about an application, they only know only ip and port to subscribe to, so it makes it impractical to make these nodes discoverable. Workers will publish snapshot of service provider "presence" now. API remains as the only public node from this application now.
Leader makes decisions and publishes them - The Storage/Leader node would subscribe to the presence nodes and could instead make the ip:port to service provider relation and make decisions about changing "presence" state and publish them. Leader needs to know about workers and send to specific or broadcast information, so there is a Peer to Peer Clustering part to the RPC library I created to facilitate this, which is just sublime to me.
Ok, so now I can have any application subscribe to any one API and get notifications that a provider's state has changed and react to it by perma removing the endpoint from a list of clients to keep in touch to.
API contacts/subscribes to Leader - The API would serve as the subscriber replica of "presence" state. Storage/Leader provides API with any information that is publicly requested such as service information, providers (ip:ports). Service must make a request to API to register itself and from then on it will be monitored by a worker.
Simple but amazing. Oh, did I say I built it all in Rust?! Rust is great, but debugging is a terrible experience for me! It is getting better though.
What happens if any node goes down?
Thinking about this every step of the way influenced why there are 3 roles.
Worker Down - This is a disruption in watching online/offline capability. The storage/leader node orphans those watch tasks until a redistribution process is triggered.
Leader Down - The worst failure that could happen. Each worker would keep doing their thing and publishing data to their PUB sockets. API calls will obviously fail. When the leader comes back up, it will have to reload the service and provider information then ask each worker for their task lists then redistrbute if neccessary since some services could have gone offline or more online. This is ZeroMQ world so the API will eventually reconnect.
API Down - The least catastrophic failure. API is stateless, so it can go online or offline as neccessary for scale up or down. Application that lose their notification subscriptions will just have to ask another API node.
Conclusion
Great experience. Definitely not done with everything. These are some of the targets for a Version 1 (alpha really). Just wanted to get my ideas somewhat documented and coded. I'm sure I will find faults and I will work through those faults like any other piece of software.
Likely will not release this thing to the public unless I document it to the point that many users can work with it and put a license on it.
Why not use something else?
Might be wondering yet another service discovery application. Why not use DNS, Zookeeper, consul catalog, istio, HAProxy or some other method of registering services? I'd say you are nuts if you ask this question and are you motivated and curious enough to understand how things actually work. Also this application isn't for you, coupled with the fact that I spend my own time doing what I want to do.😀 Any determined curious one could build this.