I understand why people want to move back to the shared (cloud) model, but understanding what needs to be ephemeral needs to be considered based off of requirements. Elasticsearch in a docker container scheduled by a compute scheduler? Is that not a bad idea for something as unstable as Elasticsearch? MemSQL in a docker container? The point of MemSQL is for fast data access in memory and I would think there is a legit case for it, so why not allocate MemSQL dedicated servers? The more important question one should answer is ‘Should important storage nodes have to compete for compute power against CRUD nodes?’ Most likely your answer would be no, so my suggestion is to use a physical dedicated server, but alas we have the cloud craze blazing ahead of common sense.
Let’s get right into it.
Persistent Storage Nodes
Databases, file storage and other persistent systems should not be ephemeral by any means. These are hard storage and require a fixed location unless you can figure out how to move data at a super fast rate in order to avoid catastrophic failure. Replicate only mitigates this issue (is it?), it is not designed to be a magic bullet for some how guaranteeing some crazy 99.9999999999% availability.
Should storage query layer clients be ephemeral? Depends on what the clients are used to do.
– For reading? Most likely not a problem as the initiating user can try again. This is bandwidth, throughput and latency, but it is easier to do a resume!
– For writing on the other hand, imagine transferring 100mb to be stored and the storage query layer client dies (for whatever reason), let’s assume that the storage engine cleans up the mess left behind (with no fragmentation) then what do you communicate to the initiating client (human or machine)? That’s right you can’t! An initiating client will likely try again and get connected (in the backend via proxy) to another storage query client. Why is this fine with people? Availability? Distributed? “Data store suffered a failure, please try your upload of 100mb again”. That is wasted bandwidth, throughput and latency!
A ephemeral node can go down at anything for no reason whatsoever. Is it worth the latency and decreased throughput to put storage write clients on an ephemeral node? I would argue against the worth given the randomness in death of nodes in the cluster. Write clients should be in a fixed location where the same loss/retry cycle can be tolerated with a load balancer.
I would like to know the architectures in distributed systems that account for death of nodes where a client is currently transferring data if there are any. Failure is usually handled with an error message and then retry of the idempotent operations. Idempotency is nice, but not for large files.
Unless you have a metric shit ton of nodes where you can replicate data enough and the in memory storage/file system is big enough to tolerate large node failures and random node failures that it would not matter if the nodes were cloud or fixed then sure cloud works, but I haven’t seen anywhere that has large enough of a deployment. Ephemeral would represent unstable nodes and fixed would represent stable long-term nodes as that point.
So how about math/compute-heavy nodes? Again, depends on the data structures and algorithms that are used. Naive will always restart in the face of failures and my guess is that most people use naive algorithms to do compute heavy operations. Another factor to consider is how loaded the nodes are going to be by not only your application but the others who are scheduled on the physical node. Do these nodes need a fixed location? Fixed if you use naive algorithms and dynamic if you use algorithms that break the problem up to distribute it.
Outward Facing Data Nodes
CRUD (REST, SOAP, etc.), UI, cache (memcache) and anything that basically serves data independently of other nodes and do not store data locally usually can live on a ephemeral node. HTTP clients usually retry on their own, so users/machines can deal with delay in the case of failure. Writes are by default considered idempotent, so if a write fails then user can retry, but this is not always the case.