When speaking at conferences, I often get asked questions about virtualization and how fast databases will run on it (and even if they are “supported” on virtualised systems). This is complex question to answer. Because it requires a very deep understanding of CPU caches, memory and I/O systems to fully describe the tradeoffs.
Let us first look at the political reasons for virtualising: Operation teams, for very good reasons, often try to push developers towards virtualised systems – cloud is just the latest in this ongoing trend. They try to provide an abstraction between application code and the nasty, physical logistics of data centers – making their job easier. The methods of the operation teams employ take many forms: VLAN, SAN, Private clouds and VMWare/HyperV to quote a few examples. Virtualising will increase their flexibility – and drive down their cost per machine, which looks great in the balance sheet. However, this is flexibility comes at a very high cost. It has been said that:
“All non-trivial abstractions, to some degree, are leaky”
Joel Spolsky
In the case of virtualisation, the abstraction provided is very non-trivial indeed and the leaking is sometimes equally extreme. Traditionally, the issue with virtualisation has been slowdown of I/O or network – though this has gotten a lot better with hardware support for virtual hosts (though SAN still haunts us). Over provisioned memory is another good example of virtualisation wrecking havoc with performance. All of these seems to be surmountable though and this is driving cloud forward.
However, lately it is becoming increasingly clear that scheduling, NUMA and L2/L3 cache misses are potentially an even larger problem and one that will surface once you take I/O out of the bottleneck club.
As we grow our data centers to cloud massive scale and pay for compute power by the hour, every machine counts and will figure in the balance sheet. It should also be clear that a important optimisation will be to focus on the performance on individual scale nodes – to make the best use of the expensive power.
This morning, I ran into some fascinating research in this area (Barret Rhoden, Kevin Klues, David Zhu, Eric Brewer) who take this to another level:
“Improving Per-Node Efficiency in the Datacenter with New OS Abstractions” (pdf)
To whet your appetite, here is a quote from the abstract (my highlight).
“We believe datacenters can benefit from more focus on per-node efficiency, performance, and predictability, versus the more common focus so far on scalability to a large number of nodes. Improving per-node efficiency decreases costs and fault recovery because fewer nodes are required for the same amount of work. We believe that the use of complex, general-purpose operating systems is a key contributing factor to these inefficiencies.”
A highly recommend read and a good primer on some of the things that concern me a lot these days.
Kejser’s Law
I think it is time for me to state my own law (or trivial insight if you will) of computing. Though I stand here at the shoulders of giants, I will steal a bit of the fame. I think it is appropriate that I state one of the things I aim to show people at conferences:
“Any shared resource in a non-trivial workload,
will eventually become a bottleneck”
The post Kejser’s Law appeared first on Fighting Bad Data Modeling.