Thursday, 6 March 2014

Reports exaggerated

I've been letting the blog rest recently, and not so recently as well.  The problem is not a lack of subjects, but a lack of time to do them any justice.  However it is quite sad to see that my last entry was in September 2012, so it is time to post again.

Of late I have been pondering what I have to say about :
  • Distributed MVCC and write-scaling
  • Different approaches to eventual consistency with replicated RDBMS
  • Various MySQL Cluster related topics
  • Various general rambling and unstructured topics
However, these will take some time to percolate and calcify.

In the meantime here are some things I have found interesting recently :
  • Learn You Some Erlang for Great Good
    I actually rediscovered this online book after watching some Joe Armstrong + Erlang videos, after watching some spoof video about bringing Erlang up to date.  All recommended  Erlang and Ndb Cluster share some Plex heritage, which can still be seen in their architectures today.  Since Plex, Erlang has mated with Prolog, and Ndb Cluster was involved in a car crash with C++.
  • HyperDex + Hyperdex Warp
    Something I discovered last year from Emin Gün Sirer's blog and have returned to since.  There are a number of nice ideas combined here (chain replication, value dependent chaining, hyperspace hashing, subspaces).  My favourites are the concept of 'spurious coordination' and their solution w.r.t. transaction consistency : ordering the route of the optimistic 'distributed commit' based on the affected keys.  I guess we need more independent analysis and evaluation to understand the strengths and weaknesses of these techniques.
  • Kronos
    This is a distributed HA 'event ordering system' from the same Cornell HyperDex team.  Thinking about distributed MVCC led me to thinking about efficiently maintaining a distributed partial ordering of events while avoiding 'spurious coordination'.  Kronos is an attempt to solve part of that problem in a kind of abstracted SOA way.  There is some nice detail in the paper about their dependency graph traversal optimisations, and how dependencies are immutable once discovered, so can be cached, replicated for read scale-out etc.  This could be a great systems building block.
  • Systems Performance book
    I am slowly reading this doorstop book from Brendan Gregg, an ex Solaris kernel engineer at Sun, now at Joyent.  It contains a great amount of recent practical information about Linux + Solaris performance analysis and optimisation.  Unix performance tools have always been a little opaque to me, with very little of how-to-approach a performance problem ever being documented.  This book covers many old and new tools, but also includes rare information on how to analyse problems with these tools, rather than just syntax and units of values returned.  Perhaps even better is his supreme confidence about tackling and solving any performance problem that foolishly catches his eye.  I guess that comes from experience, but maybe a little can be conveyed to his readers by this book.
  • Google Spanner, Galera Cluster, MoSQL, RAMCloud, NuoDB, OpenReplica
    Different approaches, ideas, hidden tradeoffs, strengths and weaknesses!
One of my favourite discoveries was this quote attributed to Charles Babbage :
"On two occasions I have been asked 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?'
I am not able rightly to comprehend the kind of confusion of ideas that could provoke such a question." 
Strange how often this response has been on my lips since ! 

Sunday, 30 September 2012

Saturday at MySQL Connect

The first day of the first MySQL Connect conference is done.  It's been a busy day!  Many attendees are interested in the new MySQL Server 5.6 release, but of course MySQL Cluster is the main draw here.  After a session from Oracle on the new features in 7.2, and early access features in 7.3, I attended Santo Leto's MySQL Cluster 'hands on lab'.  Despite having started more clusters than most, it felt like a new and exciting experience installing and running my cluster alongside everyone else.  The lab machines had some networking issues, but with Santo's help we seamlessly failed-over to some downloads he'd prepared earlier - very professional!

Afterwards it was my turn to talk on the subject of MySQL Cluster performance.  The quality of the questions was impressive - there seems to be a very capable crowd in San Francisco this weekend.  I was flattered that some audience members were taking pictures of my slides, but there's no need, you can download them here, and probably from Oracle somewhere too.

Finally we had a 'Birds of a Feather' session where a number of users asked questions about replication, backups, failover + arbitration, system recovery etc...  Great to get the chance to explain some things, hear about user experiences, and get some feedback about user experience.

There's been a lot of good content so far, and interesting discussions, but the highlight for me has been meeting with colleagues old and new, from MySQL, Sun and Oracle times.  More of the same tomorrow, I'd better get some sleep!

Wednesday, 15 August 2012

Session at MySQL Connect

I will double my all-time total of public speaking engagements this September at the MySQL Connect conference in San Fransisco.

The title of my session is "Delivering Breakthrough Performance with MySQL Cluster", and it's between 17:30 and 18:30 on Saturday 29th September.

The content is not finalised yet, so if there's something you would like to hear about which fits with the abstract, then comment below.  If it doesn't fit in with the abstract then we will just have to talk about it afterwards over some drinks.

This is the first year of MySQL Connect, but there's some great content.  I'm looking forward to meeting with users, customers, other MySQL engineers, and hopefully get the chance to learn something!

See you there.

Wednesday, 7 March 2012

The CAP theorem and MySQL Cluster

tldr; A single MySQL Cluster prioritises Consistency in Network partition events. Asynchronously replicating MySQL Clusters prioritise Availability in Network partition events.

I was recently asked about the relationship between MySQL Cluster and the CAP theorem. The CAP theorem is often described as a pick two out of three problem, such as choosing from good, cheap, fast. You can have any two, but you can't have all three. For CAP the three qualities are 'Consistency', 'Availability' and 'Partition tolerance'. CAP states that in a system with data replicated over a network only two of these three qualities can be maintained at once, so which two does MySQL Cluster provide?

Standard 'my interpretation of CAP' section

Everyone who discusses CAP like to rehash it, and I'm no exception. Daniel Abadi has the best CAP write-up that I've read so far, which reframes CAP as a decision about whether to ultimately prioritise availability or data consistency in the event of a network partition. This is how I think of CAP. He also discusses related system behaviour in normal operation which I'll return to later.

While this reframing clarifies CAP, the terms network partition, availability and consistency also need some definition.

Network replicated database

CAP is only really relevant in the context of a network replicated database (or filesystem or state machine). A network replicated database stores copies of data in multiple different systems (database nodes), connected by a network. Data can be read and updated. Updates are propagated to all nodes with replicas via the network. Database clients connect to database nodes via the network to read data and make updates. Replication may occur to improve availability, to improve request latency, or to improve read bandwidth.


The network replicated database exists to provide services such as Read and Write on the data it stores. Its availability can be measured as the ability of any client to perform any service on any data item.

This Service Availability can be compromised by :
  • Failure of client nodes
  • Network failures between clients and database nodes
  • Network failures between database nodes
  • Failure of database nodes
Client node and networking failures cannot really be considered a property within the control of a database system, so I consider their effects out of the scope of CAP. However, where clients connect to a database node, and that database node is isolated from other database nodes, whether or not those clients are given service is within the scope of CAP.

Service Availability is not binary, it can partially degrade, perhaps by affecting :
  • A subset of all clients
  • A subset of all stored data
  • A subset of request types

The shades of grey within the definition of availability are responsible for most of the arguments around CAP. If we take a strict view - either all services available on all data for all clients, or nothing, then availability is fragile and hard to maintain. If we take a more flexible approach then some service availabilty can be preserved even with a completely decimated network. In the loosest definition, if any client receives any service on any data, then the system is still available. Rather than choose one position, I regard availability as a range from 100% down to 0% for a full outage. Anything in the middle is reduced availability, but it does not mean that the system is not serving its purpose adequately.


For consistency to be satisfied, the multiple replicas of data in a network replicated database should behave as though there were only one copy of the data. Simultaneous reads of the same data item from clients connected to different database nodes must always return the same result. Where two or more updates to the same data item are submitted simulteneously, they must be serialised, or one must be rejected, or they must be merged so that a single value results. This one-copy model makes it simple for database clients to use the network replicated database as if it were a single database system with one atomically read/written copy of their data.

If one copy consistency is relaxed, then different database nodes may observably have different values for the same data item simultaneously. Over time the data copies may be aligned, but clients accessing the data must beware that reads may not return the results of the most recently accepted writes. This behaviour may be described as eventual consistency. Providing eventual consistency allows a network replicated database to maximise availability, but pushes the problem of dealing with transient inconsistencies up the stack to user applications. Furthermore there are varying qualities of eventual consistency, with varying guarantees and levels of application support available.

Network Partitions

Network partitions isolate subsets of the nodes of a network replicated database. The interesting property of a network partition is that each node subset cannot tell whether the other node subset(s) are :
  1. dead
  2. alive but isolated from clients
  3. alive and reachable by clients but isolated from us
Not knowing the state of the other subset(s) is what forces a system to decide between maximising service availability and maximising consistency. The interesting case is 3) where some database nodes (potentially containing all or some of the data) are alive elsewhere and have clients connected to them. If those clients are allowed to make writes on data copies stored on those database nodes, then we must lose one copy consistency as we cannot supply those new values in response to a read of our local copy. If those clients are not allowed to make writes then we have degraded service availability for them. Which is it to be? This is the unavoidable choice at the centre of the CAP theorem. Stated this way it seems less of a theorem and more of a fact.

Back to MySQL Cluster - which does it provide?

A single MySQL Cluster prioritises data consistency over availability when network partitions occur.

A pair of asynchronously replicating MySQL Clusters prioritise service availability over data consistency when network partitions occur.

So you can have it both ways with MySQL Cluster - Great!

Single MySQL Cluster - CP

Within a single MySQL Cluster, data is synchronously replicated between database nodes using two-phase commit. Nodes are monitored using heartbeats, and failed or silent nodes are promptly isolated by live and responsive nodes. Where a network partition occurs, live nodes in each partition regroup and decide what to do next :
  • If there are not enough live nodes to serve all of the data stored - shutdown
    Serving a subset of user data (and risking data consistency) is not an option
  • If there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide service
    No other subset of nodes can be isolated from us and serving clients
  • If there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.
    There could be another subset of nodes regrouped into a viable cluster out there.

Arbitration occurs to avoid the split brain scenario where a cluster could theoretically split in two (or more), with each half (or third, or quarter) accepting writes and diverging from the others. In other words, arbitration occurs to preserve consistency.

Arbitration involves :
  • Database nodes agree on an arbitrator in advance
  • During node or network failure handling, no data writes are committed.
  • When arbitration is required due to node failures or network issues, viable node subsets (potential clusters) request permission from the previously agreed arbitrator to provide service.
  • Each request to the arbitrator will result in either : Yes, No or timeout
  • Anything other than Yes results in node shutdown.
  • The arbitrator only says Yes once per election round (First come first served). Therefore the arbitrator only says yes to one potential cluster in a partitioned network.

Note that arbitration is not the same as achieving a quorum. A cluster with three replicas and an arbitrator node can survive the loss of two data nodes as long as the arbitrator remains reachable to the last survivor. The arbitrator role is lightweight as it is not involved in normal traffic. I am surprised that the lightweight arbitrator pattern is not more common.

How does a single MySQL Cluster degrade service availability as a result of network partitions?

Where some subset of data nodes are isolated and shut-down :
  • Those nodes are 100% out of service, until they restart and can rejoin the cluster
    They will attempt to do so automatically
  • Any clients connected only to those nodes are out of service
    By default clients attempt to connect to all data nodes, so partial connectivity issues needn't degrade client availability.
  • The remaining live nodes are 100% in-service
  • Clients connected to the remaining live nodes are 100% in service
Where no subset of data nodes is live
  • All clients experience 100% service loss, until the data nodes restart and can rejoin the cluster
    They will attempt to do so automatically.

A single MySQL Cluster does not degrade to partial data access, or read only modes as a result of network partitions. It does not sacrifice consistency.

How can MySQL Cluster be described as highly available if it sacrifices availability for consistency in the event of a network partition?

Availability is not binary - many types of network partition can erode availability, for some clients, but do not extinguish it. Some set of clients continue to receive 100% service. Only double failures in the network can cause a network partition resulting in full service loss.
Furthermore, network partitions are not the only risks to availability, software errors, power failures, upgrades, overloads are other potential sources of downtime which Cluster is designed to overcome.

Asynchronously replicating clusters - AP

Where two Clusters are asynchronously replicating via normal MySQL Replication, in a circular configuration, reads and writes can be performed locally at both clusters. Data consistency within each cluster is guaranteed as normal, but data consistency across the two clusters is not. On the other hand, availability is not compromised by network partitioning of the two clusters. Each cluster can continue to accept read and write requests to all of the data from any connected client.

Eventual consistency between the clusters is possible when using conflict resolution functions such as NDB$EPOCH_TRANS, NDB$EPOCH, NDB$MAX etc.

How does consistency degrade between replicating MySQL Clusters during a network partition?

This depends on the conflict resolution function chosen, and how detected conflicts are handled. Some details of consistency guarantees provided by NDB$EPOCH et al are described here.

What about normal operation?

Abadi's post introduced his PACELC acronym, standing for something like :

 if (network Partition)
trade-off Availability vs Consistency;
trade-off Latency vs Consistency;

My first comment has to be that it's bad form to put the common case in an else branch!
However, it is certainly true that the properties during normal operation are usually more important than what happens during a network partition. The ELC section is stating that while all database nodes are present, a network replicated database can choose between minimising request Latency, or maintaining Consistency. In theory this normal operation latency-vs-consistency tradeoff could be completely independent to the Network Partitioning availability-vs-consistency tradeoff, e.g. you could have any of :
  1. PA EL (Partition - Availability, Else - Latency minimisation)
  2. PA EC (Partition - Availability, Else - Consistency)
  3. PC EL (Partition - Consistency, Else - Latency minimisation)
  4. PC EC (Partition - Consistency, Else - Consistency)

The common cases are 1 + 4, where we choose either consistency at all times, or Maximum Availability and Minimum Latency. Case 2 is a system which aims for consistency, but when a network partition occurs, aims for Availability. Case 3 is a system which aims for minimal request Latency, and when a partition occurs aims for consistency.

Examples of systems of each type :
  1. Any eventually consistent system, especially with local-database-node updates + reads
  2. Best-effort consistent systems that degrade in failure modes (e.g. MySQL semi-synchronous replication)
  3. ???
  4. Always consistent systems (e.g. single database instance, single MySQL Cluster)

I am not aware of systems meeting case 3 where normally they minimise latency over consistency, but start choosing consistency after a network partition. Maybe this category should be called 'repentant systems'?

The problem for systems in Cases 1 or 2 - anywhere where Latency minimisation or Availability is chosen over consistency - is the need for user applications to deal with potential inconsistencies. It is not enough to say that things will 'eventually' be consistent. It's important to describe how inconsistent they can be, whether the temporary inconsistencies are values which were once valid, how those values relate to other, connected values etc.

There are certainly applications which can operate correctly with practical eventually consistent databases, but it's not well known how to design applications and schemas to cope with the transient states of an eventually consistent database. The first ORM framework to opaquely support an underlying eventually consistent database may actually be worth the effort to use! A reasonable approach is to design schemas with associated read/modification 'protocols' as if they were abstract data types (ADTs). These ADTs can then have strengths and weaknesses, properties and limitations which make sense in some parts of an application schema where the need to support eventual consistency overcomes the inherent effort and limitations.

Stonebraker and others have commented on network partitions being a minor concern for a well designed datacentre-local network, where redundancy can be reliably implemented. Also the latency cost of maintaining consistency is lower as physical distances are smaller and hop counts are lower. This results in 'CP' systems being attractive at the data centre scale as the need to sacrifice availability due to network partition is rarely dominant, and the latency implications during normal operation are bearable. Perhaps this highlights the need in these theoretical discussions to illustrate theoretically problematic latencies and availabilities with real numbers.

At a wider network scale, latencies are naturally higher, implying that bandwidth is lower. The probability of network partitions of some sort may also increase, due to the larger number of components (and organisations) involved. The factors combine to make 'AP' systems more palatable. The everyday latency cost of consistency is higher, and losing availability due to potentially more frequent network partitions may not be acceptable. Again, real numbers are required to illuminate whether the achievable latencies and probable availability impacts are serious enough to warrant changing applications to deal with eventually consistent data. For a particular application there may or may not be a point at which an AP system would meet its requirements better.

Consistent systems can be scaled across many nodes and high latency links, but the observed operation latency, and the necessary impacts to availability implied by link failure set a natural ceiling on the desirable scale of a consistent system. Paraphrasing John Mashey, "Bandwidth improves, latency is forever". Applications that find the latency and availability constraints of a single consistent system unacceptable, must subdivide their datasets into smaller independent consistency zones and manage potential consistency shear between them.

Finally (another excessively long post), I think the technical and actual merits of widely distributed 'CP' systems are not well known as they have not been commonly available. Many different database systems support some form of asynchronous replication, but few offer synchronous replication, fewer still offer to support it over wide areas with higher latency and fluctuating links. As this changes, the true potential and weaknesses of these technologies, backed by real numbers, will start to appear.

Edit 7/3/12 : Fix bad link