Monday 28 March 2011

MySQL Cluster online scaling

Most people looking at a diagram showing the Cluster architecture soon want to know if the system can scale online. Api nodes such as MySQLD processes can be added online, and the storage capacity of existing data nodes can be increased online, but it was not always possible to add new data nodes to the cluster without an initial system restart requiring a backup and restore.

An online add node and data repartitioning feature was finally implemented in MySQL Cluster 7.0. It's not clear how often users actually do scale their Clusters online, but it certainly is a cool thing to be able to do.

There are two parts to the feature :
  1. Online add an empty data node to an existing cluster
  2. Online rebalance existing data across the existing and new data nodes

Adding an empty data node to a cluster sounds trivial, but is actually fairly complex given the cluster's distributed configuration, ring heartbeating etc. Stewart Smith did some preparatory work on this a few years ago, and this was revisited for the feature.

Rebalancing existing table data to make use of the new storage capacity is more challenging. How does this work? More importantly, how does this work online, while transactions are starting and committing, and queries are running?

As an aside, the definition of 'Online' used here is that multiple distributed clients continue to start and commit transactions reading and writing data to the cluster. Some things, like concurrent DDL may be blocked during these operations.

To understand the online data rebalancing mechanism, we need to go into more detail on the native data distribution mechanisms introduced in the last post.

Ndb native partitioning variants

There are currently three variants of Ndb's native partitioning function :
  • Linear Key
  • Key
  • HashMap
Linear Key is used where the table is created with PARTITION BY LINEAR KEY. Key is used where the table is created with PARTITION BY KEY *prior to Cluster 7.0. HashMap is used where the table is created with PARTITION BY KEY in releases starting at Cluster 7.0.

These partitioning functions can be decomposed into two functions, where the first, an MD5 hash of the partition/distribution key, is fixed. MD5 is no longer considered secure, but for the purpose of 'balancing' rows across fragments it is more than adequate.

The variant part of these algorithms is how the MD5 hash of the key is mapped to a partition/fragment number. This has important implications for how repartitioning works.

fragment_num( dist_key_cols ) = mapping_fn( md5 ( dist_key_cols ) )

Linear Key Mapping Function


The Linear Key scheme is part of an old design for online repartitioning. The old design was intended to minimise the amount of data transfer required when repartitioning tables by ensuring that an existing partition could be cleanly 'split in two' so that half of its data could be migrated. This is a good policy, but Linear Key had the downside of requiring a power-of-2 number of partitions.

Where a non power-of-2 number of data nodes existed, this causes problems. Additionally, if repartitioning with this scheme had ever been implemented, it would have required fragments to be 'split in two' to expand the system. The power-of-2 requirement of this scheme was listed as a Cluster limitation in the early days, though this has not been a problem since the non-linear Key scheme became default.

Key Mapping Function

The Key scheme is simpler than Linear Key, and simply divides the hash result modulo the number of fragments of the table to determine which fragment a row resides in. This removes the power-of-2 restriction that Linear Key required, so rows can be evenly balanced across any number of nodes. However, it is not amenable to online reorganisation, as changing the number of table fragments changes the modulo division value, which can result in most of the resulting partition values changing. This means that a reorganisation using this scheme could result in excessive data transfer.

e.g.

mapping_fn_key(x) = x % num_frags

md5(dist_key_cols) = 23

num_frags = 4, 23 % 4 = 3
num_frags = 6, 23 % 6 = 5
num_frags = 8, 23 % 8 = 7


As the system expands, too much data is being moved. This is expensive, slow, requires extra storage etc.

HashMap Mapping Function

In MySQL Cluster 7.0, the HashMap distribution scheme was added and became the default. It is used when PARTITION BY KEY() is explicitly given, or implied if no partitioning specification is given.
The HashMap scheme uses a mapping table from md5 hash result to fragment number. The hash result used is 32-bits, which would require a large lookup table, so we first shrink it down to something more manageable (n) by modulo division by n. n = 240 is the default number, though the implementation supports any modulo value.
The resulting number is then used to lookup a table to get the fragment id which will store the row.

e.g.

mapping_fn_hashmap(x) = lookup_tab [x % mod_val ]

fragment_number = lookup [ md5( dist_key_cols ) % mod_val ]

fragment_number = lookup [ md5( dist_key_cols ) % 240 ]

The lookup table adds another layer of indirection between the hash result (which is fixed for any given key), and the fragment number, whose range can increase over time.

Assuming mod_val is 240, and we start with 4 fragments, then the 240 entry lookup table will have 60 entries with 0, 60 with 1, 60 with 2 and 60 with 3. As an aside, these will be sequenced as 0,1,2,3,0,1,2,3,0.... so that the actual default distribution will be exactly the same as with the KEY scheme.

If we want to spread the table data over 6 fragments, then we can change the table to use a new hashmap lookup table, where 2/6 of the existing 240 values are changed to refer to the new fragment numbers. The other 4/6 are unchanged. Expanding again to 8 fragments, we can change to another new hashmap, where 2/8 of the existing 240 values are changed, and the other 6/8 are unchanged. In each case, the minimum amount of data is affected to maintain balance.

Changing the hashmap is easy, the real work is in moving the data while it's being operated upon, but what the hashmap gives is a way to move only the minimum amount of data required when adding nodes. Only the data that has to move is moved, the rest stays where it is. The data distribution randomisation given by the MD5 function is unaffected, so system balance is maintained.

Choice of HashMap mod_val

As the default mod_val of 240 is significantly higher than common fragment counts, and because it factors well, most configurations will remain well balanced, despite being reorganised.

e.g. Assuming 2-node increments, a minimum with NoOfReplicas=2
240 / 2 = 120
240 / 4 = 60
240 / 6 = 40
240 / 8 = 30
240 / 10 = 24
240 / 12 = 20
240 / 14 = 17.1
240 / 16 = 15
240 / 18 = 13.3
240 / 20 = 12
240 / 22 = 10.9
240 / 24 = 10

240 factors cleanly into whole numbers (of lookup table entries) meaning that data should be well balanced across the table fragments when the data is repartitioned. Where there is not an integer result (e.g. 240/14), we would have most partitions with 17 lookup entries, and two with 18 lookup entries. The imbalance between them would be (18/17)-1 = 6%. If this were problematic, then a different mod_val could be used. A higher mod_val gives smaller partition imbalances, but requires
more memory to store. If necessary, the lookup table could be expanded in size by any integer factor (e.g. 2,3,4..) online to make it large enough to factor better for some desired data node count.


Moving rows online

The HashMap gives fine grained control of data placement, but how does the reorg happen online?

Table Reorganisation is similar Node recovery in some ways, in that the data is copied via fragment scans of the existing fragments, while at the same time, synchronous triggers are used to forward changes made to the existing fragment rows. The triggers and scans only copy data for rows which are to be moved,

e.g. where

new_hashmap_lookup[ md5( dist_key_cols ) % 240 ]
!=
old_hashmap_lookup[ md5( dist_key_cols ) % 240 ]


Other rows are left where they are.

With this mechanism, the new fragments are populated with rows from the existing fragments while read and write transactions continue. Once the fragment scans complete, the new fragments continue to be maintained by the synchronous triggers.

A future GCP boundary is chosed to be the 'cutover' point, and at this GCI, the new HashMap starts getting used for new transaction processing, and the new fragments start being used. Triggers are setup to propagate changes from the new fragments back to the pre-existing fragments, so that any older transactions using the old hashmap definition will see consistent data changes.

Once all transactions using rows from the pre-existing fragments have committed, the synchronous triggers are dropped, and the pre-existing fragments are scanned again, deleting the moved rows. Once this step completes, the reorganisation
is done.

Primary and Unique key operations in Ndb are short lived, and at Hashmap cutover, it doesn't take long until all old operations have committed. However, ordered index and table scans are slower and may not complete for some time. Both old and new row copies are maintained until all scans started using the old distribution have completed, so that ongoing transactions need not be aborted as part of the online reorg.

At the same time as the hashmaps are cutover at a GCI boundary, any NdbApi event subscribers listening to data change events on the table, for example attached MySQLDs recording Binlogs, start receiving events for the moved rows from the new fragments, and stop receiving them from the old fragments.

Transient storage use

When adding data nodes, the reorganisation uses no extra storage space on existing data nodes. On new data nodes, only the space used for the moved data is used. After the reorganisation completes, the space formerly used on pre-existing data nodes can be used for new data, so the system capacity is increased.

Transactional behaviour

Table reorganisation can take some time when there is a lot of data to move. A node or cluster failure during a reorganisation could leave the system in a transient state which would be difficult to recover from. One of the internal infrastructure changes in Cluster 7.0 was making all DDL operations transactional. This means
that they are atomic w.r.t. failures, including node and system failures. This applies to CREATE/DROP/ALTER of TABLE/INDEX/TABLESPACE etc.
This also applies to table reorganisation as it is a form of ALTER TABLE. If the reorganisation fails, or a node fails, or the cluster fails at some point during the reorganisation, then as part of system recovery, the reorganisation will be rolled back, or completed, if it had committed at the time of failure.

So that covers online table reorganisation. I've been meaning to write about it for some time, though somehow these entries always seem to be more like adverts than technical info.

No comments: