Thursday 10 September 2009

MySQL Cluster development

MySQL Cluster is the name given to one or more MySQL Server processes, connected to an Ndb Cluster database. From the point of view of the MySQL Server processes, the Ndb Cluster is a Storage Engine, implementing transactional storage of tables containing rows. From the point of view of the Ndb Cluster database, the MySQL Server processes are API nodes, performing DDL and DML transactions on tables stored in the cluster. Both exist independently – Ndb Cluster can be used without attached MySQL Server processes, but almost all users of Ndb Cluster connect at least one MySQL Server for DDL and administration.

Ndb stands for Network DataBase. This is a telecoms phrase where Network usually refers to a fixed or wireless telephone network, rather than the database topology definition of the term. Ndb was originally designed as a platform for implementing databases required to operate telecoms networks - HLR, VLR, Number Portability, Fraud Detection etc. At the time Ndb was first designed, Network Databases were generally implemented in-house on exotic 'switch' hardware by telecoms equipment vendors, often with hard-coded schemas and very inflexible query capabilities. These databases were expensive to develop and maintain, but had superb reliability and exceptional performance on minimal spec. hardware. The aim of the original Ndb design was to couple these desirable properties with more general purpose database functionality and deliver the result on a more standard hardware and OS stack.

I first discovered Ndb Cluster around 2001, when looking at potential designs for the next generation of an existing HLR database. I read the paper by Mikael Ronström in Ericsson Review (No 4,1997) which gives a good overview of the Ndb functionality. This paper describes functionality in the current tense when in fact some of the features described are yet to be implemented in 2009! This sort of optimism and vision has helped Ndb to survive and thrive over the years. The Ericsson Review paper was written while Ndb was one of multiple telecoms-database projects at Ericsson. Since then the Ndb product and team were spun out as a separate company, before being sold to MySQL AB in 2003 as a result of the dot com affair.

Ndb was originally designed for :
  • High throughput – sustaining tens to hundreds of thousands of transactions per second
  • Low latency – bounded transactions latencies which can be reliably factored into end-to-end latency budgets, implying main-memory storage
  • High update to read ratio – 50/50 as the norm
  • Transactional properties : Atomicity, Consistency, Isolation, Durability
  • Fault tolerance + HA – No single point of failure, automatic failover and recovery with minimal user or application involvement. Online upgrade. N-way synchronous and asynchronous replication. Fail-fast fault isolation.
  • Persistence – disk checkpointing and logging with automated recovery
  • Scalability – Parallel query execution. Distributed system can utilise > 1 system's resources. Capacity can be expanded horizontally with extra systems.

In the original Ndb design, high volume low latency transactions are submitted directly to the cluster using simple access primitives on the client. More complex queries are submitted to a separate query processor which itself uses combinations of the simpler primitives to access the cluster. An early example of a higher-level query processor was created by Martin Sköld who extended an Object Oriented query processor to create 'QDB' which could perform queries against data stored in Ndb. Numerous high level front-end processors have been implemented since.

Using MySQLD as a higher-level query processing front end we come to the architecture of MySQL Cluster, with MySQLD providing SQL based access to data stored in the cluster. In this sense MySQLD and Ndb cluster are a perfect fit and were designed for each other before they first met! Despite MySQLD being the default and most prominent front end to Ndb cluster, a number of others exist including several open and closed-source LDAP servers (OpenLDAP, OpenDS), several Java APIs and an Apache module giving HTTP access to data stored in Ndb.

The separation of low level, simple, fast access and higher level, more flexible access allows MySQL Cluster to offer many benefits of a full RDBMS without always incurring the drawback of over-generality. This fits well with many large transaction processing systems, where most heavy transaction processing does not require the full flexibility of the RDBMS, but some less frequent analysis does. Separating the central database engine (which in Ndb is referred to as the kernel ) from the query processing layer can also help with workload management – even the most complex queries are subdivided into manageable components and resources can be shared fairly.

The original Ndb design was not aimed at :
  • Disk resident storage
    Where data larger-than-aggregate-system-memory-capacity can be stored on disk. This functionality was later added in the MySQL 5.1 timeframe
  • Complex query processing
    Where multiple tables are joined. This was always possible, but not always efficient. Improving the efficiency of MySQL and Ndb on complex query processing is ongoing work - as it is in all actively developed RDBMS, for some definition of complex :).
  • Storing large rows
    Ndb currently has a per-row size limit of around 8kB, ignoring Blob and Text column types.
  • One size fits all
    Being a drop-in replacement for an existing MySQL engine such as MyISAM or InnoDB
    Many initial users were not aware of the history of Ndb, and expected it to be (MySQL + InnoDB/MyISAM) + 'Clustering'. Issuing 'ALTER TABLE xxx ENGINE=ndbcluster;' appeared to be all that was required to gain fault tolerance, but the performance of queries on the resulting tables was not always as expected!

Since the initial integration of Ndb Cluster with MySQLD in 2003+, there have been many improvements to bring Ndb closer in behaviour to the most popular MySQL engines, and to optimise MySQLD for Ndb's strengths, including :
  • Support for Autoincrement and primary key-less tables
  • Synchronisation of schemas across connected MySQLD instances
  • Support for MySQL character sets and collations
  • Storage and retrieval of Blob and Text columns
  • Support for pushed-down filter conditions
  • Support for batching of operations
  • Integration with MySQL asynchronous replication
  • 'Distribution awareness' in MySQLD for efficiency

These improvements have required work in the Ndb table handler - the code which maps MySQL storage engine API calls from the generic SQL layer to the underlying storage engine. Some improvements have also required enhancements in the storage engine API and Server, for example a new API to expose conditions (WHERE or HAVING clause predicates) to the storage engine, enabling it to perform more efficient filtering. These changes add complexity to MySQLD and the storage engine API, but as they are implemented generically, they can be reused by other engines. The pushed conditions API is now being used by the Spider engine for similar reasons to Ndb – e.g. to push filtering functionality as close to the data as possible. The Batched Key Access (BKA) improvements made to the MySQLD join executor benefit Ndb, but also benefit MyISAM and InnoDB to a lesser extent. This Functionality push-down pattern – increasing the granularity and complexity of work items which can be passed to the storage engine - will continue and benefit all storage engines.

The next large step to be taken by the MySQL Server team in this direction is referred to as Query Fragment Pushdown, where MySQLD can pass parts of queries to a storage engine for execution. Storage engines which support SQL natively could perhaps use their own implementation-aware optimisation and execution engines to efficiently evaluate query fragments. For Ndb, we are designing composite primitives at the NdbApi level for evaluating query fragments more efficiently - in parallel and closer to the data. This work will increase the number of query types that Ndb can handle efficiently, increasing the number of applications where Ndb is a good fit.

For an in-depth description of the original Ndb requirements, design approach and some specific design solutions, Mikael's phD thesis is the place to go. This is probably the best source of information on the design philosophy of Ndb Cluster. However as it is a frozen document it does not reflect the current state of the system, and as it is an academic paper, it does not describe the lower level, more software engineering oriented aspects of the system implementation.

I hope to cover some of these aspects in a future post.

No comments: