Saturday 2 April 2011

Journey upriver to the dark heart of ha_ndbcluster

Unlike most other MySQL storage engines, Ndb does not perform all of its work in the MySQLD process. The Ndb table handler maps Storage Engine Api calls onto NdbApi calls, which eventually result in communication with data nodes. In terms of layers, we have SQL -> Handler Api -> NdbApi -> Communication. At each of these layer boundaries, the mapping between operations at the upper layer to operations at the lower layer is non trivial, based on runtime state, statistics, optimisations etc.

The MySQL status variables can be used to understand the behaviour of the MySQL Server in terms of user commands processed, and also how these map to some of the Storage Engine Handler Api calls.

Status variables tracking user commands start with 'Com_'

mysql> show status like 'Com\_%';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| Com_admin_commands | 0 |
| Com_assign_to_keycache | 0 |
| Com_alter_db | 0 |
| Com_alter_db_upgrade | 0 |
| Com_alter_event | 0 |
| Com_alter_function | 0 |
| Com_alter_procedure | 0 |
| Com_alter_server | 0 |
| Com_alter_table | 0 |
| Com_alter_tablespace | 0 |
| Com_analyze | 0 |
| Com_backup_table | 0 |
| Com_begin | 0 |
| Com_binlog | 0 |
| Com_call_procedure | 0 |
| Com_change_db | 1 |
| Com_change_master | 0 |
| Com_check | 0 |
| Com_checksum | 0 |
.........
| Com_stmt_reset | 0 |
| Com_stmt_send_long_data | 0 |
| Com_truncate | 0 |
| Com_uninstall_plugin | 0 |
| Com_unlock_tables | 0 |
| Com_update | 1 |
| Com_update_multi | 0 |
| Com_xa_commit | 0 |
| Com_xa_end | 0 |
| Com_xa_prepare | 0 |
| Com_xa_recover | 0 |
| Com_xa_rollback | 0 |
| Com_xa_start | 0 |
+---------------------------+-------+
144 rows in set (0.01 sec)

mysql>

Status variables tracking Handler (Storage engine) Api calls start with 'Handler_'.

mysql> show status like 'Handler\_%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 0 |
| Handler_read_key | 0 |
| Handler_read_next | 0 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 21 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 4 |
| Handler_write | 14 |
+----------------------------+-------+
15 rows in set (0.00 sec)

mysql>

The 'Com_%' and 'Handler_%' variables are maintained by the Server for all storage engines. The server maintains these on a per-session, and global basis. By default the session status is shown, but the GLOBAL keyword shows the global view, aggregated across all sessions.

mysql> show global status like 'Handler\_%';
+----------------------------+--------+
| Variable_name | Value |
+----------------------------+--------+
| Handler_commit | 167 |
| Handler_delete | 494041 |
| Handler_discover | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 3 |
| Handler_read_key | 1 |
| Handler_read_next | 0 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 561132 |
| Handler_rollback | 6 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 24 |
| Handler_write | 43442 |
+----------------------------+--------+
15 rows in set (0.00 sec)

mysql>

The SHOW STATUS command is handy for a quick check, but for more interesting analysis, the INFORMATION_SCHEMA tables SESSION_STATUS and GLOBAL_STATUS contain the same data, and support all SQL queries and views.

mysql> select * from information_schema.session_status where Variable_name like 'Handler\_%';
+----------------------------+----------------+
| VARIABLE_NAME | VARIABLE_VALUE |
+----------------------------+----------------+
| HANDLER_COMMIT | 1 |
| HANDLER_DELETE | 0 |
| HANDLER_DISCOVER | 0 |
| HANDLER_PREPARE | 0 |
| HANDLER_READ_FIRST | 0 |
| HANDLER_READ_KEY | 0 |
| HANDLER_READ_NEXT | 0 |
| HANDLER_READ_PREV | 0 |
| HANDLER_READ_RND | 0 |
| HANDLER_READ_RND_NEXT | 85 |
| HANDLER_ROLLBACK | 0 |
| HANDLER_SAVEPOINT | 0 |
| HANDLER_SAVEPOINT_ROLLBACK | 0 |
| HANDLER_UPDATE | 4 |
| HANDLER_WRITE | 89 |
+----------------------------+----------------+
15 rows in set (0.40 sec)

mysql>

Unfortunately the entries in the information_schema are in shouty 1960s CAPITALS. Also note that some Handler calls are made as part of using the information_schema 'database' to fetch the data. This is Heisenberg's principle in action!

To shine some light into the depths of the Ndb storage engine, a set of new status variables has been added to recent cluster-7.0 and cluster-7.1 releases. These status variables track activity at the NdbApi and data node communication layers of the stack. Currently they are divided into 4 subsets :
  • Global counters
    show status like 'ndb_api_%_count';
  • Session counters
    show status like 'ndb_api_%_session';
  • Slave counters
    show status like 'ndb_api_%_slave';
  • Binlog injector counters
    show status like 'ndb_api_%_injector';
Unfortunately the mysql-5.1 server does not allow Storage Engines to differentiate GLOBAL or SESSION status variables, so the Global and Session specific versions of these variables are differentiated by name, and are visible from both GLOBAL and SESSION views of status, so it doesn't matter which you look at.

Global counters
mysql> show status like 'ndb_api_%_count';
+------------------------------------+------------+
| Variable_name | Value |
+------------------------------------+------------+
| Ndb_api_wait_exec_complete_count | 10 |
| Ndb_api_wait_scan_result_count | 19295 |
| Ndb_api_wait_meta_request_count | 67 |
| Ndb_api_wait_nanos_count | 6361966340 |
| Ndb_api_bytes_sent_count | 415704 |
| Ndb_api_bytes_received_count | 116921552 |
| Ndb_api_trans_start_count | 14 |
| Ndb_api_trans_commit_count | 3 |
| Ndb_api_trans_abort_count | 1 |
| Ndb_api_trans_close_count | 14 |
| Ndb_api_pk_op_count | 6 |
| Ndb_api_uk_op_count | 0 |
| Ndb_api_table_scan_count | 11 |
| Ndb_api_range_scan_count | 0 |
| Ndb_api_pruned_scan_count | 0 |
| Ndb_api_scan_batch_count | 25850 |
| Ndb_api_read_row_count | 103371 |
| Ndb_api_trans_local_read_row_count | 51625 |
| Ndb_api_event_data_count | 0 |
| Ndb_api_event_nondata_count | 0 |
| Ndb_api_event_bytes_count | 0 |
+------------------------------------+------------+
21 rows in set (0.00 sec)

mysql>

These counts are aggregated across all MySQL clients in the Server accessing tables in Ndb, since this MySQL Server was started. Note that this *does not* include accesses by clients of other MySQL Servers, or other NdbApi clients.

Session counters
mysql> show status like 'ndb_api_%_session';
+--------------------------------------------+----------+
| Variable_name | Value |
+--------------------------------------------+----------+
| Ndb_api_wait_exec_complete_count_session | 0 |
| Ndb_api_wait_scan_result_count_session | 38 |
| Ndb_api_wait_meta_request_count_session | 2 |
| Ndb_api_wait_nanos_count_session | 11064398 |
| Ndb_api_bytes_sent_count_session | 872 |
| Ndb_api_bytes_received_count_session | 230764 |
| Ndb_api_trans_start_count_session | 1 |
| Ndb_api_trans_commit_count_session | 0 |
| Ndb_api_trans_abort_count_session | 0 |
| Ndb_api_trans_close_count_session | 1 |
| Ndb_api_pk_op_count_session | 0 |
| Ndb_api_uk_op_count_session | 0 |
| Ndb_api_table_scan_count_session | 1 |
| Ndb_api_range_scan_count_session | 0 |
| Ndb_api_pruned_scan_count_session | 0 |
| Ndb_api_scan_batch_count_session | 51 |
| Ndb_api_read_row_count_session | 204 |
| Ndb_api_trans_local_read_row_count_session | 136 |
+--------------------------------------------+----------+
18 rows in set (0.00 sec)

mysql>

These counts are for the current session (MySQL client connection)

Slave counters
mysql> show status like 'ndb_api_%_slave';
+------------------------------------------+-------+
| Variable_name | Value |
+------------------------------------------+-------+
| Ndb_api_wait_exec_complete_count_slave | 0 |
| Ndb_api_wait_scan_result_count_slave | 0 |
| Ndb_api_wait_meta_request_count_slave | 0 |
| Ndb_api_wait_nanos_count_slave | 0 |
| Ndb_api_bytes_sent_count_slave | 0 |
| Ndb_api_bytes_received_count_slave | 0 |
| Ndb_api_trans_start_count_slave | 0 |
| Ndb_api_trans_commit_count_slave | 0 |
| Ndb_api_trans_abort_count_slave | 0 |
| Ndb_api_trans_close_count_slave | 0 |
| Ndb_api_pk_op_count_slave | 0 |
| Ndb_api_uk_op_count_slave | 0 |
| Ndb_api_table_scan_count_slave | 0 |
| Ndb_api_range_scan_count_slave | 0 |
| Ndb_api_pruned_scan_count_slave | 0 |
| Ndb_api_scan_batch_count_slave | 0 |
| Ndb_api_read_row_count_slave | 0 |
| Ndb_api_trans_local_read_row_count_slave | 0 |
+------------------------------------------+-------+
18 rows in set (0.00 sec)

mysql>

Hopefully you are seeing a pattern here. These counters are the NdbApi operations performed by the Slave SQL thread as part of replicating Binlogs into Ndb tables. These counts will only increase from zero if the MySQLD is acting, or has acted as a Slave, and has accessed tables stored in Ndb.

Binlog Injector counters
mysql> show status like 'ndb_api_%_injector';
+--------------------------------------+-------+
| Variable_name | Value |
+--------------------------------------+-------+
| Ndb_api_event_data_count_injector | 0 |
| Ndb_api_event_nondata_count_injector | 0 |
| Ndb_api_event_bytes_count_injector | 0 |
+--------------------------------------+-------+
3 rows in set (0.01 sec)

mysql>

These counts track the data change events received by the Ndb Binlog Injector thread. The Binlog Injector is responsible for recording Cluster changes in the Binlog, but even when Binlogs are not being written, it receives events related to schema changes and other system management functions, so these counts can be non zero on Servers which are not writing a Binlog.

Counter definitions
As you've hopefully noticed, there is naming overlap between each set of status counters. The same events are being counted, and recorded globally, per-session, against the Slave SQL thread and against the Ndb Binlog injector thread.

So what do these different counts actually mean?

Ndb_api_wait_exec_complete_count[_session|_slave]

The number of times a user thread has blocked waiting for some batch of primary key, or secondary unique hash key operations to complete. From the point of view of the MySQL Server, this is idle time, waiting for data nodes to send some response. An alternative name for it could be 'round trip count', and minimising it through operation batching is a good way to reduce response time and increase throughput.

Ndb_api_wait_scan_result_count[_session|_slave]

The number of times a user thread has blocked waiting for some scan operation to complete. It could be waiting for a batch of scan results, or waiting for an acknowledgement of a scan close. In any case, it indicates time spent waiting on communication related to scan processing.

Ndb_api_wait_meta_request_count[_session|_slave]

The number of times a user thread has blocked waiting for some metadata operation to complete. This is quite a catch-all term, which can include DDL (Create/Drop table etc), and some transaction initialisation steps.

Ndb_api_wait_nanos_count[_session|_slave]

The number of nanoseconds a user thread has blocked in one of the three scenarios above. This is kind of an 'IO_WAIT' time for Ndb operations. It tracks how long the thread was blocked waiting for the data nodes to complete their operations. The resolution is nanoseconds, but this requires support from the operating system. On operating systems with lower resolution, this count will be coarser, and some operations may complete with zero observed wait time.

Ndb_api_bytes_sent_count[_session|_slave]

The number of bytes sent to the Ndb data nodes. This includes all request types, rows inserted etc. It does not include regular heartbeating as that generally adds too much noise to make the counters useful.

Ndb_api_bytes_received_count[_session|_slave]

The number of bytes received from the Ndb data nodes. This includes all request types, rows read etc. It does not include regular heartbeating as that generally adds too much noise to make the counters useful.

Ndb_api_trans_start_count[_session|_slave]

The number of NdbApi transactions started. Note that NdbApi transactions are not always immediately on a BEGIN statement as an optimisation.

Ndb_api_trans_commit_count[_session|_slave]

The NdbApi transactions which have been explicitly committed. Not all transactions started are committed or aborted, some are started, and then closed.

Ndb_api_trans_abort_count[_session|_slave]

The NdbApi transactions which have been explicitly aborted.

Ndb_api_trans_close_count[_session|_slave]

The NdbApi transactions which have been closed. It should closely track the number which have been started.

Ndb_api_pk_op_count[_session|_slave]

The number of Primary Key (pk) operations which have been executed. Eack pk operation affects zero or one rows. This includes read, insert, update, write, delete. Note that operations on tables with Blobs can also generate pk and uk operations.

Ndb_api_uk_op_count[_session|_slave]

The number of Unique key (uk) operations which have been executed. Each uk operation affects zero or one rows. This includes read, update, write, delete by unique key.

Ndb_api_table_scan_count[_session|_slave]

The number of table scans which have been started. Table scans can have pushed-down filters, so although they must access all data in a table, they might not return it all to the MySQL Server. Also, started scans may be stopped before all table data is accessed.

Ndb_api_range_scan_count[_session|_slave]

The number of range (Ordered Index) scans which have been started. Range scans take bounds and pushed-down filters, so may access and/or return zero to all rows to the MySQL Server.

Ndb_api_pruned_scan_count[_session|_slave]

The number of scans which have been successfully pruned to one partition of the scanned table/index. See my previous Blog entries for details about scan partition pruning.

Ndb_api_scan_batch_count[_session|_slave]

The number of batches of rows returned to the MySQL Server from scans. Each scanned table/index fragment returns matching rows in batches, whose size is controlled by the batchsize parameters. NdbApi handles one batch from each fragment at a time. Fetching the next batch from a fragment requires a round-trip to the data nodes, although multiple fragments can be asked for their next batch in one trip. Minimising the Ndb_api_scan_batch_count, by increasing batchsize and improving scan selectivity can improve throughput, latency and efficiency.

Ndb_api_read_row_count[_session|_slave]

The number of rows returned to the MySQL Server from Primary Key read, Unique Key reads and Table and Index scans.

Ndb_api_trans_local_read_row_count[_session|_slave]

The number of rows returned to the MySQL Server from the same data node where the reading transaction has its Transaction Coordinator (TC). Where the TC and the data read reside on the same data node, one hop in the data reading control protocols is avoided. This is the goal of transaction hinting and distribution awareness, and its effectiveness can be checked by comparing Ndb_api_trans_local_read_row_count to Ndb_api_trans_read_row_count. The higher the proportion of local reads, the better.

Ndb_api_event_data_count[_injector]

The number of data change events (row insert, delete, update notifications) received from the data nodes. On a Binlogging MySQL Server, this count can give a measure of the rate of data change in a cluster in terms of rows/second.

Ndb_api_event_nondata_count[_injector]

The number of non-data events (table alter/drop notifications etc) received from the data nodes.

Ndb_api_event_bytes_count[_injector]

The total number of bytes of event data (data and nondata events) received from the data nodes. This gives another measure of the Cluster change rate in terms of bytes/second.

At NdbApi level
As the names suggest, all of these counters are reflecting things happening in the NdbApi implementation. The data collection is built into NdbApi, and is therefore also available to any NdbApi client, not just the MySQL Server.

Using the counters
During implementation of these counters, I found it easiest to create a temporary table to store 'base' values for the counters, and define a view containing the difference between the current values and the base values. This made it easy to see the effect on the counters of various different SQL statements, slave operations etc.

Setting up a baseline and a view
Please excuse my schoolboy SQL :

mysql> create table test.counts_base(variable_name varchar(255) primary key, variable_value bigint);
Query OK, 0 rows affected (0.04 sec)

mysql> replace into test.counts_base (variable_name, variable_value) select * from information_schema.session_status where variable_name like 'ndb_api%';
Query OK, 60 rows affected (0.01 sec)
Records: 60 Duplicates: 0 Warnings: 0

mysql> create view test.counts_diff as select test.counts_base.variable_name, information_schema.session_status.variable_value - test.counts_base.variable_value as diff from test.counts_base, information_schema.session_status where test.counts_base.variable_name = information_schema.session_status.variable_name and (information_schema.session_status.variable_value - test.counts_base.variable_value) > 0;
Query OK, 0 rows affected (0.05 sec)

mysql> select * from test.counts_diff where variable_name like 'ndb_api%_session';
Empty set (0.05 sec)

mysql>


Looking at the effects of a SQL statement
First check that the baseline is ok :

mysql> select * from test.counts_diff where variable_name like 'ndb_api%_session';
Empty set (0.05 sec)

Now run the statement :

mysql> select count(1) from demo_table;
+----------+
| count(1) |
+----------+
| 103343 |
+----------+
1 row in set (0.00 sec)

Now look at the difference :

mysql> select * from test.counts_diff where variable_name like 'ndb_api%_session';
+--------------------------------------------+--------+
| variable_name | diff |
+--------------------------------------------+--------+
| NDB_API_WAIT_SCAN_RESULT_COUNT_SESSION | 4 |
| NDB_API_WAIT_META_REQUEST_COUNT_SESSION | 1 |
| NDB_API_WAIT_NANOS_COUNT_SESSION | 619143 |
| NDB_API_BYTES_SENT_COUNT_SESSION | 116 |
| NDB_API_BYTES_RECEIVED_COUNT_SESSION | 268 |
| NDB_API_TRANS_START_COUNT_SESSION | 1 |
| NDB_API_TRANS_CLOSE_COUNT_SESSION | 1 |
| NDB_API_TABLE_SCAN_COUNT_SESSION | 1 |
| NDB_API_SCAN_BATCH_COUNT_SESSION | 2 |
| NDB_API_READ_ROW_COUNT_SESSION | 2 |
| NDB_API_TRANS_LOCAL_READ_ROW_COUNT_SESSION | 1 |
+--------------------------------------------+--------+
11 rows in set (0.05 sec)

mysql>

The select count(1) statement blocked 4 times on scan results, sent 116 bytes and received 268 bytes of data. Two batches of rows were received, and two rows were received in total. One of these rows was from the same node as the transaction's transaction coordinator.
This indicates that select count(1) is optimised in the Ndb handler !

Let's try a more tricky select count. First we must reset the baseline to get a 'clean' difference.

mysql> replace into test.counts_base (variable_name, variable_value) select * from information_schema.session_status where variable_name like 'ndb_api%';
Query OK, 120 rows affected (0.01 sec)
Records: 60 Duplicates: 60 Warnings: 0

mysql> select * from test.counts_diff where variable_name like 'ndb_api%_session';
Empty set (0.05 sec)

mysql> select count(length(string_value) > 10) from demo_table;
+----------------------------------+
| count(length(string_value) > 10) |
+----------------------------------+
| 103343 |
+----------------------------------+
1 row in set (6.77 sec)

mysql> select * from test.counts_diff where variable_name like 'ndb_api%_session';
+--------------------------------------------+------------+
| variable_name | diff |
+--------------------------------------------+------------+
| NDB_API_WAIT_SCAN_RESULT_COUNT_SESSION | 21139 |
| NDB_API_WAIT_NANOS_COUNT_SESSION | 5322402052 |
| NDB_API_BYTES_SENT_COUNT_SESSION | 441628 |
| NDB_API_BYTES_RECEIVED_COUNT_SESSION | 109026900 |
| NDB_API_TRANS_START_COUNT_SESSION | 1 |
| NDB_API_TRANS_CLOSE_COUNT_SESSION | 1 |
| NDB_API_TABLE_SCAN_COUNT_SESSION | 1 |
| NDB_API_SCAN_BATCH_COUNT_SESSION | 25836 |
| NDB_API_READ_ROW_COUNT_SESSION | 103343 |
| NDB_API_TRANS_LOCAL_READ_ROW_COUNT_SESSION | 51735 |
+--------------------------------------------+------------+
10 rows in set (0.05 sec)

mysql>

This select count waited for scan results ~21 thousand times, sent ~440kB of data, and received ~109MB of data from the data nodes. 103,343 rows were read in ~25 thousand scan batches, and roughly half came from the same node as the tranaction coordinator. 5.3 seconds of the 6.77 second runtime were spent waiting for responses from the data nodes.

Other uses
Hopefully this gives some notion of the possibilities with these new counters. Some other ideas :
  • Debug slow queries
  • Optimise data distribution and table partitioning
  • Get real post-execution costs for queries, DML etc.
  • Understand how data transfer for Blobs is batched
  • Check bulk insert/deletes are functioning efficiently
  • Verify Ndb slave batching is in operation
  • Draw cool graphs