MariaDB Galera Cluster upgrade test results overview 10.3 -> 10.4

Recently MariaDB Galera Cluster 10.4.3 RC was announced – Galera MariaDB 10.4.3 RC

I would like to summarize overall reported issues related to Upgrade Tests from 10.3 version to 10.4.
In general we have 3 upgrade strategies:

* Provider upgrade – Upgrading only Galera
* Bulk upgrade – shut down all cluster and upgrade all nodes.
* Rolling Upgrade – Shut down nodes 1-by-1 and Upgrade sequentially.

There is great attention on smooth and safe Upgrade practices which is quite valid point for all users.
Let’s see what we have on Rolling Upgrade side.
Here is the list of FIXED issues:

https://jira.mariadb.org/browse/MDEV-18480 -> CLOSED
https://jira.mariadb.org/browse/MDEV-18471 -> CLOSED
https://jira.mariadb.org/browse/MDEV-18422 -> CLOSED
https://jira.mariadb.org/browse/MDEV-18557 -> CLOSED
https://jira.mariadb.org/browse/MDEV-18558 -> CLOSED
https://jira.mariadb.org/browse/MDEV-18587 -> CLOSED
https://jira.mariadb.org/browse/MDEV-18588 -> CLOSED
https://jira.mariadb.org/browse/MDEV-18631 -> CLOSED

We have also Open issues:

https://jira.mariadb.org/browse/MDEV-18552 -> OPEN
https://jira.mariadb.org/browse/MDEV-18571 -> OPEN
https://jira.mariadb.org/browse/MDEV-18742 -> OPEN
https://github.com/codership/wsrep-lib/issues/73 -> OPEN

The structure of automation of Rolling Upgrade process is quite straightforward.
* We need to generate config files(.cnf) for all nodes with respective path and version for MariaDB.
Say we have 5 nodes, then there will be 5 config files – like mynode1_10.3.cnf, mynode2_10.3.cnf etc.
But we need also 10.4 version config files for Upgrade – like mynode1_10.4.cnf, mynode2_10.4.cnf etc.
Main difference between those config files is – path of Galera – it should be 3 for 10.3 and 4 for 10.4
* Nodes need to be started with old 10.3 version of MariaDB and Galera 3, with respective config files mentioned above.
* After successfully cluster start, we need to start shut down and upgrade from last node – for us it is node5
So, shut down node5, start it 10.4 binary append new config(mynode5_10.4.cnf not mynode5_10.3.cnf).
Run mysql_upgrade after successful start with new binary.
Repeat this process in a loop until whole cluster members upgraded as described. So the next node should be node4, node3, node2 and node1 at last.

But there is another question – should be there any load for nodes during upgrade? In other words, should we run DDL/DML after upgrading each node? For eg,
* Upgrade node5, run some load on node1(which is still old version) – see if data replicated from old to new.
* Upgrade node5, run some load on it – see if data replicated from new to old.

Based on these entries, here is the easy way to lose node5: MDEV-18422
Simply, upgrade node5 and then run DDL on node1 get the crash, then be happy – because it is already fixed.

[Warning] WSREP: no corresponding NBO begin found for NBO end source: 2222148c-2450-11e9-90e3-be38054a3929 version: 4 local: 0 flags: 5 conn_id: 12 trx
_id: -1 tstamp: 1432598153916; state: REPLICATING:0->CERTIFYING:3034 seqnos (l: 3, g: 1, s: 0, d: 0) WS pa_range: 0; state history: REPLICATING:0->CERTIFYING:3034
190130  9:37:28 [ERROR] mysqld got signal 6 ;

Another interesting crash, which was preventing doing full rolling upgrade is MDEV-18480
After upgrading node5, it is turn of node4 to be upgraded and after starting node4 you lose node5:

019-02-05 16:13:56 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
mysqld: /home/shako/Galera_Tests/MariaDB/storage/innobase/trx/trx0rseg.cc:92: void trx_rseg_update_wsrep_checkpoint(trx_rsegf_t*, const XID*, mtr_t*): Assertion `xid_seqno > wsrep_seqno' failed.
190205 16:13:56 [ERROR] mysqld got signal 6 ;

Fixed.

Another quite critical issue was Data Inconsistency.
All nodes already upgraded and Streaming Replication was enabled on each node.
Now it was turn to Create database and then Drop it on node5. Unfortunately it was dropped from all other nodes but not from node5 itself – MDEV-18587
Fixed.

In general we love Segfaults but not in cluster 🙂 Interestingly if you enable log-bin and log-slave-updates on all nodes of cluster and try to do Rolling Upgrade the last upgraded node – node1 will crash – MDEV-18588
Fixed.

Hall of the fame has permanent member called Sysbench. Here is the crash while doing sysbench prepare – MDEV-18631

$ sysbench oltp_read_write --mysql-user=root --mysql-socket=/tmp/mysql-node1.sock --tables=1 --table-size=10 prepare
sysbench 1.0.16 (using bundled LuaJIT 2.1.0-beta2)
Creating table 'sbtest1'...
Inserting 10 records into 'sbtest1'
FATAL: mysql_drv_query() returned error 2013 (Lost connection to MySQL server during query) for query 'INSERT INTO sbtest1(k, c, pad) VALUES

The result from crashed node:

mysqld: /home/shako/Galera_Tests/MariaDB/wsrep-lib/src/transaction.cpp:658: int wsrep::transaction::after_statement(): Assertion `client_state_.mode() == wsrep::client_state::m_local' failed.
190218 20:01:09 [ERROR] mysqld got signal 6 ;

Please meet Sysbench with following OPEN issues as well:
MDEV-18742
MDEV-18571
MDEV-18552

There are several issues found on the way but are not related directly to Upgrade Tests. But in general Upgrade tests are great playground to get dirty with all sort of issues.
This post likely will be updated as new issues or fixes will come.

Author: Shahriyar Rzayev

Azerbaijan MySQL User Group and Python user group leader. QA Engineer, bug hunter by nature and true Pythonista

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.