mysqldiskusage – source code examination

As you know there is a great toolset named “MySQL Utilities”, which you can use for solving various administrative tasks. mysqldiskusage utility is for calculating MySQL Server’s disk usage and generating informative reports. Of course this project is open source and everybody could review the source code. A few words about how mysqldiskusage calculates database disk usage will be crucial for understanding algorithm. The source tree is: mysql-utilities-1.5.4/scripts/ If you open this Python file you will see (line 169-175) :

# We do database disk usage by default. 
    diskusage.show_database_usage(servers[0], datadir, args, options) 
except UtilError: _, e, _ = sys.exc_info() 
    print("ERROR: %s" % e.errmsg) 

By default it shows database disk usage and calling another function named show_database_usage from mysql-utilities-1.5.4/mysql/utilities/command/ file. Now if we open up this file and search for show_database_usage function, you should see that in turn this function uses another function named _build_db_list. From _build_db_list it gets back all necessary information as in code stated clearly(line 550-562):

# Get list of databases with sizes and formatted when necessary columns, rows     
db_total = _build_db_list(server, res, dblist, datadir, fmt == "grid", have_read, verbosity, include_empty or do_all, is_remote) 
if not quiet: 
    print "# Database totals:" print_list(sys.stdout, fmt, columns, rows, no_headers) 
if not quiet: _print_size("nTotal database disk usage = ", db_total) 

Now we know that all calculations are happened in _build_db_list function. If you search and find this function(begins from line 360) you can see that, in fact mysqldiskusage is calculating database disk usage as follows:

It finds (data_length + index_length) from information_schema.tables per database manner

Then it sum ups (data_length + index_length) with misc_files variable data which is in fact returned by _get_db_dir_size function. But what is this misc_files? Logically misc_files must be .opt and .frm files. So misc_files must not be an “.MYD” , “.MYI”, “.IBD”, “general_log”, “slow_log”.

So in fact mysqldiskusage calculates database disk usage as -> (data_length + index_length)[size in bytes] + (.opt+.frm)[size in bytes]. First of all we must insist on not using information_schema for accurate disk usage calculation because of simple rule: “InnoDB preallocates pages(16Kib) for further table usage, but data_length column will not show these pages”

As proof of concept let’s create sample empty table:

mysql> create database test; 
Query OK, 
1 row affected (0,01 sec) 
mysql> use test; 
Database changed 
select data_length, index_length from information_schema.tables where table_schema='test' and table_name='t1'; 
| data_length | index_length | 
| 16384 | 0 | 
1 row in set (0,00 sec) 

If we consider that our exact table size is 16384 bytes, we are in wrong direction. In fact if we use OS commands we could see that the exact size of table is 98304 bytes:

[root@node1 ~]# ls -lt /var/lib/mysql/test/ 
total 208 
-rw-rw----. 1 mysql mysql 98304 Apr 18 11:44 t1.ibd 

So when we create an InnoDB table it is preallocates 6 pages(16Kib*6 = 98304) but only 1 page shown up from data_length column. Now let’s come back to our misc_files or exactly _get_db_dir_size() function. From source code we can see that there is no a check for “.IBD” files:

for item in os.listdir(folder): 
    name, ext = os.path.splitext(item) 
    if ext.upper() not in (".MYD", ".MYI") and  name.upper() not in ('SLOW_LOG', 'GENERAL_LOG'): 
        itemfolder = os.path.join(folder, item) 

Because of this calculation is wrong for databases as we see from output:

[root@node1 ~]# mysqldiskusage --server=root_pass -vvv 
# Source on localhost: ... connected. 
# Database totals: 
+---------------------+--------------+--------------+--------------+--------------+ | db_name | db_dir_size | data_size | misc_files | total | +---------------------+--------------+--------------+--------------+--------------+ | employees | 242.523.049 | 205.979.648 | 242.523.049 | 448.502.697 | 

As you see it sum ups data_size with misc_files and gets back total as 448.502.697 bytes. But in fact our employees database is exactly 242.523.049 bytes. And of course the: Total database disk usage = 450.940.391 bytes or 430,05 MB is wrong as well. For further exploration and how to patch source code see related BUG REPORT #76703.

Author: Shahriyar Rzayev

Azerbaijan MySQL User Group leader.

1 thought on “mysqldiskusage – source code examination”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s