Issue
How do I remove a large number of table rows in MySQL to help my Drupal site run faster?
Resolution
There are many strategies and tools to make a Drupal site run faster: Varnish, memcache, views caching, InnoDB tables, removing data (files and database data), and proper table indexes for the queries running. These are some of the possibilities that you might want to consider.
Indexing
For queries, adding good indexes is very important. However, it's also important to note that a large number of rows and a large data size make the process of indexing a table slower, and it uses more space in the server. This can be good if you need access to all of those rows.
Removing data
Sometimes the better method is to remove as many rows from the table as possible, leaving only the active necessary rows for the most common queries. Once a table has a significant number of rows, it becomes difficult to remove them. For example, with a large database of millions of rows of data in a table, running the following query might lock the database server:
DELETE FROM [my_table] WHERE [my_column] IS NOT NULL;
Even with an index on mytable.mycolumn, this will overutilize the disk I/O and then queries stack up. In MySQL version 5.0, even using the TRUNCATE TABLE
or DROP TABLE
statements will cause problems. There are now alternative tools to make this process easier.
pt-archiver
The Percona toolkit (formerly known as Maatkit) comes with a script called pt-archiver
that solves this problem. For an explanation of effective purging, see the How to write efficient archiving and purging jobs in SQL blog posting by Baron Schwartz.
To use pt-archiver
on Acquia Cloud, you'll need to know how to tunnel over SSH. You'll need to know your database connection information. The fastest way to get this is to sign in to Acquia, and then check the Databases tab of your site's workflow. You can also obtain this information is to use a Drush command similar to the following:
drush [@mysitename.dev] sql-connectmysql --database=[site_name_dev] --host=[srv-1234] --port=3306 --user=[site_name_dev] --password=[1234mnop456]
To use the Percona toolkit:
- Obtain and install the toolkit:
- From the local command line:
wget http://percona.com/get/pt-query-digest
- tar xvfz percona-toolkit.tar.gz
- cd percona-toolkit-2.2.14
- From the local command line:
- Set up the tunnel (see previous instructions for finding the server names and replace as necessary):
ssh -f -L 1111:server.prod.hosting.acquia.com:3306 mysitename.prod@server.prod.hosting.acquia.com -N
- Run the archiver:
bin/pt-archiver --source h=127.0.0.1,u=eeepeterson,D=myusername,t=watchdog,p=mypassword,P=1111 --purge --where "type='system'" --no-check-charset
Before you ran this command, you might have had results like this:
mysql> select type, count(*) from watchdog group by type;+----------------+----------+| type | count(*) |+----------------+----------+| access denied | 4 || actions | 12 || content | 2 || cron | 83 || manual | 1 || page not found | 16 || php | 18 || system | 41 || testing | 8 || user | 22 |+----------------+----------+10 rows in set (0.00 sec)
And after:
mysql> select type, count(*) from watchdog group by type;+----------------+----------+| type | count(*) |+----------------+----------+| access denied | 4 || actions | 12 || content | 2 || cron | 83 || manual | 1 || page not found | 16 || php | 18 || testing | 8 || user | 22 |+----------------+----------+9 rows in set (0.00 sec)
The following is an example using the pt-archiver
utility to remove millions of rows of data from a table:
bin/pt-archiver --source h=srv-1234,D=mysitenamedev,t=mytable,p=r4nd0mstr1ng --purge --where "mycolumn IS NOT NULL"
The --purge
parameter instructs pt-archiver
to remove the rows from the database. You can substitute a --dest
or --file
parameter to have pt-archiver
move the rows to a separate table; you can specify a different database and host or even dump the results to a file that would be suitable to be used with the SQL statement, LOAD DATA INFILE
to rebuild the data at any location.
To make the tool more effective, you might add a few parameters, such as --limit=1000
and --commit-each
. These two parameters together will run the command in batches of 1,000 rows at a time and commit transactions in between each statement.
To view the progress of the utility as it's running and to get some statistics at the end of the process, you might also add --progress 10000 --statistics
. The --progress
option with its value will instruct pt-archiver
to print a progress line every 10,000 rows. The --statistics
option will have the utility display a summary afterwards, which will provide some statistics on the success of the utility.
Even with the pt-archiver
tool, it's important to test it in development environments and at low traffic periods to make sure it's not impacting production sites, unnecessarily. You might first run it with the --dry-run
parameter. This will report descriptions of would happen without actually running the process. This helps to spot any mistyped parameters and other problems.
You should also actively monitor the disk I/O, CPU usage, memory usage, system load, and the mytop
processlist on the server. This will assure you that nothing is causing queries to stall and that the server is running efficiently.