How do I remove a large number of table rows in MySQL to help my Drupal site run faster?
There are many strategies and tools to make a Drupal site run faster: Varnish, memcache, views caching, InnoDB tables, removing data (files and database data), and proper table indexes for the queries running. These are some of the possibilities that you might want to consider.
For queries, adding good indexes is very important. However, it's also important to note that a large number of rows and a large data size make the process of indexing a table slower, and it uses more space in the server. This can be good if you need access to all of those rows.
Sometimes the better method is to remove as many rows from the table as possible, leaving only the active necessary rows for the most common queries. Once a table has a significant number of rows, it becomes difficult to remove them. For example, with a large database of millions of rows of data in a table, running the following query might lock the database server:
DELETE FROM [my_table] WHERE [my_column] IS NOT NULL;
Even with an index on mytable.mycolumn, this will overutilize the disk I/O and then queries stack up. In MySQL version 5.0, even using the
TRUNCATE TABLE or
DROP TABLE statements will cause problems. There are now alternative tools to make this process easier.
The Percona toolkit (formerly known as Maatkit) comes with a script called
pt-archiver that solves this problem. For an explanation of effective purging, see the How to write efficient archiving and purging jobs in SQL blog posting by Baron Schwartz.
pt-archiver on Acquia Cloud, you'll need to know how to tunnel over SSH. You'll need to know your database connection information. The fastest way to get this is to sign in to Acquia, and then check the Databases tab of your site's workflow. You can also obtain this information is to use a Drush command similar to the following:
drush [@mysitename.dev] sql-connectmysql --database=[site_name_dev] --host=[srv-1234] --port=3306 --user=[site_name_dev] --password=[1234mnop456]
To use the Percona toolkit:
- Obtain and install the toolkit:
- From the local command line:
- tar xvfz percona-toolkit.tar.gz
- cd percona-toolkit-2.2.14
- From the local command line:
- Set up the tunnel (see previous instructions for finding the server names and replace as necessary):
ssh -f -L 1111:server.prod.hosting.acquia.com:3306 firstname.lastname@example.org -N
- Run the archiver:
bin/pt-archiver --source h=127.0.0.1,u=eeepeterson,D=myusername,t=watchdog,p=mypassword,P=1111 --purge --where "type='system'" --no-check-charset
Before you ran this command, you might have had results like this:
mysql> select type, count(*) from watchdog group by type;+----------------+----------+| type | count(*) |+----------------+----------+| access denied | 4 || actions | 12 || content | 2 || cron | 83 || manual | 1 || page not found | 16 || php | 18 || system | 41 || testing | 8 || user | 22 |+----------------+----------+10 rows in set (0.00 sec)
mysql> select type, count(*) from watchdog group by type;+----------------+----------+| type | count(*) |+----------------+----------+| access denied | 4 || actions | 12 || content | 2 || cron | 83 || manual | 1 || page not found | 16 || php | 18 || testing | 8 || user | 22 |+----------------+----------+9 rows in set (0.00 sec)
The following is an example using the
pt-archiver utility to remove millions of rows of data from a table:
bin/pt-archiver --source h=srv-1234,D=mysitenamedev,t=mytable,p=r4nd0mstr1ng --purge --where "mycolumn IS NOT NULL"
--purge parameter instructs
pt-archiver to remove the rows from the database. You can substitute a
--file parameter to have
pt-archiver move the rows to a separate table; you can specify a different database and host or even dump the results to a file that would be suitable to be used with the SQL statement,
LOAD DATA INFILE to rebuild the data at any location.
To make the tool more effective, you might add a few parameters, such as
--commit-each. These two parameters together will run the command in batches of 1,000 rows at a time and commit transactions in between each statement.
To view the progress of the utility as it's running and to get some statistics at the end of the process, you might also add
--progress 10000 --statistics. The
--progress option with its value will instruct
pt-archiver to print a progress line every 10,000 rows. The
--statistics option will have the utility display a summary afterwards, which will provide some statistics on the success of the utility.
Even with the
pt-archiver tool, it's important to test it in development environments and at low traffic periods to make sure it's not impacting production sites, unnecessarily. You might first run it with the
--dry-run parameter. This will report descriptions of would happen without actually running the process. This helps to spot any mistyped parameters and other problems.
You should also actively monitor the disk I/O, CPU usage, memory usage, system load, and the
mytop processlist on the server. This will assure you that nothing is causing queries to stall and that the server is running efficiently.