Replication Checksumming Through Encryption

Problem

A problem we occasionally see is Relay Log corruption, which is most frequently caused by network errors. At this point in time, the replication IO thread does not perform checksumming on incoming data (currently scheduled for MySQL 6.x). In the mean time, we have a relatively easy workaround: encrypt the replication connection. Because of the nature of encrypted connections, they have to checksum each packet.

Solution 1: Replication over SSH Tunnel

This is the easiest to setup. You simply need to do the following on the Slave:

shell> ssh -f user@master.server -L 4306:master.server:3306 -N

This sets up the tunnel. slave.server:4306 is now a tunnelled link to master.server:3306. So now, you just need to alter the Slave to go through the tunnel:

mysql> STOP SLAVE;
mysql> CHANGE MASTER TO master_host='localhost', master_port=4306;
mysql> START SLAVE;

Everything else stays the same. Your Slave is still connecting to the same Master, just in a different manner.

This solution does have a couple of downsides, however:

  • If the SSH tunnel goes down, it won’t automatically reconnect. This can be fixed with a small script that restarts the connection if it fails. The script can be added to your init.d setup, so it automatically opens on server startup.
  • If you use MySQL Enterprise Monitor, it won’t be able to recognize that the Master/Slave pair go together.

Solution 2: Replication with SSL

Replication with SSL can be trickier to setup, but it removes the two downsides of the previous solution. Luckily, the MySQL Documentation Team have done all the hard work for you.

Conclusion

If you’re seeing corruption problems in your Relay Log, but not in your Master Binary Log, try Solution 1. It’s quick to setup and will determine if encryption is the solution to your problem. If it works, setup Solution 2. It will take a little bit of fiddling around, but is certainly worth the effort.

9 comments

  1. But the real source of corrupt relay logs in my experience is rarely the network corruption. I realize that there are people for whom the network is usually the problem, but I’ve seen a ton of problems that are caused by bugs in the replication code, and your suggestions don’t mitigate that. Read the comments on http://bugs.mysql.com/bug.php?id=25737

    1. I guess we have differing experiences, then.

      The comments on Bug #25737 also seem to agree that network-related corruption is the main cause. That said, I agree that this is not a perfect workaround, checksums would be much better.

  2. Well, I neglected to mention that in my builds, I changed the compression level (from 6 to 1?) so that there is less CPU load on the server. There really should be a my.cnf parameter for that.

    1. Any chance that LZO could be used as that is much more lightweight for compression/decompression of discrete blocks of data.

  3. Question… how did you know that it was network that was causing the error? I mean how did you debug this? I suspect that we might be having some network issues related to the relay logs but I’m just not sure how to show that it is a network issue.

    1. An easy way to test the network link is to FTP a reasonably large file from one server to the other. FTP doesn’t do any checksumming, so is susceptible to network errors. Once the transfer is complete, check the md5sum of the file on each server.

      You probably also want to check your messages log for hardware errors.

Comments are closed.