Replication Checksumming Through Encryption

Problem

A problem we occasionally see is Relay Log corruption, which is most frequently caused by network errors. At this point in time, the replication IO thread does not perform checksumming on incoming data (currently scheduled for MySQL 6.x). In the mean time, we have a relatively easy workaround: encrypt the replication connection. Because of the nature of encrypted connections, they have to checksum each packet.

Solution 1: Replication over SSH Tunnel

This is the easiest to setup. You simply need to do the following on the Slave:

shell> ssh -f user@master.server -L 4306:master.server:3306 -N

This sets up the tunnel. slave.server:4306 is now a tunnelled link to master.server:3306. So now, you just need to alter the Slave to go through the tunnel:

mysql> STOP SLAVE;
mysql> CHANGE MASTER TO master_host='localhost', master_port=4306;
mysql> START SLAVE;

Everything else stays the same. Your Slave is still connecting to the same Master, just in a different manner.

This solution does have a couple of downsides, however:

If the SSH tunnel goes down, it won’t automatically reconnect. This can be fixed with a small script that restarts the connection if it fails. The script can be added to your init.d setup, so it automatically opens on server startup.
If you use MySQL Enterprise Monitor, it won’t be able to recognize that the Master/Slave pair go together.

Solution 2: Replication with SSL

Replication with SSL can be trickier to setup, but it removes the two downsides of the previous solution. Luckily, the MySQL Documentation Team have done all the hard work for you.

Step 1: Create the certificates
Step 2: Setup the servers to recognize the certificates
Step 3: Change the Slave to use SSL

Conclusion

If you’re seeing corruption problems in your Relay Log, but not in your Master Binary Log, try Solution 1. It’s quick to setup and will determine if encryption is the solution to your problem. If it works, setup Solution 2. It will take a little bit of fiddling around, but is certainly worth the effort.

9 comments

Xaprb says:

February 17, 2009 at 2:48 am

But the real source of corrupt relay logs in my experience is rarely the network corruption. I realize that there are people for whom the network is usually the problem, but I’ve seen a ton of problems that are caused by bugs in the replication code, and your suggestions don’t mitigate that. Read the comments on http://bugs.mysql.com/bug.php?id=25737
1. Gary Pendergast says:
  
  February 17, 2009 at 3:01 am
  
  I guess we have differing experiences, then.
  
  The comments on Bug #25737 also seem to agree that network-related corruption is the main cause. That said, I agree that this is not a perfect workaround, checksums would be much better.
Mark Callaghan says:

February 17, 2009 at 2:46 pm

The next Google patch adds checksums to binlog events. Can’t you also get some protection by using compression between master and slave?
1. Gary Pendergast says:
  
  February 17, 2009 at 8:24 pm
  
  Next Google patch sounds good, I’ll have a play around with it when it comes out.
  
  slave_compressed_protocol would probably work as well. The only reason I hesitate to recommend it is because I haven’t tested the overhead. SSH and SSL are both pretty good for keeping CPU usage down.
Mark Callaghan says:

February 17, 2009 at 9:04 pm

Well, I neglected to mention that in my builds, I changed the compression level (from 6 to 1?) so that there is less CPU load on the server. There really should be a my.cnf parameter for that.
1. Antony T Curtis says:
  
  February 17, 2009 at 11:14 pm
  
  Any chance that LZO could be used as that is much more lightweight for compression/decompression of discrete blocks of data.
2. Gary Pendergast says:
  
  February 17, 2009 at 11:36 pm
  
  Nice idea, I’ve created Bug #42949 for it. Any chance you could attach your patch for it, as a reference?
bichonfrise74 says:

February 18, 2009 at 1:00 am

Question… how did you know that it was network that was causing the error? I mean how did you debug this? I suspect that we might be having some network issues related to the relay logs but I’m just not sure how to show that it is a network issue.
1. Gary Pendergast says:
  
  February 18, 2009 at 2:37 am
  
  An easy way to test the network link is to FTP a reasonably large file from one server to the other. FTP doesn’t do any checksumming, so is susceptible to network errors. Once the transfer is complete, check the md5sum of the file on each server.
  
  You probably also want to check your messages log for hardware errors.

Comments are closed.