Replication Checksumming Through Encryption

February 16, 2009 · Posted in MySQL 

Problem

A problem we occasionally see is Relay Log corruption, which is most frequently caused by network errors. At this point in time, the replication IO thread does not perform checksumming on incoming data (currently scheduled for MySQL 6.x). In the mean time, we have a relatively easy workaround: encrypt the replication connection. Because of the nature of encrypted connections, they have to checksum each packet.

Solution 1: Replication over SSH Tunnel

This is the easiest to setup. You simply need to do the following on the Slave:

shell> ssh -f user@master.server -L 4306:master.server:3306 -N

This sets up the tunnel. slave.server:4306 is now a tunnelled link to master.server:3306. So now, you just need to alter the Slave to go through the tunnel:

mysql> STOP SLAVE;
mysql> CHANGE MASTER TO master_host='localhost', master_port=4306;
mysql> START SLAVE;

Everything else stays the same. Your Slave is still connecting to the same Master, just in a different manner.

This solution does have a couple of downsides, however:

  • If the SSH tunnel goes down, it won’t automatically reconnect. This can be fixed with a small script that restarts the connection if it fails. The script can be added to your init.d setup, so it automatically opens on server startup.
  • If you use MySQL Enterprise Monitor, it won’t be able to recognize that the Master/Slave pair go together.

Solution 2: Replication with SSL

Replication with SSL can be trickier to setup, but it removes the two downsides of the previous solution. Luckily, the MySQL Documentation Team have done all the hard work for you.

Conclusion

If you’re seeing corruption problems in your Relay Log, but not in your Master Binary Log, try Solution 1. It’s quick to setup and will determine if encryption is the solution to your problem. If it works, setup Solution 2. It will take a little bit of fiddling around, but is certainly worth the effort.

Comments

9 Responses to “Replication Checksumming Through Encryption”

  1. Xaprb on February 17th, 2009 2:48 am

    But the real source of corrupt relay logs in my experience is rarely the network corruption. I realize that there are people for whom the network is usually the problem, but I’ve seen a ton of problems that are caused by bugs in the replication code, and your suggestions don’t mitigate that. Read the comments on http://bugs.mysql.com/bug.php?id=25737

  2. Gary Pendergast on February 17th, 2009 3:01 am

    I guess we have differing experiences, then.

    The comments on Bug #25737 also seem to agree that network-related corruption is the main cause. That said, I agree that this is not a perfect workaround, checksums would be much better.

  3. Mark Callaghan on February 17th, 2009 2:46 pm

    The next Google patch adds checksums to binlog events. Can’t you also get some protection by using compression between master and slave?

  4. Gary Pendergast on February 17th, 2009 8:24 pm

    Next Google patch sounds good, I’ll have a play around with it when it comes out.

    slave_compressed_protocol would probably work as well. The only reason I hesitate to recommend it is because I haven’t tested the overhead. SSH and SSL are both pretty good for keeping CPU usage down.

  5. Mark Callaghan on February 17th, 2009 9:04 pm

    Well, I neglected to mention that in my builds, I changed the compression level (from 6 to 1?) so that there is less CPU load on the server. There really should be a my.cnf parameter for that.

  6. Antony T Curtis on February 17th, 2009 11:14 pm

    Any chance that LZO could be used as that is much more lightweight for compression/decompression of discrete blocks of data.

  7. Gary Pendergast on February 17th, 2009 11:36 pm

    Nice idea, I’ve created Bug #42949 for it. Any chance you could attach your patch for it, as a reference?

  8. bichonfrise74 on February 18th, 2009 1:00 am

    Question… how did you know that it was network that was causing the error? I mean how did you debug this? I suspect that we might be having some network issues related to the relay logs but I’m just not sure how to show that it is a network issue.

  9. Gary Pendergast on February 18th, 2009 2:37 am

    An easy way to test the network link is to FTP a reasonably large file from one server to the other. FTP doesn’t do any checksumming, so is susceptible to network errors. Once the transfer is complete, check the md5sum of the file on each server.

    You probably also want to check your messages log for hardware errors.

Leave a Reply