How we used iptables to replicate UDP traffic when upgrading our Graylog cluster

Kishore Nallan
Kishore Nallan / August 2, 2018

At Zapier, we're big fans of Graylog, and rely heavily on its logs to help us track down tricky bugs and correlate customer issues with specific outages.

Recently, we needed to upgrade our Graylog cluster. Since we had to do this with zero downtime, we decided to provision a brand new Graylog cluster in-parallel, wait for a week to accumulate enough historical context, and then tear down the old cluster.

This plan required us to index every log message sent to the previous generation Graylog cluster in the new cluster as well. These log messages are sent as UDP datagrams in the GELF logging format. Curious why we've configured Graylog to use UDP instead of TCP? We send nearly 10K messages every second to Graylog, and UDP scales much better for us. With TCP, we would have to worry about slow connections, timeouts, and back-pressure which could impact application performance.

When we started exploring how we can mirror these log messages between the two Graylog clusters, we considered three main approaches:

  1. Let the logging client send every message to both clusters.
  2. Use an intermediate proxy server which distributes the incoming UDP traffic to both clusters.
  3. Use iptables on the previous generation Graylog cluster to clone and forward the UDP packets to the new cluster.

We ruled out option 1 since that added extra overhead and complexity to the logging client and we wanted to do that only as a last resort. Option 2 looked promising but we couldn’t find a reliable UDP proxy that could handle our scale.

That eventually led us to explore how we can use the native iptable rules on Linux to achieve this. This idea of using iptables appealed to us since we didn't have to install additional software or make any architectural changes. It took us some time to perfect this, and since we did not stumble upon anyone else documenting this approach, I thought it would be good to share what worked for us on this post.

Graylog migration diagram

The above diagram shows how the set-up works in practice. Each old generation Graylog host clones the incoming UDP datagrams received and forwards to a host from the next generation cluster.

Let's now discuss the actual iptable rules that we configured on the old generation Graylog hosts that helped us achieve this.

Cloning the incoming UDP packet

iptables -t mangle -A PREROUTING -i eth0 -p udp –dport 12201 -m state \
–state NEW,ESTABLISHED,RELATED -j TEE –gateway 127.0.0.1

We use the TEE target of the mangle table to clone the incoming UDP packets on port 12201 (Graylog's UDP port) and redirect it to the local loopback address.

Send the cloned packet to a host on the new cluster

Let's assume that the IP addresses of the new Graylog hosts are 10.0.10.200, 10.0.10.201 and 10.0.10.202. Here's how we would configure iptables to distribute the incoming packets to these hosts:

iptables -t nat -A PREROUTING -i eth0 -p udp –dport 12201 -m statistic –mode nth –every 3 \
–packet 0 -m state –state NEW,ESTABLISHED,RELATED -j DNAT –to-destination 10.0.10.200:12201

iptables -t nat -A PREROUTING -i eth0 -p udp –dport 12201 -m statistic –mode nth –every 2 \
–packet 0 -m state –state NEW,ESTABLISHED,RELATED -j DNAT –to-destination 10.0.10.201:12201

iptables -t nat -A PREROUTING -i eth0 -p udp –dport 12201 -m state –state NEW,ESTABLISHED,RELATED \
-j DNAT –to-destination 10.0.10.202:12201

Using the statistic module we forward every 3rd packet to 10.0.10.200, every 2nd packet to 10.0.10.201 and finally, every first packet to 10.0.10.202.

A couple of gotchas

  1. Be sure to verify that IP forwarding is enabled. You can enable it with sysctl net.ipv4.ipforward=1.
  2. If you are on AWS, you have to disable source/destination checks on the Graylog hosts.

That's it—a copy of all UDP traffic on port 12201 on the old Graylog cluster would now be sent to the new cluster! This approach worked flawlessly for us with no observable performance impact or additional load on the nodes.


Load Comments...

Comments powered by Disqus