Since Zalando’s earliest days, members of our technology team have been able to use OpenVPN to work anywhere, anytime. Back then (the late 2000s), we had only a single (though state-of-the-art) instance to work with, and a lot of manual maintenance to perform. Last year, in light of a years-long period of explosive growth, our team realized that we needed to build something scalable, fully redundant, and easier-to-maintain for hundreds of users. In other words, our own OpenVPN cluster!
The above diagram illustrates the structure of our new cluster, which I built using typical network models as guides. Achieving greater reliability was the primary goal behind this design, which makes it possible to scale up easily and add as many VPN servers as we need. It’s also easier to add many more IPs. Our team decided to use six servers in order to achieve site reliability and redundancy in each side. For better scaling, we put the servers behind a load balancer for the external datacenters (DC1 + DC2). The internal servers (DC3) are not load-balanced because they are running on high-availability hosts. The config looks like this:
remote 10.1.1.1 1194
remote 10.1.1.2 1194
remote 220.127.116.11 1194
remote 18.104.22.168 1194
On the server, the user config is very simple — we only add the IP address:
vpnXX:/etc/openvpn/ccd# cat username
ifconfig-push 192.168.178.10 192.168.178.9
The rest of the config is similar for all the servers in the cluster:
# some encryption hardening
push "topology net30"
port XXXXX #chose your port
key keys/XXX.key # This file should be kept secret
ifconfig 192.168.178.1 192.168.178.2
keepalive 3 10
management x.x.x.x port
push "dhcp-option DNS 22.214.171.124"
push "dhcp-option DOMAIN <whatever>"
push "route 126.96.36.199 255.255.255.255"
push “route X.X.X.X Y.Y.Y.Y”
Our configuration offers the same IP to the user every time, which greatly simplifies rule setting. However, a new problem arises in such a cluster design: Basically, the network in the datacenter is forced to know which server the user is currently logged on. To handle cases like these, we’ve created two simple scripts in case the cluster can self learn where the user is login. route_add.sh will add the logged-in user using a static route to the VPN server; route_delete.sh will remove it after logout. Two examples:
if [ ! -z "$ifconfig_pool_remote_ip" ]; then
ip route add "$ifconfig_pool_remote_ip" dev $dev
ip route add "<IPv6>::$ifconfig_pool_remote_ip" dev $dev
ip route del "$ifconfig_pool_remote_ip"
After adding the route to the server’s routing table, the only necessary thing to do is to announce it. We use a dynamic routing protocol in the backend of our datacenter, and Quagga on every VPN server to announce the routes. Depending on your infrastructure, you can use one of the preferred routing protocols: Routing Information Protocol (RIP), Border Gateway Protocol (BGP) or Open Shortest Path First (OSPF).
It’s important to activate routing on the VPN server. You can do this by typing:
echo 1 > /proc/sys/net/ipv4/ip_forward
redistribute connected # this is the important line
network X.X.X.X area 0.0.0.0
area 0.0.0.0 authentication message-digest
The user can log on and off to every server, and the IP address will be announced to the entire network.
For security and access control, we chose to use Firewall Builder, an open-source iptables manager that allows us to build a ruleset and install it on all servers in the cluster simultaneously. Since making improvements to our user management system earlier this year, we’ve reduced the complexity of iptables from hundreds of rules/10,000 lines to 50 rules and a more viewable interface.
With this set up, we have about 800 users (150 in parallel) and haven’t faced any performance issues at anytime, from anywhere!