In my last post, I spelled out my requirements for a home router: dependable and not requiring babysitting or monthly rebooting, but flexible enough to let me run and control dnsmasq, tcpdump, and VLANs.
When I realized I was seeing so much weirdness at once from my OpenWrt router as to be circumstantial evidence against OpenWrt itself, I mentioned this to my officemate and he said “why don’t you stop screwing around and install full-blown Linux?” Sure, I thought, but that brings up two problems: it sounded like a huge time suck, and where am I going to find appropriate hardware to use as a router?
With help from friends I eventually solved the second one, not without paying a heavy tax in terms of the first (and not in the way I expected): this is that story.
On the time suck question, it seemed like I would have to learn a lot of new things to set up and, possibly, maintain a lot of tasks that I was accustomed to having OpenWrt do for me. I already knew how to install and administer Linux for standard desktop or server use, but I’d never myself configured any advanced networking topologies and my few interactions with iptables had been painful, so configuring NAT and firewalling and routing and dealing with multiple network interfaces was daunting (and this box is by definition going to be exposed to the Internet, so I’d better get the firewall right). I poked around and found shorewall, which exists basically to configure the parts that I didn’t already know how to do, and the more I read about it, the more it seemed a good match for what I was trying to do.
On the hardware question, I wanted something small and quiet and low-power, which would fit in my server rack and stay on all the time without running up my power bill or generating so much heat that it either fails or needs a leaf blower of a fan. (That basically describes most consumer routers, for which, generally, the closest thing you can find to a standard Linux distribution supporting them is OpenWrt. Ahem.)
I also wanted it to have multiple network interfaces, as a router should. (This may or may not be relevant to the hardware decision, though; read on.) A router needs a minimum of two interfaces by definition: one for each network it routes between, so at the minimum, one for the LAN and one for the WAN. The scenario I had in mind was more complicated, with two separate LANs (one for my family and one for guests who just want to get their tablet on the internet), and leaving room for the possibility of multiple internet providers, so I’d need at least 3 interfaces, with the option to expand to 4 or more in the future. Now, these don’t all have to be physical interfaces built into the router. If the router has USB ports, you can add more that way; also if you have other physical infrastructure supporting VLANs, you can multiplex several networks over one physical port. (Again as a comparison to OpenWrt: standard consumer routers that OpenWrt runs on tend to have 5 ports and 2 network interfaces; one network interface is connected directly to one port labeled WAN, and the other interface is connected through a switch chip to the other 4 ports, which by default are bridged onto a single VLAN but which can be configured for 4 separate VLANs if that floats your boat.)
After getting some advice from friends and discussing it ad nauseum, I ended up buying a fit-pc2i, notable because it’s a standard x86 PC (so I can choose really any standard Linux distribution or even Windows to run on it), in a tiny passively cooled case, drawing 6W, and with 2 physical network interfaces. (I didn’t like the idea of depending on a bunch of USB network adapters, and I wasn’t sure I could rely only on VLAN support to get extra ports, so I wanted a 2nd port for insurance. Now that I’ve used it, I think a single reliable physical network interface + VLAN support would work out fine.) Those 2 network ports are not enough for my scenario, so I also bought a Cisco SG200-08 switch, which I use solely to add ports, turning 1 into 8.
Having made these decisions, I bought the fit-pc2i and SG200, installed Linux (Ubuntu 11.10 Server) and dnsmasq and shorewall, configured VLANs between the router and switch so that various switch ports acted like they were connected to additional eth1.x interfaces in the router, and started testing things. It worked fine until I tried a speed test (from a client connected through the new router which was connected behind my old router); the speed test promptly hung from the client’s point of view, and I couldn’t access the new router over the network at all. I power cycled the new router, tried again, same result. I poked around log files, tried to enable the Linux NMI watchdog, and generally looked for clues without finding anything until I visited the fit-pc forums and read “solution for freezes when scp/ftp/nfs with most Linux dist”. This pointed the blame squarely at the Realtek network interfaces, and suggested an alternate driver as a solution. Once I started investigating fixes for this, I got really pessimistic at first: a Google query for “r8169 freeze” shows a dismaying number of hits, many in distribution-specific bug reports going back years and years. I’d been under the assumption that networking is Linux’s lifeblood and that wired networking has long been a solved problem — wireless network hardware flaky under Linux, sure, any network hardware flaky under Windows, sure, but wired network hardware flaky under Linux? That was a rude surprise.
Long story shorter, the in-tree driver (open source and provided with Linux kernels) for this class of Realtek hardware is named r8169. It actually supports a family of Realtek chips named RTL8111/RTL8168, of which there are apparently many variants with important programming differences even inside the same PCI ID, so using lspci won’t necessarily tell you enough about which one you have. Realtek also has their own driver, also ostensibly open source but not included in the standard Linux kernel, called r8168. For years now, you can find blog posts saying “I had such and such a problem with r8169 and I switched to r8168 and it worked better.” So naturally, I tried r8168, and found it didn’t work at all. Upon further investigation, it has completely broken VLAN support (at least on my hardware, in the 8.027 driver that was current at the time, in the phase of the moon that obtained at the time): on a non-virtual interface it worked fine (and without freezing the kernel); frames that should have an 802.1Q tag added or removed had it done incorrectly, and would either (outgoing) get ignored by the switch, or (incoming) get ignored by the kernel. After spending hours running 3 instances of tcpdump (on the fitpc on the raw interface, on the fitpc on the virtual interface, and on a separate machine plugged into a switch port on the SG200 mapped to the same VLAN), I could characterize the problem: outgoing frames were transmitted with no tag, and get dropped by the switch. Incoming frames with a tag actually had it stripped and were dispatched properly. I found out about “ethtool -K” to control hardware acceleration of VLAN tagging (does this really benefit from hardware acceleration? More than it loses from the possibility of someone screwing it up?), disabled VLAN tag hardware acceleration in both directions, and found the opposite problem. Just by luck, at this point I re-enabled hardware acceleration for VLAN tagging only on the RX path, and things started working. But only on certain ports.
As a recap of what I found to be broken with r8169 and 802.1Q: as the driver loads by default, it improperly tags packets on the TX path. If I use “ethtool -K txvlan off”, TX works but RX packets are ignored. If I use “ethtool -K txvlan off rxvlan off” followed by “ethtool -K txvlan off rxvlan on”, TX and RX both work, but flakily — some ports and protocols work, some don’t, and I don’t know why but I’ve spent too much time staring at packet traces and I don’t care any more. The driver is broken out of the box, can be made to almost work by enabling and disabling VLAN tag acceleration in the right order through an order-dependent set of transitions reminiscent of port knocking, but still doesn’t entirely work, and I’m not going to trust it.
Then, back in r8169-land, I found an Ubuntu bug report, Network problem with the r8169 driver and RTL8111/8168B, in response to which people said the 3.1 kernel driver seems to work better than the 3.0 kernel driver, and Leann Ogasawara produced a 3.0 kernel with the 3.1 r8169 driver grafted in for people to try. So I tried it, and: lo and behold, while my repro scenario would still provoke a nasty warning and stack trace in system.log, there was no freeze.
At this point, I reported my findings to both the r8168 maintainers (Realtek) and the r8169 drivers (Linux netdev mailing list and Francois Romieu). Realtek didn’t respond at all. Francois did reply, saying he’d been fixing a bunch of problems in this area recently, and the 3.2 driver should work even better (this was last December, in the final throes of the 3.2 kernel release). I grabbed a 3.2RC7 kernel, installed it under the Ubuntu 11.10 install I was using, and it worked fine. No warnings, no backtraces, no freezes.
I haven’t touched the configuration since; after another week or so of testing I installed the fitpc + Ubuntu 11.10 + the 3.2RC7 kernel as our main router, and we haven’t had any problems with it. Hopefully, the Ubuntu 12.04 release (which already uses a 3.2 kernel) will install and run fine, and I won’t have to worry about this for another another 5 years since 12.04 is an LTS release.
Lessons learned here:
- VLANs are cool, and I don’t really need more than one physical interface on the router the way I’m using it. I recommend the VLAN + separate switch as port splitter technique. But you do want a gigabit network interface if you’re going to do that.
- Realtek was the bane of my existence for a few weeks in December. It looks like I just had bad timing, and if I’d done the same setup in April using Ubuntu 12.04 with a Linux 3.2 kernel I wouldn’t have had to learn any of this r8168/r8169 business. But given the history, I wouldn’t recommend their products. I went so far as to reconsider the whole fitpc choice. But in this form factor, it seems all the alternatives (including other fitpc products) use Realtec NICs.
- shorewall makes configuration of NAT routing, firewalling, and traffic shaping much easier, in my opinion, than raw iptables and tc.
- Aside from dealing with the Realtek issue, this was less of a time suck than I was expecting.
- Including dealing with the Realtek issue, this was more of a time suck than I was expecting.
- I’m happy with the result, though. Treating the fitpc-2i and SG200 as one unit, I have something that’s about the size and power consumption of the OpenWrt router, except now it’s got a 1 GHz x86 CPU, 1GB of RAM, 32GB of flash storage, 9 individually addressable network ports, and is still entirely solid state. Those hardware specs only matter inasmuchas they give me plenty of breathing room for future expansion (I don’t think my actual usage was taxing the much-lower-speced OpenWrt router), but the real bonus is it’s stable: OpenVPN, dnsmasq and miniupnpd are all behaving as they ought to.