Streamlined, High-Speed Virtualized Packet I/O

An overview over the performance optimizations.

Looking for Xen examples? Check out Getting started section for more details

Architecture


Our goal is to build a multi-tenant, high performance software middlebox platform on commodity hardware.

To achieve isolation and multi-tenancy, we must rely on hypervisor virtualisation, which adds an extra software layer between the hardware and the middlebox software and could potentially hurt throughput or increase delay.

To minimize these effects, paravirtualization is preferable to full virtualization: para-virtualization makes minor changes to the guest OSes, greatly reducing the overheads inherent in full virtualization such as VM exits or the need for instruction emulation.

We base our work on top Xen since its support for paravirtualized modes provides the possibility to build a low delay and high throughput platform.

High-level architecture overview.

Xen is typically split into a privileged virtual machine or domain called Domain-0 (typically running Linux), and a set of guest domains comprising the users’ virtual machines (also known as DomUs). In addition, Xen includes the notion of a driver domain VM which hosts the device drivers, though in most cases Dom0 acts as the driver domain.

Our platform components consist of a fast backend switch, a new netback driver and new corresponding netfront drivers for MiniOS and Linux.

Xen Network I/O Optimizations


The Xen network I/O pipe has a number of components and mechanisms that add overhead but that are not fundamental to the task of getting packets in and out of VMs. In order to optimize this, it would be ideal if we could have a more direct path between the backend NIC and switch and the actual VMs. Conceptually, we would like to directly map ring packet buffers from the device driver or back-end switch all the way into the VMs’ memory space, much like certain fast packet I/O frameworks do between kernel and user-space in non-virtualized environments.

More specifically, we replaced the standard but sub-optimal Open vSwitch backend switch with a high-speed, VALE switch; this switch exposes per-port ring packet buffers which are the ones we map into VM memory space. We observe that in our model the VALE switch and netfront driver transfer packets to each other directly so that the netback driver becomes a redundant component of the data plane. As a result, we remove it from the pipe, but keep it as a control plane driver for things like communicating ring buffer addresses (grants) to the netfront driver. Finally, we revamp the netfront driver to map the ring buffers into its memory space.

Backend driver

As mentioned, we redesigned the netback driver to turn it (mostly) into a control-plane only driver. Our modified driver is in charge of allocating memory for the receive and transmit packet rings and their buffers and to set-up memory grants for these so that the VM’s netfront driver can map them into its memory space. We use the Xen store to communicate the rings’ memory grants to the VMs, and use the rings themselves to tell the VM about the ring buffers’ grants; the latter is because these are numerous and would overload the Xen store with entries. On the data plane side, the driver is only in charge of:

  • Setting up the kthreads that will handle packet transfers between switch and netfront driver.
  • Proxying event channel notifications between the netfront driver and switch to signal the availability of packets.

Note that the driver is no longer involved with actual packet transfer and we no longer use vifs nor OS-specific data structures such as sk buffs for packet processing. Further, as already suggested by the Xen community, we adopt a 1:1 model for mapping kernel threads to CPU cores: this avoids unfairness issues. Finally, we split event channels: the standard netback driver uses a single event channel (a Xen interrupt) for notifying the availability of packets for both transmit and receive. Instead, we implement separate Tx and Rx event channels that can be serviced by different CPU cores.

Frontend drivers

Considerably redesigning the netback driver meant breaking the corresponding netfront drivers. To fix this, we rewrote the MiniOS and Linux netfront drivers to be compliant with the new back-end implementation. These netfront drivers use netmap rings almost the way a userspace netmap application would. The synchronization is the key difference, since we are no longer talking between user and kernel, but between domains: transmits are done asynchronously, unless there are no more slots available in the ring. While the backend is processing packets, the netfront driver uses its private part of the ring, ensuring it does not use the same memory region and indexes as the backend. The figure below shows the improvement for a Linux guest, where we achieve 10 Gb/s rates for most packet sizes.

Linux performance

MiniOS

MiniOS is a tiny, paravirtualized operating system available with the Xen sources and forms the basis for our ClickOS VMs. MiniOS has a single address space so no kernel/user space separation, and a cooperative scheduler, reducing context switch costs. MiniOS does not have SMP support, but this does not pose a problem for our platform: our model is to have large numbers of tiny VMs rather than a few large VMs using several CPU cores each.

MiniOS TX performance