Using VirtualBox, I set up a Linux server with the intention of hosting multiple virtual machine instances in order to make as much use of the server as possible. Networking proved to be a big problem. The ISP allocates up to 8 eight public IP addresses for you, so each machine can have it's own address. There are different ways to make the virtual machines available to the public internet.

Bridged network

This basically works like a virtual Ethernet switch that connects your virtual instances and one designated physical network card together. Each host has a different MAC address and behaves like a distinct device on the network. Herein lies the problem: my ISPs registers the MAC address of the physical network adapter with it's network switches, and as soon as the switch sees an Ethernet frame coming from my port with a different MAC address, it will not only discard that frame, but also shutdown the port completely and permanently. This is to avoid ARP spoofing, and is a reasonable way. Anyway, whatever you do, DO NOT connect VirtualBox instances in bridged mode to the public network adapter. This only works in normal Ethernet setups, and not in data centers with managed switches and monitoring against ARP spoofing. I got my networking back after a call to support, but because the port was completely disabled, I could only access the server with a serial terminal for which the ISP provides a SSH access. Without this way to disable the bridged mode configuration, the minute the port would have been re-enabled, the ARP spoofing detection would have kicked in again and disabled the port another time.

Bridged network №2

The setup is basically the same, the instances are all in bridged mode, connected to the public network adapter, but this time, each machine gets the same MAC address as the physical adapter of the hypervisor. This kind of works, because the ISP's switch never sees a foreign MAC address, and the internal VirtualBox architecture simply distributes all incoming frames to all connected virtual instances. However, it still creates some problems:

  • All machines virtual network adapters (btw. virtio) are basically permanently in promiscuous mode. Although only IP packets destined for a particular machine are really picked up, firewalls permanently log dropped packets because of the wrong destination address. It also opens up a security hole and reduces performance by having all machines inspect traffic for all the other machines, even the hypervisor itself.
  • Connections between the machines were really slow. I never exactly figured out why, but because the whole setup was so ridiculous (there has to be a reason why each network card usually gets it's unique MAC address), I didn't bother with it. There was also a lot of spurious ARP traffic, probably because the machines couldn't agree on what MAC address belonged to what IP.

NAT

This is the default mode when creating a new VirtualBox instance. I never bothered using it, because in tests it was awfully slow, so slow actually that is was unusable (this may have changed). It also creates a lot of problems. The basic setup would be to have the hypervisor get all the public IPs, which creates the first problem, as to which IP the hypervisor itself should use, and which are reserved for your virtual machines. The virtual instances then get private IP addresses on a host-only adapter, and each additional public IP is then forwarded to a private one. While this probably works, and the speed issue could be resolved by using iptables instead of the VirtualBox built-in NAT mode, an "identity problem" remains because none of the virtual instances really know their real public address. This will seriously sabotage protocols like FTP that need to know their own public IP address.

Proxy ARP

After a lot of research, I found a way that actually works and behaves well. Each virtual instance is configured with it's public IP address and correct DNS and gateway settings, as if it were directly connected to the external network. Also each virtual network adapter has it's own unique MAC address. The only difference is that they are connected to a host-only, internal network adapter which can be created with VirtualBox.

Now the machines can talk to each other, but not with the internet, and they cannot be reached from the outside. This is because no one knows they are actually there. To mitigate this, the almost ancient proxy_arp is enabled. It basically works as an ethernet bridge, which allows one host to impersonate the Ethernet interface of another. Enabling it is simple:

echo 1 > /proc/sys/net/ipv4/conf/eth0/proxy_arp echo 1 > /proc/sys/net/ipv4/conf/vboxnet0/proxy_arp

Be advised that the host-only networking adapter only gets enabled after at least one virtual instance has been started and connected to the network. Instead you can force it up with a simple command, which makes it easier to put everything together into a boot script:

ifconfig vboxnet0 [HYPERVISOR_IP] netmask [YOUR_NETMASK] up

Replace [HYPERVISOR_IP] with your primary public IP address of your hypervisor, and accordingly it's netmask, which often will be 255.255.255.255.

After issuing this command, the hypervisor will not only answer ARP requests for it's own MAC address, but also forward the requests and relay the answers back to the source adapter where it received the request. This way the external router knows that it can reach the additional public addresses on the physical network adapter of the hypervisor.

The only thing left to do is enabling routing and adding routes so that IP packets actually get forwarded in both directions. You might also need to tweak your iptables configuration to allow traffic through, as the hypervisor now acts as a transparent Ethernet bridge and a stateful firewall at the same time.

route add [FIRST_VIRTUAL_IP] vboxnet0 route add [SECOND_VIRTUAL_IP] vboxnet0 route add [THIRD_VIRTUAL_IP] vboxnet0 echo 1 > /proc/sys/net/ipv4/ip_forward

As usual, replace the [...] with your actual IPs and add as many routes as you have IP addresses allocated for virtual machines. As an optional step you can configure each virtual instance with direct routes as to avoid the roundtrip to the ISPs gateway for internal traffic. But this only improves performance, and makes no other difference.

The only remaining problem was the ISPs preferred IP setup, with netmask 255.255.255.255 and default gateway at 10.255.255.1. Most Linux servers won't accept this configuration out of the box (I'll write an article to make it work), and some firewall products like Microsoft Forefront Threat Management Gateway (TMG) won't even run with it. The hypervisor setup could be changed to give the virtual hosts proper IP addresses and netmasks (i.e. a netmask where the host address and the default gateway share the same subnet).

Virtualization disk storage concerns Posted on 01 May 2012 at 18:34

I am using a dedicated Linux server as a virtualization hypervisor. VirtualBox is a free, multi-platform type 2 user-mode hypervisor that works well with many different guest operating systems. One problem though was a less than acceptable disk IO performance in the guest. The server has a RAID5 of three off-the-shelf 7.200 rpm HDs, and should deliver around 100 MB/s with sequential reading. What I saw was erratic behavior, from high transfer rates (especially when reading non-allocated blocks) down to very low ones.

The default and pretty much standard method for creating a virtual hard disk is using a dynamic image. Dynamic images have a size attribute, and this is how much space they report to the guest OS. However, they start as small files and only get larger when the guest actually allocates, i.e. writes, a block. This creates a number of problems:

  • The disk image has to maintain a table or tree of allocated blocks, because allocations can occur anywhere at the beginning, the middle or the end of the emulated block device.
  • As an operating system is installed, the image will rapidly grow to more than 10 GB, but in small increments. In a real world file system, growing a large file in small chunks will inevitably lead to fragmentation of the file.
  • Having only small files leads to having a total of virtual hard disks much larger than the available physical space.

The first problem is partially solved by allocating large chunks and using an efficient algorithm to find the association between the emulated physical block device and the disk file. However, as this is implementation specific, we don't have a lot of control about how this works. It's basically a simple file system inside a file.

The second problem is much worse. Given several guest systems running in parallel, fragmentation will occur, especially on file systems with little space. Because the files can get pretty big, defragmentation is problematic, as it produces a lot of IO and requires a continuous free space slot the size of the disk file.

[root@hypervisor ROOT1]# xfs_bmap ROOT1.vdi ROOT1.vdi: 0: [0..59391]: 4280160..4339551 1: [59392..175103]: 4841920..4957631 2: [175104..401023]: 5656128..5882047 3: [401024..856831]: 39002944..39458751 4: [856832..1772415]: 50868288..51783871 [...] 226: [319714816..319991295]: 914358336..914634815 227: [319991296..322039295]: 916455488..918503487 228: [322039296..322094591]: 918552640..918607935 229: [322094592..323094015]: 920649792..921649215 230: [323094016..323804671]: 922746944..923457599 231: [323804672..324185511]: 924844096..925224935

To sum up what is happening when a guest tries to access a file: the guest has to access it's own file system trees to find the physical location of the file on the emulated disk. We'll simply assume that this incurs no overhead. Now that the guest knows the location, it will issue read commands to the emulated hard disk controller. The hypervisor now has to check it's internal entries in the virtual disk image file to locate the position where the requested data blocks are stored. Then the hypervisor file system has to locate the real physical position on the hard drive. As we know, a mechanical hard disk usually doesn't deliver more than 100 IOPS due to it's disk access times of about 10 ms. Under the right circumstances, even a sequential read by the guest, which would be pretty fast on a physical machine, will lead to dozens of different places being accessed.

Further problems arise if you try to mitigate the situation. You might be inclined to defragment the guest file system. But not only is this a very slow process, but because of the nature of the dynamic disk image, it won't lead to large files being in a contiguous space on the physical disk. You can also try to defragment the host file system, but still, this produces a high IO load, and due to the nature of the dynamic disk image, still no contiguous blocks where the guest would expect them.

The third problem doesn't seem like a problem at all. It is some kind of disk over-commitment usually deployed with RAM or with sparse files. However, there is a very distinct difference between memory over-commitment and dynamic images: unused space can never be reclaimed in dynamic images. That is, almost: images can be compacted. However, most file systems will only delete nodes in their file system structure, but will usually not touch the data itself. On the other side, the hypervisor doesn't know anything about the file system of the guest, so it cannot reclaim that unused space. In order for the host to be able to reclaim the space, the guest has to blank out data blocks it doesn't need anymore. This is usually done by first defragmenting on the guest, and then writing a big file full of zeroes until the disk is full. However, this is NOT recommended for production systems, as a full or nearly full hard disk is the worst nightmare for any kind of server, be it database, web or file server. At least shutdown services that might try to allocate memory from the disk. The whole process also has to be supervised the whole time and takes some time, especially if the dynamic disk has a large maximum size. With a virtual disk image, it will also lead to the image file growing to the maximum size on the host, causing further fragmentation on the host. A more intricate way would be to mount the virtual disks on the hypervisor or in a separate recovery OS and have a file system aware tool zero-out the unused space, however this requires the guest to be shut down. It would also be possible to do this online, but even the Microsoft tool sdelete simply chooses to fill up the disk with a useless file. Guest extensions could theoretically signal the unused space to the host, in the same fashion that TRIM signals unused space to SSDs. But never fill up your hard disks, especially not C:\ and especially not on a server, especially not a production system. After unused space has been cleared, the disk image can be compacted offline, meaning further downtime. The process is also very lengthy, but usually ends in a very compact image, bringing us back to where we started: an ever growing, fragmentation-prone file system in a file system in a file system. I will not document this process as it is documented elsewhere and in my opinion a big waste of time, just to save a few GBs that will get filled up again eventually.

First solution: fixed-sized images

To overcome the problems, I tried to convert the dynamic images to fixed-sized ones. You can use VBoxManage to convert between dynamic and fixed-size, however be advised that the process cannot easily be aborted, as VBoxSVC does all the work, and Ctrl-C'ing the VBoxManage-instance doesn't do anything at all.

VBoxManage clonehd [ old-fixed-VDI ] [ new-dynamic-VDI ] --variant Standard VBoxManage clonehd [ old-dynamic-VDI ] [ new-fixed-VDI ] --variant Fixed

This has to happen offline, so again, downtime. It's lengthy, it requires you to have at least as much free space available as the fixed-size image will take, and again leads to fragmentation. If you get the resulting fixed-size image down to an acceptable degree of fragmentation, no further fragmentation will occur, the overhead of the dynamic image will be gone, and there are good chances that contiguous blocks of data in the guest will be represented by contiguous blocks in the host file system, so unnecessary disk seeking will be avoided. In my experiments however, xfs_bmap reported hundreds of fragments for the fixed-size disk file, and IO performance still wasn't anywhere near I wanted it to be. To sum it up, it was just a large waste of time, and with large I mean several hours. You basically now have the worst of both worlds: an immensely large image file, still fragmentation, no way for over-commitment and still no real increase in performance, as each disk access in the guest still has to go through the host file system.

Second solution: enter LVM

As I had a lot of unused space, and the host was set up with LVM, I tried the raw disk approach. You basically copy the raw contents to a volume, and create a proxy image file which redirects to this raw volume. You first have to convert the disk image, dynamic or fixed-size doesn't matter, to a raw file:

VBoxManage internalcommands converttoraw file.vdi file.raw

Then you set up a new logical volume in LVM with at least the size of the raw file. After dd'ing the contents from the raw file over to the volume, the proxy image file is created:

dd if=file.raw of=/dev/mapper/vgXY-ABC VBoxManage internalcommands createrawvmdk -filename ~/rawdisk.vdi -rawdisk /dev/mapper/vgXY-ABC

There is one caveat: VirtualBox and it's instances usually run as non-root, so you need to chown/chgrp the block device representing your logical volume before this works. You also have to run all the above commands either as root or with sudo.

The result

As predicted, disk operations by the guest now much more behave like it would operate on a physical disk. You still have the option to grow the disk with LVM, with minimal fragmentation, and the erratic behavior is gone. Average read speads of 100 MB/s are normal. Instances boot faster and are more responsive with concurrent load.

As the disk is now hosted on an LVM volume, we can do some neat tricks. One technique to backup a virtual machine online is by using VirtualBox snapshots aka states. This technique is not recommended, because while the system itself might be in a consistent state, including the RAM content, it will not account for client connections. A restored instance could have incomplete or still-locked files and other issues. Dynamic images with multiple and possibly nested states/snapshots will introduce additional overhead. I also had situations where a saved state could not be restored due to changes in the host. Cold-starting the instance will then be like having pulled the plug while the machine was running, which can lead to data and/or file system corruption.

Strategies for redundancy and reliability are clustering and online backups – usually because large-scale backups cannot be restored in reasonable time. However, for a full backup, and if a short downtime can be accommodated, one way is to simply shutdown (or hibernate if you want to save the memory state) the guest, create a LVM volume snapshot, then copy the snapshot to your backup media, after which the snapshot can be removed. Creating a LVM snapshot while the guest is running is possible, but can lead to data and/or file system corruption if outstanding caches have not been flushed to disk.

Conclusion

IO performance could be significantly increased by using a raw LVM volume instead of a dynamic image. As the images already had grown to nearly their maximum size due to usage, not much space is wasted. However, the process was lengthy, so I advice to create production instances directly with raw volumes, and use dynamic images only for testing and when moving, duplicating, etc. is likely, as smaller files incur less time. Another side-effect is the fact that file system corruption in the host doesn't affect the guest volumes, as they don't exist as (fragmented) files anymore. The raw volumes usually can be restored with the LVM tool suite and then recreated on the same or a different machine, as long as the partition table and the LVM configuration is intact.

One problem still is the fact that after restarting the host computer, LVM will give the default root/root permissions to the block devices, ending in VirtualBox not being able to access them. I have yet to decide how to resolve this problem, either by having permanent ownership, by having a script changing ownership at boot time or by giving the vbox user account more rights. But because an unattended shutdown/reboot is currently not possible (it would require the hypervisor sending ACPI requests to the guest, then waiting for shutdown, which can take minutes, especially if all instances are to be shut down simultaneously, then allowing the host to shutdown, and then restart the instances when the machine comes back up), this isn't an issue now. I usually shutdown all instances manually, and then the hypervisor. A UPS for the hypervisor is advised because pulling the plug on XFS-based file systems with write cache enabled will usually lead to data loss.

Because drive space is cheap, I would at least stick with fixed-sized images or raw volumes for production systems. Both can be enlarged if really necessary, and current trends towards SANs, iSCSI and distributed file systems with transparent sparse file support make dynamic disk images redundant – usually because they substitute the feature, or because virtual machines can access external hard disk space more easily without actually affecting the disk image itself. The only niche where dynamic disk images seem to fit are on capacity-limited SSDs, where seeking time is not an issue any more. But then again, a grown disk image needs considerable space and time to reclaim unused space, so guest operating systems need to be either aware of the virtualization, or user-mode tools have to be developed to help online shrinking.