I am using a dedicated Linux server as a virtualization hypervisor. VirtualBox is a free, multi-platform type 2 user-mode hypervisor that works well with many different guest operating systems. One problem though was a less than acceptable disk IO performance in the guest. The server has a RAID5 of three off-the-shelf 7.200 rpm HDs, and should deliver around 100 MB/s with sequential reading. What I saw was erratic behavior, from high transfer rates (especially when reading non-allocated blocks) down to very low ones.
The default and pretty much standard method for creating a virtual hard disk is using a dynamic image. Dynamic images have a size attribute, and this is how much space they report to the guest OS. However, they start as small files and only get larger when the guest actually allocates, i.e. writes, a block. This creates a number of problems:
- The disk image has to maintain a table or tree of allocated blocks, because allocations can occur anywhere at the beginning, the middle or the end of the emulated block device.
- As an operating system is installed, the image will rapidly grow to more than 10 GB, but in small increments. In a real world file system, growing a large file in small chunks will inevitably lead to fragmentation of the file.
- Having only small files leads to having a total of virtual hard disks much larger than the available physical space.
The first problem is partially solved by allocating large chunks and using an efficient algorithm to find the association between the emulated physical block device and the disk file. However, as this is implementation specific, we don't have a lot of control about how this works. It's basically a simple file system inside a file.
The second problem is much worse. Given several guest systems running in parallel, fragmentation will occur, especially on file systems with little space. Because the files can get pretty big, defragmentation is problematic, as it produces a lot of IO and requires a continuous free space slot the size of the disk file.
[root@hypervisor ROOT1]# xfs_bmap ROOT1.vdi
0: [0..59391]: 4280160..4339551
1: [59392..175103]: 4841920..4957631
2: [175104..401023]: 5656128..5882047
3: [401024..856831]: 39002944..39458751
4: [856832..1772415]: 50868288..51783871
226: [319714816..319991295]: 914358336..914634815
227: [319991296..322039295]: 916455488..918503487
228: [322039296..322094591]: 918552640..918607935
229: [322094592..323094015]: 920649792..921649215
230: [323094016..323804671]: 922746944..923457599
231: [323804672..324185511]: 924844096..925224935
To sum up what is happening when a guest tries to access a file: the guest has to access it's own file system trees to find the physical location of the file on the emulated disk. We'll simply assume that this incurs no overhead. Now that the guest knows the location, it will issue read commands to the emulated hard disk controller. The hypervisor now has to check it's internal entries in the virtual disk image file to locate the position where the requested data blocks are stored. Then the hypervisor file system has to locate the real physical position on the hard drive. As we know, a mechanical hard disk usually doesn't deliver more than 100 IOPS due to it's disk access times of about 10 ms. Under the right circumstances, even a sequential read by the guest, which would be pretty fast on a physical machine, will lead to dozens of different places being accessed.
Further problems arise if you try to mitigate the situation. You might be inclined to defragment the guest file system. But not only is this a very slow process, but because of the nature of the dynamic disk image, it won't lead to large files being in a contiguous space on the physical disk. You can also try to defragment the host file system, but still, this produces a high IO load, and due to the nature of the dynamic disk image, still no contiguous blocks where the guest would expect them.
The third problem doesn't seem like a problem at all. It is some kind of disk over-commitment usually deployed with RAM or with sparse files. However, there is a very distinct difference between memory over-commitment and dynamic images: unused space can never be reclaimed in dynamic images. That is, almost: images can be compacted. However, most file systems will only delete nodes in their file system structure, but will usually not touch the data itself. On the other side, the hypervisor doesn't know anything about the file system of the guest, so it cannot reclaim that unused space. In order for the host to be able to reclaim the space, the guest has to blank out data blocks it doesn't need anymore. This is usually done by first defragmenting on the guest, and then writing a big file full of zeroes until the disk is full. However, this is NOT recommended for production systems, as a full or nearly full hard disk is the worst nightmare for any kind of server, be it database, web or file server. At least shutdown services that might try to allocate memory from the disk. The whole process also has to be supervised the whole time and takes some time, especially if the dynamic disk has a large maximum size. With a virtual disk image, it will also lead to the image file growing to the maximum size on the host, causing further fragmentation on the host. A more intricate way would be to mount the virtual disks on the hypervisor or in a separate recovery OS and have a file system aware tool zero-out the unused space, however this requires the guest to be shut down. It would also be possible to do this online, but even the Microsoft tool sdelete simply chooses to fill up the disk with a useless file. Guest extensions could theoretically signal the unused space to the host, in the same fashion that TRIM signals unused space to SSDs. But never fill up your hard disks, especially not C:\ and especially not on a server, especially not a production system. After unused space has been cleared, the disk image can be compacted offline, meaning further downtime. The process is also very lengthy, but usually ends in a very compact image, bringing us back to where we started: an ever growing, fragmentation-prone file system in a file system in a file system. I will not document this process as it is documented elsewhere and in my opinion a big waste of time, just to save a few GBs that will get filled up again eventually.
First solution: fixed-sized images
To overcome the problems, I tried to convert the dynamic images to fixed-sized ones. You can use VBoxManage to convert between dynamic and fixed-size, however be advised that the process cannot easily be aborted, as VBoxSVC does all the work, and Ctrl-C'ing the VBoxManage-instance doesn't do anything at all.
VBoxManage clonehd [ old-fixed-VDI ] [ new-dynamic-VDI ] --variant Standard
VBoxManage clonehd [ old-dynamic-VDI ] [ new-fixed-VDI ] --variant Fixed
This has to happen offline, so again, downtime. It's lengthy, it requires you to have at least as much free space available as the fixed-size image will take, and again leads to fragmentation. If you get the resulting fixed-size image down to an acceptable degree of fragmentation, no further fragmentation will occur, the overhead of the dynamic image will be gone, and there are good chances that contiguous blocks of data in the guest will be represented by contiguous blocks in the host file system, so unnecessary disk seeking will be avoided. In my experiments however, xfs_bmap reported hundreds of fragments for the fixed-size disk file, and IO performance still wasn't anywhere near I wanted it to be. To sum it up, it was just a large waste of time, and with large I mean several hours. You basically now have the worst of both worlds: an immensely large image file, still fragmentation, no way for over-commitment and still no real increase in performance, as each disk access in the guest still has to go through the host file system.
Second solution: enter LVM
As I had a lot of unused space, and the host was set up with LVM, I tried the raw disk approach. You basically copy the raw contents to a volume, and create a proxy image file which redirects to this raw volume. You first have to convert the disk image, dynamic or fixed-size doesn't matter, to a raw file:
VBoxManage internalcommands converttoraw file.vdi file.raw
Then you set up a new logical volume in LVM with at least the size of the raw file. After dd'ing the contents from the raw file over to the volume, the proxy image file is created:
dd if=file.raw of=/dev/mapper/vgXY-ABC
VBoxManage internalcommands createrawvmdk -filename ~/rawdisk.vdi -rawdisk /dev/mapper/vgXY-ABC
There is one caveat: VirtualBox and it's instances usually run as non-root, so you need to chown/chgrp the block device representing your logical volume before this works. You also have to run all the above commands either as root or with sudo.
As predicted, disk operations by the guest now much more behave like it would operate on a physical disk. You still have the option to grow the disk with LVM, with minimal fragmentation, and the erratic behavior is gone. Average read speads of 100 MB/s are normal. Instances boot faster and are more responsive with concurrent load.
As the disk is now hosted on an LVM volume, we can do some neat tricks. One technique to backup a virtual machine online is by using VirtualBox snapshots aka states. This technique is not recommended, because while the system itself might be in a consistent state, including the RAM content, it will not account for client connections. A restored instance could have incomplete or still-locked files and other issues. Dynamic images with multiple and possibly nested states/snapshots will introduce additional overhead. I also had situations where a saved state could not be restored due to changes in the host. Cold-starting the instance will then be like having pulled the plug while the machine was running, which can lead to data and/or file system corruption.
Strategies for redundancy and reliability are clustering and online backups – usually because large-scale backups cannot be restored in reasonable time. However, for a full backup, and if a short downtime can be accommodated, one way is to simply shutdown (or hibernate if you want to save the memory state) the guest, create a LVM volume snapshot, then copy the snapshot to your backup media, after which the snapshot can be removed. Creating a LVM snapshot while the guest is running is possible, but can lead to data and/or file system corruption if outstanding caches have not been flushed to disk.
IO performance could be significantly increased by using a raw LVM volume instead of a dynamic image. As the images already had grown to nearly their maximum size due to usage, not much space is wasted. However, the process was lengthy, so I advice to create production instances directly with raw volumes, and use dynamic images only for testing and when moving, duplicating, etc. is likely, as smaller files incur less time. Another side-effect is the fact that file system corruption in the host doesn't affect the guest volumes, as they don't exist as (fragmented) files anymore. The raw volumes usually can be restored with the LVM tool suite and then recreated on the same or a different machine, as long as the partition table and the LVM configuration is intact.
One problem still is the fact that after restarting the host computer, LVM will give the default root/root permissions to the block devices, ending in VirtualBox not being able to access them. I have yet to decide how to resolve this problem, either by having permanent ownership, by having a script changing ownership at boot time or by giving the vbox user account more rights. But because an unattended shutdown/reboot is currently not possible (it would require the hypervisor sending ACPI requests to the guest, then waiting for shutdown, which can take minutes, especially if all instances are to be shut down simultaneously, then allowing the host to shutdown, and then restart the instances when the machine comes back up), this isn't an issue now. I usually shutdown all instances manually, and then the hypervisor. A UPS for the hypervisor is advised because pulling the plug on XFS-based file systems with write cache enabled will usually lead to data loss.
Because drive space is cheap, I would at least stick with fixed-sized images or raw volumes for production systems. Both can be enlarged if really necessary, and current trends towards SANs, iSCSI and distributed file systems with transparent sparse file support make dynamic disk images redundant – usually because they substitute the feature, or because virtual machines can access external hard disk space more easily without actually affecting the disk image itself. The only niche where dynamic disk images seem to fit are on capacity-limited SSDs, where seeking time is not an issue any more. But then again, a grown disk image needs considerable space and time to reclaim unused space, so guest operating systems need to be either aware of the virtualization, or user-mode tools have to be developed to help online shrinking.