The Virtualization Micro Conference was held at the Linux Plumbers Conference 2010 on Thursday November 4th. We had a good group of people and many useful discussions. The slides from the presentations, as well as the notes taken during the discussions are available below.
The track focused on on general Linux Virtualization, it was not limited to a specific hypervisor. In addition there were a couple of virtualization related presentations at the main Linux Plumbers presentation track. Please see the main schedule for information on those presentations.
The tentative schedule for the Virtualization Micro Conference at Linux Plumbers 2010 is as follows:
by Kevin O'Connor
Design dominated by 16 bit BIOS interfaces
Motivation: bring support to Coreboot for Windows/etc.
Bochs BIOS sucks
Bochs BIOS provides a useful indication of what BIOS calls actually matter
Debug: -chardev stdio,id=foo -device isa-debugcon,chardev=foo
SeaBIOS: converts a lot of Bochs BIOS assembly code to C
As much code as possible is written in 32-bit mode
BIOS fidelity matters when dealing with foreign option ROMs (PCI passthrough)
Q: Does virtual time impact BIOS?
A: at a high frequency, time is unreliable on bare metal too
Q: sleep state support
A: basic APM
Next Steps:
by Khoa Huynh and Stefan Hajnoczi
Storage subsystem; 30k IOPs
8 x 24-disk arrays
Perf is around 85% of native
Reaching 27k iops perf second within a guest
by Stefano Stabellino and Anthony Perard
map cache; domain-0 prefers to run as 32-bit; qemu cannot map all of the guest memory at once
pv drivers totally avoid qemu
stub domains; break the domain-0 bottleneck by running qemu in a small guest
double scheduling problem: xen schedules domain-0, domain-0 has to schedule the right qemu
PCI passthrough
KVM is rewriting its device passthrough interface to use VFIO
potential area to collaborate
xen migration: treats qemu save data as an opaque blob (just another device state)
by Alexander Graf
Xenner can run Xen pv guests in KVM
Originally created by Gerd Hoffmann (kraxel)
Makes migration Xen → KVM easy
Xenner differences to Xen architecture:
Xenstored, domain builder, blkback, netback all live inside QEMU (no Dom0)
Xenner has a guest kernel mode stub, hypercall shim that translates to QEMU emulated I/O
DomU runs in guest user mode
Pure TCG QEMU can be used to run Xenner guests, KVM not necessary
pv clock is passed through from KVM to Xenner guest, not emulated today by QEMU
Discussion:
Should all the xen infrastructure be inside QEMU or should it be separate?
Cross-platform discussion: cross-platform Xen is not realistic (use-case missing?)
Xenner on Xen HVM scenario
Alex suggests sharing code for xenstored, Stefano is for reusing code or converting blk/netback to virtio inside Xenner guest code
Anthony suggests basic xenstored inside Xenner guest code, also do virtio-blk/net PCI instead of Xen blk/net to QEMU
by Andre Przywara
Non-Uniform Memory Architectures present performance challenges
NUMA should be invisible, but can easily hurt performance
ACPI tables describe memory topology
If you exceed a node's resources in your guest, then performance suffers
QEMU supports guest memory topology (command-line option) but no binding on host
Xen patches still under development
Q: How does hotplug work for NUMA
A: NUMA config is static, ACPI tables must account for all possible configs.
Not sure what current level of support is, although ACPI probably supports this.
lmbench unpinned shows high variability in benchmark results
With numactl results are much better and predictable
Avi mentions initial placement causes memory to be allocated on a node early one but memory migration would be needed to move later
Pinning is mentioned as a bad solution for more dynamic workloads (versus fixed HPC workloads)
Discussion items:
Handle NUMA inside QEMU or use external tools?
QEMU normally leverages Linux kernel and tools, Anthony suggests using external tools
Memory inside QEMU may not be easily visible/accessible to external code
by Jeremy Fitzhardinge
Introduced by Thomas for virtualized envs.
Old tickets: byte locks
Ticket locks (2.6.24 in general Linux code)
Lockholder Preemption issues in virtualized makes guests slower
Ticketlocks in virtualized env has lock claim scheduling issues (can have big impact - 90% time spent in ticket locking spining)
Old style Xen PV spinlocks use the bytelocks with enhancments to kick other VCPUs that are waiting for this spinlock. Drawback: bigger size of spinlock_t structure. On baremetal, some perf. hit on (some architectures). Different from the generic Linux spinlock implementation.
New: utilize the generic ticketlock (existing since 2.6.24~), enhance the slow path with VCPU kicking. Exist extra enh to know the exact number of waiters in the slow path (drawback: expands the spinlock_t struct more, and some bit lookup).
Q: "Yield the VCPU instead of kicking?" "Q:Influence the hypervisor
to give the other guest waiting for the spinlock the turn."
A: Keep it simple.
Q: "Influence the hypervisor scheduler with spinlock data."
A: Impact unclear.
Q: "Dynamic spinlocks."
A: Linux upstream will throw it out.
Q:" How much benefit do we get for VCPU kicking when we get notified from hardware due to PAUSE-lock detection"
A: Avi, extremely much.
Q: "What if no waiting count and just kick all VCPUs." ?
A: Back to lock-claim scheduling issue.
A :"Fairness for spinlocks (generic). Lock value from 0x100 to diff."
A: Under Xen either one byte-lock or ticketlock would potentially.
The problem is more of the generic spinlocks.
A:"spinlock data exposed to hypevisor?"
A: "Xen does use it (but does not have the owner). But rare case. Spinlock
are used for short locks."
A: "per_cpu(); kick vcpu; then pause."
A: With multiple spinlocks no good.
A: "pSereis, 390, do they have different virtualized spinlock
implementations: "
A: .. run out of time.
by Dan Magenheimer
Memory is getting to be used in diff. ways. It is also getting more power-hungry.
With all that memory,
OS grab all of it and get fat.
We want a continous way to slim OSes and give it to other ones that need it, and also keep the perf better than now.
Step 1:
OS needs to be able to give memory back. Diff solutions:
Partitioning
Host Swapping (think swap in virtualized env.)
Page sharing (KSM) - not used in generic cases, but in cloud space (1000 of guests) - but time is required.
Q: "OS when asked how much memory, always say ALL. How do you know
which memory it actually does not need?
A: Answered later in slides.
Solution 2B:
OS can give back memory back (balloon driver, virtual hot plug memory).
The real problem (how much does the
OS actually need) is not solved, they are just mechanisms.
Solution C?:
Policies (Citrix Memory Control), KVM Memory overcommitment Manager. Don't have input from the
OS.
Solution D:
self-ballooning (feedback system in
OS). Issues refault, additional disk writes, OOMs”
Q: "OS whichh are hostile can use this. It does not play nice with
other guests?"
A: "Later in slides."
A: Linux kernel has a similar to this, called page-cache. Why have it:
Q: "B/c we are in virtualized env. Perf. numbers will explain it."
Q: "Why not have this in Linux baremetal?"
A:"Looking at it now."
Q: "Why would a guest participate?"
A: They might now, but web-hosting might very well. Amazon EC2 utilizes
similar pricing model to charge more for overcommit and give discount
for under-utilization.
Q: "Some vendors don't want this b/c they depend on undercommited guests. " ..
Q: "Compression of cache data, and dynamic page cache across all
guests. With these two things why can't it be solved in hypervisor?
A: Various people have tried to do it and it did not work. Found
important places in Linux kernel (along with Chris Mason) that can
benefit this, and it works.
Q: "Benchmarks with KVM, strong cap on page cache. Similar numbers."
A: There is work on swap compression on Linux kernel (2.6.36
time-frame posted by N.)
Q: "Why not implement a cache in blkback?
A: But it does not work with DIRECT_IO, so doing this in Xen blkback can
not be done.
Q: "Large guest, what about this impact? "
A: Not done.
by Leonid Grossman
Existing trends, every body wants to virtualize. Memory and CPU : done.
IO: Much harder, and it brings interesting challenges. #2
Solutions:
software para-virtualized storage
SR-IOV hardware. (hard to migrate)
Virtual I/O has existed for years (enterprise MPIO).
Paravirtualized: CPU utilization goes up, lack of SLA, and isolation of hardware features.
SLA is important for multiple guests, but with paravirtualized I/Os everybody shares the pipe.
Solutions: Seperate traffic (network on one NIC, spread guest usages on other NICs, but not enough PCI slots and adapters).
Hardware vendors solution: SR-IOV or PCI passthrough (Direct Hardware Access) of multifunction adapters. Due to its newness SR-IOVnot supported by Microsoft and VMware.
ESX MF IOV: PCI passthrough of multifunction adapter to guests to complement the paravirtualized networking. Can pass in one function
to guests instead of the whole PCI device. Perf numbers close to native with PCI passthrough and get bell and whistles of hardware features (IPsec, GOS, iSCSI offload).
Hypervisor elects guests to be privilged.
IEEE Virtual Bridge to definie a
spec which can set
QoS, etc on Ports (VFs on SR-IOV)
SR-IOV will be supported on PCI-E Gen.3 Virtual Ethernet Bridge supported. The multifunction adapters have switch chipsets to re-route traffic on PCI-e card instead of having to go outside switch. This can give 60 Gbps+.
Some PCI-e cards are not that useful for virtualization. Example is iSCSI TOE cards.
Solutions: move offloading
Q: "Security? Say pass in firewire card to sniff other memory cards. "
A: Qualification by vendor to discourage this.
by Dhavai Giani
Want
QoS for throughput and latency.
Graph (iperf, 2 VMs, UP, with diff workloads). CPU usage vs network throughput (
MB/s).
Issue: two VMs at the same time, network intensive, compute intensive. The CPU intensive guest influences the network intensive VMs (by 10%).
Tried V-Bus. Failed (not stable).
Solution: Seperate network device exclusive for guest. QEMU has a tap device that does it. But not done in diff contexts, not are
multiple CPU workqueues.
A: "Is this an accounting problem." No, we are trying to isolate the performance.
Q: It did not work as expected, unknown reason. Any ideas? vhost having separate threads?
vbus did the same stuff. Tried to raise softirq priority. No luck."
A: There is to common Linux code to deal with the packet. If you take the netfilter, bridge firewall out, does that not get it faster?
A: No idea. But it might not be applicable to take out all of this.
Q: Why does netfilter not work?
A: ".." (note taker did not hear the answer).
Q: Have you tried seperate NICs to seperate the traffic?
A: "Did not have two NICs and the problem space was to use one and be better at
giving data to guest."
Q: The basic problem is the length of the time to the guest?
A: Yes, that is what we want to fix and have a better QoS."
Q: Latency with ICMP decreased over time?
A: CPU cache is warmed up and get the benefit of cache.
path. Gives better isolation.”
Increasing responsiveness means more CPU utilization (accounting, etc).
Solution #3: Hint from NIC which virtual MAC it is for, to include this. MAC is already in there. Perhaps the final thing is to threaded vhosts. Nobody is sure if the patches are per device or per cpu. It looks as if per device thread would solve this issue.
One thread/VCPU for NUMA performance. The benefit is to have _all_ VCPUs to transmit/receive in parallel. The solutions that exist are not generic enough.
by Joerg Roedel
Status (2.6.31 initial, bug fixes, perf++, 2..37-rc1 emulating Nested Paging).
Supported hypervisors: Everything except VMWare (there is some code to make this work. Needs revisiting).
Costs. Thousands of extra cycles. 5-10x than non-emulated.
Benchmarks: 1 VM, 1 VCPU: Widows XP in KVM + Nested Paging. PCMark05 numbers for HDD (IDE) 50% slower, Memory close to non-emulated. (throughput numbers). CPU varies by work-load, but still close to non-emulated (decompression big hitter).
But random memory latency, 20→80% perf drop (4KB→16MB).
These numbers should not be so bad. Perhaps the warm-of memory should be done on _all_ memory regions.
To find the problem, looking at Nested Paging faults ..
HDD benchmarks varies. 0%→80% perf depending on workload (virus scan horrible)
VirtIO vs IDE combinations. Virtio on virtio is slow (emulated QEMU). It is half throughput of virtio or IDE non-emulated.
Perhaps aligment issues are creeping up? More I/O b/c 4KB get broken up?
A strange beast of running KVM, then QEMU with Xen (2nd guest is Xen Dom0) and try that.
Perhaps try with VirtIO PCI Windows driver, that might solve the 4KB I/O splitting issue?
VirtiIO drivers are unsigned for Windows 7 (hacks exist) and XP on Fedora (virt package?)
-
Folks are getting tired by now so not quite lucid.
Benchamrk (Kern compile on SMP) KVM of Linux then KVM of Linux, then compile. Various work-loads (4, 2, 1 CPUs).. Perf drops 20%, but not
tragic. Surprisingly scalling (2, 4, 1) the ratio between different guests of perf is the same