2010:virtualization [Linux Plumbers Conf Wiki]

Virtualization

The Virtualization Micro Conference was held at the Linux Plumbers Conference 2010 on Thursday November 4th. We had a good group of people and many useful discussions. The slides from the presentations, as well as the notes taken during the discussions are available below.

The track focused on on general Linux Virtualization, it was not limited to a specific hypervisor. In addition there were a couple of virtualization related presentations at the main Linux Plumbers presentation track. Please see the main schedule for information on those presentations.

Schedule

The tentative schedule for the Virtualization Micro Conference at Linux Plumbers 2010 is as follows:

SeaBIOS in a virtualized environment - Kevin O'Connor
KVM/QEMU Storage Stack Performance Discussion - Khoa Huynh, Stefan Hajnoczi
Qemu/Xen integration - Stefano Stabellini
Combining Xenner, Xen and KVM into a giant furball - Alexander Graf
Virtualizing NUMA - Andre Przywara

Ticket to a tarpit - Jeremy Fitzhardinge
Advances in memory management in a virtualized environment - Dan Magenheimer
IOV hardware and software plumbing for VMware ESX - journey to uncompromised I/O Virtualization and Convergence - Leonid Grossman
Network QoS Guarantees for Virtual Machines - Dhaval Giani, Tommaso Cucinotta, Dario Faggioli
SVM Emulation - Performance Discussion - Joerg Roedel

Discussion notes

SeaBIOS

by Kevin O'Connor

Design dominated by 16 bit BIOS interfaces
Motivation: bring support to Coreboot for Windows/etc.
Bochs BIOS sucks
Bochs BIOS provides a useful indication of what BIOS calls actually matter
Debug: -chardev stdio,id=foo -device isa-debugcon,chardev=foo
SeaBIOS: converts a lot of Bochs BIOS assembly code to C
As much code as possible is written in 32-bit mode
BIOS fidelity matters when dealing with foreign option ROMs (PCI passthrough)

Q: Does virtual time impact BIOS?
A: at a high frequency, time is unreliable on bare metal too

DOS 1.0 uses very small stacks so is unlikely to work with SeaBIOS

Q: sleep state support
A: basic APM

Next Steps:

Boot ordering
Xen using SeaBIOS
Xen has a number of changes to Bochs BIOS
1. ACPI tables are generated by hvmloader
VGA BIOS
1. Need someone to continue to port VGABios to GCC

KVM/QEMU Storage Peformance

by Khoa Huynh and Stefan Hajnoczi

Storage subsystem; 30k IOPs
8 x 24-disk arrays
Perf is around 85% of native
Reaching 27k iops perf second within a guest

File based performance
virtio vs. emulated ide; ide is terrible

QEMU/Xen Integration

by Stefano Stabellino and Anthony Perard

map cache; domain-0 prefers to run as 32-bit; qemu cannot map all of the guest memory at once
pv drivers totally avoid qemu
stub domains; break the domain-0 bottleneck by running qemu in a small guest
double scheduling problem: xen schedules domain-0, domain-0 has to schedule the right qemu
PCI passthrough
1. KVM is rewriting its device passthrough interface to use VFIO
2. potential area to collaborate
3. xen migration: treats qemu save data as an opaque blob (just another device state)

Xenner

by Alexander Graf

Xenner can run Xen pv guests in KVM
1. Originally created by Gerd Hoffmann (kraxel)
2. Makes migration Xen → KVM easy
Xenner differences to Xen architecture:
1. Xenstored, domain builder, blkback, netback all live inside QEMU (no Dom0)
2. Xenner has a guest kernel mode stub, hypercall shim that translates to QEMU emulated I/O
3. DomU runs in guest user mode
Pure TCG QEMU can be used to run Xenner guests, KVM not necessary
pv clock is passed through from KVM to Xenner guest, not emulated today by QEMU
Discussion:
1. Should all the xen infrastructure be inside QEMU or should it be separate?
2. Cross-platform discussion: cross-platform Xen is not realistic (use-case missing?)
3. Xenner on Xen HVM scenario
4. Alex suggests sharing code for xenstored, Stefano is for reusing code or converting blk/netback to virtio inside Xenner guest code
5. Anthony suggests basic xenstored inside Xenner guest code, also do virtio-blk/net PCI instead of Xen blk/net to QEMU

Virtualizing NUMA

by Andre Przywara

Non-Uniform Memory Architectures present performance challenges
NUMA should be invisible, but can easily hurt performance
ACPI tables describe memory topology
1. If you exceed a node's resources in your guest, then performance suffers
QEMU supports guest memory topology (command-line option) but no binding on host
Xen patches still under development

Q: How does hotplug work for NUMA
A: NUMA config is static, ACPI tables must account for all possible configs.
   Not sure what current level of support is, although ACPI probably supports this.

lmbench unpinned shows high variability in benchmark results
With numactl results are much better and predictable
Avi mentions initial placement causes memory to be allocated on a node early one but memory migration would be needed to move later
Pinning is mentioned as a bad solution for more dynamic workloads (versus fixed HPC workloads)
Discussion items:
1. Handle NUMA inside QEMU or use external tools?
2. QEMU normally leverages Linux kernel and tools, Anthony suggests using external tools
3. Memory inside QEMU may not be easily visible/accessible to external code

Ticketlocks

by Jeremy Fitzhardinge

Introduced by Thomas for virtualized envs.
1. Old tickets: byte locks
Ticket locks (2.6.24 in general Linux code)
1. Lockholder Preemption issues in virtualized makes guests slower
2. Ticketlocks in virtualized env has lock claim scheduling issues (can have big impact - 90% time spent in ticket locking spining)
Old style Xen PV spinlocks use the bytelocks with enhancments to kick other VCPUs that are waiting for this spinlock. Drawback: bigger size of spinlock_t structure. On baremetal, some perf. hit on (some architectures). Different from the generic Linux spinlock implementation.
New: utilize the generic ticketlock (existing since 2.6.24~), enhance the slow path with VCPU kicking. Exist extra enh to know the exact number of waiters in the slow path (drawback: expands the spinlock_t struct more, and some bit lookup).

Q: "Yield the VCPU instead of kicking?" "Q:Influence the hypervisor
   to give the other guest waiting for the spinlock the turn."
A: Keep it simple.

Q: "Influence the hypervisor scheduler with spinlock data."
A: Impact unclear.

Perf numbers. Old stle (bytelock) vs New style (ticketlock) around the same under Xen. On baremetal (ticketlocks to ticketlocks with pvops): -1% to -2%.

Q: "Dynamic spinlocks."
A: Linux upstream will throw it out.

Q:" How much benefit do we get for VCPU kicking when we get notified from hardware due to PAUSE-lock detection"
A: Avi, extremely much.

Q: "What if no waiting count and just kick all VCPUs." ?
A: Back to lock-claim scheduling issue.
A :"Fairness for spinlocks (generic). Lock value from 0x100 to diff."
A: Under Xen either one byte-lock or ticketlock would potentially.
   The problem is more of the generic spinlocks.
A:"spinlock data exposed to hypevisor?"
A: "Xen does use it (but does not have the owner). But rare case. Spinlock
   are used for short locks."
A: "per_cpu(); kick vcpu; then pause."
A: With multiple spinlocks no good.
A: "pSereis, 390, do they have different virtualized spinlock
   implementations: "
A: .. run out of time.

Adv. Memory Management in a virt envr.

by Dan Magenheimer

Memory is getting to be used in diff. ways. It is also getting more power-hungry.
With all that memory, OS grab all of it and get fat.
We want a continous way to slim OSes and give it to other ones that need it, and also keep the perf better than now.
1. Step 1: OS needs to be able to give memory back. Diff solutions:
  1. Partitioning
  2. Host Swapping (think swap in virtualized env.)
  3. Page sharing (KSM) - not used in generic cases, but in cloud space (1000 of guests) - but time is required.

Q: "OS when asked how much memory, always say ALL. How do you know
    which memory it actually does not need?
A: Answered later in slides.

Solution 2B:
1. OS can give back memory back (balloon driver, virtual hot plug memory).
2. The real problem (how much does the OS actually need) is not solved, they are just mechanisms.
Solution C?:
1. Policies (Citrix Memory Control), KVM Memory overcommitment Manager. Don't have input from the OS.
Solution D:
1. self-ballooning (feedback system in OS). Issues refault, additional disk writes, OOMs”

Q: "OS whichh are hostile can use this. It does not play nice with
    other guests?"
A: "Later in slides."

Solutions that are based on policies look to be doomed.
Transcendent memory (pool of memory that can be used with other guests).

A: Linux kernel has a similar to this, called page-cache. Why have it:

Q: "B/c we are in virtualized env. Perf. numbers will explain it."

It solves Solution D refault issues, and additional disk writes, and OOMs.
1. Perf numbers. Comparison to self-ballooning. tmem + deduplication + compression >= raw kernel. VCPU are used more (expected - compression), disk reads are slowed.

Q: "Why not have this in Linux baremetal?"
A:"Looking at it now."

Tmem benefits : mem utilization ++, i/o bandwith decreased, better vCPU utilization. Workload gets faster.
http://oss.oracle.com/projects/tmem

Q: "Why would a guest participate?"
A: They might now, but web-hosting  might very well. Amazon EC2 utilizes
   similar pricing model to charge more for overcommit and give discount
   for under-utilization.

Q: "Some vendors don't want this b/c they depend on undercommited guests. " ..

Q: "Compression of cache data, and dynamic page cache across all
   guests. With these two things why can't it be solved in hypervisor?
A: Various people have tried to do it and it did not work. Found
   important places in Linux kernel (along with Chris Mason) that can
   benefit this, and it works.

Q: "Benchmarks with KVM, strong cap on page cache. Similar numbers."
A: There is work on swap compression on Linux kernel (2.6.36
   time-frame posted by N.)

Q: "Why not implement a cache in blkback?
A: But it does not work with DIRECT_IO, so doing this in Xen blkback can
   not be done.

Q: "Large guest, what about this impact? "
A: Not done.

IOV [hard|soft]ware for VMware ESX

by Leonid Grossman

Existing trends, every body wants to virtualize. Memory and CPU : done.
IO: Much harder, and it brings interesting challenges. #2
Solutions:
1. software para-virtualized storage
2. SR-IOV hardware. (hard to migrate)
Virtual I/O has existed for years (enterprise MPIO).
Paravirtualized: CPU utilization goes up, lack of SLA, and isolation of hardware features.
SLA is important for multiple guests, but with paravirtualized I/Os everybody shares the pipe.
Solutions: Seperate traffic (network on one NIC, spread guest usages on other NICs, but not enough PCI slots and adapters).
Hardware vendors solution: SR-IOV or PCI passthrough (Direct Hardware Access) of multifunction adapters. Due to its newness SR-IOVnot supported by Microsoft and VMware.
ESX MF IOV: PCI passthrough of multifunction adapter to guests to complement the paravirtualized networking. Can pass in one function

to guests instead of the whole PCI device. Perf numbers close to native with PCI passthrough and get bell and whistles of hardware features (IPsec, GOS, iSCSI offload).

Hypervisor elects guests to be privilged.
IEEE Virtual Bridge to definie a spec which can set QoS, etc on Ports (VFs on SR-IOV)
SR-IOV will be supported on PCI-E Gen.3 Virtual Ethernet Bridge supported. The multifunction adapters have switch chipsets to re-route traffic on PCI-e card instead of having to go outside switch. This can give 60 Gbps+.
Some PCI-e cards are not that useful for virtualization. Example is iSCSI TOE cards.
Solutions: move offloading

Q: "Security? Say pass in firewire card to sniff other memory cards. "
A: Qualification by vendor to discourage this.

Network QoS guarantess for Virtual Machines

by Dhavai Giani

Want QoS for throughput and latency.
Graph (iperf, 2 VMs, UP, with diff workloads). CPU usage vs network throughput (MB/s).
Issue: two VMs at the same time, network intensive, compute intensive. The CPU intensive guest influences the network intensive VMs (by 10%).
Tried V-Bus. Failed (not stable).
Solution: Seperate network device exclusive for guest. QEMU has a tap device that does it. But not done in diff contexts, not are

multiple CPU workqueues.

A: "Is this an accounting problem." No, we are trying to isolate the performance.

Want to get out of guest interrupt handler to the workqueue ASAP.
Avi proposed threaded interrupt handler in a cgroup.
Solution #2:Pretty much what Avi described..

From here on discussion started.

Q: It did not work as expected, unknown reason. Any ideas? vhost having separate threads?
   vbus did the same stuff. Tried to raise softirq priority. No luck."
A: There is to common Linux code to deal with the packet. If you take the netfilter, bridge firewall out, does that not get it faster?
A: No idea. But it might not be applicable to take out all of this.

Q: Why does netfilter not work?
A: ".." (note taker did not hear the answer).

Q: Have you tried seperate NICs to seperate the traffic?
A: "Did not have two NICs and the problem space was to use one and be better at
   giving data to guest."

Q: The basic problem is the length of the time to the guest?
A: Yes, that is what we want to fix and have a better QoS."

Q: Latency with ICMP decreased over time?
A: CPU cache is warmed up and get the benefit of cache.

Figure out if this is an accounting problem. Specifically post-filter accounting.”
1. Avi: ” keeping the time, doing the accounting means more CPU costs.”
2. Avi: “Lots of context switches, rescheduling per packet would be more expensive than normal

path. Gives better isolation.”

Increasing responsiveness means more CPU utilization (accounting, etc).
Solution #3: Hint from NIC which virtual MAC it is for, to include this. MAC is already in there. Perhaps the final thing is to threaded vhosts. Nobody is sure if the patches are per device or per cpu. It looks as if per device thread would solve this issue.
One thread/VCPU for NUMA performance. The benefit is to have _all_ VCPUs to transmit/receive in parallel. The solutions that exist are not generic enough.

Nested SVM, SVM aka AMD-V, aka Pacifica

by Joerg Roedel

Status (2.6.31 initial, bug fixes, perf++, 2..37-rc1 emulating Nested Paging).
Supported hypervisors: Everything except VMWare (there is some code to make this work. Needs revisiting).
Costs. Thousands of extra cycles. 5-10x than non-emulated.
Benchmarks: 1 VM, 1 VCPU: Widows XP in KVM + Nested Paging. PCMark05 numbers for HDD (IDE) 50% slower, Memory close to non-emulated. (throughput numbers). CPU varies by work-load, but still close to non-emulated (decompression big hitter).
But random memory latency, 20→80% perf drop (4KB→16MB).
These numbers should not be so bad. Perhaps the warm-of memory should be done on _all_ memory regions.
To find the problem, looking at Nested Paging faults ..
HDD benchmarks varies. 0%→80% perf depending on workload (virus scan horrible)
VirtIO vs IDE combinations. Virtio on virtio is slow (emulated QEMU). It is half throughput of virtio or IDE non-emulated.
Perhaps aligment issues are creeping up? More I/O b/c 4KB get broken up?
A strange beast of running KVM, then QEMU with Xen (2nd guest is Xen Dom0) and try that.
Perhaps try with VirtIO PCI Windows driver, that might solve the 4KB I/O splitting issue?
VirtiIO drivers are unsigned for Windows 7 (hacks exist) and XP on Fedora (virt package?)
VirtIO drivers are signed on a magic Microsoft website (http://windowsservercatalog.com).
Folks are getting tired by now so not quite lucid.
Benchamrk (Kern compile on SMP) KVM of Linux then KVM of Linux, then compile. Various work-loads (4, 2, 1 CPUs).. Perf drops 20%, but not

tragic. Surprisingly scalling (2, 4, 1) the ratio between different guests of perf is the same

Conclusion: CPU perf = ok, mem perf = so so, i/o = horror.