The Linux Plumbers 2012 Microconference - Virtualization track focuses on general Linux Virt and related technologies. There are quite a few active projects that encompass virtualization, and this forum will be available for developers to meet and collaborate.
The structure will be similar to what was followed the previous two years, 2011 and 2010: about 30 minutes per subject with discussion. This microconference has been approved for a double track.
Slides
threats to virt system
3 things to worry about
protecting guest against malicious hosts
host has full access to guest resources
host has ability to modify guest stuff at will; w/o guest knowing it
how to solve?
no real concrete solutions that are perfect
guest needs to be able to verify / attest host state
guests need to be able to protect data when offline
protect hosts against malicious guests
just assume all guests are going to be malicious
more than just qemu isolation
how?
multi-layer security
restrict guest access to guest-owned resources
h/w passthrough – make sure devices are tied to those guests
limit avl. kernel interfaces
system calls, netlink, /proc, /sys, etc.
if a guest doesn't need an access, don't give it!
libvirt+svirt
MAC in host to provide separation, etc.
addresses netlink, /proc, /sys
(discussion) aside: how to use libvirt w/o
GUI?
protecting guest against hostile networks
guests vulnerable directly and indirectly
direct: buggy apache
indirect: host attacked
qos issue on loaded systems
(discussion) blue pill vulnerability – how to mitigate?
somebody pulling rug beneath you, happens even after boot
you'll need h/w support?
yes, TPM
UEFI, secure boot
what about post-boot security threats?
let's say booted securely. other mechanisms you can enable - IMA – extends root of trust higher. signed hashes, binaries.
unfortunately, details beyond scope for a 20-min talk
Slides
glusterfs features
replication, striping, distribution, geo-replication/sync, online volume extension
future
is it possible to export LUNs to gluster clients?
creating a VM image means creating a LUN
exploit per-vm storage offload – all this using a block device translator
export LUNs as files; also export files as LUNs.
(discussion) why not use raw files directly instead of adding all this overhead? This looks like a perf disaster (ip network, qemu block layer, etc.) – combination of stuff increasing latency, etc.
Slides
current state
kvm emulates local apic and io-apic
all reads/writes intercepted
interrupts can be queued from user or kernel
ipi costs high
guest vapic backing page
store local apic contents for one vcpu
writes to accelerated won't intercept
to non-accelerated cause intercepts
Slides
intel: have a bitmap, and they decide whether to exit or not.
amd: hardcoded. apic timer counter, for example.
interrupt window: when hypervisor wants to inject interrupt, guest may not be running. hyp. has to enter vm. when guest is ready to receive interrupt, it comes back with vmexit. problem: as you need to inject interrupt, more vmexits, guest becomes busier. so: they wanted to eliminate them.
read case: if you have something in advance (apic page), hyp can just point to that instead of this exit dance
more than 50% exits are interrupt-related or apic related.
virt-interrupt delivery
extend tpr virt to other apic registers
eoi - no need for vm exits (using new bitmap)
but for eoi behaviour, intel/amd can have common interface.
virt api can have common infra, but data structures are totally different. intel
spec will be avl. in a month or so (update: already available now). amd
spec shd be avl in a month too.
VMFUNC
this can hide info from other vcpus
secure channel between guest and host; can do whatever hypervisor wants.
vcpu executes vmfunc instrucion in special thread
Slides
benchmarkings
developed a new benchmark tool, autonuma-benchmark
comparing to gravity measurement
put all memory in single node
perf numbers
also includes comparison with alternative approach, sched numa.
graphs show autonuma is better than schednuma, which is better than vanilla kernel
using printks right now for development, there's a lot of info, all the info you have to see why the algo is doing what it's doing.
good to have in production so that admins can see
overall, all such stats can be easily exported, it's already avl. via printk, but have to moved to something more structured and standard.
Slides
solution proposed
core should be hypervisor-independent
should co-operate on h/w independent level - e.g mem hotplug, tmem, movable pages to reduce fragmentation
selfballooning ready
support for hugepages
standard api and abi if possible
arch-specific parts should communicate with underlying hypervisor and h/w if needed
crazy idea
replace ballooning with mem hot-unplug support
however, ballooning operates on single pages whereas hotplug/unplug works on groups of pages that are arch-dependent.
Slides
ARM architecture virtualization extensions
recent introduction in arch
new hypervisor mode PL2
traditionally secure state and non-secure state
Hyp mode is in non-secure side
higher privilege than kernel mode
adds second stage translation; adds extra level of indirection between guests and physical mem
ability to trap accesses to most system registers
can handle irqs, fiqs, async aborts
KVM/ARM
uses HYP mode to context switch from host to guest and back
exits guest on physical interrupt firing
access to a few privileged system registers
WFI (wait for interrupt)
etc.
on guest exit, control restored to host
no nesting; arch isn't ready for that.
MM
host in charge of all MM
has no stage2 translation itself (saves tlb entries)
guests are in total control of page tables
becomes easy to map a real device into the guest physical space
for emulated devices, accesses fault, generates exit, and then host takes over
4k pages only
instruction emulation
trap on mmio
most instructions described in HSR
added complexity due to having to handle multiple ISAs (ARM, Thumb)
interrupt handling
redirect all interrupts to hyp mode only while running a guest. This only affects physical interrupts.
leave it pending and return to host
pending int will kick in when returns to guest mode?
No, it will be handled in host mode. Basically, we use the redirection to HYP mode to exit the guest, but keep the handling on the host.
booting protocol
if you boot in HYP mode, and if you enter a non-kvm kernel, it gracefully goes back to SVC.
if kvm-enabled kernel is attempted to boot into, automatically goes into HYP mode
If a kvm-enabled kernel is booted in HYP mode, it installs a HYP stub and goes back to SVC. The only goal of this stub is to provide a hook
for KVM (or another hypervisor) to install itself.
Slides
challenges
traditional way: port xen, and port hypercall interface to arm
from Linux side, using PVOPS to modify setpte, etc., is difficult
one type of guest
like pv guests
like hvm guests
exploit nested paging
same entry point on native and xen
use device tree to discover xen presence
simple device emulation can be done in xen
exploit h/w
running xen in hyp mode
no pv mmu
hypercall
generic timer
status
xen and dom0 boot
vm creation and destruction work
pv console, disk, network work
xen hypervisor patches almost entirely upstream
linux side patches should go in next merge window
open issues
acpi
will have to add acpi parsers, etc. in device table
linux has 110,000 lines – should all be merged
Slides
are we there yet? almost
what is vfio?
what's next?
qemu integration
legacy pci interrupts
libvirt support
iommu grps changed the way we do device assignment
sysfs entry point; move device to vfio driver
do you pass group by file descriptor?
lots of discussion on how to do this
existing method needs name for access to /sys
how can we pass file descriptors from libvirt for groups and containers to work in different security models?
The difficulty is in how qemu assembles the groups and containers. On the qemu command line, we specify an individual device, but that device lives in a group, which is the unit of ownership in vfio and may or may not be connectable to other containers. We need to figure out the details here,
error reporting
better ability to inject AER etc to guest
maybe another ioctl interrupt
What are we going to be able to do if we do get PCIe AER errors to show up at a device, what is the guest going to be able to do (for instance can it reset links).
Slides
xen solution – prototype
create a bridge in domU
guest sees a pv device and a real device
guest changes needed for bridge
migration is guest-visible, since real device goes away and comes back (hotplug)
vmware way
writing a new driver for each n/w device they want to support
this new driver calls into vmxnet
binary blob is mapped into your address space
migration is guest exposed
alex way
emulate real device in qemu
e.g. expose emulated igbvf if passing through igbvf
need to write migration code for each adapter as well
is it a good idea?
how much effort really?
* no one needs single-vendor/card dependency in an entire datacenter
Slides
current situation: one kernel thread per vhost
if we create a lot of VMs and a lot of virtio-net devices, perf doesn't scale
instead of having a thread of every vhost device, create a vhost thread per cpu
add some numa-awareness scheduling – pick best cpu based on load
perf graphs
for 1 VM, number of instances of netperf increase, per-cpu-vhost doesn't shine.
another tweak: use 2 threads per cpu: perf is better
tried tcp, udp, inter-guest, all netperf tests, etc.
RFC
should they continue?
strong objections?
overlay networks
when migrating across domains (subnets), have to re-number IP addresses
solution is to have a set of tunnels
every end-user can view their domain/tunnel as a single virtual network
Slides
open issues
need flexibility in qemu to start w/o devices
modern qemu better
one qemu uses this all, others have functionality, but don't use it