Virtualization

The Linux Plumbers 2011 Virtualization track is focusing on general free software Linux Virtualization. It is not focused on a specific hypervisor, but will focus on general virtualization issues and in particular collaboration amongst projects. This would include KVM, Xen, QEMU, containers etc.

The structure will be similar to what was used in 2010, ie. 30 minutes per subject, including discussion.

Schedule

The schedule of the 2011 Virtualization Plumbers Micro Conference was as follows. Note presentation slides can be found on the Plumbers page by following the links to the abstracts:

Discussion notes

The following is the notes taken in the Etherpad from the discussions.

Seeking a path to zero-copy paravirtualised networking in the Linux kernel

Xen PV network transmit operation:

  • Back end either maps the guest pages or copies them, the former is needed to implement zero copy but there is one issue: network stack does not do proper page reference counting.
  • Page life cycle tracking issue in the network layer affects any subsystem that gives paget to the network layer (such as NFS).
  • Basic requirement for a general solution: even if we give a page to the network layer we need ti retain ownership of it.Implemented using fragment API and destructor infrastructure.

Questions

  1. Problem should not be relevant to KVM. Discuss issue on the mailing list whether KVM's implementation is affected?.
  2. What about performance?: still have not measured the effect on performance of calling the destructors. It is a correctness issue so we have to live with it.
  3. Might double the overhead of allocating a skb. Need to measure how it affects memory usage.
  4. James: could we get rid of the new page struct member using an indexed array.
  5. Patches well received by the netdev guys.

Virtio on Xen

* Started as Google Summer of code project.

* Need to figure out why “pv guest, with virtio nic” is so slow.

Questions

- Trying to switch Xen over to virtio?

  1. It was not the speaker's inital intent.
  2. Ian: could do it for simple devices such as the random generator and such.
  3. Jes: If we could build Xen APIs on top of virtio it would be good for the whole FOSS virtualization community.
  4. Ian: that is not our current goal.

Yabusame: Postcopy Live Migration for Qemu/KVM

Proposed by Takahiro Hirofuchi, AIST. Presented by Isaku Yamahata, VALinux systems Japan K.K.

  • precopy: copy mem before switching execution. (status quo)
  • postcopy: start executing before pages are copied on demand, and in background

precopy can result in the same page being copied multiple times, as it is repeatedly dirtied. Migration time depends on how fast memory is re-dirtied (memory update intensity).

  • postcopy switch time deterministic 200-300ms, independent of RAM size. Performance loss after switch, but should be short.
  • Early-stage proof-of-concept code
  • qemu-kvm cannot handle 100% itself, since others may have modified guest RAM on source machine. Must hook guest RAM access during postcopy phase
  • postcopy vulnerable to either machine failing during transition period. Checkpointing likely needed. Lockstep? (a talk in cloud MC (Remus) talked about HA snapshotting that may be relevant). James described Stratus lockstep solution, duplicating inputs (e.g. incoming web requests) and verifying outputs match.

VFIO: PCI device assignment breaks free of KVM

  • paravirt: bypass qemu improves latency & perfomance
  • even better: assign a device directly to the VM via SRIOV. close to bare metal performance a compatibilty win
  • downside: guest pinned in host memory, VM tied to phys host device
  • PCI config, BARs, interrupts mapped or fwded to guest
  • Current implementation not ideal
  • VFIO: high-perf userspace driver. KVM not required
  • kernel module. configs iommu, other stuff. iommu issues can be thorny. VFIO-NG coming soon.
  • IOMMU2 will help, can page-fault

New filesystem freeze API

Idea is to suspend writes to a filesystem before snapshot / resume after.

  • Often used for backup - get a consistent snapshot
  • Suspend writes – includes VFS I/O and MMAP I/O (mmap since v3.0).
  • Was XFS specific – now in VFS: ioctls FIFREEZE and FITHAW.
  • wart: Can umount a frozen fs. But cannot thaw a block device
  • wart: Cannot check status. Comment from audience: Can call thaw and check result. Not ideal
  • New usecase: snapshot of virtual machine from hypervisor
  • Needs in-guest support (Linux: virtagent. Windows: VSS)
  • Problem: what if agent dies within guest. cannot check state on restart
  • JamesB: Why does dm-snapshot not meet this usecase?
    • Answer: use _inside_ guest. guest agent creates snapshot, but this is intrusive. Guest does not have visibility into many host-level I/O – e.g. snapshots in filers etc.
  • New API: ioctls FIISFROZEN (query FS), BLKISFROZEN (query blkdev, could -EBUSY from umount instead)
  • An issue with returning EBUSY: lazy umount non-trivial
  • New ioctl: FIGETFREEZEFD – freeze and return fd (handle). thaw when fd is closed. automagically thawed, solves issues with agent going away. On this fd can do FS_FREEZE_FD, FS_THAW_FD and FS_ISFROZEN_FD. Can add freeze=true/false paramter to FIGETFREEZEFD.
  • JamesB: What if agent is alive but unresponsive?
    • Jes: cannot both open comms channel to host.
  • JamesB: Can do multiple FIGETFREEZEFD which is problematic.
    • Could make exclusive or reference count (make union of two requests)? Careful not to allow hostile process to wedge things. Need a per-fd reference not a global one so cannot release other caller's freezes.
  • Access control: Use CAPABILITY (CAP_SG?) and permission to open a path within the fs (needed for FIGETFREEZEFD).
  • Need to be careful when snapshotting filesystem containing agent binary. mlock etc. small dedicated binary for ease of review (rather than funciton of larger binary)
  • Need for polling/notiication interface for applications? Cannot freeze/thaw but can monitor.

Contact

Proposal added by Jes Sorensen Jes.Sorensen@redhat.com

 
2011/virtualization.txt · Last modified: 2011/09/27 09:01 by jsorensen
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki