Microconference - Scaling

The Linux Plumbers 2013 Microconference - Scaling track focuses on scalability, both upwards and downwards as well as up and down the stack. There are quite a few active projects that are working to enhance performance and scalability both in the Linux kernel and in user-space projects. In fact, one of the purposes of this forum is for developers from different projects to meet and collaborate. After all, for the user to see good performance and scalability, all relevant projects must perform and scale well.

The structure will be similar to what was followed the previous years (2011 and 2012): about 30 minutes per subject with discussion.

Possible Topics (please add...)

  • Scaling userspace
    • Userspace RCU library: new APIs and data structures (Mathieu Desnoyers)
  • Scaling the kernel
    • Validating RCU usage (dhaval, paulmck)
    • per-CPU atomics (pjt)
    • kernel supported M:N concurrency (pjt)
  • Validation of parallel hardware
  • Hardware Transactional Memory
    • Eliding locks (andi kleen)
  • Adaptive Ticks
  • Scalability and energy efficiency
  • Scalability to large memory sizes
    • We are nearing having terabytes of memory in a single NUMA node, which means things like the LRU lists or RMA chains can be tens of millions of entries long. (dhansen)
    • Faults are expensive, don't parallelize well, and we currently have no good way avoid them for page cache.
  • etc…

Schedule

The schedule of the 2013 Scalability Plumbers Micro Conference is as follows. Note presentation slides can be found on the Plumbers page by following the links to the abstracts:

  • Session 1 - Presenter 1
  • Session 2 - Presenter 2
  • etc…

Key Attendees

  • Paul McKenney
  • Mathieu Desnoyers
  • Steven Rostedt
  • Adrian Sutton
  • Frederic Wesibecker
  • Andi Kleen
  • Thomas Gleixner
  • Ingo Molnar
  • Peter Zijlstra
  • Paul Turner
  • Sami Al Bahra
  • Morten Rassmussen
  • Eric Dumazet
  • Darren Hart
  • Rik van Riel
  • Dave Hansen
  • Tim Chen

Discussion notes

http://etherpad.osuosl.org/lpc2013-scaling

 Linux Plumbers Conference 2013

Scaling Microconference

Volunteer needed for taking notes.

Andi Kleen on transactional sync extensions

Current status of lock elision in Linux http://linuxplumbersconf.org/2013/ocw/sessions/1161 (slides) Andi Kleen

Speculative execution:
    Blocking or non-blocking
    Blocking has added latency when transfering locks
    

Intel TSX

  User controled execution mode in CPU (Haswell)
  
  HLE adds instruction prefixes for atomic ops
      Failure fallsback
      nop on old CPUs
      
  RTM
      new XBEGIN/XEND instruction
      Explicit abor handler
      
  

Locking scalling results depend on workload.

An elided lock: fast path non-blocking similar to recursive reader lock individual cache line locks may always fall back uses the standard locking model

Overview of Linux implementation TSX perf profilling supported need to understand speculation TSX lock elision Elide kernel locks Glibc mutex elision Elide application locks Carious custom locks elide in applications Libitm (gcc) Application with non-scalable locking primary target

Lock adaptation RTM locks with adaptive abort handler skips elision on fail Safety net Simple algorithm state stored in lock tunables

Glibc mutex elision in glibc 2.18 needs to be globally enabled Any program using pthread mutexes can elide Missing: tunables in prgram, global, rwlocks Obsucre POSIX requirements are a problem deadlock requirements for nested lcoks lost nesting support for trylocks and write locks lost adaptive spinning locks

Future work recursive locks, rwlock improve adaptive algorithms tuning interface more than POSIX c++11 locking new interface to elide condition variables better fast-path without dynamic dispatch adaptive spinning the default

Kernel elision eliding mutex, spin, rw, bitspin, rwsem, custom lock only win in some areas with big locks occasional losses due to too fine grained locking may benefit from lock coarsening

Full dynticks status http://linuxplumbersconf.org/2013/ocw/sessions/1143 Frederic Weisbecker

Low frequency tick: more throughput, less interrupts, CPU less stolen, less cache trashed High frequency tick: latency, timer and scheduler granularity, precision

Duty of the tick:

  timekeeping (walltime, xtime, gettimeofday)
  jiffies
  timer wheel
  cpu time stats
  scheduler
  RCU

dynticks: remove the tick when possible – idle

problems: lcache, dcache periodically trashed tick steals CPU multiple times per second

Who's effected? HPC – exterme throughput Real time – extreame latency

CPU time accounting

  poll driven
  listen to ring boundaries
      syscalls
      exception
      irqs

dynaticks is only enabled if there is only a single runnable task per CPU

User-level threads……. with threads. http://linuxplumbersconf.org/2013/ocw/sessions/1653 Paul Turner

Three models 1:1 (kernel threading) → this is the current ubiquitous implementation n:1 (user threading) –> single kernel context. No kernel awareness of user-level threading m:n (hybrid) –> kernel assisted

Parallel programming models syncronous delegated events message passing / event loops

models → threads per request + simple + good data-locality - realized parallelism within a resuest - latency predictability varies inverselt with load

asynronous workers + greater control of work paritioning + impoved latency predictability + lower overheads achievable - complex programming model - encapsulation of control and data-flow - loss of data locality

Scalability Issues in Linux Kernel http://linuxplumbersconf.org/2013/ocw/sessions/1299 Dave Hansen, Tim Chen

Writes on shared data writes to shared structures are expensive spinlock r/w lock atomic counter cache line bouncing even very short hold time on a lock is expensive A ext4 inode lock on a sorted LRU list for reclaim put pressure on page cache >90% lock contention

Lock stat – scales poorly due to lockdep infrastucture build took 30% longer on a 60 core system heavy weight would be useful on production systems for debugging

Magic numbers batch sizes memory pool sizes hash table sizes

Multi-threaded ops mmap_sem and page_table_lock contention contention when many signals are sent to individual threads files_open contending on file_lock

VM Scalling mmap_sem page fault has significal page allocation & clearing cost fork operations contend on root anon_vma hugepages and other similar things (tlb flushing) are hacks and aren't actually very scalable.

vmsplice with transparent huge pages Robert Jennings rcj@linux.vnet.ibm.com

RFC on list for page flipping with vmsplice in addition to copy

http://marc.info/?l=linux-kernel&m=137477297209750&w=2
http://marc.info/?l=linux-fsdevel&m=137477295109744&w=2
http://marc.info/?l=linux-fsdevel&m=137477295109743&w=2

QEMU adding migration feature to move VM to new QEMU executable (for applying patches). Requires:

page flipping rather than copy (can't afford to double memory usage)
speed to minimize downtime

RFC adds page flipping for a narrow case (4K, aligned, single mapping, non-THP, etc) KVM host would like to use THP. Currently THPs would be copied.

Moving 4K pages is much slower than desired (~5GB/s)

Some room to improve in the scope of current code but larger
improvement is required

THP support for vmsplice page flipping would address environment in which KVM operates and provide a significant speedup.

BOF: Finding RCU Bugs Dhaval Giani, Paul McKenney, Frederic Weisbecker

Few people can pinpoint RCU bugs, correctness bugs and performance bugs. Detecting pointer leaks Converting RCU into rwlock RCU watchpoints by poisoning RCU pointers, expecting GPF Add data (tagging) into pointer to bits within rcu_dereference operation Check when we enter a GPF

Discussion: - hpa:

  1. there is a reason why those pointers are invalid. Will probably be used in the future.
  2. there are other options: use userspace pointers to trace kernel accesses on SMAP

- Using the bits (not just putting garbage data) - right now, it can catch bugs. - could add a level of indirection ? would be better if you don't have to. - Move this pointer info into another pointer.

  1. no garbage collection

- Use a modified page table entry, per-cpu cr3 (x86). Set at rcu_dereference. - does it require to have every instrucation page aligned ? no. - scheme used for trapping bad mallocs/free - ASAN does something similar - Alan: what about bugs caused by sequence of use of rcu_defer/rcu_assign.

  1. Detected.
  2. has detected bugs ? still under dev. similar tool has detected bugs.

- kernel address sanitizer

  1. discovered 10 bugs and is working

- pte modification: limits you to use-cases with page tables

  1. RCU bugs that happen on powerpc only code

- people like the idea of unlimited “watchpoint”. Not actually a watchpoint, but poisoned pointer.

Per-CPU atomics in userspace Andrew Hunter & Paul Turner

- Keep data on per cpu level rather than per thread level. For systems with many threads.

- by moving memory allocator free list to per-cpu rather than per-thread, handles better cases where memory is allocated by one thread, then handoff to other thread, then freed.

- How do you abort transaction ? The kernel handles it.

- restart logic handled in user-space, simple check in kernel (infrequent abort vs frequent test trade-off)

- combining all regions for all such restart code, so the kernel only has to check one range

- initial use-case is malloc per thread

- you could do this with transactions (real HW transactions) in some cases could be complementary (for more complex structures for instance)

- similar scheme to sys_membarrier(), could possibly share code

Contact

Proposal added by Dhaval Giani dhaval.giani@gmail.com (with Paul McKenney as the chief advisor)

 
2013/scaling.txt · Last modified: 2013/09/20 16:32 by 204.57.119.28
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki