The Linux Plumbers 2013 Microconference - Scaling track focuses on scalability, both upwards and downwards as well as up and down the stack. There are quite a few active projects that are working to enhance performance and scalability both in the Linux kernel and in user-space projects. In fact, one of the purposes of this forum is for developers from different projects to meet and collaborate. After all, for the user to see good performance and scalability, all relevant projects must perform and scale well.
The structure will be similar to what was followed the previous years (2011 and 2012): about 30 minutes per subject with discussion.
The schedule of the 2013 Scalability Plumbers Micro Conference is as follows. Note presentation slides can be found on the Plumbers page by following the links to the abstracts:
http://etherpad.osuosl.org/lpc2013-scaling
Linux Plumbers Conference 2013
Scaling Microconference
Volunteer needed for taking notes.
Andi Kleen on transactional sync extensions
Current status of lock elision in Linux http://linuxplumbersconf.org/2013/ocw/sessions/1161 (slides) Andi Kleen
Speculative execution: Blocking or non-blocking Blocking has added latency when transfering locks
Intel TSX
User controled execution mode in CPU (Haswell) HLE adds instruction prefixes for atomic ops Failure fallsback nop on old CPUs RTM new XBEGIN/XEND instruction Explicit abor handler
Locking scalling results depend on workload.
An elided lock: fast path non-blocking similar to recursive reader lock individual cache line locks may always fall back uses the standard locking model
Overview of Linux implementation TSX perf profilling supported need to understand speculation TSX lock elision Elide kernel locks Glibc mutex elision Elide application locks Carious custom locks elide in applications Libitm (gcc) Application with non-scalable locking primary target
Lock adaptation RTM locks with adaptive abort handler skips elision on fail Safety net Simple algorithm state stored in lock tunables
Glibc mutex elision in glibc 2.18 needs to be globally enabled Any program using pthread mutexes can elide Missing: tunables in prgram, global, rwlocks Obsucre POSIX requirements are a problem deadlock requirements for nested lcoks lost nesting support for trylocks and write locks lost adaptive spinning locks
Future work recursive locks, rwlock improve adaptive algorithms tuning interface more than POSIX c++11 locking new interface to elide condition variables better fast-path without dynamic dispatch adaptive spinning the default
Kernel elision eliding mutex, spin, rw, bitspin, rwsem, custom lock only win in some areas with big locks occasional losses due to too fine grained locking may benefit from lock coarsening
Full dynticks status http://linuxplumbersconf.org/2013/ocw/sessions/1143 Frederic Weisbecker
Low frequency tick: more throughput, less interrupts, CPU less stolen, less cache trashed High frequency tick: latency, timer and scheduler granularity, precision
Duty of the tick:
timekeeping (walltime, xtime, gettimeofday) jiffies timer wheel cpu time stats scheduler RCU
dynticks: remove the tick when possible – idle
problems: lcache, dcache periodically trashed tick steals CPU multiple times per second
Who's effected? HPC – exterme throughput Real time – extreame latency
CPU time accounting
poll driven listen to ring boundaries syscalls exception irqs
dynaticks is only enabled if there is only a single runnable task per CPU
User-level threads……. with threads. http://linuxplumbersconf.org/2013/ocw/sessions/1653 Paul Turner
Three models 1:1 (kernel threading) → this is the current ubiquitous implementation n:1 (user threading) –> single kernel context. No kernel awareness of user-level threading m:n (hybrid) –> kernel assisted
Parallel programming models syncronous delegated events message passing / event loops
models → threads per request + simple + good data-locality - realized parallelism within a resuest - latency predictability varies inverselt with load
asynronous workers + greater control of work paritioning + impoved latency predictability + lower overheads achievable - complex programming model - encapsulation of control and data-flow - loss of data locality
Scalability Issues in Linux Kernel http://linuxplumbersconf.org/2013/ocw/sessions/1299 Dave Hansen, Tim Chen
Writes on shared data writes to shared structures are expensive spinlock r/w lock atomic counter cache line bouncing even very short hold time on a lock is expensive A ext4 inode lock on a sorted LRU list for reclaim put pressure on page cache >90% lock contention
Lock stat – scales poorly due to lockdep infrastucture build took 30% longer on a 60 core system heavy weight would be useful on production systems for debugging
Magic numbers batch sizes memory pool sizes hash table sizes
Multi-threaded ops mmap_sem and page_table_lock contention contention when many signals are sent to individual threads files_open contending on file_lock
VM Scalling mmap_sem page fault has significal page allocation & clearing cost fork operations contend on root anon_vma hugepages and other similar things (tlb flushing) are hacks and aren't actually very scalable.
vmsplice with transparent huge pages Robert Jennings rcj@linux.vnet.ibm.com
RFC on list for page flipping with vmsplice in addition to copy
http://marc.info/?l=linux-kernel&m=137477297209750&w=2 http://marc.info/?l=linux-fsdevel&m=137477295109744&w=2 http://marc.info/?l=linux-fsdevel&m=137477295109743&w=2
QEMU adding migration feature to move VM to new QEMU executable (for applying patches). Requires:
page flipping rather than copy (can't afford to double memory usage) speed to minimize downtime
RFC adds page flipping for a narrow case (4K, aligned, single mapping, non-THP, etc) KVM host would like to use THP. Currently THPs would be copied.
Moving 4K pages is much slower than desired (~5GB/s)
Some room to improve in the scope of current code but larger improvement is required
THP support for vmsplice page flipping would address environment in which KVM operates and provide a significant speedup.
BOF: Finding RCU Bugs Dhaval Giani, Paul McKenney, Frederic Weisbecker
Few people can pinpoint RCU bugs, correctness bugs and performance bugs. Detecting pointer leaks Converting RCU into rwlock RCU watchpoints by poisoning RCU pointers, expecting GPF Add data (tagging) into pointer to bits within rcu_dereference operation Check when we enter a GPF
Discussion: - hpa:
- Using the bits (not just putting garbage data) - right now, it can catch bugs. - could add a level of indirection ? would be better if you don't have to. - Move this pointer info into another pointer.
- Use a modified page table entry, per-cpu cr3 (x86). Set at rcu_dereference. - does it require to have every instrucation page aligned ? no. - scheme used for trapping bad mallocs/free - ASAN does something similar - Alan: what about bugs caused by sequence of use of rcu_defer/rcu_assign.
- kernel address sanitizer
- pte modification: limits you to use-cases with page tables
- people like the idea of unlimited “watchpoint”. Not actually a watchpoint, but poisoned pointer.
Per-CPU atomics in userspace Andrew Hunter & Paul Turner
- Keep data on per cpu level rather than per thread level. For systems with many threads.
- by moving memory allocator free list to per-cpu rather than per-thread, handles better cases where memory is allocated by one thread, then handoff to other thread, then freed.
- How do you abort transaction ? The kernel handles it.
- restart logic handled in user-space, simple check in kernel (infrequent abort vs frequent test trade-off)
- combining all regions for all such restart code, so the kernel only has to check one range
- initial use-case is malloc per thread
- you could do this with transactions (real HW transactions) in some cases could be complementary (for more complex structures for instance)
- similar scheme to sys_membarrier(), could possibly share code
Proposal added by Dhaval Giani dhaval.giani@gmail.com (with Paul McKenney as the chief advisor)