Davidlohr Bueso

LPC 2023: CXL Microconference

2023-12-01T11:14:00.000-08:00

The Compute Express Link (CXL) microconference was held, for a second straight time, at this year's Linux Plumbers Conference. The goals for the track were to openly discuss current on-going development efforts around the core driver, as well as experimental memory management topics which lead to accommodating kernel infrastructure for new technology and use cases.

CXL session at LPC23

(i) CXL Emulation in QEMU - Progress, status and most importantly what next? The cxl qemu maintainers presented the current state of the emulation, for which significant progress has been made, extending support beyond basic enablement. During this year, features such as volatile devices, CDAT, poison and injection infrastructure have been added upstream qemu, while several others are in the process, such as CCI/mailbox, Scan Media and dynamic capacity. There was also further highlighting of the latter, for which DCD support was presented along with extent management issues found in the 3.0 spec. Similarly, Fabric Management was another important topic, continuing the debate about qemu's role in FM development, which is still quite early. Concerns about the production (beyond testing) use cases for CCI kernel support were discussed, as well as semantics and interfaces that constrain qemu, such as host and switch coupling and differences with BMC behavior.

(ii) CXL Type-2 core support. The state and purpose of existing experimental support for type 2 (accelerators) devices was presented, for both the kernel and qemu sides. The kernel support led to preliminary abstraction improvement work being upstreamed, facilitating actual accelerator integration with the cxl core driver. However, the rest is merely guess work and the floor is open for an actual hardware backed proposal. In addition, HDM-DB support would also be welcomed as a step forward. The qemu side is very basic and designed to just exercise core checks, for which it's emulation should be limited, specially in light of cxl_test.

(iii) Plumbing challenges in Dynamic capacity device. An in-depth coverage and discussion, from a kernel side, of the state of DCD support and considerations around corner cases. Semantics of releasing DC for full partial extents (ranges) are two different beasts altogether. Releasing all the already given memory can simply require memory being offline and be done, avoiding unnecessary complexity in the kernel. Therefore the kernel can perfectly well reject the request, and FM design should keep that into consideration. Partial extents, on the other hand, are unsupported for the sake of simplicity, at least until a solid industry use case comes along. Forced DC removal of online memory semantics were also discussed, emphasizing that such DC memory is not guaranteed to ever be given back by the kernel, mapped or not. Forcing the event, the hardware does not care and the kernel has most likely crashed anyway. Support for extent tagging was another topic, establishing the need for supporting it, coupling a device to a tag domain, being a sensible use case. For now at least, the implementation can be kept to to simply enumerate tags and the necessary attributes to leave the memory matching to userspace, instead of more complex surgeries to create DAX devices on specific extents, dealing with sparse regions.

(iv) Adding RAS Support for CXL Port Devices. Starting with a general overview of RAS, this touched on the current state for support in CXL 1.1 and 2.0. Special handling is required for RCH: due to the RCRB implementation, the RCH downstream port does not have a BDF, needed for AER error handling; this work was merged in v6.7. As for CXL Virtual Hierarchy implementation, it is left still open, potentially things could move away from the PCIe port service driver model, which is not entirely liked. There are however, clear requirements: not-CXL specific (AER is a PCIe protocol, used by CXL.io); implement driver callback logic specific to that technology or device, giving flexibility to handle that specific need; and allow enable/disable on a per-device granularity. There were discussions around the order for which a registration handler is added in the PCI port driver, noting that it made sense to go top-down from the port and searching children, instead of written from a lower level.

(v) Shared CXL 3 memory: what will be required? Overview of the state, semantics and requirements for supporting shared fabric attached memory (FAM). A strong enablement use case is leveraging applications that already handle data sets in files. In addition appropriate workload candidates will fit the "master writer, multiple readers" read-only model for which this sort of machinery would make sense. Early results show that the benefits can out-weigh costly remote CXL memory access such as fitting larger data sets in FAM that would otherwise be possible in a single host. Similarly this avoids cache-coherency costs by simply never modifying the memory. A number of concrete data science and AI usecases were presented. Shared FAM is meant to be mmap-able, file-backed, special purpose memory, for which a FAMFS prototype is described, overcoming limitations of just using DAX device/FSDAX, such as distributing metadata in a shareable way.

(vi) CXL Memory Tiering for heterogenous computing. Discusses the pros and cons of interleaving heterogeneous (ie: DRAM and CXL) memory through hardware and/or software for bandwidth optimization. Hardware interleaving is simple to configure through the BIOS, but limited by not allowing the OS to manage allocations, otherwise hiding the NUMA topology (single node) as well as being a static configuration. The software interleaving solves these limitations with hardware and relies on weighted nodes for allocation distribution when doing the initial mapping (vma). Several interfaces have been posted, which incrementally are converging into a NUMA node based interface. The caveat is to have a single (configurable) system-wide set of weights, or to allow more flexibility, such as hierarchically through cgroups - something which has not been particularly sold yet. Combining both hardware and software models relies on within a socket, splitting channels among respective DDR and CXL NUMA nodes for which software can explicitly (numactl) set the interleaving - it is still restrained however by being static as the BIOS is in charge of setting the number of NUMA nodes.

(vii) A move_pages() equivalent for physical memory. Through an experimental interface, this focused on the semantics of tiering and device driven page movement. There are currently various mechanisms for access detection, such as PMU-based, fault hinting for page promotion and idle bit page monitoring; each with its set of limitations, while runtime overhead is a universal concern. Hardware mechanisms could help with the burden but the problem is that devices only know physical memory and must therefore do expensive reverse mapping lookups; nor are there any interfaces for this, and it is difficult to with out hardware standardization. A good starting point would be to keep the suggested move_phys_pages as an interface, but not have it be an actual syscall.

Linux v5.2: Performance Goodies

2019-09-10T12:26:00.002-07:00

locking/rwsem: optimize trylocking for the uncontended case

This applies the idea that in most cases, a rwsem will be uncontended (single threaded). For example, experimentation showed that page fault paths really expect this. The change itself makes the code basically not read in a cacheline in a tight loop over and over. Note however that this can be a double edged sword, as microbenchmarks have show performance deterioration upon high amounts of tasks, albeit mainly pathological workloads.

[Commit ddb20d1d3aed a338ecb07a33]

lib/lockref: limit number of cmpxchg loop retries

Unbounded loops are rather froned upon, specially ones ones doing CAS operations. As such, Linus suggested adding an arbitrary upper bound to the loop to force the slowpath (spinlock fallback), which was seen to improve performance on an adhoc testcase on hardware that incurrs in the loop retry game.

[Commit 893a7d32e8e0]

rcu: avoid unnecessary softirqs when system is idle

Upon an idle system with no pending callbacks, rcu sofirqs to process callbacks were being triggered repeatedly. Specifically the mismatch between cpu_no_qs and core_need_rq was addressed.

[Commit 671a63517cf9]

rcu: fix potential cond_resched() slowdowns

When using the jiffies_till_sched_qs kernel boot parameter, a bug made jiffies_to_sched_qs become uinitialized as zero and therefore impacts negatively in cond_resched().

[Commit 6973032a602e]

mm: improve vmap allocation

Doing a vmalloc can be quite slow at times, and with it being done with preemption disabled, can affect workloads that are sensible to this. The problem relies in the fact that a new VA area is done over a busy list iteration until a suitable hole is found between two busy areas. The changes propose the always reliable red-black tree to keep blocks sorted by their offsets along with a list keeping the free space in order of increasing addresses.

[Commit 68ad4a330433 68571be99f32]

mm/gup: safe usage of get_user_pages_fast() with DAX

Users of get_user_pages_fast() have potential performance benefits compared to its non-fast cousin, by avoiding mmap_sem, than it's non-fast equivalent. However drivers such as rdma can pin these pages for a significant amount of time, where a number of issues come with the filesystem as referenced pages will block a number of critical operations and is known to mess up DAX. A new FOLL_LONGTERM flag is added and checked accordingly; which also means that other users such as xdp can now also be converted to gup_fast.

[Commit 932f4a630a69 b798bec4741b 73b0140bf0fe 7af75561e171 9fdf4aa15673 664b21e717cf f3b4fdb18cb5 ]

lib/sort: faster and smaller

Because CONFIG_RETPOLINE has made indirect calls much more expensive, these changes reduce the number made by the library sort functions, lib/sort and lib/list_sort. A number of optimizations and clever tricks are used such as a more efficient bottom up heapsort and playing nicer with store buffers.

[Commit 37d0ec34d111 22a241ccb2c1 8fb583c4258d 043b3f7b6388 b5c56e0cdd62]

ipc/mqueue: make msg priorities truly O(1)

By keeping the pointer to the tree's rightmost node, the process of consuming a message can be done in constant time, instead of logarithmic.

[Commit a5091fda4e3c]

x86/fpu: load FPU registers on return to userland

This is a large, 27-patch, cleanup and optimization to only load fpu registers on return to userspace, instead of upon every context switch. This means that tasks that remain in kernel space do not load the registers. Accessing the fpu registers in the kernel requires disabling preemption and bottom-halfs for scheduler and softirqs, accordingly.

[Commit 2722146eb784 ... a5eff7259790]

x86/hyper-v: implement EOI optimization

Avoid a vmexit on EOI. This was seen to slightly improve IOPS when testing nvme disks with raid and ext4.

[Commit ba696429d290]

btrfs: improve performance on fsync of files with multiple hardlinks

A fix to a performance regression seen in pgbench which can make fsync a full transaction commit in order to avoid losing hard links and new ancestors of the fsynced inode.

[Commit b8aa330d2acb]

fsnotify: fix unlink performance regression

This restores an unlink performance optimization that avoids take_dentry_name_snapshot().

[Commit 4d8e7055a405]

block/bfq: do not merge queues on flash storage with queuing

Disable queue merging on non-rotational devices with internal queueing, thus boosting throughput on interleaved IO.

[Commit 8cacc5ab3eac]

Linux v5.1: Performance Goodies

2019-05-09T13:10:00.001-07:00

sched/wake_q: reduce atomic operations for special users

Some core users of wake_qs, futex and rwsems were incurring in double task reference counting - which was a side effect for safety reasons. This change levels the call's performance with the rest of the users.

[Commit 07879c6a3740]

irq: Speedup for interrupt statistics in /proc/stat

On large systems with a large amount of interrupts the readout of /proc/stat takes a long time to sum up the interrupt statistics. The reason for this is that interrupt statistics are accounted per cpu. So the /proc/stat logic has to sum up the interrupt stats for each interrupt. While applications shouldn't really be doing this to a point where it creates bottlenecks, the fix was fairly easy.

[Commit 1136b0728969]

mm/swapoff: replace quadratic complexity for lineal

try_to_unuse() is of quadratic complexity, with a lot of wasted effort. It unuses swap entries one by one, potentially iterating over all the page tables for all the processes in the system for each one. With these changes, it now iterates over the system's mms once, unusing all the affected entries as it walks each set of page tables.

Improvements show time reductions for swapoff being called on a swap partition containing about 6G of data, from 8 to 3 minutes.

[Commit c5bf121e4350 b56a2d8af914]

mm: make pinned_vm an atomic counter

This reduces some of the bulky mmap_sem games that are played when, mostly rdma, deals with the pinned pages counter. It also pivots on not relying on the lock for get user pages operations.

[Commit 70f8a3ca68d3 3a2a1e90564e b95df5e3e459]

drivers/async: NUMA aware async_schedule calls

Asynchronous function calls reduces, primarily, kernel boot time by safely doing out of order operations, such as device discovery. This series improves the NUMA locality by being able to schedule device specific init work on specific NUMA nodes in order to improve performance of memory initialization. Significant init reduction times for persistent memory were seen.

[Commit 3451a495ef24 ed88747c6c4a ef0ff68351be 8204e0c1113d 6be9238e5cb6 c37e20eaf4b2 8b9ec6b73277 af87b9a7863c 57ea974fb871]

lib/iov_iter: optimize page_copy_sane()

This avoid cacheline misses when dereferencing a struct page, via compound_head(), when possible. Apparently the overhead was visible on TCP doing recvmsg() calls dealing with GRO packets.

[Commit 6daef95b8c91]

fs/epoll: reduce lock contention in ep_poll_callback()

This patch increases the bandwidth of events which can be delivered from sources to the poller by adding poll items in a lockless way to the ready list; via clever ways of xchg() while holding a reader rwlock . This improves scenarios with multiple threads generating IO events which are delivered to a single threaded epoll_wait()er.

[Commit c141175d011f c3e320b61581 a218cc491420]

fs/nfs: reduce cost of listing huge directories (readdirplus)

When listing very large directories via NFS, clients may take a long time to complete. Most of the culprit is in various degrees of libc's readdir(2) reading 32k files at a time. To improve performance and reduce the amount of rpc calls, NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. Benchmarks show rpc calls decreasing by 85% while listing a directory with 300k files.

[Commit be4c2d4723a4]

fs/pnfs: Avoid read/modify/write when it is not necessary

When testing with fio, Throughput of overwrite (both buffered and O_SYNC) is noticeably improved.

[Commit 97ae91bbf3a7 2cde04e90d5b]

Linux v5.0: Performance Goodies

2019-05-09T13:10:00.000-07:00

mm/page-alloc: reduce zone->lock contention

Contention in the page allocator was seen in a network traffic report, in which order-0 allocations are being freed by back to the directly to the buddy, instead of making use of percpu-pages in the page_frag_free() call. Aside from eliminating the contention, it was seen to improve some microbenchmarks.

[Commit 65895b67ad27]

mm/mremap: improve scalability on large regions

When THP is disabled, move_page_tables() can bottleneck a large mremap() call, as it will copy each pte at a time. This patch speeds up the performance by copying at the PMD level when possible. Up to 20x speedups were seen when doing a 1Gb remap.

[Commit 2c91bd4a4e2e]

mm: improve anti-fragmentation

Given sufficient time or an adverse workload, memory gets fragmented and the long-term success of high-order allocations degrades. Overall the series reduces external fragmentation causing events by over 94% on 1 and 2 socket machines, which in turn impacts high-order allocation success rates over the long term.

[Commit 6bb154504f8b a921444382b4 0a79cdad5eb2 1c30844d2dfe]

mm/hotplug: optimize clear hw_poisoned_pages()

During hotplug remove, the kernel will loop for the respective number of pages looking for poisoned pages. Check the atomic hint in case this are none, and optimize the function.

[Commit 5eb570a8d924]

mm/ksm: Replace jhash2 with xxhash

xxhash is an extremely fast non-cryptographic hash algorithm for checksumming, making it suitable to use in kernel samepage merging. On a custom KSM benchmark, throughput was seen to improve from 1569 to 8770 MB/s.

[Commit 0b9df58b79fa 59e1a2f4bf83]

genirq/affinity: Spread IRQs to all available NUMA nodes

If the number of NUMA nodes exceeds the number of MSI/MSI-X interrupts which are allocated for a device, the interrupt affinity spreading code fails to spread them across all nodes. NUMA nodes above the number of interrupts are all assigned to hardware queue 0 and therefore NUMA node 0, which results in bad performance and has CPU hotplug implications. Fix this by assigning via round-robin.

[Commit b82592199032]

fs/epoll: Optimizations for epoll_wait()

Various performance changes oriented towards improving the waiting side, such that contention epoll waitqueue (previously ep->lock) spinlock is reduced. This produces pretty good results for various concurrent epoll_wait(2) benchmarks.

[Commit 74bdc129850c 4e0982a00564 76699a67f304 21877e1a5b52 c5a282e9635e abc610e01c66 86c051793b4c]

lib/sbitmap: Various optimizations

Two optimizations to the sbitmap core were introduced, which is used, for example, by the block-mq tags. The first optimizes wakeup checks and adds to the core api, while the second introduces batched clearing of bits, trading 64 atomic bitops for 2 cmpxchg calls.

[Commit 5d2ee7122c73 ea86ea2cdced]

fs/locks: Avoid thundering herd wakeups

When one thread releases a lock on a given file, it wakes up all other threads that are waiting (classic thundering-herd) - one will get the lock and the others go to sleep. The overhead starts being noticeable with increasing thread counts. These changes create a tree of pending lock request in which siblings don't conflict and each lock request does conflict with its parent. When a lock is released, only requests which don't conflict with each other a woken.

Testing shows that lock-acquisitions-per-second is now fairly stable even as number of contending process goes to 1000. Without this patch, locks-per-second drops off steeply after a few 10s of processes. Micro-benchmarks can be found per the lockscale program, which tests fcntl(..., F_OFD_SETLKW, ...) and flock(..., LOCK_EX) calls.

[Commit d6367d624137 5946c4319ebb 16306a61d3b7 c0e15908979d fd7732e033e3 cb03f94ffb07]

arm64/lib: improve crc32 performance for deep pipelines

This change replace most branches with a branchless code path that overlaps 16 byte loads to process the first (length % 32) bytes, and process the remainder using a loop that processes 32 bytes at a time.

[Commit efdb25efc764]

Linux v4.20: Performance Goodies

2019-02-24T15:53:00.001-08:00

With v4.20 out for almost the entire v5.0 rc-cycle, here are some of the more interesting performance related changes that made their way in.

signal: Use a smaller struct siginfo in the kernel

Reduces the memory footprint of 'struct siginfo' most of which is just reserved. Ultimately this avoid spanning two cachelines to just one.

[Commit 4ce5f9c9e754]

sched/fair: Fix cpu_util_wake() for 'execl' type workloads

Fix an exec() related performance regression, which was caused by incorrectly calculating load and migrating tasks on exec() when they shouldn't be.

[Commit c469933e7721]

locking/rwsem: Exit read lock slowpath if queue empty and no writer

This change presents a new heuristic for optimizing rw-semaphores, specifically in read-mostly scenarios. Before the patch, a reader could find itself in a situation when it was in the slowpath, due to an occasional writer thread, but the writer was then released, and only other readers are now present. At that point the waitqueue was enlarged unnecessarily, causing other readers attempting to lock to see waiting readers. This directly improves some issues found when (ab)using pread64() and XFS.

[Commit 4b486b535c33]

mm: mmap: zap pages with read mmap_sem in munmap

When a process unmaps a range of memory, the infamous mmap_sem would held for the duration of the entire munmap() call, which can be a long time for big mappings (reportedly up to 18 seconds for a 320Gb mapping). A two-phase approach was done to address this where the key is to unmap the vma first such that the semaphore can be taken exclusively at first then downgrade it such that it can be shared while doing the zapping and freeing of page tables.

[Commit dd2283f2605e b4cefb360512 cb4922496ae4]

net/tcp: optimize tcp internal pacing

When TCP implements its own pacing (when no fq packet scheduler is used), it is arming high resolution timer after a packet is sent. But in many cases (like TCP_RR kind of workloads), this high resolution timer expires before the application attempts to write the following packet. Setup the timer only when a packet is about to be sent, and if tcp_wstamp_ns is in the future, showing a ~10% performance increase in TCP_RR workloads.

[Commit 864e5c090749]

fs: better member layout of struct super_block

Re-organize 'struct super_block' to try and keep some frequently accessed fields on the same cache line as well as grouping the rarely accessed members. This was seen to address a regression on a concurrent unlink intensive workload.

[Commit 99c228a994ec]

fs/fuse: improved scalability

Two changes that have performance visible effects went in. The first series changes some of the protections for background requests. This allows async reads not take the fuseconn lock. Secondly implement a hash table for processing requests which was seen to address a 20% time spent in request_find() under some workloads with Virtuozzo storage over rdma.

[Commit e287179afe21 2a23f2b8adbe 2b30a533148a ae2dffa39485 63825b4e1da5 c59fd85e4fd0 be2ff42c5d6e]

Linux v4.19: Performance Goodies

2018-10-25T11:19:00.000-07:00

This post marks one year since I began doing these kernel performance goodies write ups, starting from v4.14. And this week Greg released Linux v4.19, so here are some of the changes related to software optimizations, performance and scalability topics across various subsystems.

epoll: loosen irq safety when possible

The epoll code uses an irq-safe spinlock to protect concurrent operations to the ready-event linked list. However, with the exception of the callback done from the wakequeues, the calls to the spinlock are never done in irq context, and therefore there is really no need to save and restore interrupts each time the lock is acquired and released. For example, on x86, a POPF (irqrestore) instruction can be quite expensive as it changes all the flags and therefore potentially heavy on dependencies. These changes yield some measurable results on a range of epoll_wait(2) microbenchmarks, around 7-20% in raw throughput. This is unsurprising as PUSHF + POPF is more expensive than STI + CLI.

[Commit 002b343669c4, 304b18b8d6af, 92e641784055, 679abf381a18]

sched/numa: migrate pages to local nodes quicker early in the lifetime of a task

Automatic NUMA Balancing uses a multi-stage pass to decide whether a page should migrate to a local node. This filter avoids excessive ping-ponging if a page is shared or used by threads that migrate cross-node frequently. Threads inherit both page tables and the preferred node ID from the parent. This means that threads can trigger hinting faults earlier than a new task which delays scanning for a number of seconds. As it can be load balanced very early in its lifetime there can be an unnecessary delay before it starts migrating thread-local data. This patch migrates private pages faster early in the lifetime of a thread using the sequence counter as an identifier of new tasks.

[Commit 37355bdc5a12]

rcu: check if GP already requested

This commit makes rcu_nocb_wait_gp() check to see if the current CPU already knows about the needed grace period having already been requested. If so, it avoids acquiring the corresponding leaf rcu_node structure's lock, thus decreasing contention. This optimization is intended for cases where either multiple leader rcu kthreads are running on the same CPU or these kthreads are running on a non-offloaded (e.g., housekeeping) CPU.

[Commit ab5e869c1f7a]

cpufreq/schedutil: take into account time spent in irq

Time being spent in interrupt handlers was not being accounted for in the CPU utilization when selecting an operating performance point. This can be a significant amount of time which is reported in the normal context time window. The new CPU utilization is yields a 10% performance boost on iperf workloads.

[Commit 9033ea11889f]

mm/page_alloc: enlarge zone's batch size

The page allocator will first try to use a percpu set of pages, then if all used up, ask the Buddy for a batch of pages. The size of this batch can have a number of consequences, including performance. The last time this magic number was increased was 13 years ago, and there have been numerous hardware improvements since then. As such a recent study with allocator intensive benchmarks, shows that doubling the size of the batch can yield improvements on larger/modern machines.

[Commit d8a759b57035]

mm: skip invalid pages block at a time in zero_resv_unresv()

The role of zero_resv_unavail() is to make sure that every struct page that is allocated but is not backed by memory that is accessible by kernel is zeroed and not in some uninitialized state. Since struct pages are allocated in blocks we can skip pageblock_nr_pages at a time, when the first one is found to be invalid. This optimization may help since now on x86 every hole in e820 maps is marked as reserved in memblock, and thus will go through this function.

[Commit 720e14ebec64]

kvm, x86: implement paravirt "send IPI" hypercall

Replace sending IPIs one by one for xAPIC physical mode by a single hypercall (vmexit). This patchset lets a guest send multicast IPIs, with at most 128 destinations per hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode. An IPI microbenchmark shows non-trivial performance improvements for broadcast IPIs (send IPI to all online CPUs and force them to take/drop a spinlock).

[Commit 4180bf1b655a]

arm64: use queued spinlocks

Similar to x86, replace the old ticket spinlocks with fair qspinlocks and make use of MCS features as well as better performance under virtualization. This is particularly suitable for larger multicore machines.

[Commit c11090474d70]

Linux v4.18: Performance Goodies

2018-10-15T13:19:00.002-07:00

Linux v4.18 has been out a two months now; making this post a bit late, but still in time before the next release. Also so much drama in the CoC to care about performance topics :P As always comes with a series of performance enhancements and optimizations across subsystems.

locking: avoid pointless TEST instructions

A number of places within locking primitives have been optimized to avoid superfluous test instructions for the CAS return by relying on try_cmpxchg, generating slightly better code for x86-64 (for arm64 there is really no difference). Such have been the cases for mutex fastpath (uncontended case) and queued spinlocks.

[Commit c427f69564e2, ae75d9089ff7]

locking/mcs: optimize cpu spinning

Some architectures, such as arm64, can enter low-power standby state (spin-waiting) instead of purely spinning on a condition. This is applied to the MCS spin loop, which in turn directly helps queued spinlocks. On x86, this can also be cheaper than spinning on smp_load_acquire().

[Commit 7f56b58a92aa]

mm/mremap: reduce amount of TLB shootdowns

It was discovered that on a heavily dominated mremap workload, the amount of TLB flushes was excessive causing overall performance issues. By removing the LATENCY_LIMIT magic number to handle TLB flushes on a PMD boundary instead of every 64 pages, the amount of shootdowns can be redced by a factor of 8 in the ideal case. The LATENCY_LIMIT was almost certainly used originally to limit the PTL hold times but the latency savings are likely shadowed by the cost of IPIs in many cases.

[Commit 37a4094e828f]

mm: replace mmap_sem to protect cmdline and environ procfs files

Reducing (ab)users of the mmap_sem is always good for general address space performance. Introduce a new mm->arg_lock to protect against races when handling /proc/$PID/{cmdline,environ} files, this removes (mostly) the semaphore's requirements.

[Commit 88aa7cc688d4]

mm/hugetlb: make better use of page clearing optimization

Pass the fault address (address of the sub-page to access) to the nopage fault handler to better use the general huge page clearing optimization. This allows the sub-page to access to be cleared last to avoid the cache lines of to access sub-page to be evicted when clearing other sub-pages. Performance improvements were reported for vm-scalability.anon-w-seq workload under hugetlbfs, reducing ~30% throughput.

[Commit 285b8dcaacfc]

sched: don't schedule threads on pre-empted vCPUs

It can be determined whether a vCPU is running to prioritize CPUs when scheduling threads. If a vCPU has been pre-empted, it will incur the extra cost of VMENTER and the time it actually spends to be running on the host CPU. If we had other vCPUs which were actually running on the host CPU and idle we should schedule threads there.

[Commit 247f2f6f3c70, 943d355d7fee]

sched/numa: Stagger NUMA balancing scan periods for new threads

It is redundant and counter productive for threads sharing an address space to change the protections to trap NUMA faults. Potentially only one thread is required but that thread may be idle or it may not have any locality concerns and pick an unsuitable scan rate. This patch uses independent scan period but they are staggered based on the number of address space users when the thread is created.

The intent is that threads will avoid scanning at the same time and have a chance to adapt their scan rate later if necessary. This reduces the total scan activity early in the lifetime of the threads. The different in headline performance across a range of machines and workloads is marginal but the system CPU usage is reduced as well as overall scan activity.

[Commit 137844759843]

block/bfq: postpone rq preparation to insert or merge

A lock contention point is removed (see patch for details and justification) by postponing request preparation to insertion or merging, as lock needs to be grabbed any longer in the prepare_request hook.

[Commit 18e5a57d7987]

btrfs: improve rmdir performance for large directories

When checking if a directory can be deleted, instead of ensuring all its children have been processed, this optimization keeps track of the directory index offset of the child last checked in the last call to can_rmdir(), and then use it as the starting point for future calls. The changes were shown to yield massive performance benefits; for test directory with two million files being deleted the runtime is reduced from half an hour to less than two seconds.

[Commit 0f96f517dcaa]

KVM: VMX: Optimize tscdeadline timer latency

Add the advance tscdeadline expiration support to which the tscdeadline timer is emulated by VMX preemption timer to reduce the hypervisor lantency (handle_preemption_timer -> vmentry). The guest can also set an expiration that is very small in that case we set delta_tsc to 0, leading to an immediately vmexit when delta_tsc is not bigger than advance ns. This patch can reduce ~63% latency for kvm-unit-tests/tscdeadline_latency when testing busy waits.

[Commit c5ce8235cffa]

net/sched: NOLOCK qdisc performance enhancements and fixes

There have been various performance related core changes to the NOLOCK qdisc code. The first begins with reducing the atomic operations of __QDISC_STATE_RUNNING. The bit is flipped twice per packet in the uncontended scenario with packet rate below the line rate: on packed dequeue and on the next, failing dequeue attempt. The changes simplify the qdisc. The changes moves the bit manipulation into the qdisc_run_{begin,end} helpers, so that the bit is now flipped only once per packet, with measurable performance improvement in the uncontended scenario.

Later, the above is actually replaced by using a sequence spinlock instead of the atomic approach address pfifo_fast performance regressions. There is also a reduction in the Qdisc struct memory footprint (spanning a cacheline less).

[Commit 96009c7d500e, 021a17ed796b, e9be0e993d95]

lib/idr: improve scalability by reducing IDA lock granularity

Improve the scalability of the IDA by using the per-IDA xa_lock rather than the global simple_ida_lock. IDAs are not typically used in performance-sensitive locations, but since we have this lock anyway, we can use it.

[Commit b94078e69533]

x86-64: micro-optimize __clear_put()

Use immediate constants and saves two registers.

[Commit 1153933703d9]

arm64: select ARCH_HAS_FAST_MULTIPLIER

It is probably safe to assume that all Armv8-A implementations have a multiplier whose efficiency is comparable or better than a sequence of three or so register-dependent arithmetic instructions. Select ARCH_HAS_FAST_MULTIPLIER to get ever-so-slightly nicer codegen in the few dusty old corners which care.

[Commit e75bef2a4fe2]

Linux v4.17: Performance Goodies

2018-06-05T07:51:00.003-07:00

With Linux v4.17 now released, there are some interesting performance changes that went worth looking at. As always, the term 'performance' can be vague in that some gains in one area can negatively affect another so take everything with a grain of salt.

sysvipc: introduce STAT_ANY commands

There was a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/shm (0444) and getting stat info (such as via SHM_STAT shmctl command). The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the shm metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it.

Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases. For this, the new {SEM,SHM,MSG}_STAT_ANY commands have been introduced.

[Commit c21a6970ae72, a280d6dc77eb, 23c8cec8cf67]

kvm: x86 paravirtualization hints and KVM_HINTS_DEDICATED

When dealing with CPU virtualization, many in-kernel heuristics and optimizations revolve around the overcommited scenario. By introducing KVM_HINTS_DEDICATED, the hypervisor administrator can select this option when there are pinned 1:1 virtual to physical CPU scenarios; particularly reducing the paravirt overhead in locking and TLB flushing as the vCPU is most unlikely to get preempted. In these cases, native qspinlock may perform better than pvqspinlock as it disables paravirt spinlock slowpath optimizations. There is an older Xen equivalent available as a kernel parameter: xen_nopvspin.

[Commit b2798ba0b876, 34226b6b7098, 6beacf74c257]

sched: rework idle loop

Rework the idle loop in order to prevent CPUs from spending too much time in shallow idle states by making it stop the scheduler tick before putting the CPU into an idle state only if the idle duration predicted by the idle governor is long enough. It reduces idle power on some systems by 10% or more and may improve performance of workloads in which the idle loop overhead matters. This required the code to be reordered to invoke the idle governor before stopping the tick, among other things

[Commit 0e7767687fda, 2aaf709a518d, ed98c3491998]

mm: pcpu pages optimizations around zone lock

Two optimizations around zone->lock in free_pcpupages_bulk() that yield around a 5% performance improvement in page-fault benchmarks (will-it-scale in this case). The first reduces the scope of the when freeing a batch of pages from back to buddy. Considering the per-cpu semantics, the lock was unnecessarily held while pages are chosen from the pcpu page's migratetype list.

The second improvement adds a prefetch to the to-be-freed page's buddy outside of the lock in hope that accessing the buddy's page structure later with the lock held will be faster. Normally prefetching is froundupon, particularly for microbenchmarks, however in the particular case the prefetched pointer will always be used.

[Commit 0a5f4e5b4562, 97334162e4d7]

mm: lockless list_lru_count_one()

During the reclaiming slab of a memcg, shrink_slab() iterates over all registered shrinkers in the system, trying to count and consume objects related to the cgroup. In case of memory pressure, the operation was had a bottlenecking while trying to acquire the nlru->lock. By applying RCU to the data structure, the lookup can be done without taking the lock, which translates in the overall contention pretty much disappearing.

[Commit 0c7c1bed7e13]

memory hotplug optimizations

Such optimizations reduce the amount of times struct pages is traversed during a memory hotplug operation, from three to one. Among other benefits, the memory hotplug is made similar to the boot memory initialization path because it initializes struct pages only in one function. Finally, this improves memory hotplug performance because the cache is not being evicted several times and also reduce loop branching overhead.

[Commit d0dc12e86b31]

procfs: miscellaneous optimizations

Access to various files within procfs have been optimized by replacing calls to seq_printf() with lower cost alternatives. Changes show some performance benefits for ad-hoc microbenchmarks.

[Commit 0e3dc0191431, 8cfa67b4d9a9, d1be35cb6f96, f66406638fff, 48dffbf82d2f, d0f02231222b]

btrfs: relax barrier when unlocking an extent buffer

Serializing checks for active waitqueue requires a barrier as it can race with the waiter side. Such is the case with btrfs_tree_unlock(), which was abusing the barrier semantics on architectures where atomic operations are ordered, such as x86. A performance improvement is immediately noticeable by optimizing barrier usage while maintaining the necessary semantics.

[Commit 2e32ef87b074]

x86/pti: leave kernel text global for no PCID

From the patch: Global pages are bad for hardening because they potentially let an exploit read the kernel image via a Meltdown-style attack. But, global pages are good for performance because they reduce TLB misses when making user/kernel transitions, especially when PCIDs are not available, such as on older hardware, or where a hypervisor has disabled them for some reason.

This change implements a basic, sane policy: If PCIDs are available, only map a minimal amount of kernel text global. If no PCIDs, map all kernel text global. This translates into a considerable throughput increase on an lseek microbenhmark.

[Commit 8c06c7740d19]

lib/raid6/altivec: Add vpermxor implementation for raid6 Q syndrome

This enhancement uses the vpermxor instruction to optimize the raid6 Q syndrome. This instruction was made available with POWER8, ISA version 2.07. It allows for both vperm and vxor instructions to be done in a single instruction. The benchmark results show a 35% speed increase over the best existing algorithm for powerpc (altivec).

[Commit 751ba79cc552]

Linux v4.16: Performance Goodies

2018-05-07T10:53:00.001-07:00

Linux v4.16 was released a few weeks ago and continues the mitigation of meltdown and spectre bugs for x86-64, as well as for arm64 and IBM s390. While v4.16 is not the most exciting kernel version in terms of performance and scalability, the following is an unsorted and incomplete list of changes that went in which I have cherry-picked. As always, the term 'performance' can be vague in that some gains in one area can negatively affect another so take everything with a grain of salt.

sched: reduce migrations and spreading of load to multiple CPUs

The scheduler decisions are biased towards reducing latency of searches but tends to spread load across an entire socket, unnecessarily. On low CPU usage, this means the load on each individual CPU is low which can be good but cpufreq decides that utilization on individual CPUs is too low to increase P-state and overall throughput suffers.

When a cpufreq driver is completely under the control of the OS, it can be compensated for. For example, intel_pstate can decide to boost apparent cpu utilization if a task recently slept on a CPU for idle. However, if hardware-based cpufreq is in play (e.g. hardware P-states HWP) then very poor decisions can be made and the OS cannot do much about it. This only gets worse as HWP becomes more prevalent, sockets get larger and the p-state for individual cores can be controlled. Just setting the performance governor is not an answer given that plenty of people really do worry about power utilization and still want a reasonable balance between performance and power. Experiments show performance benefits for network benchmarks running on localhost (at ~10% on netperf RR for UDP and TCP, depending on the machine). Hackbench also has some small improvements with ~6-11%, depending on machine and thread count.

[Commit 89a55f56fd1c, 3b76c4a33959, 806486c377e3, 32e839dda3ba]

printk: new locking scheme

Problems around the kernel's printk() call aren't new and traditionally must overcome issues with the console lock. Considering that the kernel printing out to the console is very generic operation which can be called from virtually anywhere at any time, relying on any sort of lock can cause deadlocks. Similarly, the call to printk() must proceed regardless of the availability of the console lock. As such, what would happen is that upon contention, the task buffers the output for the console lock owner to flush as when it releases the lock.

On large multi-core systems this scheme can lead to the console owner to pile up a lot unbound work before it can release the lock, triggering watchdog lockups. This was replaced with a new mechanism that, upon contention, the task will not delay the work to the console lock owner and return, but it'll stay around spinning until it is available. The heuristics imply a console owner and waiter such that if multiple CPUs are generating output, the console lock will circulate between them, and none will end up printing output for too long.

[Commit dbdda842fe96]

idr tree optimizations

With the extensions and improvements of the ID allocation API, there is a performance enhancement for ID numbering schemes that don't start at 0; which, according to the patch, accounts for ~20% of all the kernel users. So by using the new idr functions with the _base() suffix users can immediately benefit from unnecessary iterations in the underlying radix tree.

[Commit 6ce711f27500]

arm64: 52-bit physical address support

With ARMv8.2 the physical address space is extended from 48 to 52-bit, thus tasks are now able to address up to 4 pebibytes (PiB).

[Commit fa2a8445b1d3, 193383043f14, 529c4b05a3cb, 787fd1d019b2]

Linux v4.15: Performance Goodies

2018-03-20T10:37:00.003-07:00

With the Meltdown and Spectre fiascos, performance isn't a very hot topic at the moment. In fact, with Linux v4.15 released, it is one of the rare times I've seen security win over performance in such a one sided way. Normally security features are tucked away under a kernel config option nobody really uses. Of course the software fixes are also backported in one way or another, so this isn't really specific to the latest kernel release.

All this said, v4.15 came out with a few performance enhancements across subsystems. The following is an unsorted and incomplete list of changes that went in. Note that the term 'performance' can be vague in that some gains in one area can negatively affect another, so take everything with a grain of salt and reach your own conclusions.

epoll: scale nested calls

Nested epolls are necessary to allow semantics where a file descriptor in the epoll interested-list is also an epoll instance. Such calls are not all that common, but some real world applications suffered severe performance issues in that it relied on global spinlocks, acquired throughout the callbacks in the epoll state machine. By removing them, we can speed up adding fds to the instance as well as polling, such that epoll_wait() can improve by 100x, scaling linearly when increasing amounts of cores block an an event.

[Commit 57a173bdf5ba, 37b5e5212a44]

pvspinlock: hybrid fairness paravirt semantics

Locking under virtual environments can be tricky, balancing performance and fairness while avoiding artifacts such as starvation and lock holder/waiter preemption. The current paravirtual queued spinlocks, while free from starvation, can perform less optimally than an unfair lock in guests with CPU over-commitment. With Linux v4.15, guest spinlocks now combine the best of both worlds, with an unfair and a queued mode. The idea is that, upon contention, extend the lock stealing attempt in the slowpath (unfair mode) as long as there are queued MCS waiters present, hence improving performance while avoiding starvation. Kernel build experiments show that as a VM becomes more and more over-committed, the ratio of locks acquired in unfair mode increases.

[Commit 11752adb68a3]

mm,x86: avoid saving/restoring interrupts state in gup

When x86 was converted to use the generic get_user_pages_fast() call a performance regression was introduced at a microbenchmark level. The generic gup function attempts to walk the page tables without acquiring any locks, such as the mmap semaphore. In order to do this, interrupts must be disabled, which is where things went different between the arch-specific and generic flavors. The later must save and restore the current state of interrupt, introducing extra overhead when compared to a simple local_irq_enable/disable().

[Commit 5b65c4677a57]

ipc: scale INFO commands

Any syscall used to get info from sysvipc (such as semctl(IPC_INFO) or shmctl(SHM_INFO)) requires internally computing the last ipc identifier. For cases with large amounts of keys, this operation alone can consume a large amount of cycles as it looked up on-demand, in O(N). In order to make this information available in constant time, we keep track of it whenever a new identifier is added.

[Commit 15df03c87983]

ext4: improve smp scalability for inode generation

The superblock's inode generation number was currently sequentially increased (from a randomly initialized value) and protected by a spinlock, making the usage pattern quite primitive and not very friendly to workloads that are generating files/inodes concurrently. The inode generation path was optimized to remove the lock altogether and simply rely on prandom_u32() such that a fast/seeded pseudo random-number algorithm is used for computing the i_generation.

[Commit 232530680290]

Linux v4.14: Performance Goodies

2017-11-20T07:50:00.001-08:00

Last week Linus released the v4.14 kernel with some noticeable performance changes. The following is an unsorted and incomplete list of changes that went in. Note that the term 'performance' can be vague in that some gains in one area can negatively affect another, so take everything with a grain of salt and reach your own conclusions.

sysvipc: scale key management

We began using relativistic hash tables for managing ipc keys, which greatly improves the current O(N) lookups. As such, ipc_findkey() calls are significantly faster (+800% in some reaim file benchmarks) and we need not iterate all elements each time. Improvements are even seen in scenarios where the amount of keys is but a handful, so this is pretty much a win from any standpoint.

[Commit 0cfb6aee70bd]

interval-tree: fast overlap detection

With the new extended rbtree api to cache the smallest (leftmost) node, instead of doing O(logN) walks to the end of the tree, we have the pointer always available. This allows to extend and complete the fast overlap detection for interval trees to speedup (sub)tree searches if the interval is completely to the left or right of the current tree's max interval. In addition, a number of other users that traverse rbtrees are updated to use the new rbtree_cached, such as epoll, procfs and cfq.

[Commit cd9e61ed1eeb, 410bd5ecb276, 2554db916586, b2ac2ea6296a, f808c13fd373]

sched: waitqueue bookmarks

A situation where constant NUMA migrations of a hot-page triggered large number of page waiters being awoken exhibited some issues in the waitqueue implementation. In such cases, large number of wakeups will occur while holding a spinlock, which causes significant unbounded lantencies. Unlike wake_qs (used in futexes and locks), where batched wakeups are done without the lock, waitqueue bookmarks allow to to pause and stop iterating the wake list such that another process has a chance to acquire the lock. Then it can resume where it left off.

[Commit 3510ca20ece, 2554db916586, 11a19c7b099f]

x86 PCID (Process Context Identifier)

This is a 64-bit hardware feature that allows tagging TLBs such that upon context switching, only flush the required entries. For virtualization (VT-x) this has supported similar features for a while, via vpid. On other archs it is called address space ID. Linux's support is somewhat special. In order to avoid the x86 limitations of 4096 IDs (or processes), the implementation actually uses a PCID to identify a recently-used mm (process address space) on a per-cpu basis. An mm has no fixed PCID binding at all; instead, it is given a fresh PCID each time it's loaded, except in cases where we want to preserve the TLB, in which case we reuse a recent value. To illustrate, a workload under kvm that ping pongs two processes, dTLB misses were reduced by ~17x.

[Commit f39681ed0f48, b0579ade7cd8, 94b1b03b519b, 43858b4f25cf, cba4671af755, 0790c9aad849, 660da7c9228f, 10af6235e0d3]

ORC (Oops Rewind Capability) Unwinder

The much acclaimed replacement to frame pointers and the (out of tree) DWARF unwinder. Through simplicity, the end result is faster profiling, such as for perf. Experiments show a 20x performance increase using ORC vs DWARF while calling save_stack_trace 20,000 times via single vfs_write. With respect to frame pointers, the ORC unwinder is more accurate across interrupt entry frames and enables a 5-10% performance improvement across the entire kernel compared to frame pointers.

[Commit ee9f8fce9964, 39358a033b2e]

mm: choose swap device according to numa node

If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap device to use in get_swap_pages() to get better performance. This change replaces a single global swap_avail list with a per-numa-node list: each numa node sees its own priority based list of available swap devices. Swap device's priority can be promoted on its matching node's swap_avail_list. Shows ~25% improvements for a 2 node box, benchmaring random writes on mmaped region withSSDs attached to each node, ensuring swapping in and out.

[Commit a2468cc9bfdf]

mm: reduce cost of page allocator

Upon page allocation, the per-zone statistics are updated, introducing overhead in the form of cacheline bouncing; responsible for ~30% of all CPU cycles for allocating a single page. The networking folks have been known to complain about the performance degradation when dealing with the memory management subsystem, particularly the page allocator. The fact that these NUMA associated counters are rarely used allows the counter threshold that determines the frequency of updating the global counter with the percpu counters (hence cacheline bouncing) to be increased. This means hurting readers, but that's the point.

[Commit 3a321d2a3dde, 1d90ca897cb0, 638032224ed7]

archs: multibyte memset

New calls memset16(), memset32() and memset64() are introduced, which are like memset(), but allow the caller to fill the destination with a value larger than a single byte. There are a number of places in the kernel that can benefit from using an optimized function rather than a loop; sometimes text size, sometimes speed, and sometimes both. When supported by the architecture, use a single instruction, such as stosq (stores a quadword) in x86-64. Zram shows a 7% performance improvement on x86 with a 100Mb non-zero deduplicate data. If not available, default back to the slower loop implementation.

[Commits 3b3c4babd898, 03270c13c5ff, 4c51248533ad, 48ad1abef402]

powerpc: improve TLB flushing

A few optimisations were also added to the radix MMU TLB flushing, mostly to avoid unnecessary Page Walk Cache (PWC) flushes when the structure of the tree is not changing.

[Commit a46cc7a90fd8, 424de9c6e3f8]

There are plenty of other performance optimizations out there, including ext4 parallel file creation and quotas, additional memset improvements in sparc, transparent hugepage migrations and swap improvements, ipv6 (ip6_route_output()) optimizations, etc. Again, the list here is partial and biased by me. For more list of features play with 'git log' or visit lwn (part1, part2) and kernelnewbies.

fu(zz)tex: targeted fuzzing of futexes

2015-12-29T05:07:00.002-08:00

The complexity of futexes, their non-trivial interactions and semantics, very much serve as a good candidate for applying fuzzy testing techniques to them. In general futex code is poorly understood and audited, both at a kernel implementation level and by the respective userland callers, normally trying to implement some sort of locking primitive. Unsurprisingly, bugs related to this call will often be subtle and nasty, sometimes with security implications. Specifically for futexes, all system call fuzzers use generic and completely randomized inputs, which has only limited usefulness. This is even the case for Dave Jones' trinity program, which has been extremely good at finding kernel bugs (and ruining my weekends more than once ;). Much of the success and popularity of this program is because not all the inputs are random and meaningful parameters are passed for many of the exercised syscalls. This is called targeted fuzzing, and has been proven to find more bugs than blindly random inputs, which in turn is more likely to produce logic that makes the kernel actually do something related to the call, as opposed to quickly erroring out due to some trivial bogus scenario. A nice example is the perf_event_open(2) call, which was studied for targeted fuzzy testing with very good results.

Extending Trinity

Reusing the already proven-to-work machinery of trinity. and extend it for futex ad-hoc work, is the obvious step for improving coverage, in the hope to tackle some of the issues previously described. While reading the code is always the definite answer, having a man-page that is up-to-par with the call is quite essential; if we want programmers to make correct use of the tools we provide, that is. Fortunately, Michael Kerrisk has been doing a nice job of rewriting the current futex.2 page, which is so surprisingly crappy and incomplete, it's sad. This makes the task correctly setting the input parameters following a certain purpose a little less tedious and error-prone:

 SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
     struct timespec __user *, utime, u32 __user *, uaddr2, u32, val3)

-- just imagine if mmap.2 were barely documented and stale.

There are two immediately obvious op flags that are not being exercised at all (with the exception of randomly bumping into them, which is quite unlikey and badly controllable):

FUTEX_CLOCK_RT: When set, the kernel treats the timeout as an absolute time based on CLOCK_REALTIME as opposed to CLOCK_MONOTONIC. This is only affected by FUTEX_WAIT_BITSET and FUTEX_REQUEUE_PI commands.

FUTEX_PRIVATE_FLAG: Refers to the user address space mapping, and applies to all operations. The main benefit is that kernel can directly use the virtual address without having to do any lookups or other overhead (vmas, gup, thp, etc.) imposed by shared mappings.

Ever-changing task priorities

The whole purpose of PI futexes are to address priority inheritance issues for systems with real time requirements. Randomly changing a processes priority will therefore better stress the system call instead of always using the default nice value, exercising priority boosting code in the kernel.

Fault/error injections

This year we added support for artificially triggering errors within the various futex paths faults and deadlock scenarios, via the CONFIG_FAULT_INJECTION kernel framework along with the CONFIG_FAIL_FUTEX option. Trinity can make use of this feature by randomly toggling the process' make-it-fail file as well as selecting appropriate fault injection debugfs options.

Feeding user-addresses

Perhaps the single most important argument that we can pass to the syscall is the user address (uaddr, or 'the futex'), which will govern everything the kernel attempts to do with it, being private or shared address space. As such, it is not very useful to blindly feed it random addresses, even if trinity is setup by default, these inputs will sometimes be picked by previously mmap-created shared memory playgrounds. However, at a futex level, this does not matter unless we are doing blocking calls (WAIT).

So this has been reworked such that trinity now creates a number of locks in shared memory at startup, which has the owner PID and the actual futex. Upon a call, both fields of uaddr get either a random lock or a random address from the mmap playground, each with a 50% chance. The locks follow very simple semantics, where a successful cmpxchg will allow the caller to acquire the lock without the kernel being involved (fastpath), otherwise we need to wait/block through the futex call.

Because of how trinity is structured with callbacks for pre/post syscall invocation, there are a number of racy windows between when the lock is dealt (ie considered contended) with and when the fuzzer actually calls futex(2). As such, this must be taken with a grain of salt, but does exercise lots of real world situations, nonetheless.

Choosing operations

The idea is to randomly perform different operations on the selected futex, such that combinations of wake, wait, requeue are done (both for regular and PI futexes). While passing informed, not-so-random, parameters to the system call reduces the chance of shallow fuzzing, choosing the futex operation will determine the kind of work to be done on the uaddress. As such this part can further determine the usefulness of trinity regarding futexes. However, one cannot get too strict here as reducing the randomness will also limit the usefulness. For now the layout is a 25% chance when performing lock operations. Oh the other hand, for the case of mmap selected uaddress, the operation is left up to trinity to decide.

Evaluation and future work

Evaluating software that purposely tries to mess up other software is always twofold. For one, any new futex bug that is found indicates that modifying trinity was a good step towards better testing coverage. But unfortunately this creates a new headache for futex hackers, and a bug needs to be fixed (including any corresponding Linux distribution backporting, security and -stable work). So any useful results which exhibit the presence of bugs can be bitter/sweet -- just think Dijkstra.

One immediate way of evaluating the changes to trinity is to see the number of successful calls. While this can be a misleading metric, it does at least indicate whether or not many of the bogus parameter passing have been mitigated and replaced with smarter, more informed calls. Tests show that these changes have in fact boosted the amount of successful futex(2) returns; within a trinity run of 10,000 calls with 4 threads, we were able to go from ~470 to nearly ~4300, which is around a 10x improvement. This also means that it takes more time to run trinity as the kernel is doing actual work now with our futexes, not simply returning immediately due to bogus parameters and trivial error checks.

In the future, it would be good to fuzz futexes with memory-back file (uaddress), instead of always relying on anonymous memory. While is perhaps not so interesting from a futex standpoint (with the exception of hashing), it would be good when combining with other memory related calls which actually do things with the file. Another useful direction would be to further investigate operation selection policies. Different models will fuzz different parts of the futex subsystem, and perhaps (very probably, actually) I have not found the best one yet.

This work was done as part of SUSE Hackweek 13, which allowed me to finally allocate some time to focus on this (although this writing is much overdue). So as always, lots of thanks to my employer.

acquire/release semantics in the kernel

2015-10-04T23:52:00.000-07:00

With the need for better scaling on increasingly larger multi-core systems, we've continued to extend our CPU barriers in the kernel. Two important variants to prevent CPU reordering for lock-free shared memory synchronization are pairs of load/acquire and store/release barriers; also known as LOCK/UNLOCK barriers. These enable threads to cooperate between each other.

Multiple, yet pretty much equivalent, definitions of acquire/release semantics can be found all over the internet, but I like the version from the infamous 'Documentation/memory-barriers.txt' file for three reasons: (i) it is clear and concise, (ii) it explicitly warns that they are the minimum operations and not to assume anything about reordering of loads and stores before or after the acquire or release, respectively. Finally, (iii) it strongly mentions the need for pairing and thus portability:

(5) ACQUIRE operations.

     This acts as a one-way permeable barrier. It guarantees that all memory operations after the ACQUIRE operation will appear to happen after the CQUIRE operation with respect to the other components of the system. ACQUIRE operations include LOCK operations and smp_load_acquire() operations.

     Memory operations that occur before an ACQUIRE operation may appear tohappen after it completes.

     An ACQUIRE operation should almost always be paired with a RELEASE operation.

(6) RELEASE operations.

     This also acts as a one-way permeable barrier. It guarantees that all   memory operations before the RELEASE operation will appear to happen before the RELEASE operation with respect to the other components of the system. RELEASE operations include UNLOCK operations and smp_store_release() operations.

     Memory operations that occur after a RELEASE operation may appear to happen before it completes.

     The use of ACQUIRE and RELEASE operations generally precludes the need for other sorts of memory barrier (but note the exceptions mentioned in the subsection "MMIO write barrier"). In addition, a RELEASE+ACQUIRE pair is -not- guaranteed to act as a full memory barrier. However, after an ACQUIRE on a given variable, all memory accesses preceding any prior RELEASE on that same variable are guaranteed to be visible. In other words, within a given variable's critical section, all accesses of all previous critical sections for that variable are guaranteed to have completed.

     This means that ACQUIRE acts as a minimal "acquire" operation and    RELEASE acts as a minimal "release" operation.


Thread B's ACQUIRE pairs with Thread A's RELEASE. Copyright (C) IBM.

In lock-speak, all this means is that nothing leaks from the critical region that is protected by the primitive in question. A thread attempting to take a lock will synchronize/pair the load (ACQUIRE), for instance via Rmw (cmpxchg), when attempting to take the lock with the last store (RELEASE) when another thread is concurrently releasing the lock (for example, setting the counter to 0).

For v4.2, Will Deacon introduced more relaxed extensions of traditional atomic operations (including Rmw) which allow finer grained control over, what used to be, full barriers semantics on both sides of the instruction. This is also true for just about all atomic functions that return a value to the caller, ie: atomic_*_return(). As such weakly ordered architectures can make use of these -- currently only arm64 makes use of them, but efforts for PPC are being made.

      - *_relaxed: No ordering guarantees. This is similar to what we have already for the non-return atomics (e.g. atomic_add).

      - *_acquire: ACQUIRE semantics, similar to smp_load_acquire.

      - *_release: RELEASE semantics, similar to smp_store_release.

So we now have goodies such as atomic_cmpxchg_acquire() or atomic_add_return_relaxed(). Most recently, aiming for v4.4, I've ported all our locks to make use of these optimizations, which can save almost half the amount of barriers in the kernel's locking code -- which is specially nice under low or regular contention scenarios, where the fastpaths are exercised. There are plenty of other examples of real world code making use of acquire/release semantics. Mostly by using smp_load_acquire()/smp_store_release() other primitives also use these semantics for common building blocks (as esoteric as they can get, ie RCU).

LPC 2015: Performance and Scalability MC

2015-08-24T07:34:00.000-07:00

This year I had the privilege of leading the Performance and Scalability micro-conference for Linux Plumbers. The goals and motivation behind organizing this track were threefold. First present relevant work-in-progress ideas that can improve performance in core kernel subsystems, and need some face to face discussion -- as such, this requires previous debate on lkml. Similarly, learn about real bottlenecks and issues people are running into. And finally, get to know more relevant academic (experimental) work going on in in both the kernel and system-level userland. As such, the sessions were grouped as follows:

(i) Fast Bounded-Concurrency Hash Tables. Samy Bahra introduced a novel non-blocking multi-reader/single writer hash table with strong forward progress guarantees for TSO. Because the common-case fastpath does not incur in barriers or atomic operations, this technique allows nearly perfect scaling. While his work is done in userspace, he sees potential for it in the kernel, such as the networking subsystem. In such situations, the use of RCU (readers being the common case) might also be used.

(ii) Improving Transactional Memory Performance with Queued Locking. While transactional memory works nicely in conflict-free setups, it ends up requiring common serialization otherwise. An option is to retry, however, when the amount of threads executing in the CR is larger than the amount of completed threads, you can get pileups. Tim Chen presented a solution based on applying a sort of 'aperture' and using principles based on MCS for faired queuing, where can be regulated based on metrics such as the number of threads in the critical region and abort rate.

(iii) How to Apply Mutation Testing to RCU. Iftekhar Ahmed from OSU, summarized his research in overcoming limitations of mutation testing to identify problems in RCU. As usual, working with Paul McKenney, they have been able to identify a number of mutants along with making use of rcutorture for specific periods of time. They generated ~3300 mutants from rcu and rcutorture is doing a good job identifying them. It would be interesting to see this applied along with fuzzy testing which has already uncovered several bugs in RCU in the past.


Scaling track -- LPC'15, Seattle.

(iv) Unfair Queued Spinlocks and Transactional Locks. Waiman Long has been working on extending spinlocks and apply them to solve issues with transactional memory. He presented experiments based on rwlocks and transactional spinlock (new primitive) for transactional (reader) and non-transactional (writer) executions. This talk nicely complemented Tim Chen's previous presentation. He also touched on the qspinlock performance in virtualized environments and the challenges currently out there. As we already have code for this, it was much easier to discuss face to face. Consensus in the room was that kernel developers are not against improving pv spinlocks, but what is determined is that we will not accept a 3rd primitive.

(v) Do Virtual Machines Really Scale. Sanidhya Kashyap from GA Tech showed us the state of scalability in the cloud where there is a clear trend that services hit poor scalability after certain degrees of contention/core-count. These are LHP issues and vmexits/enters cause performance issues at high vcpu counts. He introduces oticket backed by performing multiple wakeups at once when granting the lock. Good feedback and suggestions to overcome some of the presented issues with the approach. This was an extra short BoF like of presentation, but there was quite a bit of interest, and the appropriate people were in the room.

Overall I would say that all three objectives were met and the quality of the sessions were high, thus meeting all expectations (if not, please email me for feedback ;-). In fact, there were some highly interesting and relevant presentations that, due to time constraints, had to be left out.

futexes and hash table collisions

2014-01-20T13:32:00.002-08:00

Hash tables are popular data structures that efficiently handle dictionary operations (search and insert/delete). The Linux kernel relies on them for a number of subsystems, including major core kernel areas, such as dcache/inode lookups, workqueues, timers, the PID table, TCP/UDP and futexes. This last being used as common building blocks for implementing userspace locking primitives, pthreads being, perhaps, the most popular user.

Futexes make use of single, chained, hash table. The user space address (uaddr) is used by the kernel to generate a unique futex_key to reference the futex. Each key is hashed to a bucket (hb), which contains a single priority based linked list -- real-time tasks are queued in front of regular tasks, otherwise ordered as FIFO. To synchronize updates to the list, a hb->lock spinlock is used. Note that collisions can occur where similar user addresses can hash to the same futex key, so a single list can contain tasks blocked on different futexes. There are a total of 256 hash buckets in the entire table. For a much more thorough futex architectural overview, refer to:

A futex overview and update. Darren Hart, LWN.net. Nov, 2009.
Futexes Are Tricky. Ulrich Drepper. Nov, 2011.
Requeue-PI: Making Glibc Condvars PI-Aware. Darren Hart, Dinakar Guniguntala.

Operations on futexes can be classified as putting a task to sleep/block to wait on a futex, or, the opposite, wake up one or more blocked tasks. Both commands make use of the architecture (very briefly) described above. Of course, each of these operations require hashing the uaddr, and thus taking the hb->lock to access the list.

Bottlenecks

The size of the hash table is evidently a major bottleneck in today's systems. Large systems, using many futexes, can be prone to high amounts of collisions; where these futexes hash and therefore lead to extra contention on the same hb->lock. Furthermore, cacheline bouncing occurs when we have multiple hash bucket spinlocks residing on the same cacheline and different futexes hash to adjacent buckets. If tasks operating on different futexes that are on the same list, the lock will become contended really fast.

In addition, the entire hash table is allocated on a single NUMA node, which creates remote node memory accesses. As systems become more powerful, having NUMA aware algorithms and data structures is paramount to take advantage of today's hardware trends. accessing the hash table from remote NUMA nodes can lead higher memory latencies.

Optimizations & Results

Upstream commit a52b89eb deals with both bottlenecks. The hash table now contains 256 hash buckets per CPU as well as being NUMA aware. There was also some discussion on scaling the table up by RAM as well, and furthermore hashing on the uaddr's page node, thus reducing the cost of collisions. However this cannot be done as the pages can move between nodes at any point. In addition to enlarging the table, cacheline aligning the hash bucket structure also provided a nice optimization, as it avoids accesses across cacheline boundaries. The figure below (higher is better) shows the throughput of uaddr hashing for the different optimizations, where each thread operates on 1024 futexes.

Combining both cacheline aligning and larger, NUMA table provides the best results -- each percentage increase is added to the final value. As more futexes are dealt with, the more clear the benefits, with speedups from 78% to 800%.

Of course, performance goes down as more futexes are added to the equation. This is unavoidable given the overall architectural designs that govern futexes. Hashing on 512 threads isn't as fast as on 32 threads, but the proportion of baseline and both clearly becomes larger.

Another recent improvement is dealing with smarter wake-ups. Commit b0c29f79 avoids taking the hb->lock when there are not tasks waiting on the futex -- thus a free ride for futex(2) calls returning 0. This extends the parallelism of futexes, allowing other calls to be processed concurrently instead of wasting time spinning on a potentially contended spinlock.

These optimizations will be included in Linux 3.14.

Special thanks to, among others, Thomas Gleixner, Darren Hart and Peter Zijlstra for entertaining discussion and taking the time to review this work.

Detecting Hybrid MBRs in the Linux Kernel

2013-09-28T13:43:00.000-07:00

EFI's GPT disklabels present a number of benefits to the traditional MBR scheme. For instance, not having to deal with CHS addressing, better data integrity (including a backup header as data redundancy) and 64bit LBA addressing, allowing partitions to go beyond the 2Tb limit all the way up to 9.4 Zb.

These nice features don't come free, however, having to deal with older legacy systems (normally BIOS-based) that only use MBR, and do not know about GPT. For example, users who have an EFI system (say a Mac), dual booting with an older, non-EFI version of Windows. While OSX knows GPT and uses the GPT partition(s), Windows doesn't, so you cannot dual boot without creating a hybrid MBR - the standard protective MBR (pMBR) won't allow Windows to boot. This hybrid MBR will extend the regular pMBR (containing a 0xEE GPT partition) so that it contains up to three primary partitions that point to the same disk locations that the GPT partitions point to. Hybrid MBRs are unofficial workarounds to the GPT specs, but necessary for backward compatibility. Furthermore, most bootloaders are now acknowledging this kind of scheme.

In order for Linux to properly discover protective MBRs, it must be made aware of devices that have hybrid MBRs. To this end, Linux v3.12 will now be able to detect these partitioning schemes.

Furthermore, the kernel will no longer require the GPT partition to begin at sector 1, enabling Linux to be more flexible when probing for GPT disklabels. Linux was the only OS that enforced this, and apart from it not being enforced by UEFI, it caused Linux to potentially fail to detect valid partitions on the disk. For compatibility reasons, if the first partition is hybridized, the 0xEE partition must be small enough to ensure that it only protects the GPT data structures - as opposed to the the whole disk in a protective MBR. Note that these changes do not affect already existing partitions.

FOSS.IN organization team critique

2012-10-08T07:09:00.002-07:00

Back in June I submitted a talk to FOSS.IN 2012 conference in Bangalore, India. Unfortunately my talk was not included in the list of accepted proposals, in other words, it was rejected. But that's not the reason why I'm writing, or why I am most disappointed in how things were handled by FOSS.IN's organizing team.

The program's call for participation stated that the list of accepted proposals would be published by August 6th. This, however, did not occur until two months later, in early October. Working in academia, I am well aware that conference dates and deadlines can be changed, and one gets used to this, and takes it with a grain of salt. What I cannot understand, or accept, is the fact that FOSS.IN did not bother to inform anyone (specially those of us who took the time to submit a talk) that the deadlines were not going to be kept. A delay of two months is already incredible, not to mention the total lack of information as to when the accepted talks would be published. I had never seen such a thing from a conference, and hope to never see it again.

Furthermore, I was not even informed that my talk had been rejected. FOSS.IN already has an automated system to send people emails, I got one confirming my submission. So why wasn't I notified? Automatic emails are easy, fast and free.

Submitting a good talk to a conference takes time and careful preparation. I would expect a minimal amount of courtesy and professionalism by FOSS.IN. People make plans around deadlines and it's extremely rude to keep them in the dark.

This kind of behavior is simply unacceptable and makes the entire conference look bad. Yes, it's quite a big and well known event within the free software community, but that's not a justification. I am aware of Atul Chitnis' condition and wish him all the best and hope he can overcome the illness. I was very sorry to learn about it. But FOSS.IN is not a one man job.

I am writing this as a constructive criticism, hoping that these unpleasant things do not reoccur in future events. Since this year was the first time I proposed a presentation I don't know if these issues were a one time thing or is what people have come to expect from the FOSS.IN teams. Still I wish the best of luck to you folks and hope that you have another great conference this year.

fdisk updates and GPT support

2012-09-27T14:15:00.000-07:00

The fdisk tool is perhaps the most recognized disk partitioner in the world, as it has historically been present in Windows and all Unix flavors, among other OSs. While this tool has proven useful for its Linux variant, it as been subject to intense patching along its 20 years of existence, and it is a product of multiple authors, coding styles and concepts. Because of this, extending fdisk, to keep up with modern day computing and disk needs is hard, time consuming and error prone. To address this, a serious effort, initially sponsored by Google, was started to redesigned and update fdisk to fit the requirements of a modern disk partitioning program. Some include removing DOS compatibility mode, replacing the deprecated CHS addressing with LBA, GPT support, creating a generic a driver based API that can transparently handle different partition types and major code cleanups and refactoring, among others. While several things have been done, there is still a long ways to go.

I'm pleased to announce that fdisk can now work with GPT based disks!!

GUID Partition Table (GPT) , developed by Intel in the late '90s, is a standard for laying out partitioning on hard disks, now forming part of the UEFI standard. Its increasing popularity is easily understandable, as it provides several benefits over the traditional PC master boot record (MBR) scheme. Furthermore, people using Intel based Apple products (like macbooks) will most likely be using GPT (with a hybrid MBR scheme). While the Internet is full of documents that go into the details of this format, there are a few benefits worth mentioning here:

GPT does not know anything about CHS addressing, and only uses LBA (64bit).
Because it uses 64bit LBAs, it can hold 2⁶⁴−1 sectors, typically 9.4 Zb with standard 512 byte sectors, way above the 2Tb limit offered by MBR.
GPT uses 32bit CRC checksums to validate data integrity for its headers and partition entries. It also adds redundancy to it's structures having them present twice, once at the start and again at the end of the disk. This, of course, helps protect the system against disk errors and allows better recovery.

Some considerations about the implementation:

We currently support probing, listing/adding/deleting/writing partitions, data integrity verification. Furthermore, fdisk can determine if there is a traditional protected, or hybrid MBR present.
For now, primary header corruption is not recoverable from he backup at the end of the disk.
Header checksums are updated upon every change (ie: add/delete partitions), this allows us to mathematically verify the changes on-the-fly, and not only when writing to disk, like most other related tools do.
When creating a new partition, all partition type GUIDs are available.

I'd like to thank both Petr Uzel from SuSE and Karel Zak from Red Hat for their time reviewing, testing and answering any doubts I had.

Enjoy!

kvm: Intel associative TLBs

2012-05-26T11:26:00.001-07:00

Traditional x86 architecture implicitly requires TLB flushing upon context switching (CR3 writes) so the new process-to-run's address space does not conflict with lineal to physical translations cached by previous processes. When using shadow pages for MMU virtualization, it can be quite expensive to throw away.

Intel introduced Virtual Processor ID (vpid) into its VT-x technology in order to tag different processes and therefore avoid unnecessary TLB flushes.

KVM uses a global bitmap to facilitate vpid management for all guests and all vCPUs, managing up to ~64000 unique identifiers. Upon virtual machine startup it will allocate a vpid for each vCPU with a first-come, first-serve policy. The data is protected by a vmx_vpid_lock spinlock.

 static DECLARE_BITMAP(vmx_vpid_bitmap, VMX_NR_VPIDS);  
 static DEFINE_SPINLOCK(vmx_vpid_lock);  
 ...  
 static void allocate_vpid(struct vcpu_vmx *vmx)  
 {  
      int vpid;  
      vmx->vpid = 0;  
      if (!enable_vpid)  
           return;  
      spin_lock(&vmx_vpid_lock);  
      vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);  
      if (vpid < VMX_NR_VPIDS) {  
           vmx->vpid = vpid;  
           __set_bit(vpid, vmx_vpid_bitmap);  
      }  
      spin_unlock(&vmx_vpid_lock);  
 }

Similarly, when the guest is shutdown, it will free its corresponding the vpid(s):

 static void free_vpid(struct vcpu_vmx *vmx)  
 {  
      if (!enable_vpid)  
           return;  
      spin_lock(&vmx_vpid_lock);  
      if (vmx->vpid != 0)  
           __clear_bit(vmx->vpid, vmx_vpid_bitmap);  
      spin_unlock(&vmx_vpid_lock);  
 }

To invalidate different cached translations based on vpid, Intel added the invvpid instruction. The specific invalidations are grouped as (for more information check the Intel reference manual vol. 3C 2.8 - Caching Translation Information):

Individual address: the vCPU invalidates translations for a specific given address and PID
Single context: the vCPU invalidates all tagged translations for a specific given VPID
All context: the vCPU invalidates all translations for all VPIDs (except the original, id 0)
Single context, retaining global translations: the vCPU invalidates all tagged translations for a specific given VPID, except global translations.

Whenever there's a TLB flush call or a vCPU reset (like when setting up the architecture at boot time), both part of standard x86 operations, the vpid_sync_context() function is called:

 static inline void vpid_sync_context(struct vcpu_vmx *vmx)
{
 if (cpu_has_vmx_invvpid_single())
  vpid_sync_vcpu_single(vmx);
 else
  vpid_sync_vcpu_global();
}

This function calls the corresponding invalidation type, previously described. The

vpid_sync_vcpu_single() routine obviously must pass the vmx->vpid in order to specify what id its referring to.

Both global and single contexts end up calling __invvpid(), that does all assembler the work.

The VPID feature can be enabled/disabled by traditional kernel module parameters at /sys/module/kvm_intel/parameters/vpid

A while ago I proposed a patch to enable tracing vpid management for simulating tagged TLB behavior and performance. Unfortunately tracing these events for experimentation/research did not suit mainstream enough to be officially merged. Understandable.

linux local system locks

2012-05-03T15:02:00.002-07:00

The lslk(8) program has been unmaintained and deprecated for over a decade now, since 2001. I've recently rewritten the tool, now called lslocks(8) that allows an easier and up-to-date way of seeing all the current file held locks in a Linux system. This program will be shipping soon with standard system tools and available in your favorite distribution.

Some important modifications include removing legacy Unix outputs and options, for example:

Don't output inode number, whence and maj:min device numbers.
Don't provide nonblocking syscall options stat(2) and readlink(2).

The option to use nonblocking calls was previously intended for NFS partitions; however this should be transparent to utility programs considering that timeouts can occur generically in any context (fuse - sshfs, NFS, netdevs, etc).

The command itself is quite straightforward - KISS:

 $> lslocks   
 COMMAND      PID TYPE SIZE MODE M   START    END PATH  
 smbd       1379 POSIX  5B WRITE 0     0     0 /var/run/samba/smbd.pid  
 smbd       1379 POSIX 696B READ 0     4     4 /var/run/samba/messages.tdb  
 ...
 smbd       1379 POSIX 696B READ 0     4     4 /var/run/samba/gencache_notrans.tdb  
 smbd       1513 POSIX 696B READ 0     4     4 /var/run/samba/messages.tdb  
 (unknown)    1717 FLOCK  0B WRITE 0     0     0 /var/run  
 atd       1793 POSIX  5B WRITE 0     0     0 /var/run/atd.pid  
 sendmail-mta   2004 POSIX  52B WRITE 0     0     0 /var/run/sendmail/mta/sendmail.pid  
 nmbd       2292 POSIX  5B WRITE 0     0     0 /var/run/samba/nmbd.pid  
 nmbd       2292 POSIX 696B READ 0     4     4 /var/run/samba/messages.tdb  
 nmbd       2292 POSIX 108K READ 0     4     4 /var/run/samba/connections.tdb  
 cat       3221 POSIX  0B WRITE 0     0     0 /home/dave/.local/share/zeitgeist/fts.index/flintlock  
 zeitgeist-daemo 3211 POSIX 989K WRITE 0 1073741824 1073742335 /home/dave/.local/share/zeitgeist/activity.sqlite  
 chromium-browse 4306 POSIX 202K WRITE 0 1073741824 1073742335 /home/dave/.config/chromium/Default/Web Data  
 ...

We can quickly see the command name and PID that currently cause a lock to be held, as well as its size and canonical path. The lock itself, can be FLOCK (created with flock(2)) or POSIX (created with fcntl(2) and lockf(2)) - I won't go over explaining the differences as it's not in the context of this post. The start and end are the relative byte offset of the lock.

Enjoy!

kvm: virtual x86 mmu setup

2012-03-05T10:52:00.001-08:00

One of the initialization steps that KVM does when a virtual machine (VM) is started, is setting up the vCPU's memory management unit (MMU) to translate virtual (lineal) addresses into physical ones within the guest's domain. For x86, which is what will be covered here, most of the corresponding code is in <kernel>/arch/x86/kvm/mmu.c.

Disclaimer: Although this document requires at least some basic knowledge of x86 paging and traditional virtual memory, I hope it can be useful for people that are interested in low-level virtualization, linux kernel and/or KVM internals in general.

The first step call kvm_mmu_setup() which simple does some trivial asserting and calls init_kvm_mmu():

 static int init_kvm_mmu(struct kvm_vcpu *vcpu)  
 {  
      if (mmu_is_nested(vcpu))  
           return init_kvm_nested_mmu(vcpu);  
      else if (tdp_enabled)  
           return init_kvm_tdp_mmu(vcpu);  
      else  
           return init_kvm_softmmu(vcpu);  
 }

The first check is regarding nested MMUs, which is to run VMMs within guests, having yet another layer of indirection. This is part of the Turtles project and won't be covered in this document, but it is well documented elsewhere.

The tdp_enabled (two dimentional paging) boolean variable determines wether or not hardware assisted paging (EPT or RVI/NPT) is enabled. If true, it will use 2D paging, otherwise, the default option, shadow paging through software only support. Since KVM can be built as a kernel module, it uses the user's options to set the variable's value, with kvm_enable_tdp() and kvm_disable_tdp(). For example, users can check /sys/modules/kvm_intel/parameters/ept to verify if EPT is enabled or not. Most distributions will load the module with it enabled, anyway:

#> modprobe kvm_intel ept=1

Both init_kvm_tdp_mmu()and init_kvm_softmmu() are responsible for setting up how the guest's page walking will be handled, by populating the walk_mmu structure. This structure abstracts the details of architecture-specific paging modes, allowing common operations like loading and setting CR3 for upper page level base pointer, flushing TLB entries (invlpg) and page fault handing, among others.

Just like traditional, non virtualized environments, the guest's MMU must be capable of handling paging in 32bit, PAE, 64bit, optionally it can have paging disabled, so guest virtual addresses (gva) are the actual guest physical addresses (gpa), mapped 1:1. This is quite obvious since the guest's does not know that its MMU is the one KVM presents to a it, and not the real, physical one - making everything transparent - which is not the case for paravirtualization, like Xen.

Hardware support initialization
Most logic is done in this single function:

static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
{
    struct kvm_mmu *context = vcpu->arch.walk_mmu;

    context->base_role.word = 0;
    context->new_cr3 = nonpaging_new_cr3;
    context->page_fault = tdp_page_fault;
    context->free = nonpaging_free;
    context->sync_page = nonpaging_sync_page;
    context->invlpg = nonpaging_invlpg;
    context->update_pte = nonpaging_update_pte;
    context->shadow_root_level = kvm_x86_ops->get_tdp_level();
    context->root_hpa = INVALID_PAGE;
    context->direct_map = true;
    context->set_cr3 = kvm_x86_ops->set_tdp_cr3;
    context->get_cr3 = get_cr3;
    context->get_pdptr = kvm_pdptr_read;
    context->inject_page_fault = kvm_inject_page_fault;

    if (!is_paging(vcpu)) {
        context->nx = false;
        context->gva_to_gpa = nonpaging_gva_to_gpa;
        context->root_level = 0;
    } else if (is_long_mode(vcpu)) {       
        context->nx = is_nx(vcpu);
        reset_rsvds_bits_mask(vcpu, context, PT64_ROOT_LEVEL);
        context->gva_to_gpa = paging64_gva_to_gpa;
        context->root_level = PT64_ROOT_LEVEL;
    } else if (is_pae(vcpu)) {
        context->nx = is_nx(vcpu);
        reset_rsvds_bits_mask(vcpu, context, PT32E_ROOT_LEVEL);
        context->gva_to_gpa = paging64_gva_to_gpa;
        context->root_level = PT32E_ROOT_LEVEL;
    } else {
        context->nx = false;
        reset_rsvds_bits_mask(vcpu, context, PT32_ROOT_LEVEL);
        context->gva_to_gpa = paging32_gva_to_gpa;
        context->root_level = PT32_ROOT_LEVEL;
    }


    return 0;
}

The is_paging() function simply checks the vCPU's CR0.PG flag to see if paging is enabled or not - this will most likely be enabled!
The is_long_mode()checks if the guest has a 64bit vCPU, by reading the EFER.LMA (long mode active) bits, assuming, of course, CONFIG_X86_64 is set, since 64bit guests cannot run on 32bit hosts.
If PAE is enabled, then is_pae()'s CR4.PAE check will return successfully and indicate that the physical address extension is present, and the 32bit guest can reference more than 4Gb of address space.
Finally, if the above three fail, its assumed that the guest works in standard 32bit mode.

No matter what mode is set, no-execution bits, rsvds bits, what function will handle gva to gpa translation and the paging's root level is set:

The ->nx flag refers to No-eXecution bits to separate areas of memory from being executed, avoiding buffer overflow attacks. This is obtained by checking vCPU's EFER.NX flag.

The ->gva_to_gpa is the function that will handle guest's virtual to physical translations, discussed here. When paging is disabled, the gpa is returned, and for the other modes, gva_to_gpa() is the same function (defined in paging_tmpl.h), but varies according to the root level and paging mode.

The reset_rsvds_bit_mask() function just sets the reserved machine memory.

Finally, the page walker's ->root_level refers to the amount of hierarchical levels of guest's paging. With the standard 4k page size, 64bits will have four (PML4, PDP, PD, PTE), 32bits will have two (PD, PTE) and PAE will have three (PDP, PD, PTE). If paging is disabled, there obviously won't be any levels to walk.

Software support initialization

Unlike hardware support, most of the work for setting up software MMU and shadow page is done by kvm_init_shadow_mmu(), while init_kvm_softmmu() simply calls it and later sets control register 3, page directory pointer and how the VMM will emulate (inject) and propagate the page faults.

 static int init_kvm_softmmu(struct kvm_vcpu *vcpu)  
 {  
      int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);  

      vcpu->arch.walk_mmu->set_cr3           = kvm_x86_ops->set_cr3;  
      vcpu->arch.walk_mmu->get_cr3           = get_cr3;  
      vcpu->arch.walk_mmu->get_pdptr         = kvm_pdptr_read;  
      vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;  
      return r;  
 }

The kvm_init_shadow_mmu() function is quite similar to what was discussed above, based on the paging modes, it sets how the walker will work paging32_init_context_common() and paging64_init_context_common(), for 64bit and PAE systems.

kvm: hardware assisted paging

2012-03-03T16:11:00.000-08:00

CPU vendors began adding hardware virtual memory management unit (vMMU) support circa 2009, with Intel's VT-x (vmx flag) addition. Historically, the guest's physical (gpa) to host physical (hpa) addresses where translated through software, using shadow page tables. These tables are kept synchronized with the guest's page tables, and are one of the main sources of overhead in virtual machines, as they incur in expensive vm exits. A common way of keeping the shadow pages up to date are to write-protect the guest's pages, so that when they are changed, page faults are triggered and intercepted by the VMM, which emulates it (injecting the page) and updating the shadow ones, accordingly. This, of course, is transparent to the guest. Another major problem, is that TLB semantics require flushes upon context switching, as newly assigned processes need to have it empty to cache entries only belonging to the process's address space. To overcome this, CPUs now incorporate tags into the TLB - also known as vpid, which allow mapping that associate addresses to processes and thus reducing the amount of flushes.

With hardware vMMUs, in order to avoid the VMM overhead with shadow paging, the guest is left alone to update its page tables, while the hardware maintains its own page tables which maps gpa to hpa. Intel calls these Extended Page Tables (EPT). Having two page tables now requires that when a guest translates and address, two levels must be walked (sometimes referred to as 2D page walks). So hardware support can come at a greater cost for programs with bad locality and cache unfriendly, than its software equivalent. When a TLB miss occurs, and the guest does a page walk, for each hierarchical level, the entire EPT must be walked as well, to obtain the hpa. For 64bit guests, this is worse than 32bit ones, as the 64bit address space requires more levels (PML4, PDP, PD, PTE) of translation.

KVM's implementation of EPT is quite unique and uses both the guest's tables and the hardware's to translate addresses. When a guest needs to translate virtual addresses to physical ones, the gva_to_gpa()function is called:

 static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,  
                                struct x86_exception *exception)  
 {  
      struct guest_walker walker;  
      gpa_t gpa = UNMAPPED_GVA;  
      int r;  
      r = FNAME(walk_addr)(&walker, vcpu, vaddr, access);  
      if (r) {  
           gpa = gfn_to_gpa(walker.gfn);  
           gpa |= vaddr & ~PAGE_MASK;  
      } else if (exception)  
           *exception = walker.fault;  
      return gpa;  
 }

If the guest's walk fails and the gva-gpa mapping is not present, a page fault is raised, and tdp_page_fault() - two diminutional paging - is invoked through an EPT violation - handle_ept_violation() to translate gpa to hpa. A new page table entry is created and the shadow page code is reused through mmu_set_spte()and added to the beginning of the page list through pte_list_add(). This way, the next time the guest virtual address is accessed, it will already be in the guest's pages and walk_addr() will be done successfully, and the gpa can be returned without further a due.

inode to filename

2012-01-30T16:09:00.000-08:00

We normally have a file's canonical/absolute path, and with that we can get just about any details from it, usually through stat(2)-family. What about when we have the inode number? I had to come up with this little ugly function to parse (luckily I also had the PID) procfs and go comparing all the files... we can do better!

I'm hoping someone can tell me a more straightforward way of doing this - specially considering that something very similar will be in upcoming linux distros.

http://pastebin.com/5TxB4TMa

an (incomplete) list of indispensable systems books

2012-01-28T14:54:00.000-08:00

If you're interested in a career in computer systems, here is an unsorted list of books you should get your hands on. Some are UNIX related, but hey, that's my area of knowledge and, in one way or another, they have all helped me grow as a computer scientist.

Kernighan, Brian and Ritchie, Dennis. The C Programming Language (2nd Ed.).
Hennessy, John L. and Patterson, David A. Computer Architecture: A Quantitate Approach.
Bach, Maurice J. The Design of the UNIX Operating System.
Silverschatz, Abraham. and Galvin, Peter. and Gagne, Greg. Operating Systems Concepts.
Stevens, Richard W. and Rago, Stephen A. Advanced Programming in the UNIX Environment.
Raymond, Eric S. The Art of UNIX Programming.
Duntemann, Jeff. Assembly Language Step by Step (2nd Ed.).
Kernighan, Brian and Pike, Rob. The Practice of Programming.
Love, Robert. Linux Kernel Development (3rd Ed.).

linux and processor attributes

2011-12-18T10:36:00.000-08:00

I was having some trouble finding my CPU's TLB page size and data entries a few days ago, and it's no mystery that Intel provides very poor specs in this specific area. I couldn't see it exported from Linux either (although it *does* list it in /proc/cpuinfo, depending on the L1/L2 cache sizes and attributes, but that's another story).

To overcome this I was forced to write my own little program that uses the cpuid instruction (x86-family specific, introduced in the early 90s) to obtain processor attributes like vendor/model, cache sizes, flags, etc. Since this is the way the kernel actually gets the information, it might be useful to share some basic information of this feature... again, this is x86 only.

This instruction reads the EAX register to know what information the caller is asking for, and with 0 it will return all the available attributes; thus a smart implementation will use this first, then decide if the information we want is available. The outputs of the instruction are loaded into EAX (yes, this is input and ouput), EBX, ECX and EDX, and to use it all we need to know are the register bit offsets, well documented here.

Calling cpuid in C is quite trivial, just specify the operation level, load it into EAX and return the out registers to return the value(s) through reference:

 void cpuid(unsigned int op, unsigned int *eax, unsigned int *ebx,  
       unsigned int *ecx, unsigned int *edx)  
 {  
  __asm__(  
  "cpuid;"  
  : "=b" (*ebx), "=a" (*eax),"=c" (*ecx),"=d" (*edx)  
  : "1" (op), "c"(0));  
 }

Now, say we want to get the size of the L2 cache and some power management information, so by looking at the reference I know to load 0x80000006 and ECX will hold my data in bits 31-16 for the L2 size in Kb, and the CPU thermal monitoring in bit 4 of EDX when loading level 0x80000007. So we have:

      unsigned int *eax, *ebx, *ecx, *edx;  
      cpuid(0x80000006, &eax, &ebx, &ecx, &edx);  
      printf("my L2 cache is %dKb\n", ecx>>16);  
      cpuid(0x80000007, &eax, &ebx, &ecx, &edx);  
      printf("my EPM thermal monitor is %d\n", edx>>4);

The kernel does exactly this to determine the processor(s) information, of course with a little more precaution and optimization, but in the end what you see in /proc/cpuinfo (and therefore lscpu) is a result of cpuid.

For further details read cpu_detect_cache_size() and cpu_get_model_name() in arch/x86/kernel/cpu/common.c

happy hacking!