Design Notes
This page documents the internal design of Lace for contributors and advanced users who need to understand the concurrency model. It is not required reading for normal use.
Architecture overview
Each Lace worker thread owns a double-ended queue (deque) of tasks.
The owner pushes and pops from the head (private end) in a
wait-free manner. Thieves steal from the tail (public end) using
a lock-free compare-and-swap on a shared TailSplit word.
The deque is split into a private region (between split and head)
and a shared region (between tail and split). Only tasks in the
shared region are visible to thieves. The split point is adjusted
lazily via a movesplit flag: when a thief finds no stealable
tasks, it sets movesplit on the victim, and the victim moves
the split on its next SPAWN.
Each task slot contains a function pointer and an atomic thief
field. The thief field serves as a state machine:
THIEF_TASK(0x1) — task is live, not yet stolenTHIEF_EMPTY(0x0) — slot is empty / task cancelledTHIEF_COMPLETED(0x2) — task was stolen and has finishedany other pointer — task is being executed by that worker
Memory model
Lace is designed for relaxed-memory architectures (ARM, POWER) and uses explicit fences and atomic operations rather than relying on x86’s strong ordering. The key synchronisation points are:
SPAWN (publishing a task). After writing the task’s function
pointer, arguments, and thief field, a release fence ensures
these stores are visible before the split is advanced. Without this
fence, a thief on ARM could observe the updated split but read
stale task data.
owner writes: t->f, t->thief, t->args
owner executes: atomic_thread_fence(release)
owner writes: split (making the task stealable)
Steal (claiming a task). A thief reads the victim’s
TailSplit word with a relaxed load and attempts a CAS to
increment the tail. The CAS itself provides acquire-release
semantics on the TailSplit word. After a successful CAS, the
thief has exclusive ownership of that task slot. The thief reads
the task data — these reads are ordered after the CAS, so they
observe the stores made by the owner before the release fence in
SPAWN.
thief reads: victim->ts.v (relaxed)
thief executes: CAS(&victim->ts.v, old, new) — acquires on success
thief reads: t->f, t->args (ordered after CAS)
thief writes: t->thief = self (relaxed — informational only)
thief calls: t->f(...)
thief writes: t->thief = THIEF_COMPLETED (release)
The release store of THIEF_COMPLETED ensures that all
stores made during task execution are visible to the owner when
it observes the completion.
SYNC (fast path). When a task has not been stolen
(split <= head and movesplit == 0), the owner simply
executes it directly. No atomic operations are needed — the owner
is the only thread that accesses its private deque region.
SYNC (slow path / leapfrog). When a task has been stolen, the
owner enters leapfrog mode: it follows the thief pointer and
attempts to steal work from the thief’s deque, recursively.
The owner reads t->thief in a loop until it observes
THIEF_COMPLETED, at which point an acquire fence ensures
all stores made by the thief (including the task result) are
visible.
owner reads: t->thief (relaxed, in loop)
owner observes: t->thief == THIEF_COMPLETED
owner executes: atomic_thread_fence(acquire)
owner reads: t->d.res (ordered after fence)
Shrink shared region. When the owner detects that its shared
region is too large (during SYNC slow path), it moves the split
point closer to the tail. A seq_cst fence between the split
write and the tail read prevents the StoreLoad reordering that
could cause both owner and thief to believe they own the same
task slot:
owner writes: split = newsplit (relaxed)
owner executes: atomic_thread_fence(seq_cst)
owner reads: tail (relaxed)
TOGETHER / NEWFRAME handshake. The initiator writes a task
into the lace_newframe.t pointer with a release fence
before the CAS. Workers poll this pointer with a relaxed load
in lace_check_yield. When non-NULL, a worker executes an
acquire fence before reading the task data via memcpy.
This release-acquire pair ensures the task contents are fully
visible before any worker copies them.
xternal task submission
Non-Lace threads submit tasks via lace_run_task. The task is
placed into one of 64 atomic slots (a fixed-size array). Workers
check an atomic counter external_task_count periodically in the
steal loop; when non-zero, they scan the array and claim a task.
The scan uses a two-step approach: a relaxed atomic_load to
skip empty slots (read-only, no cache line invalidation), followed
by atomic_exchange only on non-NULL slots. This minimises
write traffic on the shared slot array when most slots are empty.
The submitting thread blocks on a per-task atomic_int done
field, using a futex wait. Before falling back to the futex, the
submitter briefly spins (256 PAUSE iterations) to avoid a kernel
transition when the task completes quickly (typical for fine-grained
BDD operations). The worker signals completion with an atomic store
and futex_wake. The ext_lace_task struct lives on the
submitting thread’s stack, so no heap allocation is needed and no
initialisation or destruction is required for the futex word.
When an external task is published, the submitter wakes one sleeping
worker via futex_wake. This ensures that even when all workers
are in deep futex sleep, external tasks are picked up promptly.
Multiple external threads can submit concurrently without contention on a single slot. The atomic counter keeps the fast path (no external tasks) to a single relaxed load.
Worker idle strategy
When a worker fails to find work, it progresses through idle stages of increasing aggressiveness to balance responsiveness against CPU and cache-bus overhead.
Stage 1: Yield (0–256 failed attempts). The worker calls
sched_yield() (or SwitchToThread() on Windows) on each
iteration. This gives the OS scheduler an opportunity to run other
threads while remaining responsive to new work. No PAUSE stage
precedes this: on modern out-of-order CPUs, PAUSE instructions
provide insufficient throttling and the core continues to drive
cache-coherence traffic at near-full rate.
Stage 2: Futex sleep with ramping timeout (256+ failed attempts).
The worker increments an n_sleeping counter, then enters
futex_wait on a shared wake_futex word. The timeout starts
at 100 µs and increases by 100 µs per futex cycle, capping at
1 ms. This ramp avoids two failure modes:
Too-short timeouts cause excessive futex syscall overhead (each
futex_wait/return is ~3–5 µs of kernel entry/exit).Too-long timeouts cause barrier-synchronized workloads (cholesky, LU) to miss the start of a new parallel phase.
The 1 ms cap was determined empirically: 10 ms caused measurable regressions in LU decomposition; 100 ms caused severe regressions in multiple structured benchmarks.
Wake signals. Workers can be woken from futex sleep by:
lace_run_task: always wakes one worker when an external task is submitted. This is the primary mechanism for external task responsiveness.Successful steal with remaining work (
LACE_STOLEN, notLACE_STOLEN_LAST): the thief knows the victim still has stealable tasks, so it wakes one sleeping worker to help. This creates a cascade: workers wake proportionally to available parallelism. TheLACE_STOLEN_LASTdistinction prevents spurious wakes when the stolen task was the last one.lace_stop: wakes all sleeping workers so they observe the quit flag and exit promptly.
The wake_futex word is incremented (fetch_add) before each
futex_wake call. This ensures the futex value has changed,
preventing the lost-wake race where a worker enters futex_wait
between the value change and the wake signal.
Steal result signalling. lace_steal returns one of:
LACE_STOLEN— task stolen, victim still has more work (tail < splitafter CAS). Triggers a wake signal.LACE_STOLEN_LAST— task stolen, but it was the last stealable task. No wake signal.LACE_BUSY— CAS failed, another thief was concurrent.LACE_NOWORK— victim has no stealable tasks.
Futex portability
Lace uses address-based wait/wake primitives (commonly called
“futex”) for the idle system and external task completion. These
operate on plain atomic_int values with no initialisation or
destruction required. The kernel maintains wait queues internally,
keyed by virtual address.
Platform implementations:
Linux:
futex(FUTEX_WAIT_PRIVATE)/futex(FUTEX_WAKE_PRIVATE)viasyscall(SYS_futex, ...). The_PRIVATEvariants avoid the overhead of shared-memory mapping lookups since all threads are in the same process.FreeBSD:
_umtx_op(UMTX_OP_WAIT_UINT_PRIVATE)/_umtx_op(UMTX_OP_WAKE_PRIVATE). Timeout is passed via astruct _umtx_timefor relative timeouts.macOS 14.4+:
os_sync_wait_on_address_with_timeout/os_sync_wake_by_address_any/os_sync_wake_by_address_all. This is the public, documented Apple API introduced in macOS 14.4.macOS < 14.4:
__ulock_wait/__ulock_wake. This is a private Darwin API used by the Rust standard library, Go runtime, and LLVM libc++. It has been stable since macOS 10.12.Windows:
WaitOnAddress/WakeByAddressSingle/WakeByAddressAllfromsynchapi.h(Windows 8+). Requires linking withSynchronization.lib(-lsynchronizationfor MinGW).
Barrier
Lace uses a custom sense-reversing barrier. A shared wait flag is
flipped by the last thread to arrive (the leader). Other threads spin
on this flag with acquire loads. A leaving counter tracks how
many threads have exited the barrier, used by lace_barrier_destroy
to wait for all threads to leave before tearing down the barrier.
The barrier provides full memory ordering: all memory operations by
any thread before the barrier are visible to all threads after the
barrier. This includes StoreLoad ordering — a store before the
barrier and a load after the barrier within the same thread cannot be
reordered. The guarantee follows from the acq_rel read-modify-write
on count: unlike a separate release-store / acquire-load pair,
an acq_rel RMW is indivisible, so it prevents both prior stores
from floating past it (release) and later loads from floating before
it (acquire). Cross-thread visibility follows from the transitive
chain: each fetch_add on count acquires all stores released
by prior fetch_add operations (via the modification order of
count), so the leader acquires all pre-barrier stores from all
threads. The leader then releases through the wait flip, which
the spinning threads acquire.
Stack and deque allocation
Worker thread stacks and task deques are allocated as virtual memory
mappings (mmap on Unix, VirtualAlloc on Windows). Pages are
committed lazily by the kernel on first access. This means:
A 64 MB deque (1M task slots × 64 bytes) consumes only as much physical memory as the actual recursion depth requires.
Worker stacks similarly consume physical memory proportional to actual call depth, not the reserved size.
On NUMA systems, first-touch policy places pages on the node where
the accessing thread runs. Since each worker is pinned before
executing any tasks, its stack and deque pages are allocated locally.
When hwloc is available, stacks are allocated with
hwloc_alloc_membind for explicit NUMA binding.
Each stack has a guard page (PROT_NONE / PAGE_NOACCESS) at
the low end. The guard page is excluded from the region passed to
pthread_attr_setstack, so pthreads does not attempt to use it.
Scratch arena
Each worker owns a private bump-allocated arena for task-local temporary storage (see Defining Tasks for the user-facing API). The arena exists because the three obvious alternatives all have problems for the fork-join workloads Lace targets:
alloca/ C99 VLAs run on the thread stack and overflow without warning past the guard page; MSVC does not support VLAs at all.malloc/freeintroduces allocator contention on workloads with many fine-grained tasks, dominating the cost of the task itself.A shared lock-free allocator would still bounce cache lines between workers on every allocation.
A per-worker arena avoids all three: the allocations are stack-like bump operations on private state, no cache lines are shared, and the backing memory is bounded by the worker’s recursion depth rather than by a fixed stack size.
Virtual memory layout
Each worker reserves a large virtual address range at startup (1 GiB
on 64-bit, 16 MiB on 32-bit, configurable). The reservation is purely
address space; on 64-bit systems address space is essentially free and
typical totals fit comfortably below any practical limit. On 32-bit
systems user-mode address space is limited (typically 2–3 GiB) so the
default is sized to fit a moderate number of workers without exhausting
the address space; an application that needs more should call
lace_set_scratch_size before lace_start.
The arena exposes four pointers per worker:
scratch_base— base of the reservationscratch_top— current allocation pointer (hot, written on everylace_scratch_alloc)scratch_committed_end— one past the highest currently-committed byte (hot, read on every allocation as a bound check)scratch_reserve_end— one past the end of the reservation (cold, only consulted on growth)
The two hot fields live adjacent to the deque pointers in the worker
struct, in the cache line that every SPAWN/SYNC already touches.
lace_scratch_alloc is then a load + add + compare + conditional
out-of-line branch + store — about four instructions on the fast path,
with branch prediction trivially correct (the slow path is taken at
most once per band’s worth of allocations).
Reserve / commit / decommit
Allocations bump scratch_top. When the bump crosses
scratch_committed_end, lace_scratch_grow commits additional
pages by calling mprotect(PROT_READ|PROT_WRITE) on POSIX or
VirtualAlloc(MEM_COMMIT, PAGE_READWRITE) on Windows. To avoid
syscalling on every small push/pop oscillation near the top, the grow
path commits a band of pages above the immediate request (1 MiB by
default, configurable). The fast path stays in the committed band
without further system calls until the band is exhausted.
The reserve uses PROT_NONE / PAGE_NOACCESS rather than
PROT_READ|PROT_WRITE with MAP_NORESERVE. This is more work
on the commit path but provides guard semantics: a runaway recursion
that overshoots the band hits a segfault rather than silently
faulting in zero pages. Reservation exhaustion is caught explicitly
in lace_scratch_grow with a diagnostic message before aborting.
The decommit primitive uses madvise(MADV_DONTNEED) followed by
mprotect(PROT_NONE) on POSIX, and VirtualFree(MEM_DECOMMIT)
on Windows. The MADV_DONTNEED returns the physical pages to the
OS immediately; the PROT_NONE ensures that any stray access past
the new top fails fast.
NUMA placement
The reservation is created in lace_init_worker on the worker’s own
thread, after CPU pinning. The reservation itself touches no pages;
all physical allocations happen later, on first access from the same
worker thread, so first-touch NUMA policy places the pages on the
worker’s local node. No explicit NUMA binding is needed.
Trim on deep idle
When a worker exhausts its steal attempts and transitions into the
futex-wait stage of the idle progression (see Worker idle strategy),
Lace decommits arena pages above scratch_top + band. The trim
fires exactly once per backoff sequence: idle_count resets on any
successful steal, so the next dry spell will trim again from a new
high-water mark. The trim is invoked from the same cold code path
that is already about to make a futex_wait syscall, so its cost
is invisible.
By construction the worker is between tasks when it idles, so
scratch_top equals scratch_base and the trim leaves only the
band committed — enough to absorb the first wave of allocations after
wake without faulting in pages.
Leak detection
The arena’s correctness depends on every lace_scratch_mark being
paired with a matching lace_scratch_reset on every return path.
A missing reset would not corrupt anything — the next outer reset
will eventually restore — but it can cause the arena to grow without
bound until the reservation is exhausted.
The same deep-idle transition that triggers the trim also performs a
leak check: if scratch_top != scratch_base at the point when all
task frames have unwound, Lace prints a one-line warning identifying
the worker and the leaked byte count, then resets scratch_top to
scratch_base. The warning is emitted once per worker per program
lifetime to avoid flooding logs.
This combination — warn loudly, recover quietly — keeps a leak from killing a long-running program while still making the bug visible during development.