Benchmarks

Lace includes a set of benchmark programs in the benchmarks/ directory. These are primarily intended for evaluating Lace’s work-stealing performance and comparing it against other frameworks.

Many of these benchmarks originate from well-known benchmark suites, such as the Cilk-5 benchmark suite. The implementations here are mostly adapted from Nowa (Schmaus et al, IPDPS 2021). UTS is from the Unbalanced Tree Search project (Olivier et al., LCPC 2006). The benchmarks cover a range of parallelism patterns:

fib: Recursive Fibonacci. Micro-benchmark for task creation overhead: tasks are near-empty, so performance is dominated by spawn/sync cost.
uts: Unbalanced Tree Search. Highly irregular, unpredictable workload; a standard stress-test for work-stealing schedulers.
nqueens: N-Queens solver. Spawns one task per candidate position at each row, producing a regular N-ary task tree. Moderate task granularity.
pi: Numerical integration for computing pi. Embarrassingly parallel with uniform task sizes.
integrate: Adaptive numerical integration. Recursion depth varies with the function being integrated.
cilksort: Recursive merge-sort (from Cilk-5, MIT/Frigo). Uniform binary task tree with a cutoff for switching to sequential sort.
quicksort: Parallel quicksort. Irregular partitioning due to pivot selection.
dfs: Parallel depth-first search. Regular task tree structure.
matmul: Recursive matrix multiplication. Splits along the largest dimension, with a 64-element base case.
rectmul: Rectangular matrix multiplication (MIT/Prokop, 1997). Variant of matmul for non-square matrices where n < m and dimensions are multiples of 16.
strassen: Strassen’s fast matrix multiplication (MIT, 1996). Seven-way recursive split with coarser granularity than naïve matmul.
lu: LU decomposition (MIT/Blumofe, 1996). Block-recursive Gaussian elimination with data dependencies between subproblems.
cholesky: Sparse Cholesky factorization (MIT/Frigo, 2000). Uses small blocks at the leaves of a quad-tree decomposition.
heat: Heat diffusion using Jacobi-type iteration (MIT/Strumpen, 1996). Cache-oblivious stencil computation with spatial and temporal tiling.
fft: Fast Fourier Transform (MIT/Frigo, 2000). Cooley-Tukey divide-and-conquer with a binary recursive task tree.
knapsack: 0/1 knapsack via branch-and-bound (MIT/Frigo, 2000). Irregular pruning makes the task tree unpredictable.

Running benchmarks

Build Lace as the top-level project with benchmarks enabled (the default):

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

Then run individual benchmarks from the build directory:

./build/benchmarks/fib-lace -w 0 42        # fibonacci(42) using all cores
./build/benchmarks/nqueens-lace -w 0 14    # 14 queens
./build/benchmarks/uts-lace -w 0 -t 0 -b 2000 -q 0.200014 -m 5 -r 7  # UTS T3L

Most benchmarks accept a -w N flag to set the number of worker threads. Run without arguments to see usage information. In addition, there is a bench.py program in the build/benchmarks directory that can be used to automatically run the benchmarks.

Profiling with statistics

To collect steal counts, task counts, split counts, and timing data, rebuild with the appropriate options:

cmake -B build -DLACE_COUNT_SPLITS=ON -DLACE_COUNT_STEALS=ON -DLACE_COUNT_TASKS=ON -DLACE_PIE_TIMES=ON
cmake --build build

Then call lace_count_report at the end of your program. Note that enabling statistics adds instrumentation overhead, so the reported wall-clock times will not be representative of production performance.