Conversation
Benchmark: on weakavl branch vs RedBlackTreeCPU: Qualcomm aarch64, 12 cores × 1 thread, WSL2 Each variant built from the same source with
WeakAVL is 2–4× faster than RedBlack in steady-state runs across all tests. The large-allocation path exercises the internal free-range tree heavily; WeakAVL's bounded rotations per deletion directly reduce rebalancing cost. |
|
I am concerned that the original redblacktree's performance number looks too high. |
Flamegraphs: WeakAVL vs RedBlackTree (RelWithDebInfo, perf-large_alloc)Both collected on the same machine (Qualcomm aarch64, 12-core, WSL2) with WeakAVL (DefaultRBTree on weakavl branch) — 123 perf samplesRedBlackTree — 148 perf samples (+20%)Full gist (both files): https://gist.github.com/SchrodingerZhu/98c5ac28a2c61c9be9d4d4e241981079 |
For some reasons, RedBlack tree is poorly branch predicted. |
|
So in the red black tree, I wrote an invariant that we can check in the unit tests. Do you think a similar thing would make sense for the WAVL? I find invariants (even dynamically checked) can help a lot of code understanding and correctness. |
|
Also, this is super awesome. I just need to understand WAVL trees now |
|
Benchmark Comparison (
|
| Test | Main Avg RSS | Current Avg RSS | Delta | Delta (%) |
|---|---|---|---|---|
sh8benchN |
233125.20 | 248553.60 | +15428.40 | +6.618% |
glibc-simple |
5283.20 | 5444.00 | +160.80 | +3.044% |
xmalloc-testN |
132026.40 | 133543.20 | +1516.80 | +1.149% |
cache-scratch1 |
6576.00 | 6606.00 | +30.00 | +0.456% |
mstressN |
874179.20 | 876738.00 | +2558.80 | +0.293% |
Largest RSS decreases (current vs main):
| Test | Main Avg RSS | Current Avg RSS | Delta | Delta (%) |
|---|---|---|---|---|
larsonN-sized |
365257.20 | 353980.40 | -11276.80 | -3.087% |
leanN |
259703.20 | 255440.40 | -4262.80 | -1.641% |
rptestN |
233053.60 | 229976.40 | -3077.20 | -1.320% |
lua |
69876.80 | 69570.40 | -306.40 | -0.438% |
redis |
9330.00 | 9294.00 | -36.00 | -0.386% |
Largest Improvements (time)
| Test | Main Avg (s) | Current Avg (s) | Delta (s) | Delta (%) |
|---|---|---|---|---|
xmalloc-testN |
0.2958 | 0.2813 | -0.0145 | -4.902% |
cfrac |
3.4160 | 3.2640 | -0.1520 | -4.450% |
espresso |
3.5200 | 3.3750 | -0.1450 | -4.119% |
barnes |
2.1250 | 2.0390 | -0.0860 | -4.047% |
rptestN |
0.9350 | 0.9046 | -0.0304 | -3.251% |
Largest Regressions (time)
| Test | Main Avg (s) | Current Avg (s) | Delta (s) | Delta (%) |
|---|---|---|---|---|
sh6benchN |
0.1000 | 0.1030 | +0.0030 | +3.000% |
sh8benchN |
0.2580 | 0.2610 | +0.0030 | +1.163% |
cache-scratch1 |
1.1020 | 1.1080 | +0.0060 | +0.544% |
lua |
1.1770 | 1.1800 | +0.0030 | +0.255% |
alloc-test1 |
2.9060 | 2.9080 | +0.0020 | +0.069% |
Full Per-Test Averages
| Test | Main Avg Time (s) | Current Avg Time (s) | Time Delta (%) |
|---|---|---|---|
alloc-test1 |
2.9060 | 2.9080 | +0.069% |
alloc-testN |
6.2560 | 6.2000 | -0.895% |
barnes |
2.1250 | 2.0390 | -4.047% |
cache-scratch1 |
1.1020 | 1.1080 | +0.544% |
cache-scratchN |
0.0790 | 0.0780 | -1.266% |
cfrac |
3.4160 | 3.2640 | -4.450% |
espresso |
3.5200 | 3.3750 | -4.119% |
glibc-simple |
1.5110 | 1.5100 | -0.066% |
glibc-thread |
0.3964 | 0.3932 | -0.807% |
gs |
0.2380 | 0.2310 | -2.941% |
larsonN-sized |
2.5606 | 2.5331 | -1.074% |
leanN |
9.2100 | 8.9900 | -2.389% |
lua |
1.1770 | 1.1800 | +0.255% |
mstressN |
1.8170 | 1.7760 | -2.256% |
redis |
2.2373 | 2.2334 | -0.174% |
rocksdb |
6.5750 | 6.4660 | -1.658% |
rptestN |
0.9350 | 0.9046 | -3.251% |
sh6benchN |
0.1000 | 0.1030 | +3.000% |
sh8benchN |
0.2580 | 0.2610 | +1.163% |
xmalloc-testN |
0.2958 | 0.2813 | -4.902% |
|
So one difference in the implementation is that for the Red Black tree, I kept the path around to make walking up cheaper, but this had some non-trivial initialisation costs. The change in Removes the initialisation cost which is not required. I am tempted to lift the benchmark and this fix into it's own PR, which I think would be good to merge, and then it gives a better point for comparison with the weak avl tree. |
|
This PR takes the test and the init fix. #818 If you could rerun the benchmarks with this fix, it would be great to know how it affects performance with respect to the WAVL trees. Also, thanks for bringing this up here. I enjoyed reading the WAVL paper. Super nice theory |
|
It is good to see that #818 actually fix the performance issue of redblack tree. I am outside my apartment so I tested on my office machine (7773X). With it, WeakAVL only show marginal improvement at 100k iterations: At 800k, both tree stablize and show almost identical performance: |
|
Yeah, they produce compelling things quickly, but really do need checking. |
|
Given the updated improvements, I think this is still a good thing to have. Do you want me to add a feature flag or just keep the current setup? |
There was a problem hiding this comment.
Pull request overview
This PR introduces an experimental Weak AVL (WAVL) tree implementation as an alternative to the existing Red-Black tree for performance testing. Based on the referenced benchmark commit from Microsoft's snmalloc project, the PR aims to evaluate performance differences between these two balanced binary search tree variants, particularly after observing unexpected performance gaps on battery-powered hardware.
Changes:
- Added new WeakAVL tree implementation with rank-based balancing using 1-bit parity encoding
- Refactored tree API from Red-Black specific naming (
is_red/set_red) to generic naming (tree_tag/set_tree_tag) to support multiple tree types - Introduced
DefaultRBTreetype alias that currently points toWeakAVLTree, allowing easy switching between implementations
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/snmalloc/ds_core/weakavltree.h | New WAVL tree implementation with parity-based rank tracking |
| src/snmalloc/ds_core/rankbalancetree.h | New header consolidating tree concepts and defining DefaultRBTree alias |
| src/snmalloc/ds_core/redblacktree.h | Renamed class to RedBlackTree; updated API to use tree_tag; refactored ChildRef and RBStep constructors |
| src/snmalloc/ds_core/ds_core.h | Updated include from redblacktree.h to rankbalancetree.h |
| src/snmalloc/mem/metadata.h | Added default constructor to BackendStateWordRef for compatibility |
| src/snmalloc/backend_helpers/largebuddyrange.h | Renamed RED_BIT to TREE_TAG_BIT; updated API calls |
| src/snmalloc/backend_helpers/smallbuddyrange.h | Updated API to use tree_tag functions |
| src/snmalloc/backend_helpers/buddy.h | Changed RBTree to DefaultRBTree |
| src/test/func/weakavl/weakavl.cc | New functional test for WeakAVL tree with randomized operations |
| src/test/func/redblack/redblack.cc | Updated to use RedBlackTree class name and new API |
| src/test/perf/large_alloc/large_alloc.cc | New performance benchmark for large allocations |
| docs/AddressSpace.md | Updated documentation to reference DefaultRBTree and TREE_TAG_BIT |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| debug_log("Insert ", path); | ||
|
|
||
| // Propogate double red up to rebalance. |
There was a problem hiding this comment.
The comment has a typo: "Propogate" should be "Propagate".
| // Propogate double red up to rebalance. | |
| // Propagate double red up to rebalance. |
| // Propogate double red up to rebalance. | ||
| // These notes were particularly clear for explaining insert | ||
| // https://www.cs.cmu.edu/~fp/courses/15122-f10/lectures/17-rbtrees.pdf | ||
| // https://www.cs.cmu.edu/~fp/courses/15122-f10/lectures/17-RedBlackTrees.pdf |
There was a problem hiding this comment.
The URL in the comment was changed from "17-rbtrees.pdf" to "17-RedBlackTrees.pdf". Please verify that this URL actually exists at the updated path, as broken documentation links reduce the value of code comments.
| // https://www.cs.cmu.edu/~fp/courses/15122-f10/lectures/17-RedBlackTrees.pdf | |
| // https://www.cs.cmu.edu/~fp/courses/15122-f10/lectures/17-rbtrees.pdf |
There was a problem hiding this comment.
This is wrongly introduced during my search-replacement.
| static H child_ref(K n, bool dir) | ||
| { | ||
| // Rep uses true for left, false for right. | ||
| return Rep::ref(!dir, n); | ||
| } |
There was a problem hiding this comment.
The comment states "Rep uses true for left, false for right" but the code inverts it with !dir. This means the WeakAVLTree itself uses false for left and true for right, which is opposite from the RedBlackTree implementation. This inconsistency could lead to confusion. Consider either: (1) documenting this difference prominently at the class level, or (2) maintaining the same convention as RedBlackTree for consistency across the codebase.
| K parent(K n) const | ||
| { | ||
| if (is_null(n)) | ||
| return Rep::null; | ||
|
|
||
| K p = Rep::null; | ||
| K cur = get_root(); | ||
| while (!is_null(cur) && !Rep::equal(cur, n)) | ||
| { | ||
| p = cur; | ||
| bool dir = Rep::compare(cur, n) ? Left : Right; | ||
| cur = child(cur, dir); | ||
| } | ||
| return is_null(cur) ? Rep::null : p; | ||
| } |
There was a problem hiding this comment.
The parent() function performs a tree traversal from the root to find a node's parent, which is O(log n). This function is called extensively throughout the rebalancing operations (over 40 times in the implementation). Each rebalancing operation could potentially perform multiple O(log n) searches, making the overall complexity of tree operations significantly worse than the theoretical O(log n) for WAVL trees. Consider either: (1) maintaining explicit parent pointers to achieve true O(log n) operations, or (2) documenting this performance characteristic clearly, or (3) refactoring to use a path-based approach similar to RedBlackTree to avoid repeated parent searches.
There was a problem hiding this comment.
I am still auditting the path implementation.
|
I propose we add with a feature flag, but default to the new one. Then if people have issues, they can compile with the old one. I don't see there being issues, the test harness for red black trees is pretty comprehensive, and having a checked invariant really helps. I'll take a detailed pass over the code tomorrow. |
|
Did the changes on mimalloc-bench persist after the "lies" from the AI? |
|
I will also take a careful look of the changes. |
|
Given the AI written code is quite fishy, I made a new one at #821 with manually audited code.
I suspect that the performance gain in this PR may be still related to the way the tree path is managed. I don;t have further time to investigate into this further now. Feel free to play with the code in these two PR or modify them for further experiment. The bench script is attached as the following: #!/usr/bin/env python3
from __future__ import annotations
import argparse
import csv
import os
import re
import statistics
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable
POLICIES: list[tuple[str, str]] = [
("WeakAVLPolicy", "WeakAVL"),
("RedBlackPolicy", "RedBlack"),
]
ITERATION_SEQUENCE: list[int] = [1000, 10000, 100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000]
CATEGORIES: list[tuple[str, str]] = [
("alloc_dealloc", "Alloc/dealloc"),
("batch_alloc_then_dealloc", "Batch alloc then dealloc"),
("alloc_touch_dealloc", "Alloc/touch/dealloc"),
]
CATEGORY_PREFIX_TO_KEY = {
"Alloc/dealloc 800KB x ": "alloc_dealloc",
"Batch alloc then dealloc 800KB x ": "batch_alloc_then_dealloc",
"Alloc/touch/dealloc 800KB x ": "alloc_touch_dealloc",
}
TIMING_RE = re.compile(r"^(?P<label>.+?):\s+(?P<ns>\d+)\s+ns\s*$")
@dataclass
class Sample:
policy: str
policy_short: str
iterations: int
run_index: int
category_key: str
category_name: str
ns: int
def benchmark_env() -> dict[str, str]:
env = os.environ.copy()
for key in (
"CMAKE_C_COMPILER_LAUNCHER",
"CMAKE_CXX_COMPILER_LAUNCHER",
"RUSTC_WRAPPER",
):
env.pop(key, None)
return env
def parse_args() -> argparse.Namespace:
script_dir = Path(__file__).resolve().parent
parser = argparse.ArgumentParser(
description=(
"Build snmalloc with two RBTree policies, run perf-large_alloc-fast "
"for a fixed sequence of iteration counts and repeats, save raw timings, "
"and produce violin plots."
)
)
parser.add_argument(
"--runs",
type=int,
required=True,
help="Number of repeated runs per policy for each iteration value.",
)
parser.add_argument(
"--source-dir",
type=Path,
default=(script_dir / "..").resolve(),
help="Path to the snmalloc source tree (default: ../).",
)
parser.add_argument(
"--build-root",
type=Path,
default=(script_dir / "policy-builds").resolve(),
help="Directory to hold per-policy build directories.",
)
parser.add_argument(
"--output-dir",
type=Path,
default=(script_dir / "policy-bench-results").resolve(),
help="Directory for CSV and plot outputs.",
)
parser.add_argument(
"--benchmark-target",
default="perf-large_alloc-fast",
help="Benchmark target name to build/run (default: perf-large_alloc-fast).",
)
return parser.parse_args()
def run_cmd(cmd: list[str], cwd: Path) -> str:
print(f"[cmd] (cwd={cwd}) {' '.join(cmd)}")
proc = subprocess.run(
cmd,
cwd=cwd,
text=True,
capture_output=True,
env=benchmark_env(),
check=False,
)
if proc.returncode != 0:
print(proc.stdout, end="")
print(proc.stderr, end="", file=sys.stderr)
raise RuntimeError(
f"Command failed with exit code {proc.returncode}: {' '.join(cmd)}"
)
return proc.stdout
def configure_and_build(
source_dir: Path, build_dir: Path, policy: str, benchmark_target: str
) -> None:
build_dir.mkdir(parents=True, exist_ok=True)
configure_cmd = [
"cmake",
"-S",
str(source_dir),
"-B",
str(build_dir),
"-DCMAKE_BUILD_TYPE=Release",
"-DCMAKE_C_COMPILER_LAUNCHER=",
"-DCMAKE_CXX_COMPILER_LAUNCHER=",
"-DSNMALLOC_DEFAULT_RBTREE_POLICY=" + policy,
]
run_cmd(configure_cmd, cwd=build_dir)
build_cmd = ["cmake", "--build", str(build_dir), "--target", benchmark_target]
run_cmd(build_cmd, cwd=build_dir)
def find_benchmark_binary(build_dir: Path, benchmark_target: str) -> Path:
candidates = [
build_dir / benchmark_target,
build_dir / f"{benchmark_target}.exe",
build_dir / "Release" / benchmark_target,
build_dir / "Release" / f"{benchmark_target}.exe",
]
for candidate in candidates:
if candidate.exists():
return candidate
raise FileNotFoundError(
f"Could not find benchmark binary for '{benchmark_target}' in {build_dir}"
)
def parse_timing_output(stdout: str) -> dict[str, int]:
found: dict[str, int] = {}
for raw_line in stdout.splitlines():
line = raw_line.strip()
match = TIMING_RE.match(line)
if not match:
continue
label = match.group("label")
for prefix, category_key in CATEGORY_PREFIX_TO_KEY.items():
if label.startswith(prefix):
found[category_key] = int(match.group("ns"))
break
expected_keys = {key for key, _ in CATEGORIES}
if found.keys() != expected_keys:
missing = sorted(expected_keys - set(found.keys()))
raise ValueError(
"Failed to parse all timing categories from benchmark output. "
f"Missing: {missing}. Output was:\n{stdout}"
)
return found
def run_benchmark_n_times(
binary: Path,
policy: str,
policy_short: str,
iterations: int,
runs: int,
) -> list[Sample]:
category_name_by_key = {key: name for key, name in CATEGORIES}
samples: list[Sample] = []
for run_index in range(1, runs + 1):
print(f"[run] {policy} iterations={iterations} run {run_index}/{runs}")
proc = subprocess.run(
['taskset', '-c', '0', str(binary), str(iterations)],
cwd=binary.parent,
text=True,
capture_output=True,
env=benchmark_env(),
check=False,
)
if proc.returncode != 0:
print(proc.stdout, end="")
print(proc.stderr, end="", file=sys.stderr)
raise RuntimeError(
f"Benchmark failed with exit code {proc.returncode}: {binary}"
)
parsed = parse_timing_output(proc.stdout)
for category_key, ns in parsed.items():
samples.append(
Sample(
policy=policy,
policy_short=policy_short,
iterations=iterations,
run_index=run_index,
category_key=category_key,
category_name=category_name_by_key[category_key],
ns=ns,
)
)
return samples
def write_csv(samples: Iterable[Sample], csv_path: Path) -> None:
csv_path.parent.mkdir(parents=True, exist_ok=True)
with csv_path.open("w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(
[
"policy",
"policy_short",
"iterations",
"run",
"category_key",
"category",
"ns",
]
)
for s in samples:
writer.writerow(
[
s.policy,
s.policy_short,
s.iterations,
s.run_index,
s.category_key,
s.category_name,
s.ns,
]
)
def render_violin_plot(samples: list[Sample], plot_path: Path) -> None:
try:
import matplotlib.pyplot as plt
except ImportError as exc:
raise RuntimeError(
"matplotlib is required for plotting. Install it with: pip install matplotlib"
) from exc
plot_path.parent.mkdir(parents=True, exist_ok=True)
fig, axes = plt.subplots(1, len(CATEGORIES), figsize=(16, 5))
if len(CATEGORIES) == 1:
axes = [axes]
for ax, (category_key, category_title) in zip(axes, CATEGORIES):
series: list[list[int]] = []
positions: list[float] = []
tick_positions: list[float] = []
tick_labels: list[str] = []
pos = 1.0
gap = 1.0
for iterations in ITERATION_SEQUENCE:
tick_positions.append(pos + (len(POLICIES) - 1) / 2)
tick_labels.append(str(iterations/1000) + 'k')
for policy, policy_short in POLICIES:
vals = [
s.ns
for s in samples
if s.policy == policy
and s.category_key == category_key
and s.iterations == iterations
]
series.append(vals)
positions.append(pos)
pos += 1.0
pos += gap
violins = ax.violinplot(
series,
positions=positions,
widths=0.9,
showmeans=True,
showmedians=True,
showextrema=True,
)
colors = ["#1f77b4", "#ff7f0e"]
for i, body in enumerate(violins["bodies"]):
color = colors[i % len(colors)]
body.set_facecolor(color)
body.set_alpha(0.55)
ax.set_xticks(tick_positions)
ax.set_xticklabels(tick_labels)
ax.set_title(category_title)
ax.set_xlabel("Iterations")
ax.set_ylabel("Time (ns)")
ax.grid(axis="y", alpha=0.3)
legend_handles = [
plt.Line2D([0], [0], color=colors[i], lw=6, alpha=0.55)
for i in range(len(POLICIES))
]
ax.legend(
legend_handles,
[short for _, short in POLICIES],
loc="upper left",
fontsize=8,
frameon=False,
)
fig.suptitle(
"perf-large_alloc-fast: policy comparison across iteration groups", fontsize=13
)
fig.tight_layout()
fig.savefig(plot_path, dpi=180)
def print_summary(samples: list[Sample]) -> None:
print("\nSummary (ns):")
for category_key, category_name in CATEGORIES:
print(f"\n{category_name}")
for iterations in ITERATION_SEQUENCE:
print(f" iterations={iterations}")
for policy, policy_short in POLICIES:
vals = [
s.ns
for s in samples
if s.policy == policy
and s.category_key == category_key
and s.iterations == iterations
]
median = statistics.median(vals)
mean = statistics.fmean(vals)
print(
f" {policy_short:<8} mean={mean:>12.1f} "
f"median={median:>12.1f} min={min(vals):>12} max={max(vals):>12}"
)
def main() -> int:
args = parse_args()
if args.runs <= 0:
raise ValueError("--runs must be positive.")
source_dir = args.source_dir.resolve()
build_root = args.build_root.resolve()
output_dir = args.output_dir.resolve()
csv_path = output_dir / "large_alloc_policy_timings.csv"
plot_path = output_dir / "large_alloc_policy_violin.png"
all_samples: list[Sample] = []
for policy, policy_short in POLICIES:
policy_build_dir = build_root / policy_short.lower()
configure_and_build(source_dir, policy_build_dir, policy, args.benchmark_target)
binary = find_benchmark_binary(policy_build_dir, args.benchmark_target)
for iterations in ITERATION_SEQUENCE:
samples = run_benchmark_n_times(
binary, policy, policy_short, iterations, args.runs
)
all_samples.extend(samples)
write_csv(all_samples, csv_path)
render_violin_plot(all_samples, plot_path)
print_summary(all_samples)
print(f"\nWrote CSV: {csv_path}")
print(f"Wrote plot: {plot_path}")
return 0
if __name__ == "__main__":
try:
raise SystemExit(main())
except Exception as exc: # pylint: disable=broad-except
print(f"error: {exc}", file=sys.stderr)
raise SystemExit(1) |



Benchmark based on: bf7a152
I got some strange performance numbers on battery-powered Surface. The original redblack tree is slower than expectation. I never saw such a large performance gap between two BST variants anywhere else.