-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Labels: bug, high priority, cli, memory
Summary
Running toolbench analyze on large input files (≥ ~200MB) causes the process to crash with either an out-of-memory error or unhandled exception. The crash appears to occur during parsing/aggregation in lib/analyze.js (or toolbench/analyze.py depending on implementation). Small files work fine.
Expected
toolbench analyze <file> should stream or chunk the input and complete successfully (or fail gracefully with a helpful error) for large files.
Actual
Process crashes with either:
- Node:
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory, or - Python:
MemoryErroror process killed by OS (OOM killer), or - Unhandled exception and non-zero exit code with no helpful message.
Environment
- ToolBench commit / tag:
HEAD(please replace with exact commit hash) - OS: Ubuntu 22.04 LTS (also reproduced on macOS 12)
- Node.js: v18.16.0 (if applicable)
- Python: 3.11.4 (if applicable)
- RAM: 8GB
- Reproduction on both machine-local and CI (GitHub Actions) observed
Reproduction steps (minimal)
- Create a large test file (200MB+). Example command to generate a test file:
# Linux / macOS: create a 250MB test file
base64 /dev/urandom | head -c 250000000 > ./test-large.ndjson- Run the analysis:
# CLI invocation
toolbench analyze ./test-large.ndjson --mode summary- Observe crash:
# Node.js example crash
$ toolbench analyze ./test-large.ndjson --mode summary
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
Aborted (core dumped)
Or for Python:
$ toolbench analyze ./test-large.ndjson --mode summary
Traceback (most recent call last):
File "toolbench/cli.py", line 42, in <module>
main()
File "toolbench/analyze.py", line 210, in analyze
results = aggregator.aggregate(all_items)
MemoryError
Quick root-cause hypothesis
The current implementation accumulates the entire parsed input into memory (e.g. all_items = [], or a full json.load()/file.read()), then runs in-memory aggregation. For very large files this triggers huge memory usage and crashes. The CLI should either:
- Stream-process input (line-by-line or chunked), keeping only aggregated state, or
- Use a bounded buffer / external temporary storage (SQLite or temporary file) for intermediate results, or
- Provide an option to use memory-limited mode.
Suggested fix (Node.js example)
Change the implementation to stream the file instead of reading whole file into memory.
Before (problematic pattern):
// lib/analyze.js (hypothetical)
const fs = require('fs');
function analyze(path) {
const raw = fs.readFileSync(path, 'utf8'); // reads entire file
const items = raw.split('\n').map(JSON.parse);
// heavy in-memory aggregation
return aggregate(items);
}After (streaming approach using readline):
// lib/analyze.js
const fs = require('fs');
const readline = require('readline');
async function analyze(path) {
const stats = createAggregator(); // small stateful object
const rl = readline.createInterface({
input: fs.createReadStream(path, { encoding: 'utf8' }),
crlfDelay: Infinity
});
for await (const line of rl) {
if (!line.trim()) continue;
let item;
try {
item = JSON.parse(line);
} catch (err) {
// handle / log parse error per-line
continue;
}
stats.add(item); // aggregator keeps only necessary summary/state
}
return stats.finalize();
}
module.exports = { analyze };This avoids loading the entire file into memory.
Suggested fix (Python example)
Use an iterator and avoid reading the entire file with json.load().
Before (problematic):
# toolbench/analyze.py
with open(path, 'r', encoding='utf-8') as fh:
data = json.load(fh) # loads whole file -> OOM
aggregator = Aggregator()
aggregator.aggregate(data)After (streaming, NDJSON example):
# toolbench/analyze.py
def analyze(path):
aggregator = Aggregator()
with open(path, 'r', encoding='utf-8') as fh:
for line in fh:
line = line.strip()
if not line:
continue
try:
item = json.loads(line)
except json.JSONDecodeError:
# optionally log and continue
continue
aggregator.add(item)
return aggregator.result()If input is not NDJSON, consider ijson for streaming JSON arrays:
# example with ijson for JSON arrays
import ijson
with open(path, 'rb') as fh:
parser = ijson.items(fh, 'item')
for item in parser:
aggregator.add(item)Add ijson to optional dependencies if needed.
Tests to add (unit / integration)
Node: jest integration test
Create tests/large-file.integration.test.js:
const fs = require('fs');
const { spawnSync } = require('child_process');
const tmp = require('tmp');
// generate small-ish file but conceptually large for CI
test('analyze handles streaming input without OOM', () => {
const tmpFile = tmp.fileSync({ postfix: '.ndjson' });
const lines = [];
for (let i = 0; i < 10000; i++) {
lines.push(JSON.stringify({ id: i, value: Math.random() }));
}
fs.writeFileSync(tmpFile.name, lines.join('\n'), 'utf8');
const result = spawnSync('node', ['bin/toolbench', 'analyze', tmpFile.name], {
encoding: 'utf8',
maxBuffer: 1024 * 1024 * 10
});
expect(result.status).toBe(0);
expect(result.stdout).toMatch(/summary/); // adapt to actual CLI output
});Python pytest integration
tests/test_analyze_large.py:
import json
import tempfile
import subprocess
import sys
def test_analyze_handles_large_file(tmp_path):
p = tmp_path / "test.ndjson"
with p.open("w", encoding="utf-8") as fh:
for i in range(20000):
fh.write(json.dumps({"id": i, "v": i}) + "\n")
proc = subprocess.run([sys.executable, "-m", "toolbench", "analyze", str(p)],
capture_output=True, text=True)
assert proc.returncode == 0
assert "summary" in proc.stdout.lower() # adapt to actual outputSuggested PR checklist / reviewer notes
- Replace any
readFileSync/read()+JSON.parse()of entire file with streaming approach. - Add unit/integration test above to CI matrix.
- Add a CLI flag
--streamor automatically detectstdinand stream. - Update README to document memory-safe mode and supported input formats (NDJSON / JSON array).
- If using
ijsonor another third-party streaming parser, add it to dependencies and provide fallback.
Logs / attachments
(Attach any verbose logs or profiler output, e.g., node --trace_gc or python -X tracemalloc, to help root-cause.)
Example: node --max-old-space-size=4096 bin/toolbench analyze test-large.ndjson
- If increasing the node heap avoids crash, it's more evidence of memory use pattern.
Temporary workarounds
- Split input file into smaller chunks and run
toolbench analyzeon each, then merge results externally. - Run under larger-memory machine / increase Node heap with
--max-old-space-size=8192(not a real fix).
Example minimal patch idea (pseudo)
- Create
lib/streaming-aggregator.jswith a small stateful aggregator API (add(item),finalize()). - Modify CLI entrypoint to choose streaming path for files > 10MB or when
--streamis passed. - Add tests and docs.