Skip to content

Crash: toolbench analyze OOM / crash on large input files #4

@smart-dev1028

Description

@smart-dev1028

Labels: bug, high priority, cli, memory

Summary

Running toolbench analyze on large input files (≥ ~200MB) causes the process to crash with either an out-of-memory error or unhandled exception. The crash appears to occur during parsing/aggregation in lib/analyze.js (or toolbench/analyze.py depending on implementation). Small files work fine.

Expected

toolbench analyze <file> should stream or chunk the input and complete successfully (or fail gracefully with a helpful error) for large files.

Actual

Process crashes with either:

  • Node: FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory, or
  • Python: MemoryError or process killed by OS (OOM killer), or
  • Unhandled exception and non-zero exit code with no helpful message.

Environment

  • ToolBench commit / tag: HEAD (please replace with exact commit hash)
  • OS: Ubuntu 22.04 LTS (also reproduced on macOS 12)
  • Node.js: v18.16.0 (if applicable)
  • Python: 3.11.4 (if applicable)
  • RAM: 8GB
  • Reproduction on both machine-local and CI (GitHub Actions) observed

Reproduction steps (minimal)

  1. Create a large test file (200MB+). Example command to generate a test file:
# Linux / macOS: create a 250MB test file
base64 /dev/urandom | head -c 250000000 > ./test-large.ndjson
  1. Run the analysis:
# CLI invocation
toolbench analyze ./test-large.ndjson --mode summary
  1. Observe crash:
# Node.js example crash
$ toolbench analyze ./test-large.ndjson --mode summary
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
Aborted (core dumped)

Or for Python:

$ toolbench analyze ./test-large.ndjson --mode summary
Traceback (most recent call last):
  File "toolbench/cli.py", line 42, in <module>
    main()
  File "toolbench/analyze.py", line 210, in analyze
    results = aggregator.aggregate(all_items)
MemoryError

Quick root-cause hypothesis

The current implementation accumulates the entire parsed input into memory (e.g. all_items = [], or a full json.load()/file.read()), then runs in-memory aggregation. For very large files this triggers huge memory usage and crashes. The CLI should either:

  • Stream-process input (line-by-line or chunked), keeping only aggregated state, or
  • Use a bounded buffer / external temporary storage (SQLite or temporary file) for intermediate results, or
  • Provide an option to use memory-limited mode.

Suggested fix (Node.js example)

Change the implementation to stream the file instead of reading whole file into memory.

Before (problematic pattern):

// lib/analyze.js (hypothetical)
const fs = require('fs');

function analyze(path) {
  const raw = fs.readFileSync(path, 'utf8');  // reads entire file
  const items = raw.split('\n').map(JSON.parse);
  // heavy in-memory aggregation
  return aggregate(items);
}

After (streaming approach using readline):

// lib/analyze.js
const fs = require('fs');
const readline = require('readline');

async function analyze(path) {
  const stats = createAggregator(); // small stateful object
  const rl = readline.createInterface({
    input: fs.createReadStream(path, { encoding: 'utf8' }),
    crlfDelay: Infinity
  });

  for await (const line of rl) {
    if (!line.trim()) continue;
    let item;
    try {
      item = JSON.parse(line);
    } catch (err) {
      // handle / log parse error per-line
      continue;
    }
    stats.add(item); // aggregator keeps only necessary summary/state
  }

  return stats.finalize();
}

module.exports = { analyze };

This avoids loading the entire file into memory.


Suggested fix (Python example)

Use an iterator and avoid reading the entire file with json.load().

Before (problematic):

# toolbench/analyze.py
with open(path, 'r', encoding='utf-8') as fh:
    data = json.load(fh)   # loads whole file -> OOM
aggregator = Aggregator()
aggregator.aggregate(data)

After (streaming, NDJSON example):

# toolbench/analyze.py
def analyze(path):
    aggregator = Aggregator()
    with open(path, 'r', encoding='utf-8') as fh:
        for line in fh:
            line = line.strip()
            if not line:
                continue
            try:
                item = json.loads(line)
            except json.JSONDecodeError:
                # optionally log and continue
                continue
            aggregator.add(item)
    return aggregator.result()

If input is not NDJSON, consider ijson for streaming JSON arrays:

# example with ijson for JSON arrays
import ijson
with open(path, 'rb') as fh:
    parser = ijson.items(fh, 'item')
    for item in parser:
        aggregator.add(item)

Add ijson to optional dependencies if needed.


Tests to add (unit / integration)

Node: jest integration test
Create tests/large-file.integration.test.js:

const fs = require('fs');
const { spawnSync } = require('child_process');
const tmp = require('tmp');

// generate small-ish file but conceptually large for CI
test('analyze handles streaming input without OOM', () => {
  const tmpFile = tmp.fileSync({ postfix: '.ndjson' });
  const lines = [];
  for (let i = 0; i < 10000; i++) {
    lines.push(JSON.stringify({ id: i, value: Math.random() }));
  }
  fs.writeFileSync(tmpFile.name, lines.join('\n'), 'utf8');

  const result = spawnSync('node', ['bin/toolbench', 'analyze', tmpFile.name], {
    encoding: 'utf8',
    maxBuffer: 1024 * 1024 * 10
  });

  expect(result.status).toBe(0);
  expect(result.stdout).toMatch(/summary/); // adapt to actual CLI output
});

Python pytest integration
tests/test_analyze_large.py:

import json
import tempfile
import subprocess
import sys

def test_analyze_handles_large_file(tmp_path):
    p = tmp_path / "test.ndjson"
    with p.open("w", encoding="utf-8") as fh:
        for i in range(20000):
            fh.write(json.dumps({"id": i, "v": i}) + "\n")

    proc = subprocess.run([sys.executable, "-m", "toolbench", "analyze", str(p)],
                          capture_output=True, text=True)
    assert proc.returncode == 0
    assert "summary" in proc.stdout.lower()  # adapt to actual output

Suggested PR checklist / reviewer notes

  • Replace any readFileSync/read() + JSON.parse() of entire file with streaming approach.
  • Add unit/integration test above to CI matrix.
  • Add a CLI flag --stream or automatically detect stdin and stream.
  • Update README to document memory-safe mode and supported input formats (NDJSON / JSON array).
  • If using ijson or another third-party streaming parser, add it to dependencies and provide fallback.

Logs / attachments

(Attach any verbose logs or profiler output, e.g., node --trace_gc or python -X tracemalloc, to help root-cause.)

Example: node --max-old-space-size=4096 bin/toolbench analyze test-large.ndjson

  • If increasing the node heap avoids crash, it's more evidence of memory use pattern.

Temporary workarounds

  • Split input file into smaller chunks and run toolbench analyze on each, then merge results externally.
  • Run under larger-memory machine / increase Node heap with --max-old-space-size=8192 (not a real fix).

Example minimal patch idea (pseudo)

  1. Create lib/streaming-aggregator.js with a small stateful aggregator API (add(item), finalize()).
  2. Modify CLI entrypoint to choose streaming path for files > 10MB or when --stream is passed.
  3. Add tests and docs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions