Skip to content

Segfault in _pickle.c:batch_dict_exact when pickling dict with concurrent mutation (free-threading) #146452

@overlorde

Description

@overlorde

Bug report

Bug description

I've been testing the free-threaded build for thread safety issues in unaudited modules (ref #116738_pickle.c is unchecked). Found a segfault when pickle.dumps() runs concurrently with dict mutations on the same dict.

The crash happens in batch_dict_exact at Modules/_pickle.c:3496 — it iterates dict items via _PyDict_Next which gives borrowed references, but there's no critical section on the dict. If another thread pops or replaces a key mid-iteration, the borrowed reference goes stale and Py_INCREF hits a freed/NULL pointer.

Sometimes it raises RuntimeError: dictionary changed size during iteration cleanly, but other times the reference gets invalidated before the size check and it just segfaults.

Reproducer

import pickle
import threading

shared = {str(i): list(range(50)) for i in range(100)}

for trial in range(2000):
    barrier = threading.Barrier(3)

    def dumper():
        try:
            barrier.wait()
            pickle.dumps(shared)
        except Exception:
            pass

    def mutator():
        try:
            barrier.wait()
            for j in range(200):
                key = str(j % 100)
                shared[key] = list(range(j % 50))
                if j % 10 == 0:
                    shared.pop(str(j % 100), None)
                    shared[str(j % 100)] = [j]
        except Exception:
            pass

    threads = [
        threading.Thread(target=dumper),
        threading.Thread(target=dumper),
        threading.Thread(target=mutator),
    ]
    for t in threads: t.start()
    for t in threads: t.join()

Crashes within a few hundred iterations usually. I ran it 10 times, got 9 segfaults.

Tested on

3.14.0a4 free-threading (stock install, no sanitizers):

$ python3.14t reproducer.py
Segmentation fault (core dumped)

3.15.0a7+ free-threading (built with --with-address-sanitizer):

AddressSanitizer:DEADLYSIGNAL
==1334802==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000000c
    #0 _Py_atomic_load_uint32_relaxed Include/cpython/pyatomic_gcc.h:367
    #1 Py_INCREF Include/refcount.h:267
    #2 batch_dict_exact Modules/_pickle.c:3496
    #3 save_dict Modules/_pickle.c:3563
    #4 save Modules/_pickle.c:4568
    #5 dump Modules/_pickle.c:4768
    #6 _pickle_dumps_impl Modules/_pickle.c:7992

Root cause

  • batch_dict_exact calls PyDict_Next() which returns borrowed references to key and value
  • Between PyDict_Next() returning and Py_INCREF(key) / Py_INCREF(value), another thread can mutate the dict
  • The borrowed reference points to a freed object, Py_INCREF dereferences it and segfaults
  • The size check at line 3512 (PyDict_GET_SIZE(obj) != dict_size) only catches mutations that change dict size, and only after the batch — too late to prevent the stale reference
  • Same issue exists in both the single-item path (line 3474) and the batch loop (line 3494)

Possible fix

  • Wrap PyDict_Next() + Py_INCREF(key) + Py_INCREF(value) in Py_BEGIN_CRITICAL_SECTION(obj) / Py_END_CRITICAL_SECTION() to prevent dict mutation while converting borrowed refs to owned refs
  • Keep save() calls outside the critical section since they can reenter Python
  • Same approach as the set iteration path at line 3644 which already uses Py_BEGIN_CRITICAL_SECTION(obj) with _PySet_NextEntryRef
  • Both PyDict_Next call sites in batch_dict_exact need this (line 3474 and line 3494)

CPython versions tested on

3.14, CPython main branch

Operating systems tested on

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions