Skip to content

GH-49448: [C++][CI] Detect mismatching schema in differential IPC fuzzing#49451

Open
pitrou wants to merge 1 commit intoapache:mainfrom
pitrou:gh49448-fuzz-ipc-differential-fix
Open

GH-49448: [C++][CI] Detect mismatching schema in differential IPC fuzzing#49451
pitrou wants to merge 1 commit intoapache:mainfrom
pitrou:gh49448-fuzz-ipc-differential-fix

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Mar 4, 2026

Rationale for this change

In #49311 we added differential fuzzing for the IPC file fuzzer, where we compare the contents as read by the IPC file reader and the IPC stream reader, respectively.

However, as the IPC file footer carries a duplicate definition of the file's schema, the fuzzer could mutate it and make it mismatch the IPC stream schema. The fuzz target would currently fail on such a file, even though it's technically not a bug in the IPC implementation.

This was detected by OSS-Fuzz in https://issues.oss-fuzz.com/issues/486127061

What changes are included in this PR?

Detect a mismatching schema between IPC stream and IPC file, and skip the batches comparison in that case.

Are these changes tested?

Yes, by new regression test.

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Mar 4, 2026

@github-actions crossbow submit fuzz

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

Revision: 7587b2d

Submitted crossbow builds: ursacomputing/crossbow @ actions-4d3562254e

Task Status
test-build-cpp-fuzz GitHub Actions

@pitrou pitrou marked this pull request as ready for review March 4, 2026 14:26
@pitrou
Copy link
Member Author

pitrou commented Mar 4, 2026

@WillAyd Would you like to review this?

@addisoncrump
Copy link

as the IPC file footer carries a duplicate definition of the file's schema, the fuzzer could mutate it and make it mismatch the IPC stream schema.

Shouldn't the program already reject the input in this case? What is the purpose of the duplicate definition otherwise?

Also, would it be possible to "fix" the footer to match the original schema?

@pitrou
Copy link
Member Author

pitrou commented Mar 4, 2026

Shouldn't the program already reject the input in this case? What is the purpose of the duplicate definition otherwise?

An IPC stream is meant to be read sequentially and therefore has the schema appearing at the start of the encoded stream.

An IPC file is basically an IPC stream + a file footer with dedicated metadata for random access (a bit like a ZIP file catalog). The IPC file footer contains a copy of the schema to reduce the number of required IOs to read into the file.

The IPC file reader reads directly from the end of file, ignoring the schema that is stored at the start of the encoded IPC stream. Validating that the two schemas are identical would do a spurious IO while correct files would have identical schemas anyway.

Also, would it be possible to "fix" the footer to match the original schema?

Ah, you mean use the same schema when comparing the contents? There's no way to tell the IPC file reader API to use a different schema for reading, because it doesn't make sense with valid IPC files.

Moreover, in some cases the different schema will not matter because only a field name changed, but as soon as a more important piece of information has changed (for example a field type, or an additional field etc.), then passing the wrong schema to the reader will just fail or return gibberish.

@addisoncrump
Copy link

There's no way to tell the IPC file reader API to use a different schema for reading, because it doesn't make sense with valid IPC files.

I mean in the sense of "fixing the footer"; if we know what it should be, we can correct it in advance. This way you can still do comparisons on the rest of the data.

You might look into LLVMFuzzerCustomMutator and LLVMFuzzerCustomCrossover (note that you must do both) to do this: https://github.com/google/fuzzing/blob/master/docs/structure-aware-fuzzing.md

Internally, you can call LLVMFuzzerMutate/LLVMFuzzerCrossover and then apply the fixup after the mutation from the fuzz engine.

@pitrou
Copy link
Member Author

pitrou commented Mar 4, 2026

You might look into LLVMFuzzerCustomMutator and LLVMFuzzerCustomCrossover (note that you must do both) to do this: https://github.com/google/fuzzing/blob/master/docs/structure-aware-fuzzing.md

Internally, you can call LLVMFuzzerMutate/LLVMFuzzerCrossover and then apply the fixup after the mutation from the fuzz engine.

Hmm, thanks for the pointer. This looks intriguing. Implementing those hooks for this specific issue is certainly overkill, but it might be useful for other purposes. I'll try to think about that later :)

@addisoncrump
Copy link

Correct me if I'm wrong here, but the batch comparison won't happen unless the schema is suitably comparable, right? That would mean that you very rarely actually compare batches, then, because the probability that you have mutated the schema at the top and bottom of the IPC file s.t. they are equal is basically zero. That's why I'm suggesting the fix-up, as it would prevent the failure to compare.

@pitrou
Copy link
Member Author

pitrou commented Mar 4, 2026

That would mean that you very rarely actually compare batches, then, because the probability that you have mutated the schema at the top and bottom of the IPC file s.t. they are equal is basically zero

The seed corpus starts with valid IPC files, i.e. where the two copies are identical. I presume in many cases the fuzzer would not mutate either of the copies, but the rest of the file? (I don't know which precise heuristics it uses, though)

@addisoncrump
Copy link

Sure, but that limits your search to just the schema in the provided corpus... potentially undesirable.

@pitrou
Copy link
Member Author

pitrou commented Mar 5, 2026

Ah, that's a good point. Hmm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants