Skip to content

Eval bug: gpt-oss20b strange behaviour #18004

@egomarker

Description

@egomarker

Name and Version

I've used several versions from b7296 to b7380

Operating systems

Mac

GGML backends

Metal

Hardware

M1 Max, M4 Max

Models

gpt-oss20b FP16

Problem description & steps to reproduce

I have a task that I use to test models with, it's a simple tool use "see three functions implementations in file 1, update them to use security measures implemented in file and add them to file2". usuall success rate for gpt-oss20b was always 100% or very close.

I've noticed performance in this task has significantly degraded yesterday. I've started narrowing down specific llama.cpp buils and it looks like it broke around b7371.

Image

You can see that 7350, 7363 and 7370 made proper code inserts without bugs. 7380 can't insert correct code.
And I was not able to get any inserts from 7371 at all, it's like model is partially blind and barely "sees" the code. Sometimes it just claims code is already there and ends. Sometimes it keeps using "read file" and "search in file tools" forever. Sometimes it inserts same code several times (after checking if inserts went fine).

Idk how to provide reproducible example because it involves several mcp servers and proprietary code. Hope the data I've provided is enough, because I see 7371 has some breaking changes and the fix will be easy.

First Bad Commit

I think it's release b7371

Relevant log output

Logs look absolutely normal, I've ran diff on them and only two strings are different: "ggml_metal_library_init: loaded <time>" and "build: <build>".

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions