Fix data loss when inserting duplicate values during a migration#1633
Fix data loss when inserting duplicate values during a migration#1633ggilder wants to merge 6 commits intogithub:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a critical correctness issue in gh-ost binlog replay when adding unique indexes by switching DML replay semantics to avoid REPLACE-driven row deletion, and adds warning-based failure behavior for PanicOnWarnings during replay (plus a change intended to make panic-aborts recoverable for library usage).
Changes:
- Switch DML insert replay from
REPLACEtoINSERT IGNORE(SQL builder + tests). - Add
SHOW WARNINGSdetection inApplyDMLEventQueries()to fail migrations on unexpected warnings whenPanicOnWarningsis enabled. - Change panic-abort handling from
os.Exit-style fatal logging topanic(err)and add regression tests for duplicate/warning scenarios.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| go/sql/builder.go | Updates generated DML insert statement from REPLACE to INSERT IGNORE. |
| go/sql/builder_test.go | Updates expectations for the new insert statement form. |
| go/logic/applier.go | Adds warning detection/rollback path for replay when PanicOnWarnings is enabled. |
| go/logic/applier_test.go | Adds integration test ensuring duplicate on non-migration unique index triggers failure without data loss. |
| go/logic/applier_delete_insert_test.go | Adds integration tests around UPDATE→DELETE+INSERT behavior with INSERT IGNORE + warnings. |
| go/logic/migrator.go | Changes panic-abort handler from fatal exit to panic(err). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var sqlWarnings []string | ||
| for rows.Next() { | ||
| var level, message string | ||
| var code int | ||
| if err := rows.Scan(&level, &code, &message); err != nil { |
There was a problem hiding this comment.
rows.Err() is checked before iterating, but errors that occur during iteration will only surface after the loop. Add a rows.Err() check after the for rows.Next() loop so iteration errors don't get ignored in PanicOnWarnings mode.
There was a problem hiding this comment.
Also copied from existing code — may be worth fixing in both places
|
Thanks for the PR @ggilder. This was a known issue when @grodowski introduced |
|
@meiji163 I'm not sure I understand the problem with Scenario 1: Binlog event first, then bulk copy Scenario 2: Bulk copy first, then binlog event Scenario 3: With updates |
…igration indexes
gh-ost could silently lose data when adding a unique index to a column during
a migration, even with the PanicOnWarnings flag enabled. This occurred when:
1. A migration adds a unique index to a column (e.g., email)
2. Rows with duplicate values are inserted into the original table after the
bulk copy phase completes (during postponed cutover)
3. These duplicate rows are applied via binlog replay to the ghost table
**Expected behavior:** Migration fails with clear error
**Actual behavior:** Original rows with duplicate values silently
deleted, data lost
Original table: id PRIMARY KEY, email (no unique constraint)
Ghost table: id PRIMARY KEY, email UNIQUE (being added)
Initial state (after bulk copy):
- Ghost table: (id=1, email='bob@example.com')
(id=2, email='alice@example.com')
During postponed cutover:
- INSERT (id=3, email='bob@example.com') into original table
Binlog replay attempts:
- INSERT (id=3, email='bob@example.com') into ghost table
- Duplicate email='bob@example.com' (conflicts with id=1)
- Row with id=1 silently deleted → DATA LOSS
The DMLInsertQueryBuilder used `REPLACE` statements for binlog event replay:
```sql
REPLACE INTO ghost_table (id, email) VALUES (3, 'bob@example.com');
REPLACE behavior:
- If duplicate PRIMARY KEY: deletes old row, inserts new row (no warning/error)
- If duplicate on OTHER unique index: deletes conflicting row, inserts new row
- NO warnings or errors generated, so PanicOnWarnings cannot detect the issue
The original design assumed REPLACE was needed to handle timing edge cases
where binlog replay might process a row before bulk copy, but this caused
silent data corruption when other unique indexes had duplicates.
Changed DMLInsertQueryBuilder to use INSERT IGNORE instead of REPLACE:
INSERT IGNORE INTO ghost_table (id, email) VALUES (3, 'bob@example.com');
INSERT IGNORE behavior:
- If duplicate on ANY unique index: skip insert, generate WARNING
- Does not delete existing rows
Added warning detection to ApplyDMLEventQueries() when PanicOnWarnings is
enabled:
- Checks SHOW WARNINGS after batch execution
- Ignores duplicates on migration unique key (expected - row already copied)
- FAILS migration for duplicates on other unique indexes
- Transaction rollback ensures no partial state
Edge Case: DELETE+INSERT Conversion
When an UPDATE modifies the migration unique key (e.g., PRIMARY KEY), gh-ost
converts it to DELETE+INSERT within a single transaction:
BEGIN;
DELETE FROM ghost WHERE id=2;
INSERT IGNORE INTO ghost VALUES (3, 'bob@example.com');
COMMIT;
If the INSERT encounters a duplicate on a non-migration unique index:
- With PanicOnWarnings: Warning detected, transaction rolled back, both
DELETE and INSERT undone → no data loss ✓
- Without PanicOnWarnings: DELETE succeeds, INSERT silently skips →
data loss. This further reinforces that PanicOnWarnings should default
to on.
|
I'm removing the code changes to replace |
9780c84 to
57b77b3
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| // Check for warnings when PanicOnWarnings is enabled | ||
| if this.migrationContext.PanicOnWarnings { | ||
| rows, err := tx.Query("SHOW WARNINGS") | ||
| if err != nil { |
There was a problem hiding this comment.
SHOW WARNINGS only reports warnings for the most recently executed statement. Since the DML batch is executed as a single multi-statement, warnings generated by earlier statements in the batch can be missed, which undermines PanicOnWarnings and can reintroduce silent data loss. Consider disabling multi-statement batching when PanicOnWarnings is enabled (execute each statement and check warnings after each), or otherwise capture warning counts per statement.
| this.migrationContext.Log.Warningf("Failed to read SHOW WARNINGS row") | ||
| continue |
There was a problem hiding this comment.
On rows.Scan failure, this logs and continues, which can cause real warnings to be ignored and the transaction to commit even though PanicOnWarnings is enabled. It’s safer to treat scan errors as fatal here (rollback/return error) so warning handling can’t silently degrade.
| this.migrationContext.Log.Warningf("Failed to read SHOW WARNINGS row") | |
| continue | |
| return rollback(err) |
There was a problem hiding this comment.
Copied from existing code; will consider fixing in both places
|
@meiji163 I've pushed up some commits to address Copilot feedback. I still need to make changes to address the issues around |
|
@ggilder Scenario such as this T0: Therefore the update is ignored by binlog applier |
|
@ggilder Aha I forgot that, then it does seem OK. I will do a few internal tests |
Related issue: #1636
Description
This PR fixes a critical data loss bug in gh-ost's
--panic-on-warningsmode that could silently delete rows when adding unique indexes to columns with duplicate values during binlog replay.Problem
gh-ost could silently lose data when migrations add unique indexes to columns, even with
--panic-on-warningsenabled. This occurred due to the use ofREPLACE INTOfor binlog event replay, which silently deletes conflicting rows instead of generating warnings.Example Scenario
Migration: Add unique index to
emailcolumnOriginal table:
id(PRIMARY KEY),email(no unique constraint)Ghost table:
id(PRIMARY KEY),email(UNIQUE - being added)Initial state (after bulk copy):
During postponed cutover:
INSERT (id=3, email='bob@example.com')into original tableREPLACE INTO ghost_table VALUES (3, 'bob@example.com')email='bob@example.com'conflicts with existing rowid=1Root Cause
The
REPLACEstatement behavior:Solution
Replace
REPLACEwithINSERT IGNORE+ Warning DetectionChanged DML replay from:
To:
INSERT IGNORE behavior:
Added warning detection in
ApplyDMLEventQueries()whenPanicOnWarningsis enabled:SHOW WARNINGSafter batch executionscript/cibuildreturns with no formatting errors, build errors or unit test errors.(edited to remove references to
Fatale/panicbehavior, this will be tackled in a separate PR)