[wip][spark] calling merge into on DE table should not update indexed columns #7028

steFaiz · 2026-01-13T10:08:47Z

Purpose

This PR is about to add a check for DataEvolutionMergeInto in Spark, to prevent users from updating global-indexed fields.
Otherwise, the indexed-scan results would be wrong.

Linked issue: none

Tests

Please see org.apache.paimon.spark.sql.RowTrackingTestBase

API and Format

None

Documentation

Will be added ASAP

…mns.

steFaiz · 2026-01-13T10:13:34Z

Now I just simply check all indexed fields and updated fields. I think we could also push a Map<Partition, List<IndexedColumns>> or something else down to the DataEvolutionPaimonWriter, to check at the partition level.
But I don't know whether it's worth. @JingsongLi Could you please tell me your opinions?

JingsongLi

That's a very good question. I think we can create an option to describe what to do when data is updated. For example:

Index columns are not allowed to be updated.
Update and remove the index from the indexed column.
For updates without changing the index, the business is responsible for rebuilding the index.

steFaiz · 2026-01-13T13:11:42Z

That's a very good question. I think we can create an option to describe what to do when data is updated. For example:

@JingsongLi Thanks for your insightful advise! I've drafted this PR and will work on this! I will create an issue after conceiving a basic design and reopen this PR.

JingsongLi · 2026-01-19T01:43:57Z

@steFaiz You can investigate how other systems handle the relationship between data updates and indexing.

steFaiz · 2026-01-19T02:31:18Z

@steFaiz You can investigate how other systems handle the relationship between data updates and indexing.

@JingsongLi Thanks for your suggestion! Apologies for the delayed response. I caught the flu and was out for about a week. I’m back now and will follow up on this today.

My understanding is that Lance currently takes a simple approach: when a small portion of the data changes, it directly marks the corresponding part of the index as invalid, and then relies on the user to explicitly trigger an update to rebuild/refresh it. In our case, the index isn’t updated frequently either, so I think our scenario is fairly similar to Lance’s.

JingsongLi · 2026-01-19T02:58:24Z

@steFaiz You can investigate how other systems handle the relationship between data updates and indexing.

@JingsongLi Thanks for your suggestion! Apologies for the delayed response. I caught the flu and was out for about a week. I’m back now and will follow up on this today.

My understanding is that Lance currently takes a simple approach: when a small portion of the data changes, it directly marks the corresponding part of the index as invalid, and then relies on the user to explicitly trigger an update to rebuild/refresh it. In our case, the index isn’t updated frequently either, so I think our scenario is fairly similar to Lance’s.

Sounds reasonable.

[spark] calling merge into on DE table should not update indexed colu…

b218294

…mns.

JingsongLi reviewed Jan 13, 2026

View reviewed changes

steFaiz changed the title ~~[spark] calling merge into on DE table should not update indexed columns~~ [wip][spark] calling merge into on DE table should not update indexed columns Jan 13, 2026

steFaiz marked this pull request as draft January 13, 2026 13:08

steFaiz mentioned this pull request Jan 19, 2026

[Feature] Properly deal with global index in MergeInto #7079

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip][spark] calling merge into on DE table should not update indexed columns #7028

[wip][spark] calling merge into on DE table should not update indexed columns #7028

steFaiz commented Jan 13, 2026

Uh oh!

steFaiz commented Jan 13, 2026 •

edited

Loading

Uh oh!

JingsongLi left a comment

Uh oh!

steFaiz commented Jan 13, 2026

Uh oh!

JingsongLi commented Jan 19, 2026

Uh oh!

steFaiz commented Jan 19, 2026

Uh oh!

JingsongLi commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[wip][spark] calling merge into on DE table should not update indexed columns #7028

Are you sure you want to change the base?

[wip][spark] calling merge into on DE table should not update indexed columns #7028

Conversation

steFaiz commented Jan 13, 2026

Purpose

Tests

API and Format

Documentation

Uh oh!

steFaiz commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

steFaiz commented Jan 13, 2026

Uh oh!

JingsongLi commented Jan 19, 2026

Uh oh!

steFaiz commented Jan 19, 2026

Uh oh!

JingsongLi commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steFaiz commented Jan 13, 2026 •

edited

Loading