Skip to content

Conversation

@steFaiz
Copy link
Contributor

@steFaiz steFaiz commented Jan 13, 2026

Purpose

This PR is about to add a check for DataEvolutionMergeInto in Spark, to prevent users from updating global-indexed fields.
Otherwise, the indexed-scan results would be wrong.

Linked issue: none

Tests

Please see org.apache.paimon.spark.sql.RowTrackingTestBase

API and Format

None

Documentation

Will be added ASAP

@steFaiz
Copy link
Contributor Author

steFaiz commented Jan 13, 2026

Now I just simply check all indexed fields and updated fields. I think we could also push a Map<Partition, List<IndexedColumns>> or something else down to the DataEvolutionPaimonWriter, to check at the partition level.
But I don't know whether it's worth. @JingsongLi Could you please tell me your opinions?

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good question. I think we can create an option to describe what to do when data is updated. For example:

  1. Index columns are not allowed to be updated.
  2. Update and remove the index from the indexed column.
  3. For updates without changing the index, the business is responsible for rebuilding the index.

@steFaiz steFaiz changed the title [spark] calling merge into on DE table should not update indexed columns [wip][spark] calling merge into on DE table should not update indexed columns Jan 13, 2026
@steFaiz steFaiz marked this pull request as draft January 13, 2026 13:08
@steFaiz
Copy link
Contributor Author

steFaiz commented Jan 13, 2026

That's a very good question. I think we can create an option to describe what to do when data is updated. For example:

@JingsongLi Thanks for your insightful advise! I've drafted this PR and will work on this! I will create an issue after conceiving a basic design and reopen this PR.

@JingsongLi
Copy link
Contributor

@steFaiz You can investigate how other systems handle the relationship between data updates and indexing.

@steFaiz
Copy link
Contributor Author

steFaiz commented Jan 19, 2026

@steFaiz You can investigate how other systems handle the relationship between data updates and indexing.

@JingsongLi Thanks for your suggestion! Apologies for the delayed response. I caught the flu and was out for about a week. I’m back now and will follow up on this today.

My understanding is that Lance currently takes a simple approach: when a small portion of the data changes, it directly marks the corresponding part of the index as invalid, and then relies on the user to explicitly trigger an update to rebuild/refresh it. In our case, the index isn’t updated frequently either, so I think our scenario is fairly similar to Lance’s.

@JingsongLi
Copy link
Contributor

@steFaiz You can investigate how other systems handle the relationship between data updates and indexing.

@JingsongLi Thanks for your suggestion! Apologies for the delayed response. I caught the flu and was out for about a week. I’m back now and will follow up on this today.

My understanding is that Lance currently takes a simple approach: when a small portion of the data changes, it directly marks the corresponding part of the index as invalid, and then relies on the user to explicitly trigger an update to rebuild/refresh it. In our case, the index isn’t updated frequently either, so I think our scenario is fairly similar to Lance’s.

Sounds reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants