-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[wip][spark] calling merge into on DE table should not update indexed columns #7028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Now I just simply check all indexed fields and updated fields. I think we could also push a |
JingsongLi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good question. I think we can create an option to describe what to do when data is updated. For example:
- Index columns are not allowed to be updated.
- Update and remove the index from the indexed column.
- For updates without changing the index, the business is responsible for rebuilding the index.
@JingsongLi Thanks for your insightful advise! I've drafted this PR and will work on this! I will create an issue after conceiving a basic design and reopen this PR. |
|
@steFaiz You can investigate how other systems handle the relationship between data updates and indexing. |
@JingsongLi Thanks for your suggestion! Apologies for the delayed response. I caught the flu and was out for about a week. I’m back now and will follow up on this today. My understanding is that Lance currently takes a simple approach: when a small portion of the data changes, it directly marks the corresponding part of the index as invalid, and then relies on the user to explicitly trigger an update to rebuild/refresh it. In our case, the index isn’t updated frequently either, so I think our scenario is fairly similar to Lance’s. |
Sounds reasonable. |
Purpose
This PR is about to add a check for DataEvolutionMergeInto in Spark, to prevent users from updating global-indexed fields.
Otherwise, the indexed-scan results would be wrong.
Linked issue: none
Tests
Please see
org.apache.paimon.spark.sql.RowTrackingTestBaseAPI and Format
None
Documentation
Will be added ASAP