-
Notifications
You must be signed in to change notification settings - Fork 127
feat: implement DuckDB filesystem integration for Vortex file handling #6198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Merging this PR will improve performance by 10.18%
Performance Changes
Comparing Footnotes
|
0ax1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to look into this. There's a couple of issues around thread safety / locking, error handling and memory leaks.
If you used AI to assist in writing the code for your PR please mention this in your PR description (see: https://github.com/vortex-data/vortex/blob/develop/CONTRIBUTING.md).
| |_| {} | ||
| ); | ||
|
|
||
| #[derive(Clone, Copy)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a separate type (SendableClientContext) given that Send + Sync is set on ClientContext directly now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it separates ownership/lifetime semantics from thread-safe handle passing.
ClientContext is a wrapper! type with owned state and Drop logic. It is not Copy and should not be casually moved across async boundaries or spawned tasks.
On the other hand, SendableClientContext is a tiny, Copy wrapper around the raw duckdb_vx_client_context pointer with no destructor. It’s just a handle, which is ideal for passing into async or spawn_blocking code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Let's add a SAFETY comment on the wrapper stating that the same guarantees apply as for ClientContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
vortex-duckdb/src/scan.rs
Outdated
| vortex_bail!("GCS glob expansion not yet implemented") | ||
| } | ||
| _ => { | ||
| let paths = glob(file_glob.as_ref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also hook into the duckdb_fs_glob for local paths? Or asked differently, should we make any assumptions at all here about globbing details of duckdb_fs_glob but rather unconditionally delegate to duckdb? Whether or not gcs is supported is now a Duckdb implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to just do this let file_urls = duckdb_fs_glob(..).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will move the local file handling logic into duckdb_fs_glob
0ax1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, thanks for taking the time to address the comments!
vortex-duckdb/src/scan.rs
Outdated
| fn pushdown_complex_filter( | ||
| bind_data: &mut Self::BindData, | ||
| expr: &duckdb::Expression, | ||
| expr: &crate::duckdb::Expression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep duckdb::Expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I’ll rename it to just Expression and add use crate::duckdb::Expression; at the top of the file.
| typedef struct { | ||
| const char **entries; | ||
| size_t count; | ||
| } duckdb_vx_string_list; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only use this as a URI list, and this currently lives in filesystem.hpp. Let's rename this to duckdb_vx_uri_list to be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
Signed-off-by: Ruoshi <me@ruoshi.li>
Uh oh!
There was an error while loading. Please reload this page.