Skip to content

vortex-datafusion column size statistic #6412

@connortsui20

Description

@connortsui20

I don't think this statistic is currently used anywhere in the upstream so its hard for me to tell how they intended to use it.
The current rustdoc is:

    /// Estimated size of this column's data in bytes for the output.
    ///
    /// Note that this is not the same as the total bytes that may be scanned,
    /// processed, etc.
    ///
    /// E.g. we may read 1GB of data from a Parquet file but the Arrow data
    /// the node produces may be 2GB; it's this 2GB that is tracked here.
    ///
    /// Currently this is accurately calculated for primitive types only.
    /// For complex types (like Utf8, List, Struct, etc), this value may be
    /// absent or inexact (e.g. estimated from the size of the data in the source Parquet files).
    ///
    /// This value is automatically scaled when operations like limits or
    /// filters reduce the number of rows (see [`Statistics::with_fetch`]).

Personally happy to leave it absent or inexact until we have more clarity about that.

Originally posted by @AdamGS in #6309 (comment)

Metadata

Metadata

Assignees

Labels

ext/datafusionRelates to the DataFusion integration

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions