Skip to content

DNS / hostname resolution failure for S3 read_parquet() read with high thread count #403

@ghukill

Description

@ghukill

What happens?

When increasing thread count -- my breaking example is 64 on a 10 core machine -- seeing the following error for an S3 reading of parquet files:

IOException: IO Error: Could not resolve hostname error for HTTP GET

This is for duckdb version 1.5.1.

Threads counts of ~8 works always, thread count of 32 fails eventually (but takes longer).

NOTE: this formerly worked just fine with version <=1.4. We had tested up 128-256 threads even, where 64 was a nice performance sweet spot for this particular operation.

To Reproduce

To reproduce, I'm setting AWS env vars in my environment (but also saw this with SSO credential chain).

This works ✅ :

import duckdb

conn = duckdb.connect()
conn.execute("install httpfs; load httpfs;")

print(conn.query("""
   select count(*) from read_parquet(
       's3://<bucket>/<path>/**/*.parquet',
       hive_partitioning=true,
       filename=true
   )
   limit 3;
"""))

This throws an error 🚫 :

import duckdb

conn = duckdb.connect()
conn.execute("install httpfs; load httpfs;")

conn.execute("SET threads = 64;")  #<--------------------

print(conn.query("""
   select count(*) from read_parquet(
       's3://<bucket>/<path>/**/*.parquet',
       hive_partitioning=true,
       filename=true
   )
   limit 3;
"""))

Error:

---------------------------------------------------------------------------
IOException                               Traceback (most recent call last)
Cell In[3], line 8
      4 conn.execute("install httpfs; load httpfs;")
      6 conn.execute("SET threads = 64;")
----> 8 print(conn.query("""
      9    select count(*) from read_parquet(
     10        's3://<bucket>/<path>/**/*.parquet',
     11        hive_partitioning=true,
     12        filename=true
     13    )
     14    limit 3;
     15 """))

IOException: IO Error: Could not resolve hostname error for HTTP GET to 'https://<bucket>.s3.us-east-1.amazonaws.com/<path>/year%3D2025/month%3D02/day%3D03/3fbf80ad-afad-40a4-b3bb-0cbe7fba6076-0.parquet'

OS:

OSX

DuckDB Package Version:

1.5.1

Python Version:

3.12.6

Full Name:

Graham Hukill

Affiliation:

MIT Libraries

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - I cannot easily share my data sets due to their large size

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions