Skip to content

Conversation

@markjschreiber
Copy link
Contributor

Adhoc S3 Buckets Implementation

Overview

This implementation adds support for adhoc S3 bucket searching to the genomics file search tool, allowing users to search buckets that are not part of the standard configuration.

Changes Made

1. Model Updates (models/search.py)

  • Added adhoc_s3_buckets: Optional[List[str]] field to GenomicsFileSearchRequest
  • Added comprehensive validation for the new field:
    • Format validation using existing S3 utilities
    • Automatic path normalization

2. Validation Utilities (utils/validation_utils.py)

  • Added validate_adhoc_s3_buckets() function
  • Leverages existing validate_bucket_access() from s3_utils
  • Gracefully handles validation failures (logs warnings, continues with valid buckets)

3. Search Orchestrator (search/genomics_search_orchestrator.py)

  • Added _get_all_s3_bucket_paths() method to combine configured and adhoc buckets
  • Added _search_s3_with_timeout_for_buckets() for bucket-specific searches
  • Added _search_s3_paginated_with_timeout_for_buckets() for paginated searches
  • Updated cache key generation to include adhoc buckets
  • Updated both regular and paginated search flows

4. Tool Function (tools/genomics_file_search.py)

  • Added adhoc_s3_buckets parameter with comprehensive documentation
  • Updated logging to include adhoc bucket information
  • Updated docstring to document new functionality

Features

Security & Validation

  • Format Validation: All bucket paths are validated for proper S3 URI format
  • Access Validation: Bucket access is tested before searching
  • Graceful Degradation: Invalid/inaccessible buckets are skipped with warnings
  • Rate Limiting: Maximum 50 adhoc buckets per request

Performance

  • Caching: Adhoc buckets are included in cache keys for proper isolation
  • Parallel Processing: Adhoc buckets use the same optimized parallel search logic
  • Pagination: Full support for both regular and storage-level pagination

Compatibility

  • Backward Compatible: Existing functionality unchanged
  • Optional Parameter: Adhoc buckets are completely optional
  • Consistent API: Uses same search logic and response format

Usage Examples

Basic Usage

search_genomics_files(
    file_type="fastq",
    search_terms=["sample123"],
    adhoc_s3_buckets=["s3://my-research-bucket/genomics/"]
)

Multiple Adhoc Buckets

search_genomics_files(
    file_type="bam",
    search_terms=["alignment", "sorted"],
    adhoc_s3_buckets=[
        "s3://project-a-bucket/alignments/",
        "s3://project-b-bucket/bam-files/",
        "s3://temp-analysis-bucket/"
    ]
)

With Pagination

search_genomics_files(
    file_type="vcf",
    enable_storage_pagination=True,
    pagination_buffer_size=1000,
    adhoc_s3_buckets=["s3://large-variant-dataset/"]
)

Error Handling

  • Invalid bucket paths: Validation errors with specific error messages
  • Access denied: Logged as warnings, search continues with accessible buckets
  • Network errors: Handled gracefully, search continues with other buckets
  • Too many buckets: Clear validation error with limit information

Testing

The implementation includes comprehensive validation and testing:

  • 17 new test cases covering all functionality
  • All existing tests updated and passing (1015 total tests)

Benefits

  1. Flexibility: Search any accessible S3 bucket without configuration changes
  2. Security: Proper validation ensures only accessible buckets are searched
  3. Performance: Leverages existing optimization and caching
  4. Reliability: Graceful error handling prevents search failures
  5. Consistency: Same search quality and features as configured buckets

Checklist

If your change doesn't seem to apply, please leave them unchecked.

  • I have reviewed the contributing guidelines
  • I have performed a self-review of this change
  • Changes have been tested
  • Changes are documented

Is this a breaking change? (Y/N)
N

RFC issue number:

Checklist:

  • Migration process documented
  • Implement warnings (if it can live side by side)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

@awslabs awslabs deleted a comment from codecov bot Dec 15, 2025
@awslabs awslabs deleted a comment from codecov bot Dec 15, 2025
@codecov
Copy link

codecov bot commented Dec 15, 2025

Codecov Report

❌ Patch coverage is 91.76471% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.76%. Comparing base (b4522cb) to head (7e36add).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...wslabs/aws_healthomics_mcp_server/models/search.py 82.60% 2 Missing and 2 partials ⚠️
..._mcp_server/search/genomics_search_orchestrator.py 94.00% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1948   +/-   ##
=======================================
  Coverage   90.76%   90.76%           
=======================================
  Files         860      860           
  Lines       64522    64600   +78     
  Branches    10464    10476   +12     
=======================================
+ Hits        58563    58636   +73     
- Misses       3653     3656    +3     
- Partials     2306     2308    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

alxawan
alxawan previously approved these changes Dec 15, 2025
Copy link

@alxawan alxawan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@scottschreckengaust scottschreckengaust added ready-for-merge Folks believe this is ready to merge waiting-for-codeowners Code owners are needed to review and removed ready-for-merge Folks believe this is ready to merge labels Dec 16, 2025
@scottschreckengaust scottschreckengaust moved this from To triage to In progress in awslabs/mcp Project Dec 16, 2025
@markjschreiber markjschreiber dismissed stale reviews from a-li and alxawan via 2723564 December 16, 2025 16:42
@markjschreiber markjschreiber requested a review from a-li December 16, 2025 16:47
Copy link

@a-li a-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@alxawan alxawan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markjschreiber markjschreiber added this pull request to the merge queue Dec 16, 2025
Merged via the queue into awslabs:main with commit a9b921c Dec 16, 2025
145 checks passed
@markjschreiber markjschreiber deleted the feat/search-adhoc-s3 branch December 16, 2025 18:10
@github-project-automation github-project-automation bot moved this from In progress to Done in awslabs/mcp Project Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-for-codeowners Code owners are needed to review

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants