Conversation
Introduces a new `shared/` module structure to support application separation and code reuse across multiple Streamlit apps. ## Structure ### shared/utils/ - `clustering.py`: Dimensionality reduction (PCA, t-SNE, UMAP) and K-means clustering with multi-backend support (sklearn, FAISS, cuML) - `io.py`: File I/O utilities for embeddings and data persistence - `models.py`: Shared data models and type definitions ### shared/services/ - `clustering_service.py`: High-level clustering workflow orchestration - `embedding_service.py`: Image embedding generation using various models - `file_service.py`: File discovery and validation services ### shared/components/ - `clustering_controls.py`: Streamlit UI controls for backend selection, seed configuration, and worker settings - `summary.py`: Cluster summary statistics and representative images - `visualization.py`: Scatter plot visualization with Altair ### shared/lib/ - `progress.py`: Progress tracking utilities for long-running operations ## Backend Support - sklearn: Default CPU backend for all operations - FAISS: Optional GPU/CPU accelerated K-means clustering - cuML: Optional RAPIDS GPU acceleration for dim reduction and clustering with automatic fallback on unsupported architectures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces `apps/embed_explore/` as a self-contained Streamlit app for interactive image embedding exploration and clustering. ## Application Structure ### apps/embed_explore/ - `app.py`: Main application entry point with two-column layout (sidebar controls + main visualization area) ### apps/embed_explore/components/ - `sidebar.py`: Complete sidebar UI with embedding and clustering sections, model selection, and backend configuration - `summary.py`: Cluster statistics display and representative images - `visualization.py`: Interactive scatter plot with image preview panel ## Features - Directory-based image loading with supported format filtering - Multiple embedding model support (DINOv2, OpenCLIP, etc.) - Configurable dimensionality reduction (PCA, t-SNE, UMAP) - K-means clustering with adjustable cluster count - Interactive Altair scatter plot with click-to-preview - Cluster summary statistics with representative samples ## Usage Run as standalone app: streamlit run apps/embed_explore/app.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updates existing components to use the new shared module and removes legacy code that has been superseded by the app separation. ## Removed - `app.py`: Legacy monolithic entry point (replaced by apps/) - `components/clustering/`: Entire directory moved to shared/ and apps/ - `pages/01_Clustering.py`: Now available as standalone embed_explore app ## Updated Imports - `components/precalculated/sidebar.py`: Uses shared.services and shared.components for clustering functionality - `pages/02_Precalculated_Embeddings.py`: Uses shared.components for visualization and summary rendering ## pyproject.toml Changes - Entry points updated: - `emb-embed-explore` → apps.embed_explore.app:main - `list-models` → shared.utils.models:list_available_models - Package includes: shared/, apps/ - Dependencies: - streamlit>=1.50.0 (updated for new API) - numpy<=2.2.0 (compatibility constraint) - Version path: shared/__init__.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… code Add new standalone app: - apps/precalculated/ - Precalculated embeddings explorer with dynamic cascading filters, CUDA auto-detection, and console logging Features: - Dynamic filter generation based on parquet columns - Cascading/dependent filters with AND logic - Auto backend selection (cuml when CUDA available) - Console logging for clustering operations - Image caching to prevent re-fetch on reruns - State management for record details panel Clean up legacy code: - Remove pages/02_Precalculated_Embeddings.py (monolithic page) - Remove components/ directory (old component structure) - Remove services/ directory (old services, now in shared/) - Remove utils/ directory (old utils, now in shared/) - Remove list_models.py (replaced by entry point) - Move taxonomy_tree.py to shared/utils/ Update shared module: - Add taxonomy tree functions to shared/utils/ - Add VRAM error handling utilities to clustering.py - Fix import paths in summary.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add .interactive() to scatter plots in both apps: - Scroll wheel to zoom - Drag to pan - Double-click to reset Note: Zoom state resets on app rerun (known Streamlit limitation) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add toggleable density heatmap overlay using Altair's mark_rect with 2D binning. This helps visualize point concentration in crowded areas of the scatter plot while keeping individual points visible. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Streamlit doesn't support selections on multi-view (layered) Altair charts. When density heatmap is shown, disable on_select and show a note to the user that point selection is temporarily unavailable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Off: normal 0.7 opacity, selection enabled - Opacity: low 0.15 opacity so overlapping points show density naturally, selection still works - Heatmap: 2D binned density layer behind points (selection disabled due to Streamlit limitation with layered charts) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add grid resolution slider (10-80 bins) when Heatmap mode is selected - Replace truncated metadata display with full-width dataframe table - Show complete UUID and all field values without truncation - Use compact column layout for density options Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Display Cluster and UUID prominently as markdown (full values, no truncation), then show remaining metadata fields in a scrollable table. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Both apps now use shared/components/visualization.py for scatter plot - Shared visualization has all features: zoom/pan, density modes, configurable bins - Dynamic tooltip building works for any data columns - Added data_version tracking for selection validation - Moved embed_explore's render_image_preview to separate file - App-specific visualization.py files now re-export from shared Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add shared/utils/logging_config.py for centralized logging setup - Add logging to clustering utilities (backend selection, timing) - Add logging to ClusteringService (workflow steps, timing) - Add logging to EmbeddingService (model loading, generation stats) - Add logging to FileService (file operations, timing) - Replace print() fallback messages with proper logger.warning() - Fix use_container_width deprecation: use width="stretch" instead Logging now tracks: - Which backend is selected (sklearn/cuML/FAISS) - Operation timing for performance monitoring - Fallback events when GPU operations fail Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add shared/utils/backend.py with centralized backend utilities: - check_cuda_available(): Cached CUDA detection via PyTorch/CuPy - resolve_backend(): Auto-resolve to cuML/FAISS/sklearn based on hardware - is_oom_error(), is_cuda_arch_error(), is_gpu_error(): Error classification - Update embed_explore to use robust error handling: - Auto-resolve backends based on available hardware - Automatic fallback to sklearn on GPU errors - Consistent logging of backend selection - Update precalculated to use shared backend utilities: - Remove duplicate check_cuda_available/resolve_backend functions - Replace print() with logger calls for consistency Both apps now have identical backend selection and fallback behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…logging - Clustering summary is now computed once when clustering runs and stored in session state (clustering_summary, clustering_representatives) - Summary component displays cached results instead of recomputing on every render (zoom, pan, point selection no longer trigger recompute) - Added logging for image retrieval: - URL fetch timing and size - Timeout and error handling with warnings - Debug logging for image display - Removed ClusteringService import from summary component (uses cache) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Visualization logging: - Log density mode changes (Off/Opacity/Heatmap) - Log heatmap bin changes - Log point selection with cluster info - Log chart render with point count and settings Image I/O logging (fixed to work with caching): - Separate cached fetch from logging wrapper - Log fetch start, success (with size), and failures - Log image open with dimensions - Track last displayed image to avoid duplicate logs All logs use [Visualization] and [Image] prefixes for easy filtering. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: embed_explore was using its local summary.py which called ClusteringService.generate_clustering_summary() on every render instead of the shared version that uses cached session state results. Fix: - Update embed_explore/app.py to import from shared.components.summary - Update local summary.py to re-export from shared for backwards compat - Add ISSUES.md to track known issues Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…clip) Heavy libraries are now only imported when explicitly needed: - FAISS: loaded when FAISS backend is selected or auto-resolved - torch/open_clip: loaded when embedding generation is triggered - cuML: loaded when cuML backend is selected Changes: - shared/utils/clustering.py: lazy-load sklearn, UMAP, FAISS, cuML - shared/utils/models.py: lazy-load open_clip - shared/services/embedding_service.py: lazy-load torch and open_clip - shared/components/clustering_controls.py: cache backend availability check - shared/utils/backend.py: cache FAISS and cuML availability checks This significantly improves app startup time by avoiding unnecessary imports during module load. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reverting commit d34c33e as the lazy loading implementation made startup performance worse instead of better. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Issue tracking moved to GitHub Issues: - Slow startup: #12 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Delete stale lib/ directory (duplicated in shared/lib/), remove unused imports (pandas from models.py, Counter from taxonomy_tree.py), remove dead error detection functions from clustering __init__, add logs/ to gitignore, and add print_available_models() entry point to models.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ClusteringService.run_clustering_safe() to encapsulate GPU-to-CPU fallback logic, replacing ~100 lines of duplicated error handling in both app sidebars. Enhance logging format with funcName:lineno, add persistent file handler (logs/emb_explorer.log), switch error handlers to logger.exception() for tracebacks, and add data loading/filter logging to precalculated sidebar. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap scatter chart in @st.fragment so zoom/pan only reruns the chart fragment, not the full page. Only trigger st.rerun(scope="app") when the selected point actually changes. Run cuML UMAP in an isolated subprocess with L2-normalized embeddings to prevent SIGFPE crashes (NN-descent numerical instability with large-magnitude embeddings). Falls back to sklearn UMAP automatically if the subprocess fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add GPU acceleration section to README explaining optional GPU support with CUDA 12/13 install commands. Create docs/DATA_FORMAT.md documenting expected parquet schema for precalculated app. Split pyproject.toml GPU extras into gpu-cu12/gpu-cu13 groups and add pynvml dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Apply L2 normalization to all embeddings before clustering and dimensionality reduction via _prepare_embeddings(). This prevents cuML UMAP SIGFPE crashes from large-magnitude vectors and is appropriate for CLIP-family contrastive embeddings. Log input norms, non-finite values, and embedding shapes at each pipeline step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the full embedding pipeline (preparation, KMeans, dim reduction, visualization) with backend details and fallback chain. Link from README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add comprehensive test suite covering shared utilities, clustering logic, backend detection, PyArrow filters, taxonomy tree, and logging config. All tests pass on both CPU (login nodes) and GPU (Pitzer V100). Also addresses valid Copilot PR review comments: remove unused variables and imports, simplify lambda in reduce(), add comments to silent except clauses, document numpy cap and faiss-gpu-cu12 in cu13 section. Fix real bug found by tests: build_taxonomic_tree() NaN handling — np.nan is truthy so `val or 'Unknown'` didn't work; replaced with pd.isna(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move sklearn, umap, faiss, cuml, cupy, torch, and open_clip imports from module-level into the functions that use them. Use importlib.find_spec() for instant package detection in backend.py and clustering_controls.py dropdown population. Also remove eager re-exports from __init__.py files that were forcing all submodules to load at package import time (shared/utils/__init__.py was importing open_clip via models.py on every startup). App startup drops from ~5.7s to ~1.2s (4.7x faster). Heavy libraries now load on-demand when user clicks "Run Clustering" or "Generate Embeddings". 98 tests pass with no regressions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous constraint excluded valid patch releases (2.2.1+). numba 0.61.x requires numpy<2.3, so align the specifier accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Provide project context (HPC/GPU fallback chain, optional deps) and suppress false-positive patterns from Copilot reviews: self-referencing extras, graceful-degradation except clauses, Streamlit scope parameter, CUDA backward-compat FAISS builds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rove tests - Split main() into CLI launcher and app() layout in both Streamlit apps so pyproject.toml console_scripts invoke the Streamlit server correctly - Replace regex text filter with pc.match_substring() for safer literal matching - Fix test_returns_false_without_gpu to patch flags instead of mocking itself - Generalize run_gpu_tests.sh (require VENV_DIR, placeholder account) - Update tests/README.md to reflect @pytest.mark.gpu is reserved for future use Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Agent Code Review ExperienceInteresting code reviewing process using GH copilot with this PR. The first couple of attempts were not impressive at all. Some simple mistakes were spot such as confusing dependencies, using the wrong date of time ... The suggestions made were not well aligned to improving software robustness and setting directions for continuous development. Later found the GH documentation suggesting to place a dedicated document to provide project context, expectation, and emphasis. Added that to Also tried the Claude code review skill, where it kicks start 5 agents using different models and performing code review sequentially, each have different testing specification. It runs locally. You can enable the |
egrace479
left a comment
There was a problem hiding this comment.
A few small suggestions, primarily based on our earlier discussion. Looking forward to trying it out with you on Monday!
| # emb-explorer | ||
|
|
||
| **emb-explorer** is a Streamlit-based visual exploration and clustering tool for image datasets and pre-calculated image embeddings. | ||
| Visual exploration and clustering tool for image embeddings. |
There was a problem hiding this comment.
| Visual exploration and clustering tool for image embeddings. | |
| Visual exploration and clustering tool for image embeddings. Users can either bring pre-calculated embeddings to explore, or use the interface to embed their images and then explore those embeddings. |
Add a little extra description to prepare for the two features below.
| <h3>🔍 Explore Pre-calculated Embeddings</h3> | ||
| </td> | ||
| <td width="50%" align="center"><b>Embed & Explore</b></td> | ||
| <td width="50%" align="center"><b>Precalculated Embeddings</b></td> |
There was a problem hiding this comment.
| <td width="50%" align="center"><b>Precalculated Embeddings</b></td> | |
| <td width="50%" align="center"><b>Precalculated Embedding Exploration</b></td> |
From our earlier discussion: See how this fits or try some other variation to capture the pre-made embeddings --> exploration as the only difference from the embed & explore option. Then just match it with the bolded feature option at line 30.
| ### Example Data | ||
|
|
||
| If running the app on a remote compute node (e.g., HPC cluster), you'll need to set up port forwarding to access the Streamlit interface from your local machine. | ||
| An example dataset (`data/example_1k.parquet`) is provided with BioCLIP 2 embeddings for testing. |
There was a problem hiding this comment.
Can you add a README to that folder that describes the example data -- what it is, where it's from, etc.?
|
|
||
| ## Step 1: KMeans Clustering | ||
|
|
||
| Clusters the full high-dimensional embeddings (e.g., 768-d for BioCLIP). |
There was a problem hiding this comment.
| Clusters the full high-dimensional embeddings (e.g., 768-d for BioCLIP). | |
| Clusters the full high-dimensional embeddings (e.g., 768-d for BioCLIP 2). |
| | KMeans | cuML if GPU + >500 samples, else FAISS if available + >500 samples, else sklearn | | ||
| | Dim. Reduction | cuML if GPU + >5000 samples, else sklearn | | ||
|
|
||
| Any GPU error (architecture mismatch, missing libraries, OOM) triggers an |
There was a problem hiding this comment.
| Any GPU error (architecture mismatch, missing libraries, OOM) triggers an | |
| Any GPU error (architecture mismatch, missing libraries, out of memory (OOM)) triggers an |
| | Column | Type | Feature Enabled | | ||
| |--------|------|-----------------| | ||
| | `identifier` or `image_url` or `url` or `img_url` or `image` | `string` (URL) | **Image preview** in the detail panel. The app tries these column names in order and displays the first valid HTTP(S) image URL found. | | ||
| | `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species` | `string` | **Taxonomic tree** summary. Any subset works; missing levels default to "Unknown". At minimum `kingdom` must be present and non-null for a row to appear in the tree. | |
There was a problem hiding this comment.
So it'll show taxonomic tree if just kingdom and species are provided (as an extreme example), just backfill the inbetweens with unknown?
This PR breaks our old monolithic
app.pyinto two focused Streamlit apps underapps/Additionally, we (who's we? me and claude lol) made these improvements:
sklearnand let you know what happened (in the console output)Changes made later during PR draft:
shared/utils/backend.py) auto-falls back through cuML → FAISS → sklearn. OOM/CUDA arch errors re-raise; everything else gracefully drops to CPU.cuMLUMAP now runs in an isolated subprocess to dodge SIGFPE crashes. All embeddings are L2-normalized before clustering/reduction.<=2.2.0to<2.3for numba compatibility.BACKEND_PIPELINE.md,DATA_FORMAT.md,.github/copilot-instructions.md, and a simplified test README with SLURM instructions.