Skip to content

Feature/app separation#13

Open
NetZissou wants to merge 32 commits intomainfrom
feature/app-separation
Open

Feature/app separation#13
NetZissou wants to merge 32 commits intomainfrom
feature/app-separation

Conversation

@NetZissou
Copy link
Collaborator

@NetZissou NetZissou commented Feb 11, 2026

This PR breaks our old monolithic app.py into two focused Streamlit apps under apps/

  • Embed & Explore — upload (not technically upload, more like import from local) your own images, embed and cluster them
  • Precalculated Embeddings — jump straight into exploring precomputed data

Additionally, we (who's we? me and claude lol) made these improvements:

  • Improved clustering visualization by enabling zoom-in/out and heatmap options
  • Improved metadata filtering interface for the precalculated embedding app. Dynamic filtering based on parquet schema
  • Added GPU to CPU fallback for clustering — if you hit an OOM or CUDA error, it'll automatically retry on CPU with sklearn and let you know what happened (in the console output)
  • Pulled shared code into common modules (components, services) so both apps stay DRY!
  • Updated the README to clearly explain both workflows with simpler install/usage instructions

Changes made later during PR draft:

  • Unified backend detection (shared/utils/backend.py) auto-falls back through cuML → FAISS → sklearn. OOM/CUDA arch errors re-raise; everything else gracefully drops to CPU. cuML UMAP now runs in an isolated subprocess to dodge SIGFPE crashes. All embeddings are L2-normalized before clustering/reduction.
  • Lazy-loading heavy libraries (torch, cuML, FAISS) cuts startup time ~4.7× (this is huge, no slow kick-start)
  • Relaxed the numpy cap from <=2.2.0 to <2.3 for numba compatibility.
  • 98 tests passing on both CPU and GPU nodes.
  • Added BACKEND_PIPELINE.md, DATA_FORMAT.md, .github/copilot-instructions.md, and a simplified test README with SLURM instructions.

NetZissou and others added 20 commits January 26, 2026 14:13
Introduces a new `shared/` module structure to support application
separation and code reuse across multiple Streamlit apps.

## Structure

### shared/utils/
- `clustering.py`: Dimensionality reduction (PCA, t-SNE, UMAP) and
  K-means clustering with multi-backend support (sklearn, FAISS, cuML)
- `io.py`: File I/O utilities for embeddings and data persistence
- `models.py`: Shared data models and type definitions

### shared/services/
- `clustering_service.py`: High-level clustering workflow orchestration
- `embedding_service.py`: Image embedding generation using various models
- `file_service.py`: File discovery and validation services

### shared/components/
- `clustering_controls.py`: Streamlit UI controls for backend selection,
  seed configuration, and worker settings
- `summary.py`: Cluster summary statistics and representative images
- `visualization.py`: Scatter plot visualization with Altair

### shared/lib/
- `progress.py`: Progress tracking utilities for long-running operations

## Backend Support
- sklearn: Default CPU backend for all operations
- FAISS: Optional GPU/CPU accelerated K-means clustering
- cuML: Optional RAPIDS GPU acceleration for dim reduction and clustering
  with automatic fallback on unsupported architectures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces `apps/embed_explore/` as a self-contained Streamlit app for
interactive image embedding exploration and clustering.

## Application Structure

### apps/embed_explore/
- `app.py`: Main application entry point with two-column layout
  (sidebar controls + main visualization area)

### apps/embed_explore/components/
- `sidebar.py`: Complete sidebar UI with embedding and clustering
  sections, model selection, and backend configuration
- `summary.py`: Cluster statistics display and representative images
- `visualization.py`: Interactive scatter plot with image preview panel

## Features
- Directory-based image loading with supported format filtering
- Multiple embedding model support (DINOv2, OpenCLIP, etc.)
- Configurable dimensionality reduction (PCA, t-SNE, UMAP)
- K-means clustering with adjustable cluster count
- Interactive Altair scatter plot with click-to-preview
- Cluster summary statistics with representative samples

## Usage
Run as standalone app:
  streamlit run apps/embed_explore/app.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updates existing components to use the new shared module and removes
legacy code that has been superseded by the app separation.

## Removed
- `app.py`: Legacy monolithic entry point (replaced by apps/)
- `components/clustering/`: Entire directory moved to shared/ and apps/
- `pages/01_Clustering.py`: Now available as standalone embed_explore app

## Updated Imports
- `components/precalculated/sidebar.py`: Uses shared.services and
  shared.components for clustering functionality
- `pages/02_Precalculated_Embeddings.py`: Uses shared.components for
  visualization and summary rendering

## pyproject.toml Changes
- Entry points updated:
  - `emb-embed-explore` → apps.embed_explore.app:main
  - `list-models` → shared.utils.models:list_available_models
- Package includes: shared/, apps/
- Dependencies:
  - streamlit>=1.50.0 (updated for new API)
  - numpy<=2.2.0 (compatibility constraint)
- Version path: shared/__init__.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… code

Add new standalone app:
- apps/precalculated/ - Precalculated embeddings explorer with dynamic
  cascading filters, CUDA auto-detection, and console logging

Features:
- Dynamic filter generation based on parquet columns
- Cascading/dependent filters with AND logic
- Auto backend selection (cuml when CUDA available)
- Console logging for clustering operations
- Image caching to prevent re-fetch on reruns
- State management for record details panel

Clean up legacy code:
- Remove pages/02_Precalculated_Embeddings.py (monolithic page)
- Remove components/ directory (old component structure)
- Remove services/ directory (old services, now in shared/)
- Remove utils/ directory (old utils, now in shared/)
- Remove list_models.py (replaced by entry point)
- Move taxonomy_tree.py to shared/utils/

Update shared module:
- Add taxonomy tree functions to shared/utils/
- Add VRAM error handling utilities to clustering.py
- Fix import paths in summary.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add .interactive() to scatter plots in both apps:
- Scroll wheel to zoom
- Drag to pan
- Double-click to reset

Note: Zoom state resets on app rerun (known Streamlit limitation)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add toggleable density heatmap overlay using Altair's mark_rect with 2D
binning. This helps visualize point concentration in crowded areas of
the scatter plot while keeping individual points visible.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Streamlit doesn't support selections on multi-view (layered) Altair
charts. When density heatmap is shown, disable on_select and show
a note to the user that point selection is temporarily unavailable.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Off: normal 0.7 opacity, selection enabled
- Opacity: low 0.15 opacity so overlapping points show density naturally,
  selection still works
- Heatmap: 2D binned density layer behind points (selection disabled due
  to Streamlit limitation with layered charts)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add grid resolution slider (10-80 bins) when Heatmap mode is selected
- Replace truncated metadata display with full-width dataframe table
- Show complete UUID and all field values without truncation
- Use compact column layout for density options

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Display Cluster and UUID prominently as markdown (full values, no
truncation), then show remaining metadata fields in a scrollable table.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Both apps now use shared/components/visualization.py for scatter plot
- Shared visualization has all features: zoom/pan, density modes, configurable bins
- Dynamic tooltip building works for any data columns
- Added data_version tracking for selection validation
- Moved embed_explore's render_image_preview to separate file
- App-specific visualization.py files now re-export from shared

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add shared/utils/logging_config.py for centralized logging setup
- Add logging to clustering utilities (backend selection, timing)
- Add logging to ClusteringService (workflow steps, timing)
- Add logging to EmbeddingService (model loading, generation stats)
- Add logging to FileService (file operations, timing)
- Replace print() fallback messages with proper logger.warning()
- Fix use_container_width deprecation: use width="stretch" instead

Logging now tracks:
- Which backend is selected (sklearn/cuML/FAISS)
- Operation timing for performance monitoring
- Fallback events when GPU operations fail

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add shared/utils/backend.py with centralized backend utilities:
  - check_cuda_available(): Cached CUDA detection via PyTorch/CuPy
  - resolve_backend(): Auto-resolve to cuML/FAISS/sklearn based on hardware
  - is_oom_error(), is_cuda_arch_error(), is_gpu_error(): Error classification

- Update embed_explore to use robust error handling:
  - Auto-resolve backends based on available hardware
  - Automatic fallback to sklearn on GPU errors
  - Consistent logging of backend selection

- Update precalculated to use shared backend utilities:
  - Remove duplicate check_cuda_available/resolve_backend functions
  - Replace print() with logger calls for consistency

Both apps now have identical backend selection and fallback behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…logging

- Clustering summary is now computed once when clustering runs and stored
  in session state (clustering_summary, clustering_representatives)
- Summary component displays cached results instead of recomputing on
  every render (zoom, pan, point selection no longer trigger recompute)
- Added logging for image retrieval:
  - URL fetch timing and size
  - Timeout and error handling with warnings
  - Debug logging for image display
- Removed ClusteringService import from summary component (uses cache)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Visualization logging:
- Log density mode changes (Off/Opacity/Heatmap)
- Log heatmap bin changes
- Log point selection with cluster info
- Log chart render with point count and settings

Image I/O logging (fixed to work with caching):
- Separate cached fetch from logging wrapper
- Log fetch start, success (with size), and failures
- Log image open with dimensions
- Track last displayed image to avoid duplicate logs

All logs use [Visualization] and [Image] prefixes for easy filtering.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: embed_explore was using its local summary.py which called
ClusteringService.generate_clustering_summary() on every render instead
of the shared version that uses cached session state results.

Fix:
- Update embed_explore/app.py to import from shared.components.summary
- Update local summary.py to re-export from shared for backwards compat
- Add ISSUES.md to track known issues

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…clip)

Heavy libraries are now only imported when explicitly needed:
- FAISS: loaded when FAISS backend is selected or auto-resolved
- torch/open_clip: loaded when embedding generation is triggered
- cuML: loaded when cuML backend is selected

Changes:
- shared/utils/clustering.py: lazy-load sklearn, UMAP, FAISS, cuML
- shared/utils/models.py: lazy-load open_clip
- shared/services/embedding_service.py: lazy-load torch and open_clip
- shared/components/clustering_controls.py: cache backend availability check
- shared/utils/backend.py: cache FAISS and cuML availability checks

This significantly improves app startup time by avoiding unnecessary
imports during module load.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reverting commit d34c33e as the lazy loading implementation
made startup performance worse instead of better.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Issue tracking moved to GitHub Issues:
- Slow startup: #12

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@NetZissou NetZissou self-assigned this Feb 11, 2026
@NetZissou NetZissou marked this pull request as draft February 11, 2026 13:48
NetZissou and others added 6 commits February 11, 2026 11:33
Delete stale lib/ directory (duplicated in shared/lib/), remove unused
imports (pandas from models.py, Counter from taxonomy_tree.py), remove
dead error detection functions from clustering __init__, add logs/ to
gitignore, and add print_available_models() entry point to models.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ClusteringService.run_clustering_safe() to encapsulate GPU-to-CPU
fallback logic, replacing ~100 lines of duplicated error handling in
both app sidebars. Enhance logging format with funcName:lineno, add
persistent file handler (logs/emb_explorer.log), switch error handlers
to logger.exception() for tracebacks, and add data loading/filter
logging to precalculated sidebar.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap scatter chart in @st.fragment so zoom/pan only reruns the chart
fragment, not the full page. Only trigger st.rerun(scope="app") when
the selected point actually changes.

Run cuML UMAP in an isolated subprocess with L2-normalized embeddings
to prevent SIGFPE crashes (NN-descent numerical instability with
large-magnitude embeddings). Falls back to sklearn UMAP automatically
if the subprocess fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add GPU acceleration section to README explaining optional GPU support
with CUDA 12/13 install commands. Create docs/DATA_FORMAT.md documenting
expected parquet schema for precalculated app. Split pyproject.toml GPU
extras into gpu-cu12/gpu-cu13 groups and add pynvml dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Apply L2 normalization to all embeddings before clustering and
dimensionality reduction via _prepare_embeddings(). This prevents
cuML UMAP SIGFPE crashes from large-magnitude vectors and is
appropriate for CLIP-family contrastive embeddings. Log input norms,
non-finite values, and embedding shapes at each pipeline step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the full embedding pipeline (preparation, KMeans, dim
reduction, visualization) with backend details and fallback chain.
Link from README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This comment was marked as resolved.

NetZissou and others added 2 commits February 12, 2026 12:44
Add comprehensive test suite covering shared utilities, clustering logic,
backend detection, PyArrow filters, taxonomy tree, and logging config.
All tests pass on both CPU (login nodes) and GPU (Pitzer V100).

Also addresses valid Copilot PR review comments: remove unused variables
and imports, simplify lambda in reduce(), add comments to silent except
clauses, document numpy cap and faiss-gpu-cu12 in cu13 section.

Fix real bug found by tests: build_taxonomic_tree() NaN handling — np.nan
is truthy so `val or 'Unknown'` didn't work; replaced with pd.isna().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move sklearn, umap, faiss, cuml, cupy, torch, and open_clip imports
from module-level into the functions that use them. Use
importlib.find_spec() for instant package detection in backend.py
and clustering_controls.py dropdown population.

Also remove eager re-exports from __init__.py files that were forcing
all submodules to load at package import time (shared/utils/__init__.py
was importing open_clip via models.py on every startup).

App startup drops from ~5.7s to ~1.2s (4.7x faster). Heavy libraries
now load on-demand when user clicks "Run Clustering" or "Generate
Embeddings". 98 tests pass with no regressions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This comment was marked as resolved.

NetZissou and others added 2 commits February 12, 2026 16:59
The previous constraint excluded valid patch releases (2.2.1+).
numba 0.61.x requires numpy<2.3, so align the specifier accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Provide project context (HPC/GPU fallback chain, optional deps) and
suppress false-positive patterns from Copilot reviews: self-referencing
extras, graceful-degradation except clauses, Streamlit scope parameter,
CUDA backward-compat FAISS builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This comment was marked as resolved.

…rove tests

- Split main() into CLI launcher and app() layout in both Streamlit apps
  so pyproject.toml console_scripts invoke the Streamlit server correctly
- Replace regex text filter with pc.match_substring() for safer literal matching
- Fix test_returns_false_without_gpu to patch flags instead of mocking itself
- Generalize run_gpu_tests.sh (require VENV_DIR, placeholder account)
- Update tests/README.md to reflect @pytest.mark.gpu is reserved for future use

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@NetZissou
Copy link
Collaborator Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@NetZissou
Copy link
Collaborator Author

NetZissou commented Feb 13, 2026

Agent Code Review Experience

Interesting code reviewing process using GH copilot with this PR.

The first couple of attempts were not impressive at all. Some simple mistakes were spot such as confusing dependencies, using the wrong date of time ... The suggestions made were not well aligned to improving software robustness and setting directions for continuous development.

Later found the GH documentation suggesting to place a dedicated document to provide project context, expectation, and emphasis. Added that to .github/copilot-instructions.md and the code review suggestions from GH Copilot became much more relevant.

Also tried the Claude code review skill, where it kicks start 5 agents using different models and performing code review sequentially, each have different testing specification. It runs locally. You can enable the comment option for this skill to comment to this PR, similar to how copilot completes the review and leave you a comment. It also uses the same code review documentation that we prepared for GH copilot.

@NetZissou NetZissou linked an issue Feb 13, 2026 that may be closed by this pull request
@NetZissou NetZissou marked this pull request as ready for review February 13, 2026 01:26
Copy link
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions, primarily based on our earlier discussion. Looking forward to trying it out with you on Monday!

# emb-explorer

**emb-explorer** is a Streamlit-based visual exploration and clustering tool for image datasets and pre-calculated image embeddings.
Visual exploration and clustering tool for image embeddings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Visual exploration and clustering tool for image embeddings.
Visual exploration and clustering tool for image embeddings. Users can either bring pre-calculated embeddings to explore, or use the interface to embed their images and then explore those embeddings.

Add a little extra description to prepare for the two features below.

<h3>🔍 Explore Pre-calculated Embeddings</h3>
</td>
<td width="50%" align="center"><b>Embed & Explore</b></td>
<td width="50%" align="center"><b>Precalculated Embeddings</b></td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<td width="50%" align="center"><b>Precalculated Embeddings</b></td>
<td width="50%" align="center"><b>Precalculated Embedding Exploration</b></td>

From our earlier discussion: See how this fits or try some other variation to capture the pre-made embeddings --> exploration as the only difference from the embed & explore option. Then just match it with the bolded feature option at line 30.

### Example Data

If running the app on a remote compute node (e.g., HPC cluster), you'll need to set up port forwarding to access the Streamlit interface from your local machine.
An example dataset (`data/example_1k.parquet`) is provided with BioCLIP 2 embeddings for testing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a README to that folder that describes the example data -- what it is, where it's from, etc.?


## Step 1: KMeans Clustering

Clusters the full high-dimensional embeddings (e.g., 768-d for BioCLIP).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Clusters the full high-dimensional embeddings (e.g., 768-d for BioCLIP).
Clusters the full high-dimensional embeddings (e.g., 768-d for BioCLIP 2).

| KMeans | cuML if GPU + >500 samples, else FAISS if available + >500 samples, else sklearn |
| Dim. Reduction | cuML if GPU + >5000 samples, else sklearn |

Any GPU error (architecture mismatch, missing libraries, OOM) triggers an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Any GPU error (architecture mismatch, missing libraries, OOM) triggers an
Any GPU error (architecture mismatch, missing libraries, out of memory (OOM)) triggers an

| Column | Type | Feature Enabled |
|--------|------|-----------------|
| `identifier` or `image_url` or `url` or `img_url` or `image` | `string` (URL) | **Image preview** in the detail panel. The app tries these column names in order and displays the first valid HTTP(S) image URL found. |
| `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species` | `string` | **Taxonomic tree** summary. Any subset works; missing levels default to "Unknown". At minimum `kingdom` must be present and non-null for a row to appear in the tree. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it'll show taxonomic tree if just kingdom and species are provided (as an extreme example), just backfill the inbetweens with unknown?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow app startup due to heavy library imports

2 participants

Comments