Skip to content

Conversation

@janbro
Copy link

@janbro janbro commented Dec 23, 2025

Add Secret Scanning to Publish Workflow

Summary

  • Blocks publishing when potential secrets are found before any processing of the VSIX.
  • Uses Aho-Corasick algorithm for O(n) multi-pattern matching with keyword-routed regex + entropy checks to reduce false positives and improve performance.
  • Ships Gitleaks-based rules plus custom override file; rules are loaded from YAML with global allowlist merging.
  • Supports optional runtime rule generation from latest Gitleaks TOML.
  • Runs scans in parallel with Spring's async executor, bounded resources, timeouts, and archive guardrails.

Technical Changes

Core Scanning Infrastructure

  • Scanning Stack: SecretScanningService, SecretScanner, SecretScannerFactory, SecretRuleLoader, SecretRule, SecretScanResult, SecretFinding, EntropyCalculator, AhoCorasick
  • Aho-Corasick Optimization: Keywords, stopwords, inline suppressions, and file extensions use Aho-Corasick for O(n) multi-pattern matching instead of O(n×m) iteration
  • Runtime Rule Generation: GitleaksRulesGenerator optionally downloads and converts gitleaks.toml to YAML at startup
  • Publish Integration: Integrated into ExtensionService.doPublish, failing fast with redacted findings when secrets are detected
private void doPublish(...) {
    // Scan for secrets before processing the extension
    var scanResult = secretScanningService.scanForSecrets(extensionFile);
    if (scanResult.isSecretsFound()) {
        // Redact and report findings
        throw new ErrorResultException(errorMessage.toString());
    }
    try (var processor = new ExtensionProcessor(extensionFile)) {
        ...
    }
}

Rule Loading & Configuration

  • YAML-Based Configuration: Global allowlists (paths, regexes, stopwords, file-extensions) loaded from rule YAML files
  • Merge Strategy:
    • Rules are overridden by ID (last file wins)
    • Global allowlists are merged from all files (accumulative)
  • Flexible Paths: Supports classpath: and filesystem paths with configurable load order
  • Default Rules: Ships with secret-scanning-rules-gitleaks.yaml (optionally auto-generated) and secret-scanning-custom-rules.yaml

Archive Safety & Performance

  • Archive Guardrails:
    • Entry count limit (max-entry-count: 5000)
    • Total uncompressed size limit (max-total-uncompressed-bytes: 100MB)
    • Path traversal protection (isSafePath)
  • Performance Optimizations:
    • Skip oversized files (max-file-size-bytes: 5MB)
    • Skip very long lines (max-line-length: 10000)
    • Keyword pre-filtering before regex matching
    • Periodic timeout checks (timeout-check-every-n-lines: 100)
    • Max findings cap (max-findings: 200)
  • Exclusions: Configurable path patterns, file extensions, and content patterns

Architectural Decisions

  • Factory Pattern: SecretScannerFactory performs expensive one-time initialization (builds Aho-Corasick tries), produces immutable SecretScanner instances
  • Service Orchestration: SecretScanningService handles runtime concerns (enabled check, async execution, result aggregation)
  • Retroactive Scanning Support: Factory always initializes when rules are configured; enabled check moved to service layer to allow retroactive scans independent of publish-time scanning

Design Considerations

False-Positive Controls

  • Global Stopwords: Exact string matches using Aho-Corasick (e.g., "example", "test", "dummy")
  • Global Allowlist Regexes: RE2 patterns for known safe content
  • Rule-Level Allowlists: Per-rule exclusion patterns
  • Inline Suppressions: Comment markers to skip lines (e.g., secret-scanner:ignore)
  • Path/Extension Exclusions: Skip common build artifacts and binary files

Performance & Safety

  • Aho-Corasick Algorithm: O(n) multi-pattern matching for keywords, stopwords, suppressions, extensions
  • Keyword Routing: Only run expensive regex on lines containing relevant keywords
  • Entropy Calculation: Validate secrets have sufficient randomness
  • Bounded Resources: Thread pool limits, timeout guards, findings caps
  • Archive Limits: Prevent excessive processing

Resilience

  • Async Execution: Spring's default async executor for parallel file scanning
  • Redaction: Secrets redacted in error messages (first/last N chars only)
  • Feature Toggle: Can be completely disabled via ovsx.secret-scanning.enabled: false

Configuration

Runtime Rule Generation (Optional)

ovsx:
  secret-scanning:
    auto-generate-rules: true
    force-regenerate-rules: false
    generated-rules-path: '/app/data/secret-scanning-rules-gitleaks.yaml'

Core Scanning Configuration

ovsx:
  secret-scanning:
    enabled: true
    rules-path: 'classpath:scanning/secret-scanning-custom-rules.yaml'
    
    # File & Archive Limits
    max-file-size-bytes: 5242880              # 5 MB per file
    max-entry-count: 5000                     # Max entries in archive
    max-total-uncompressed-bytes: 104857600   # 100 MB total
    max-findings: 200                         # Stop after N secrets
    
    # Line Processing
    inline-suppresions: 'secret-scanner:ignore,@supress-secret'
    max-line-length: 10000
    long-line-no-space-threshold: 1000
    keyword-context-chars: 100
    log-allowlisted-value-preview-length: 10
    
    # Timeout & Performance
    timeout-seconds: 5
    timeout-check-every-n-lines: 100

Secret Rule Definitions and Global Allowlists (YAML Files)

allowlist:
  # Regex patterns for file paths to exclude
  paths:
    - "node_modules/"
    - ".git/"
    - "test/"
  
  # Regex patterns for content to exclude
  regexes:
    - "^test$"
    - "^example$"
  
  # Exact strings to exclude (stopwords)
  stopwords:
    - "example"
    - "test"
    - "dummy"
  
  # File extensions to exclude
  file-extensions:
    - ".png"
    - ".jpg"
    - ".zip"

rules:
  - id: test-rule-1
    description: Test rule with global allowlist
    regex: "secret[0-9]{3}"
    keywords:
      - secret

@janbro janbro force-pushed the yeeth/secretscanning branch from 04ba2f7 to 77151c6 Compare December 23, 2025 20:43
@netomi
Copy link
Contributor

netomi commented Dec 24, 2025

If the secret scanning service is not relevant for the specific test, I would mock the whole bean rather than setting it up for that test. This feels unnecessary and leads to duplicate work that you have to repeat if something changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants