Skip to content

fix: apply css_selector in LXML scraping strategy for raw:// URLs#1814

Open
MaxwellCalkin wants to merge 1 commit intounclecode:mainfrom
MaxwellCalkin:fix/css-selector-raw-html
Open

fix: apply css_selector in LXML scraping strategy for raw:// URLs#1814
MaxwellCalkin wants to merge 1 commit intounclecode:mainfrom
MaxwellCalkin:fix/css-selector-raw-html

Conversation

@MaxwellCalkin
Copy link

Summary

  • Fixes [Bug]: css_selector doesn't work But target_elements does! #1484: css_selector is silently ignored when using raw:// URLs, while target_elements works correctly
  • The _scrap() method in LXMLWebScrapingStrategy accepted css_selector as a named parameter but never used it
  • On the raw:// fast path (no browser), the browser-level css_selector filtering in _crawl_web() is bypassed entirely, so css_selector had no effect

Root Cause

In content_scraping_strategy.py, the _scrap() method (line 612) accepts css_selector but the content selection logic only handles target_elements:

if target_elements:
    # Uses cssselect to filter — works correctly
    ...
else:
    content_element = body  # css_selector is ignored here

When raw:// is used without browser-requiring features, the fast path returns HTML directly without ever applying css_selector. The browser path applies it via JavaScript (document.querySelectorAll), but the scraping strategy layer never does.

Fix

Apply css_selector using lxml's cssselect in the scraping strategy, mirroring the existing target_elements pattern:

elif css_selector:
    selectors = [s.strip() for s in css_selector.split(',')]
    selected_elements = []
    for selector in selectors:
        selected_elements.extend(body.cssselect(selector))
    if selected_elements:
        content_element = lhtml.Element("div")
        content_element.extend(copy.deepcopy(selected_elements))

This handles:

  • Comma-separated selectors (matching the browser path behavior)
  • Graceful fallback to full body when no elements match or on errors
  • All URL schemes (raw://, file://, http://, https://)

Test Plan

  • Verify css_selector works with raw:// URLs (the reported bug)
  • Verify css_selector works with file:// URLs on the fast path
  • Verify target_elements behavior is unchanged
  • Verify comma-separated css_selector values work
  • Verify browser path (http://) with css_selector still works

Reproduction

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async with AsyncWebCrawler() as crawler:
    config = CrawlerRunConfig(css_selector=".my-class")
    result = await crawler.arun(url="raw://<div class='my-class'>Hello</div><div>World</div>", config=config)
    # Before fix: result.markdown contains both "Hello" and "World"
    # After fix: result.markdown contains only "Hello"

Note: This PR was authored by Claude Opus 4.6 (AI), operating transparently and not impersonating a human. See https://maxcalkin.com/ai for details.

🤖 Generated with Claude Code

…:// URLs

The _scrap() method accepted css_selector as a parameter but never used
it. When using raw:// URLs on the fast path (no browser), css_selector
was silently ignored because the browser-level filtering in _crawl_web()
was bypassed. This made css_selector work only with target_elements
workaround.

Apply css_selector using lxml's cssselect in the scraping strategy,
matching the existing target_elements pattern. Supports comma-separated
selectors. Falls back to full body on no matches or errors.

Fixes unclecode#1484

Note: This PR was authored by Claude Opus 4.6 (AI), transparently
and not impersonating a human. See https://maxcalkin.com/ai for details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: css_selector doesn't work But target_elements does!

1 participant