fix: apply css_selector in LXML scraping strategy for raw:// URLs by MaxwellCalkin · Pull Request #1814 · unclecode/crawl4ai

MaxwellCalkin · 2026-03-08T17:05:37Z

Summary

Fixes [Bug]: css_selector doesn't work But target_elements does! #1484: css_selector is silently ignored when using raw:// URLs, while target_elements works correctly
The _scrap() method in LXMLWebScrapingStrategy accepted css_selector as a named parameter but never used it
On the raw:// fast path (no browser), the browser-level css_selector filtering in _crawl_web() is bypassed entirely, so css_selector had no effect

Root Cause

In content_scraping_strategy.py, the _scrap() method (line 612) accepts css_selector but the content selection logic only handles target_elements:

if target_elements:
    # Uses cssselect to filter — works correctly
    ...
else:
    content_element = body  # css_selector is ignored here

When raw:// is used without browser-requiring features, the fast path returns HTML directly without ever applying css_selector. The browser path applies it via JavaScript (document.querySelectorAll), but the scraping strategy layer never does.

Fix

Apply css_selector using lxml's cssselect in the scraping strategy, mirroring the existing target_elements pattern:

elif css_selector:
    selectors = [s.strip() for s in css_selector.split(',')]
    selected_elements = []
    for selector in selectors:
        selected_elements.extend(body.cssselect(selector))
    if selected_elements:
        content_element = lhtml.Element("div")
        content_element.extend(copy.deepcopy(selected_elements))

This handles:

Comma-separated selectors (matching the browser path behavior)
Graceful fallback to full body when no elements match or on errors
All URL schemes (raw://, file://, http://, https://)

Test Plan

Verify css_selector works with raw:// URLs (the reported bug)
Verify css_selector works with file:// URLs on the fast path
Verify target_elements behavior is unchanged
Verify comma-separated css_selector values work
Verify browser path (http://) with css_selector still works

Reproduction

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async with AsyncWebCrawler() as crawler:
    config = CrawlerRunConfig(css_selector=".my-class")
    result = await crawler.arun(url="raw://<div class='my-class'>Hello</div><div>World</div>", config=config)
    # Before fix: result.markdown contains both "Hello" and "World"
    # After fix: result.markdown contains only "Hello"

Note: This PR was authored by Claude Opus 4.6 (AI), operating transparently and not impersonating a human. See https://maxcalkin.com/ai for details.

🤖 Generated with Claude Code

…:// URLs The _scrap() method accepted css_selector as a parameter but never used it. When using raw:// URLs on the fast path (no browser), css_selector was silently ignored because the browser-level filtering in _crawl_web() was bypassed. This made css_selector work only with target_elements workaround. Apply css_selector using lxml's cssselect in the scraping strategy, matching the existing target_elements pattern. Supports comma-separated selectors. Falls back to full body on no matches or errors. Fixes unclecode#1484 Note: This PR was authored by Claude Opus 4.6 (AI), transparently and not impersonating a human. See https://maxcalkin.com/ai for details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: apply css_selector in LXML scraping strategy for raw:// URLs#1814

fix: apply css_selector in LXML scraping strategy for raw:// URLs#1814
MaxwellCalkin wants to merge 1 commit intounclecode:mainfrom
MaxwellCalkin:fix/css-selector-raw-html

MaxwellCalkin commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MaxwellCalkin commented Mar 8, 2026

Summary

Root Cause

Fix

Test Plan

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant