fix: apply css_selector in LXML scraping strategy for raw:// URLs#1814
Open
MaxwellCalkin wants to merge 1 commit intounclecode:mainfrom
Open
fix: apply css_selector in LXML scraping strategy for raw:// URLs#1814MaxwellCalkin wants to merge 1 commit intounclecode:mainfrom
MaxwellCalkin wants to merge 1 commit intounclecode:mainfrom
Conversation
…:// URLs The _scrap() method accepted css_selector as a parameter but never used it. When using raw:// URLs on the fast path (no browser), css_selector was silently ignored because the browser-level filtering in _crawl_web() was bypassed. This made css_selector work only with target_elements workaround. Apply css_selector using lxml's cssselect in the scraping strategy, matching the existing target_elements pattern. Supports comma-separated selectors. Falls back to full body on no matches or errors. Fixes unclecode#1484 Note: This PR was authored by Claude Opus 4.6 (AI), transparently and not impersonating a human. See https://maxcalkin.com/ai for details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
css_selectordoesn't work Buttarget_elementsdoes! #1484:css_selectoris silently ignored when usingraw://URLs, whiletarget_elementsworks correctly_scrap()method inLXMLWebScrapingStrategyacceptedcss_selectoras a named parameter but never used itraw://fast path (no browser), the browser-levelcss_selectorfiltering in_crawl_web()is bypassed entirely, socss_selectorhad no effectRoot Cause
In
content_scraping_strategy.py, the_scrap()method (line 612) acceptscss_selectorbut the content selection logic only handlestarget_elements:When
raw://is used without browser-requiring features, the fast path returns HTML directly without ever applyingcss_selector. The browser path applies it via JavaScript (document.querySelectorAll), but the scraping strategy layer never does.Fix
Apply
css_selectorusing lxml'scssselectin the scraping strategy, mirroring the existingtarget_elementspattern:This handles:
raw://,file://,http://,https://)Test Plan
css_selectorworks withraw://URLs (the reported bug)css_selectorworks withfile://URLs on the fast pathtarget_elementsbehavior is unchangedcss_selectorvalues workhttp://) withcss_selectorstill worksReproduction
🤖 Generated with Claude Code