Skip to content

How to config user_data_dir when combine crawlee playwright camoufox #1776

@ninhthuanntnt

Description

@ninhthuanntnt

I initialized my project using Crawlee CLI with Playwright and Camoufox, and I am trying to configure user_data_dir for persistent login sessions.

Although the profile data is correctly created in the specified user_data_dir after logging in to Google, the session is not reused on subsequent runs. When I restart the crawler, it redirects to the Google login page instead of restoring the previous authenticated session.

from camoufox import AsyncNewBrowser
from crawlee import Request
from crawlee._utils.context import ensure_context
from crawlee.browsers import PlaywrightBrowserPlugin, PlaywrightBrowserController, BrowserPool
from crawlee.crawlers import PlaywrightCrawler
from typing_extensions import override

from .constants.HandlerType import HandlerType
from .routes import router


class CamoufoxPlugin(PlaywrightBrowserPlugin):
    """Example browser plugin that uses Camoufox Browser, but otherwise keeps the functionality of
    PlaywrightBrowserPlugin."""

    def __init__(self, user_data_dir: str = None):
        super().__init__()
        self.user_data_dir = user_data_dir

    @ensure_context
    @override
    async def new_browser(self) -> PlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')


        return PlaywrightBrowserController(
            browser=(await AsyncNewBrowser(self._playwright, persistent_context=True,
                                           headless=False,
                                           user_data_dir=self.user_data_dir,
                                           )).browser,
            max_open_pages_per_browser=1,  # Increase, if camoufox can handle it in your use case.
            header_generator=None,  # This turns off the crawlee header_generation. Camoufox has its own.,
        )


async def main() -> None:
    """The crawler entry point."""
    crawler = PlaywrightCrawler(
        max_request_retries=0,
        max_requests_per_crawl=10,
        request_handler=router,
        fingerprint_generator=None,
        browser_pool=BrowserPool(
            plugins=[CamoufoxPlugin('xxx')]),
    )

    await crawler.run(
        [
            Request.from_url(url='https://accounts.google.com/', label=HandlerType.GOOGLE_LOGIN, user_data={
                'email': 'xxx',
                'password': 'xxx'
            })
        ]
    )

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions