A Python-based tool to scrape product data from Amazon for specified search queries. It extracts product titles, prices, reviews, and image links, saving results in JSON files. The script is designed for simplicity, scalability, and evasion of detection.
-
Anti-Detection Measures:
- Utilizes
undetected-chromedriverto bypass Amazon's anti-bot mechanisms. - Avoids using Amazon's homepage to prevent CAPTCHA triggers.
- Mimics human-like behavior with randomized typing and action delays.
- Utilizes
-
Scalability:
- Supports multiple search queries from a JSON file.
- Handles pagination for scraping multiple result pages per query.
-
Data Organization:
- Saves scraped data for each query in separate JSON files in the
scraped_datafolder.
- Saves scraped data for each query in separate JSON files in the
- Python 3.8 or later
- Google Chrome
- ChromeDriver matching your Chrome version
-
Clone the repository:
git clone https://github.com/username/repo-name.git cd repo-name -
Install dependencies:
pip install -r requirements.txt
-
Add search queries:
- Create a
user_queries.jsonfile in the root directory with search terms:["laptops", "wireless headphones", "gaming chairs"]
- Create a
Run the script with:
python amazon_scraper.pyThe scraped data will be saved as JSON files in the scraped_data/ directory.
project-folder/
├── amazon_scraper.py # Main script containing all functionality
├── user_queries.json # Input file with search terms
├── scraped_data/ # Directory to save scraped data
├── requirements.txt # List of dependencies
- The scraper runs in
--headlessmode for efficiency. - Ensure ChromeDriver matches your Chrome browser version.
- Debugging messages are logged to the console for troubleshooting.
The script uses the following libraries:
undetected-chromedriverseleniumosandjsonfor file operationstimeandrandomfor human-like delays
Install all dependencies with:
pip install -r requirements.txtThis project is licensed under the MIT License.
Contributions are welcome! Feel free to open issues or submit pull requests.