List of libraries, tools and APIs for web scraping and data processing.
-
Updated
Oct 21, 2021 - Makefile
{{ message }}
List of libraries, tools and APIs for web scraping and data processing.
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
PHP Curl Class makes it easy to send HTTP requests and integrate with web APIs
Selenium-python but lighter: Helium is the best Python library for web automation.
Web Scraping Framework
A Devtools driver for web automation and scraping
Learn Python for the next 30 (or so) Days.
General Assembly's 2015 Data Science course in Washington, DC
Snoop — инструмент разведки на основе открытых данных (OSINT world)
The Python Code Tutorials
Collection of scripts corresponding to LucidProgramming YouTube tutorials
DataHen Till is a companion tool to your existing web scraper that instantly makes it scalable, maintainable, and more unblockable, with minimal code changes on your scraper. Integrates with any scraper in 5 minutes.
Faster requests on Python 3
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Nextjs server to query websites with GraphQL
A JavaScript library for generating random user agents with data that's updated daily.
Random User-Agent middleware based on fake-useragent
UI.Vision: Open-Source RPA Software (formerly Kantu) - Modern Robotic Process Automation with Selenium IDE++
Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
A framework for creating semi-automatic web content extractors
ACHE is a web crawler for domain-specific search.
htmldate and courlan is provided (available options etc.)Hello,
Thanks for new update in personal_info section,
I found out that the attribute 'certifications' return empty list []
Test url: https://www.linkedin.com/in/an-nguyen-9b3248122/
Results:
`{'personal_info': {'name': 'An Nguyen',
'headline': 'Data Scientist/Machine Learning Engineer',
'company': 'PERSOL PROCESS & TECHNOLOGY CO., LTD.',
'school': 'National Chiao Tung University',
NBA Stats API via Basketball Reference
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, MacroTrends, SHFE and alternative data crawlers on Tomtom, BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist
Add a description, image, and links to the web-scraping topic page so that developers can more easily learn about it.
To associate your repository with the web-scraping topic, visit your repo's landing page and select "manage topics."
Main examples at Apify SDK webpage, Github repo and CLI templates should demonstrate how to manipulate with DOM and retrieve data from it.
Also add one example of scraping with Apify SDK + jQuery to https://sdk.apify.com/docs/examples/basiccrawler
Feedback from: https://medium.com/better-programming/do-i-need-python-scrapy-to-build-a-web-scraper-7cc7cac2081d
I lost an hour trying to make