The Wayback Machine - http://web.archive.org/web/20211116082506/https://github.com/topics/web-scraping

#

web-scraping

Here are 2,590 public repositories matching this topic...

lorien / awesome-web-scraping

Star

List of libraries, tools and APIs for web scraping and data processing.

Updated Oct 21, 2021
Makefile

autoscraper

alirezamika / autoscraper

Sponsor Star

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

python crawler machine-learning scraper automation ai scraping artificial-intelligence web-scraping scrape webscraping webautomation

Updated Feb 3, 2021
Python

apify-js

apify / apify-js

Star

Open

Update main examples to include DOM manipulation

1

mtrunkat commented Sep 17, 2019

Main examples at Apify SDK webpage, Github repo and CLI templates should demonstrate how to manipulate with DOM and retrieve data from it.

Also add one example of scraping with Apify SDK + jQuery to https://sdk.apify.com/docs/examples/basiccrawler

Feedback from: https://medium.com/better-programming/do-i-need-python-scrapy-to-build-a-web-scraper-7cc7cac2081d

I lost an hour trying to make

Read more

good first issue

Open

Improve error messages

1

Open

Handle ENOMEM gracefully in memory snapshotter in AutoscaledPool

1

Find more good first issues

php-curl-class / php-curl-class

Star

PHP Curl Class makes it easy to send HTTP requests and integrate with web APIs

Updated Oct 3, 2021
PHP

mherrmann / selenium-python-helium

Star

Selenium-python but lighter: Helium is the best Python library for web automation.

python firefox chrome webdriver selenium python3 web-scraping helium web-automation selenium-python

Updated Sep 3, 2021
Python

lorien / grab

Star

Web Scraping Framework

python framework spider asynchronous network http-client web-scraping pycurl urllib3

Updated Feb 22, 2021
Python

go-rod / rod

Star

A Devtools driver for web automation and scraping

testing go golang scraper automation web chrome-devtools headless devtools crawling web-scraping cdp chrome-headless rod chrome-devtools-protocol devtools-protocol gorod

Updated Nov 12, 2021
Go

codingforentrepreneurs / 30-Days-of-Python

Star

Learn Python for the next 30 (or so) Days.

python api flask automation tutorial csv jupyter rest-api selenium pandas python3 web-scraping selenium-webdriver fastapi

Updated Nov 8, 2021
HTML

justmarkham / DAT8

Star

General Assembly's 2015 Data Science course in Washington, DC

python data-science machine-learning natural-language-processing course clustering naive-bayes linear-regression scikit-learn jupyter-notebook pandas data-visualization web-scraping data-analysis ensemble-learning logistic-regression decision-trees regular-expressions data-cleaning model-evaluation

Updated Apr 18, 2016
Jupyter Notebook

tidyverse / rvest

Star

Simple web scraping for R

html r web-scraping

Updated Oct 28, 2021
R

snoop

snooppr / snoop

Star

Snoop — инструмент разведки на основе открытых данных (OSINT world)

Updated Nov 16, 2021
Python

x4nth055 / pythoncode-tutorials

Star

The Python Code Tutorials

python python-tutorials machine-learning natural-language-processing computer-vision text-classification tutorials python3 web-scraping face-detection scapy network-analysis network-programming programming-tutorial ethical-hacking network-security socket-programming scapy-tutorials

Updated Nov 13, 2021
Jupyter Notebook

vprusso / youtube_tutorials

Sponsor Star

Collection of scripts corresponding to LucidProgramming YouTube tutorials

python python3 web-scraping youtube-tutorial python-tutorial ctci-solutions lucidprogramming python3-tutorial technical-interview

Updated Feb 10, 2021
Python

DataHenHQ / till

Star

DataHen Till is a companion tool to your existing web scraper that instantly makes it scalable, maintainable, and more unblockable, with minimal code changes on your scraper. Integrates with any scraper in 5 minutes.

crawler scraper scraping mitm proxy-server web-scraping man-in-the-middle

Updated Oct 28, 2021
Go

juancarlospaco / faster-than-requests

Star

Faster requests on Python 3

Updated Sep 13, 2021
Nim

postmodern / spidr

Sponsor Star

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

ruby crawler scraper web spider web-crawler web-scraper web-scraping web-spider spider-links

Updated Jun 23, 2021
Ruby

dinubs / coolqlcool

Star

Nextjs server to query websites with GraphQL

javascript graphql schema nextjs web-scraping

Updated Aug 13, 2021
JavaScript

intoli / user-agents

Star

A JavaScript library for generating random user agents with data that's updated daily.

javascript user-agent random randomization navigator web-scraping browsers browser-automation user-agent-spoofer

Updated Nov 16, 2021
JavaScript

alecxe / scrapy-fake-useragent

Star

Random User-Agent middleware based on fake-useragent

python web-scraping scrapy

Updated Sep 17, 2020
Python

A9T9 / RPA

Star

UI.Vision: Open-Source RPA Software (formerly Kantu) - Modern Robotic Process Automation with Selenium IDE++

opencv automation webassembly web-scraping autohotkey browser-extension imacros selenium-ide browser-automation visual-recognition sikulix web-automation ui-tests uipath data-driven-tests

Updated Sep 4, 2021
JavaScript

rushter / selectolax

Star

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).

css python parser html5 web-scraping modest-engine

Updated Nov 13, 2021
Cython

AlexMathew / scrapple

Star

A framework for creating semi-automatic web content extractors

python crawler tutorial extractor scraping web-scraper selector css-selector web-scraping scrapy scrapers beautifulsoup xpath-expression lxml selector-expression

Updated Oct 24, 2020
Python

VIDA-NYU / ache

Star

ACHE is a web crawler for domain-specific search.

web-crawler web-scraping hacktoberfest web-spider focused-crawler domain-specific-search web-search

Updated Nov 9, 2021
Java

adbar / trafilatura

Star

Open

Docs update & extension

adbar commented Oct 29, 2021

last update was for version 0.9.1, take latest changes into account
review and extend general instructions for command-line interface
check if all required info on htmldate and courlan is provided (available options etc.)
add example on how to handle cookies, inspiration:
- urllib3/urllib3#2140
- https://github.com/urllib3/urllib3/pu

Read more

enhancement good first issue up for grabs

Open

Test trafilatura on further web pages and report bugs

1

austinoboyle / scrape-linkedin-selenium

Star

Open

Certifications return empty []

2

anntdiv commented Jun 10, 2021

Hello,
Thanks for new update in personal_info section,
I found out that the attribute 'certifications' return empty list []
Test url: https://www.linkedin.com/in/an-nguyen-9b3248122/
Results:
`{'personal_info': {'name': 'An Nguyen',
'headline': 'Data Scientist/Machine Learning Engineer',
'company': 'PERSOL PROCESS & TECHNOLOGY CO., LTD.',
'school': 'National Chiao Tung University',

Read more

help wanted good first issue

Open

Companyscraper doesn't work and returns error 'NoneType'

3

Open

Scrape linkedin posts

3

Find more good first issues

jaebradley / basketball_reference_web_scraper

Star

NBA Stats API via Basketball Reference

python nba web-scraper web-scraping basketball-reference

Updated Aug 10, 2021
HTML

infinitbyte / gopa

Star

[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

lightweight elasticsearch crawler spider web-crawler scraping crawling web-scraping web-spider

Updated May 19, 2021
Go

sangaline / wayback-machine-scraper

Star

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

python web-scraping command-line-tool wayback-machine wayback-archiver archive-dot-org

Updated Feb 15, 2021
Python

csu / quora-api

Star

An unofficial API for Quora.

python api flask rest-api web-api web-scraping quora quora-api

Updated Oct 9, 2016
Python

web-scraping

je-suis-tm / web-scraping

Star

Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, MacroTrends, SHFE and alternative data crawlers on Tomtom, BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist

web-scraper web-scraping newsletter reuters bloomberg futures web-scrapers scrapper financial-data news-websites data-scraping news-scraper futures-historical-data data-scraper sraping python-web-scraper financial-times options-data wall-street-journal wallstreetbets

Updated Jun 28, 2021
Python

Improve this page

Add a description, image, and links to the web-scraping topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the web-scraping topic, visit your repo's landing page and select "manage topics."