web-archiving

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

spark internet-archive warc web-archiving webarchive archivespark spark-framework

Updated Dec 13, 2019
Scala

N0taN3rd / wail

Star

🐋 One-Click User Instigated Preservation

electron warc web-archiving high-fidelity-preservation browser-based-presrevation

Updated Feb 3, 2019
JavaScript

oduwsdl / warrick

Star

Recover lost websites from the Web Infrastructure

memento recovery web-archiving memento-rfc

Updated Mar 22, 2020
HTML

internetarchive / fatcat

Star

Perpetual Access To The Scholarly Record

rust scholarly-communication web-archiving digital-library

Updated Sep 12, 2020
Python

N0taN3rd / node-warc

Star

Parse And Create Web ARChive (WARC) files with node.js

warc web-archiving webarchive web-archives webarchiving warc-files chrome-remote-interface pupeteer

Updated Sep 4, 2020
JavaScript

cocrawler / cdx_toolkit

Star

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

python warc web-archiving cdx web-archives commoncrawl cdx-api

Updated Sep 2, 2020
Python

xarantolus / Collect

Star

A server to collect & archive websites that also supports video downloads

self-hosted video-downloader archive webinterface web-archiving website-scraper website-archive

Updated Jul 19, 2020
TypeScript

oduwsdl / MemGator

Star

A Memento Aggregator CLI and Server in Go

memento web-archiving timemap memento-rfc

Updated Sep 2, 2020
Go

webrecorder / replayweb.page

Star

Serverless Web Archive Replay directly in the browser

service-worker warc web-archiving wayback-machine web-archive replay-web-page web-replay

Updated Sep 14, 2020
JavaScript

pirate / internet-archiving-talk

Star

🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.

slideshow wget talks warc censorship web-archiving ethics internet-archiving archivebox

Updated Jun 2, 2020
JavaScript

webrecorder / cdxj-indexer

Star

CDXJ Indexing of WARC/ARCs

warc web-archiving

Updated Aug 30, 2020
Python

nla / outbackcdx

Star

Web archive index server based on RocksDB

web-archiving wayback

Updated Aug 15, 2020
Java

Rhizome-Conifer / conifer-deploy

Star

Conifer setup and deployment via Ansible

ansible-playbook web-archiving webrecorder

Updated Jun 15, 2020
Shell

webrecorder / dat-share

Star

A prototype server to swarm multiple DATs for Webrecorder

hyperdrive dat web-archiving dat-protocol

Updated Apr 27, 2019
JavaScript

nla / httrack2warc

Star

Converts HTTrack crawls to WARC files

web-archiving

Updated Mar 9, 2020
Java

internetarchive / pdf_trio

Star

A PDF classifier ensemble with REST API service

pdf tensorflow scholarly-communication web-archiving digital-library

Updated Jun 11, 2020
Python

helgeho / HadoopConcatGz

Star

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

spark hadoop warc web-archiving webarchive

Updated Feb 7, 2018
Java

ukwa / ukwa-manage

Star

Shepherding our web archives from crawl to access.

hdfs warc web-archiving wayback webarchive cdx

Updated Sep 7, 2020
Jupyter Notebook

Improve this page

Add a description, image, and links to the web-archiving topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the web-archiving topic, visit your repo's landing page and select "manage topics."

Learn more

Aug	SEP	Oct
	14
2019	2020	2021