
Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.
History is littered with hundreds of conflicts over the future of a community, group, location or business that were "resolved" when one of the parties stepped ahead and destroyed what was there. With the original point of contention destroyed, the debates would fall to the wayside. Archive Team believes that by duplicated condemned data, the conversation and debate can continue, as well as the richness and insight gained by keeping the materials. Our projects have ranged in size from a single volunteer downloading the data to a small-but-critical site, to over 100 volunteers stepping forward to acquire terabytes of user-created data to save for future generations.
The main site for Archive Team is at archiveteam.org and contains up to the date information on various projects, manifestos, plans and walkthroughs.
This collection contains the output of many Archive Team projects, both ongoing and completed. Thanks to the generous providing of disk space by the Internet Archive, multi-terabyte datasets can be made available, as well as in use by the Wayback Machine, providing a path back to lost websites and work.
Our collection has grown to the point of having sub-collections for the type of data we acquire. If you are seeking to browse the contents of these collections, the Wayback Machine is the best first stop. Otherwise, you are free to dig into the stacks to see what you may find.
The Archive Team Panic Downloads are full pulldowns of currently extant websites, meant to serve as emergency backups for needed sites that are in danger of closing, or which will be missed dearly if suddenly lost due to hard drive crashes or server failures.
We have encountered an issue with
submodule.updatethat reproduces in a rather special circumstance:Setup
Have a git repo, say
test, containing one submodule, saytest_submodule.test_submodulemust have nomasterbranch (let say the default branch is calleddevelopinstead).Check out a revision of
testthat points to the tip ofdevelopfortest_submodule, but do notgit submodule initthe submodule. The directory hierarchy should look like this:Now, create a
Repofortest, and callrepo.submodules[0].update(init=True)Expected results
The tip of
developis checked out intest_submoduleand the index is clean.Actual results
test_submoduleis pointing at the tip of develop, but all of the files are staged for deletion. Additionally, there is a warning printed:Failed to checkout tracking branch refs/heads/masterMy analysis of why this happens
Before any bad behavior happens,
submodule.updateclones the submodule repository with-n, which means the clone does not actually check out the commit the clone ends up with. Remember this for later.After cloning, we begin the process of updating the submodule to match what the parent repo specifies.
submodule.updatetakes an optionalbranchargument, which defaults toNone. WhenbranchisNone, we assume the branch ismaster. Here, we try to find this branch and point HEAD to it, but of course in the repro case this fails because the branch does not exist. This is crucial, because this means we skip the line that marks the repo as "not checked out" by pointing the branch to the "NULL" sha.Now, recall that a requirement of the repro is that
testpoints to the tip ofdevelopfortest_submodule. Since we did not move the repo to the NULL sha beforehand, the repo is already at the desired sha when we arrive at this conditional. Therefore, the condition evaluates to false, and we skip all the code that actually checks out code.Finally, recall our
-ncheckout from the first paragraph. Since we did not check out any code after cloning, the repo is left in an un-checked-out state, which is exactly the "Actual results" state described above.Closing thoughts
We can workaround this issue simply by adding a dummy
masterbranch totest_submodule, but this should not be required. Ideally,submodule.updatedoes not require a valid branch to operate correctly, sincegit submodule update --initworks just fine without one.