| Nov | DEC | Jan |
| 10 | ||
| 2019 | 2020 | 2021 |
COLLECTED BY
Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.
History is littered with hundreds of conflicts over the future of a community, group, location or business that were "resolved" when one of the parties stepped ahead and destroyed what was there. With the original point of contention destroyed, the debates would fall to the wayside. Archive Team believes that by duplicated condemned data, the conversation and debate can continue, as well as the richness and insight gained by keeping the materials. Our projects have ranged in size from a single volunteer downloading the data to a small-but-critical site, to over 100 volunteers stepping forward to acquire terabytes of user-created data to save for future generations.
The main site for Archive Team is at archiveteam.org and contains up to the date information on various projects, manifestos, plans and walkthroughs.
This collection contains the output of many Archive Team projects, both ongoing and completed. Thanks to the generous providing of disk space by the Internet Archive, multi-terabyte datasets can be made available, as well as in use by the Wayback Machine, providing a path back to lost websites and work.
Our collection has grown to the point of having sub-collections for the type of data we acquire. If you are seeking to browse the contents of these collections, the Wayback Machine is the best first stop. Otherwise, you are free to dig into the stacks to see what you may find.
The Archive Team Panic Downloads are full pulldowns of currently extant websites, meant to serve as emergency backups for needed sites that are in danger of closing, or which will be missed dearly if suddenly lost due to hard drive crashes or server failures.
Collection: Archive Team: The Github Hitrub
script/setup once to download the data.
# clone this repository git clone https://github.com/github/CodeSearchNet.git cd CodeSearchNet/ # download data (~3.5GB) from S3; build and run the Docker container script/setup # this will drop you into the shell inside a Docker container script/console # optional: log in to W&B to see your training metrics, # track your experiments, and submit your models to the benchmark wandb login # verify your setup by training a tiny model python train.py --testrun # see other command line options, try a full training run with default values, # and explore other model variants by extending this baseline script python train.py --help python train.py # generate predictions for model evaluation python predict.py -r github/CodeSearchNet/0123456 # this is the org/project_name/run_idFinally, you can submit your run to the community benchmark by following these instructions.
@article{husain2019codesearchnet,
title={{CodeSearchNet} challenge: Evaluating the state of semantic code search},
author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
journal={arXiv preprint arXiv:1909.09436},
year={2019}
}
comment, code) pairs from open source libraries. Concretely, a comment is a top-level function or method comment (e.g. docstrings in Python), and code is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary statistics about this dataset can be found in this notebook
For more information about how to obtain the data, see this section.
script/setup.
script/setup
This will build Docker containers and download the datasets. By default, the data is downloaded into the resources/data/ folder inside this repository, with the directory structure described here.
The datasets you will download (most of them compressed) have a combined size of only ~ 3.5 GB.
(一)To start the Docker container, run script/console:
script/console
This will land you inside the Docker container, starting in the /src directory. You can detach from/attach to this container to pause/continue your work.
For more about the data, see Data Details below, as well as this notebook.
/resources/data folder of this repository, with this directory structure.
original_string that is code
●code_tokens: tokenized version of code
●docstring: the top-level comment or docstring, if it exists in the original string
●docstring_tokens: tokenized version of docstring
●sha: this field is not being used [TODO: add note on where this comes from?]
●partition: a flag indicating what partition this datum belongs to of {train, valid, test, etc.} This is not used by the model. Instead we rely on directory structure to denote the partition of the data.
●url: the url for the code snippet including the line numbers
Code, comments, and docstrings are extracted in a language-specific manner, removing artifacts of that language.
{
'code': 'def get_vid_from_url(url):\n'
' """Extracts video ID from URL.\n'
' """\n'
" return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
" match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n"
" match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
" match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n"
" parse_query_param(url, 'v') or \\\n"
" parse_query_param(parse_query_param(url, 'u'), 'v')",
'code_tokens': ['def',
'get_vid_from_url',
'(',
'url',
')',
':',
'return',
'match1',
'(',
'url',
',',
"r'youtu\\.be/([^?/]+)'",
')',
'or',
'match1',
'(',
'url',
',',
"r'youtube\\.com/embed/([^/?]+)'",
')',
'or',
'match1',
'(',
'url',
',',
"r'youtube\\.com/v/([^/?]+)'",
')',
'or',
'match1',
'(',
'url',
',',
"r'youtube\\.com/watch/([^/?]+)'",
')',
'or',
'parse_query_param',
'(',
'url',
',',
"'v'",
')',
'or',
'parse_query_param',
'(',
'parse_query_param',
'(',
'url',
',',
"'u'",
')',
',',
"'v'",
')'],
'docstring': 'Extracts video ID from URL.',
'docstring_tokens': ['Extracts', 'video', 'ID', 'from', 'URL', '.'],
'func_name': 'YouTube.get_vid_from_url',
'language': 'python',
'original_string': 'def get_vid_from_url(url):\n'
' """Extracts video ID from URL.\n'
' """\n'
" return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
" match1(url, r'youtube\\.com/embed/([^/?]+)') or『
'\\\n'
』 match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
" match1(url, r'youtube\\.com/watch/([^/?]+)') or『
'\\\n'
』 parse_query_param(url, 'v') or \\\n"
" parse_query_param(parse_query_param(url, 'u'), "
"'v')",
'partition': 'test',
'path': 'src/you_get/extractors/youtube.py',
'repo': 'soimort/you-get',
'sha': 'b746ac01c9f39de94cac2d56f665285b0523b974',
'url': 'https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143'
}
Summary statistics such as row counts and token length histograms can be found in this notebook
/script/setup will automatically download these files into the /resources/data directory. Here are the links to the relevant files for visibility:
The s3 links follow this pattern:
https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,javascript,ruby}.zip
For example, the link for the java is:
https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip
The size of the dataset is approximately 20 GB. The various files and the directory structure are explained here.
.csv, with the following fields:
●Language: The programming language of the snippet.
●Query: The natural language query
●GitHubUrl: The URL of the target snippet. This matches the URL key in the data (see here).
●Relevance: the 0-3 human relevance judgement, where "3" is the highest score (very relevant) and "0" is the lowest (irrelevant).
●Notes: a free-text field with notes that annotators optionally provided.
comments, code) and learn to retrieve a code snippet given a natural language query. Specifically, comments are top-level function and method comments (e.g. docstrings in Python), and code is an entire function or method. Throughout this repo, we refer to the terms docstring and query interchangeably.
The query has a single encoder, whereas each programming language has its own encoder. The available encoders are Neural-Bag-Of-Words, RNN, 1D-CNN, Self-Attention (BERT), and a 1D-CNN+Self-Attention Hybrid.
The diagram below illustrates the general architecture of our baseline models:
p3.2xlarge is sufficient).
Start the model run environment by running script/console:
script/console
This will drop you into the shell of a Docker container with all necessary dependencies installed, including the code in this repository, along with data that you downloaded earlier. By default, you will be placed in the src/ folder of this GitHub repository. From here you can execute commands to run the model.
Set up W&B (free for open source projects) per the instructions below if you would like to share your results on the community benchmark. This is optional but highly recommended.
The entry point to this model is src/train.py. You can see various options by executing the following command:
python train.py --help
To test if everything is working on a small dataset, you can run the following command:
python train.py --testrun
Now you are prepared for a full training run. Example commands to kick off training runs:
Training a neural-bag-of-words model on all languages
python train.py --model neuralbow
The above command will assume default values for the location(s) of the training data and a destination where you would like to save the output model. The default location for training data is specified in /src/data_dirs_{train,valid,test}.txt. These files each contain a list of paths where data for the corresponding partition exists. If more than one path specified (separated by a newline), the data from all the paths will be concatenated together. For example, this is the content of src/data_dirs_train.txt:
$ cat data_dirs_train.txt
../resources/data/python/final/jsonl/train
../resources/data/javascript/final/jsonl/train
../resources/data/java/final/jsonl/train
../resources/data/php/final/jsonl/train
../resources/data/ruby/final/jsonl/train
../resources/data/go/final/jsonl/train
By default, models are saved in the resources/saved_models folder of this repository.
Training a 1D-CNN model on Python data only:
python train.py --model 1dcnn /trained_models ../resources/data/python/final/jsonl/train ../resources/data/python/final/jsonl/valid ../resources/data/python/final/jsonl/test
The above command overrides the default locations for saving the model to trained_models and also overrides the source of the train, validation, and test sets.
Additional notes:
Options for --model are currently listed in src/model_restore_helper.get_model_class_from_name.
Hyperparameters are specific to the respective model/encoder classes. A simple trick to discover them is to kick off a run without specifying hyperparameter choices, as that will print a list of all used hyperparameters with their default values (in JSON format).
/src directory in this repository.
If it's your first time using W&B on a machine, you will need to log in:
$ wandb login
You will be asked for your API key, which appears on your W&B profile settings page.
_licenses.pkl files.
This code and documentation for this project are released under the MIT License.