A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.
-
Updated
May 17, 2022
{{ message }}
A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Concurrent and multi-stage data ingestion and data processing with Elixir
Large-scale pretraining for dialogue
Extract Transform Load for Python 3.5+
Describe the bug
pa.errors.SchemaErrors.failure_cases only returns the first 10 failure_cases
Note: Please read [this guide](https://matthewrocklin.c
(1) Add docstrings to methods
(2) Covert .format() methods to f strings for readability
(3) Make sure we are using Python 3.8 throughout
(4) zip extract_all() in ingest_flights.py can be simplified with a Path parameter
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
setting pretrained_model_name will not only define the model arch but also load the pre-trained checkpoint. We should have another hparam to control whether to load pre-trained checkpoint or not.
A list about Apache Kafka
Hello Benito,
For a specific task I need a "bitwise exclusive or"-function, but I realized xidel doesn't have one. So I created a function for that.
I was wondering if, in addition to the EXPath File Module, you'd be interested in integrating the EXPath Binary Module as well. Then I can use bin:xor() instead (although for
Engine for ML/Data tracking, visualization, dashboards, and model UI for Polyaxon.
Harmonious distributed data analysis in Rust.
Advanced and Fast Data Transformation in R
Write unit test coverage for SafeDataset and SafeDataLoader, along with the functions in utils.py.
Super fast list of dicts to pre-formatted tables conversion library for Python 2/3
Machine Learning notebooks for refreshing concepts.
Elastic data processing with Apache Pulsar and Apache Flink
The exception in subject is thrown by the following code:
from datetime import date
from pysparkling.sql.session import SparkSession
from pysparkling.sql.functions import collect_set
spark = SparkSession.Builder().getOrCreate()
dataset_usage = [
('steven', 'UUID1', date(2019, 7, 22)),
]
dataset_usage_schema = 'id: string, datauid: string, access_date: date'
df = spaAn open source framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud.
Manipulating VASP files with Python.
Python Adaptive Signal Processing
Is your feature request related to a problem? Please describe.
To prepare medical NER detection, we need to create a reader for the BC5CDR in the BLUE Benchmark: https://github.com/ncbi-nlp/BLUE_Benchmark
Describe the solution you'd like
Describe alternatives you've considered
A clear and concise
Add a description, image, and links to the data-processing topic page so that developers can more easily learn about it.
To associate your repository with the data-processing topic, visit your repo's landing page and select "manage topics."
Is your feature request related to a problem?
Currently, if a user tries to access an index that is larger than the dataset length or tensor length, an internal error is thrown which is not easy to understand.
Description of the possible solution
We can catch the error and throw a more descriptive e