parquet
Here are 185 public repositories matching this topic...
Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.
I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with
driver = 'libhdfs'
return pyarrow.hdfs.cQuilt is a self-organizing data hub for S3
-
Updated
Feb 8, 2021 - Jupyter Notebook
Currently, there isn't a way to get the table properties in the SparkOrcWriter via the WriterFactory.
High performance distributed data processing engine
-
Updated
Jul 29, 2020 - JavaScript
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
-
Updated
Feb 8, 2021 - Python
A tool for batch loading data files (json, parquet, csv, tsv) into ElasticSearch
-
Updated
Dec 19, 2020 - Python
Over time we've had some things leak into the diff methods that make it more cumbersome to use BigDiffy via code instead of CLI.
For example diffAvro here https://github.com/spotify/ratatool/blob/master/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala#L284
User has to manually pass in schema otherwise we they receive a non-informative error regarding null schema, add
-
Updated
Jan 10, 2021 - C#
Manipulate arrays of complex data structures as easily as Numpy.
-
Updated
Feb 8, 2021 - Python
fully asynchronous, pure JavaScript implementation of the Parquet file format
-
Updated
Dec 29, 2020 - JavaScript
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
-
Updated
Feb 1, 2019 - TypeScript
A SQLite vtable extension to read Parquet files
-
Updated
Nov 11, 2020 - C++
Problem description
Our CI takes some time to run, a significant chunk of this time is spent on creating the required Python environment.
We can speed this up by using caching every new created environment, so that we don't need to create the same environment more than once. (We may have to remove old environments eventually)
We can use https://github.com/actions/cache for this
Simple windows desktop application for viewing & querying Apache Parquet files
-
Updated
Feb 8, 2021 - C#
Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
-
Updated
Nov 29, 2020 - Scala
Go package to read and write parquet files. parquet is a file format to store nested data structures in a flat columnar data format. It can be used in the Hadoop ecosystem and with tools such as Presto and AWS Athena.
-
Updated
Feb 4, 2021 - Go
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
-
Updated
Feb 8, 2021 - Python
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
-
Updated
Mar 5, 2020 - Scala
GCS support for avro-tools, parquet-tools and protobuf
-
Updated
Nov 5, 2020 - Java
Improve this page
Add a description, image, and links to the parquet topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the parquet topic, visit your repo's landing page and select "manage topics."


Append
classto allHashCodeBuildersin Gaffer for the below issue to minimise hash collisions.