The LeoFS Storage System
-
Updated
Jun 2, 2020 - Erlang
The LeoFS Storage System
Upserts, Deletes And Incremental Processing on Big Data.
Home of the community managed version of Presto, the distributed SQL query engine for big data, under the auspices of the Presto Software Foundation.
Business Intelligence and Data Warehousing
A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of Fishtown Analytics)
Python idiomatic SDK for Cortex™ Data Lake.
Apache Spark Course Material
Apiary provides modules which can be combined to create a federated cloud data lake
AWS Auto Terminate Idle AWS EMR Clusters Framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
汇总Apache Hudi相关资料
A custom extractor designed to read parquet for Azure Data Lake Analytics
Delta Lake Examples
Microsoft Big Data, Data Scientist, and AI
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Companion repository to Linked In Learning course 'Cloud NoSQL for SQL Pros'
The AI-Driven Social Media Dashboard solutions provides customers with a CloudFormation template that is easy to deploy to use Amazon Translate, Amazon Comprehend, Amazon Kinesis, Amazon Athena, and Amazon QuickSight to build a natural-language-processing (NLP)-powered social media dashboard for tweets.
pyspark streaming kafka(0.8.2) to hdfs
Apache Spark 3 - Structured Streaming Course Material
a tool to form a lake on AWS from your data
Use of Spark to get data from S3 then wrangle it to make available back in S3 with a better schema
End to end big data project, that aims to show how to implement different big data layers, from the infrastructure layer to the end user one. [HADOOP][Spark][Kafka][Cassandra][Ansible][Jupyter][Docker]
Data lake project for sparkify music platform. Written with py spark and run on an EMR cluster on AWS.
udacity nanodegree course projects.
Add a description, image, and links to the datalake topic page so that developers can more easily learn about it.
To associate your repository with the datalake topic, visit your repo's landing page and select "manage topics."