data-engineering
Here are 693 public repositories matching this topic...
-
Updated
Feb 6, 2021
Description
I have setup a custom, remote prefect server.
However, when registering a flow, only localhost is displayed in the Flow URL :
$ prefect register flow --file ./myflow.py -p sandbox
Result check: OK
Flow URL: http://localhost:8080/default/flow/9235a237-f6bc-41c7-89bc-132db233b49e
└── ID: a09a47b0-1292-412f-bd70-89c8bf4dcf1e
Roadmap to becoming a data engineer in 2021
-
Updated
Feb 18, 2021
Describe the bug
When trying to run scaffolding (profiling) command, it fails because of commas in columns.
To Reproduce
Steps to reproduce the behavior:
- Run
great_expectations suite scaffold scaffold-nameon datasource where commas are in column - Bug
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 5323 saw 2
Expected behavior
D
Declarative streaming ETL for mundane tasks, written in Go
-
Updated
Feb 17, 2021 - Go
A list of useful resources to learn Data Engineering from scratch
-
Updated
Jan 13, 2021
-
Updated
Feb 15, 2021 - JavaScript
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
Updated
Feb 15, 2021 - Python
Quilt is a self-organizing data hub for S3
-
Updated
Feb 17, 2021 - Jupyter Notebook
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
-
Updated
Feb 10, 2021 - Jupyter Notebook
Enable delete repository action from the UI
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
-
Updated
Mar 9, 2020 - Python
Problem description
When I use the function of concatenating multiple columns, I find that it does not handle null values as expected.
This is the current output
df.concatenate_columns(["cat_1","cat_2","cat_3"],"cat",sep=",")| cat_1 | cat_2 |
|---|
if they are not class methods then the method would be invoked for every test and a session would be created for each of those tests.
`class PySparkTest(unittest.TestCase):
@classmethod
def suppress_py4j_logging(cls):
logger = logging.getLogger('py4j')
logger.setLevel(logging.WARN)
@classmethod
def create_testing_pyspark_session(cls):
return Sp
Data validation and organization of metadata for data frames and database tables
-
Updated
Feb 18, 2021 - R
Accumulated knowledge and experience in the field of Data Engineering
-
Updated
Jan 17, 2021
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
-
Updated
Mar 5, 2020 - Python
Turn complex requirements to workflows without leaving the comfort of your technology stack.
-
Updated
Feb 16, 2021 - Ruby
-
Updated
Feb 7, 2021 - CSS
An Awesome List of Open-Source Data Engineering Projects
-
Updated
Feb 13, 2021
Dataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift
-
Updated
Feb 10, 2021 - TypeScript
Egeria's open metadata labs use python notebooks to drive sequences of REST API calls to Egeria's runtime platform called the OMAG Server Platform. There is one function called printAssetUniverse that needs work. This function is designed to provide a data scientist with detailed information about an Asset (such as a file or a database). This includes name, description, its location, content,
Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on various cluster computing platforms. Please see https://github.com/cwensel/cascading for access to all WIP branches.
-
Updated
Nov 29, 2018 - Java
-
Updated
Apr 20, 2020 - Python
A daily digest of the articles or videos I've found interesting, that I want to share with you.
-
Updated
Feb 13, 2021
A Data Engineering & Machine Learning Knowledge Hub
-
Updated
Feb 13, 2021
A package to easily open an instance of a Google spreadsheet and interact with worksheets through Pandas DataFrames.
-
Updated
Feb 16, 2021 - Python
Enterprise-grade, production-hardened, serverless data lake on AWS
-
Updated
Feb 9, 2021 - Python
An automatic ML model optimization tool.
-
Updated
Dec 4, 2020 - Python
Improve this page
Add a description, image, and links to the data-engineering topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the data-engineering topic, visit your repo's landing page and select "manage topics."


BigQuery error is hard to read.
Expected results
In Explore, when creating a bad expression (say
DATE_TRUNC(column_that_dont_exist, DAY)) in BigQuery, the DatabaseError is shown as a UnknownError. In SQL Lab, DatabaseErrors are surfaced properly and make sure to use a monospace font so that the formatting is preserved. For most database, the formatting doesn't matter much, but for BigQ