data-engineering
Here are 1,060 public repositories matching this topic...
-
Updated
Jan 2, 2022
The Data Engineering Cookbook
-
Updated
Jan 2, 2022
Roadmap to becoming a data engineer in 2021
-
Updated
May 28, 2021
Current behavior
You get an error if you try to upload the same file name
azure.core.exceptions.ResourceExistsError: The specified blob already exists.
RequestId:5bef0cf1-b01e-002e-6
Proposed behavior
The task should take in an overwrite argument and pass it to [this line](https://github.com/PrefectHQ/prefect/blob/6cd24b023411980842fa77e6c0ca2ced47eeb83e/src/prefect/
Describe the bug
data docs columns shrink to 1 character width with long query
To Reproduce
Steps to reproduce the behavior:
- make a batch from a long query string
- run validation
- render result to data docs
- See screenshot
<img width="1525" alt="Data_documentation_compiled_by_Great_Expectations" src="https://user-images.githubusercontent.com/928247/103230647-30eca500-4
Under the hood, Benthos csv input uses the standard encoding/csv packages's csv.Reader struct.
The current implementation of csv input doesn't allow setting the LazyQuotes field.
We have a use case where we need to set the LazyQuotes field in order to make things work correctly.
Is your feature request related to a problem? Please describe.
I have a framework that handles the offline store. It creates the tables, indexes, reads data from different data sources, does some transformations, and then inserts into the offline store. As a part of this, I can construct the entities, feature views, feature services, etc, a instance of the ParsedRepo class for Feast. What I n
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
Updated
Jan 7, 2022 - Python
In the architecture page, under the Overview section, the overview image shows two LBs pointing to a lakeFS environment. This is no longer true and should show a single LB pointing to that environment.
A list of useful resources to learn Data Engineering from scratch
-
Updated
Oct 29, 2021
-
Updated
Aug 2, 2021 - JavaScript
Quilt is a self-organizing data hub for S3
-
Updated
Jan 7, 2022 - Jupyter Notebook
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
-
Updated
Jan 7, 2022 - Jupyter Notebook
A comprehensive list of 180+ YouTube Channels for Data Science, Data Engineering, Machine Learning, Deep learning, Computer Science, programming, software engineering, etc.
-
Updated
Dec 31, 2021
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
-
Updated
Mar 9, 2020 - Python
click has a CLIRunner to test CLI applications, however, it's limiting (e.g., monkeypatch doesn't work well). So we started to modify the test_cli.py tests to call the functions directly (e.g., install.main(use_lock=True). But given this change, we are no longer testing that cli args actually become the right function arguments (e.g., if we pass --use-lock), this should imply, we pass `ins
if they are not class methods then the method would be invoked for every test and a session would be created for each of those tests.
`class PySparkTest(unittest.TestCase):
@classmethod
def suppress_py4j_logging(cls):
logger = logging.getLogger('py4j')
logger.setLevel(logging.WARN)
@classmethod
def create_testing_pyspark_session(cls):
return Sp
Background
This thread is borne out of the discussion from #968 , in an effort to make documentation more beginner-friendly & more understandable.
One of the subtasks mentioned in that thread was to go through the function docstrings and include a minimal working example to each of the public functions in pyjanitor.
Criteria reiterated here for the benefit of discussion:
It sh
A Data Engineering & Machine Learning Knowledge Hub
-
Updated
Jan 7, 2022
Accumulated knowledge and experience in the field of Data Engineering
-
Updated
Jun 2, 2021
Data profiling, testing, and monitoring for SQL accessible data.
-
Updated
Jan 6, 2022 - Python
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
-
Updated
Mar 5, 2020 - Python
In a lot of classes we use LoggerFactory to initialize logger
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class DefaultAuthorizer implements Authorizer {
private static final Logger LOG = LoggerFactory.getLogger(DefaultAuthorizer.class);
This could be simplified to the following, with no need to initialize logger using LoggerFactory
import lombok.exte
An Awesome List of Open-Source Data Engineering Projects
-
Updated
Oct 25, 2021
Machine Learning automation and tracking
-
Updated
Jan 6, 2022 - Python
Polyglot workflows without leaving the comfort of your technology stack.
-
Updated
Nov 6, 2021 - Ruby
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
A large amount of output goes to the log, this should not happen by default.
Expected Behavior
much less content in the output of the FVT and the build bu default
Switch on debug in the logging configuration and then see all the output.
Steps To Reproduce
run the build
Env
Dataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift
-
Updated
Dec 9, 2021 - TypeScript
Improve this page
Add a description, image, and links to the data-engineering topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the data-engineering topic, visit your repo's landing page and select "manage topics."


Screenshot
I've added a red vertical ruler so that you see the issue
Description
As already explained in numerous issues, the use of 'Inter' font is problematic, it does not allow to align dates for instance,
and does not play nice with numbers either.
In my supe