pyspark

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

python spark faker pyspark spark-streaming data-generation databricks synthetic-data datagen datagenerator deltalake datageneration delta-live-tables

Updated May 25, 2024
Python

carlospadron / spatial_knn

Star

A compilation of solutions for the KNN problem.

kotlin python rust scala sklearn geospatial pyspark spatial-analysis knn jts geopandas shapely sedona pygeos

Updated May 24, 2024
Jupyter Notebook

ibis-project / ibis

Star

the portable Python dataframe library

mysql python bigquery sqlalchemy sql database clickhouse sqlite impala postgresql snowflake pandas pyspark mssql dask trino pyarrow datafusion duckdb polars

Updated May 24, 2024
Python

canimus / cuallee

Star

Possibly the fastest DataFrame-agnostic quality check library in town.

unit-testing bigdata pandas python3 performance-metrics pyspark data-quality-checks data-quality dataquality snowpark pydeequ

Updated May 24, 2024
Python

cassiobolba / Data-Engineering

Star

Projects and studies regarding Data Engineering Area

python git sql lambda-functions pyspark apache-beam gitci

Updated May 24, 2024
HTML

vladyslavyaloveha / etl_platform

Star

🚖 ETL Platform: Analyzing NYC Yellow Taxi Trips with Airflow, FastAPI, and Cloud Integration

python docker aws airflow terraform pre-commit poetry gcp pyspark traefik ruff fastapi checkov

Updated May 24, 2024
Python

h2oai / sparkling-water

Star

Sparkling Water provides H2O functionality inside Spark cluster

machine-learning scala big-data spark integration h2o pyspark pysparkling rsparkling

Updated May 24, 2024
Scala

vitorjpc10 / Spark-Application-with-Python-Using-MongoDB-and-PySpark

Star

This project is an ETL pipeline that fetches market data from the Albion Online Data API, processes it with PySpark, and stores it in MongoDB. It demonstrates real-time data extraction, transformation using Spark, and efficient NoSQL storage, providing insights into market trends and historical prices for Albion Online items.

spark mongodb pyspark

Updated May 24, 2024
Python

longNguyen010203 / Youtube-ETLT-Pipeline

Star

💜🌈📊 A Data Engineering Project that implements an ETL data pipeline using Dagster, Apache Spark, Streamlit, MinIO, Apache Superset, Dbt 🌺

mysql processing docker dockerfile machine-learning spark docker-compose postgresql pyspark data-engineering minio dbt data-engineer etl-pipeline data-engineering-pipeline cleaning-data dagster

Updated May 24, 2024
Jupyter Notebook

JohnSnowLabs / spark-nlp

Star

State of the Art Natural Language Processing

Updated May 24, 2024
Scala

apache / linkis

Star

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.