Python package for the pysparkonk8s Apache Airflow provider.
-
Updated
May 25, 2024 - Python
Python package for the pysparkonk8s Apache Airflow provider.
This project was completed as part of the CIT 650 "Intro To Big Data" course at Nile University.
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
the portable Python dataframe library
Possibly the fastest DataFrame-agnostic quality check library in town.
Projects and studies regarding Data Engineering Area
Sparkling Water provides H2O functionality inside Spark cluster
This project is an ETL pipeline that fetches market data from the Albion Online Data API, processes it with PySpark, and stores it in MongoDB. It demonstrates real-time data extraction, transformation using Spark, and efficient NoSQL storage, providing insights into market trends and historical prices for Albion Online items.
💜🌈📊 A Data Engineering Project that implements an ETL data pipeline using Dagster, Apache Spark, Streamlit, MinIO, Apache Superset, Dbt 🌺
State of the Art Natural Language Processing
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
An open source, standard data file format for graph data storage and retrieval.
Simple and Distributed Machine Learning
Material for the course "Introduction to Apache Spark APIs for Data Processing" https://sparktraining.web.cern.ch/
Add a description, image, and links to the pyspark topic page so that developers can more easily learn about it.
To associate your repository with the pyspark topic, visit your repo's landing page and select "manage topics."