Streamline your data workflows 🚀

Introduction

dataflow is a specialized issue tracker designed to streamline and enhance your data science and data analysis projects. The platform offers a unique approach to project management through the concept of flows, while also providing an array of additional features tailored to empower your data-related tasks. Whether you're a data scientist, analyst, or enthusiast, dataflow is here to optimize your workflow.

Please consider that current development is focused on the backend, core architecture, and internal developer tooling. This means a frontend won't be released in the near future. As such, this repository will document application architecture, APIs, and other non-userfacing concepts.

Installation

Dependencies

Clone the repository and install backend dependencies:

git clone https://github.com/RyanHUNGry/dataflow.git && cd ./dataflow/backend && npm install

Create an environment variables file:

cd backend && touch.env

Fill out the following environment variables inside .env:

NODE_ENV=... # 

PG_DEV_DATABASE=... # development database
PG_DEV_USERNAME=... 
PG_DEV_PASSWORD=...
PG_DEV_HOST=...

DEV_PORT=...

DEV_JWT_SECRET=...

AWS_PUBLIC_KEY:...
AWS_SECRET_KEY:...

Starting a Server

Start a server on default http://localhost:8000:

npm start

# run nodemon process for development
npm run watch

Ping the API with a tool such as Postman

Application Architecture

Environments

dataflow has a traditional three environment setup using environment variables to dictate development, test, and production settings.

AWS RDS

dataflow uses AWS RDS PostgreSQL instances for data storage. There are three databases inside the instance for development, test, and production. Connection to dataflow is facilitated via PostgreSQL connection protocol with SSL encryption.

AWS S3

dataflow uses AWS S3 buckets to store datasets related to a flow and also summary statistics related to each dataset. There are three folders within each bucket for development, test, and production.

AWS Lambda

To compute summary statistics of datasets uploaded to S3, AWS Lambda runs a Python script utilizing Pandas to compute and then output to a second bucket. The environment is inferred by Lambda using the object folder prefix.

dataflow API

The dataflow API is powered by Node.js and Express.js. Passport.js is used for authentication middleware with JWT tokens. Knex.js is used as a query builder to query against the AWS RDS PostgreSQL databases. The NODE_ENV environment variable can be used to configure how the API connects with external services. This API will listen on port 8000.

Testing Suite for dataflow API

The dataflow API comes with full unit and integration test suites. These tests should be run under test NODE_ENV so that proper connection to external services are used. The tests themselves depend on a Mocha, Chai, and Sinon stack.

Containerization

Much like other REST APIs, the dataflow API is stateless and quite serverless. Thus, containerization of the application only depends on installing the application itself, and connecting to services with proper environment variables and credentials.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
backend		backend
static		static
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Streamline your data workflows 🚀

Introduction

Table Of Contents

Installation

Dependencies

Starting a Server

Application Architecture

Environments

AWS RDS

AWS S3

AWS Lambda

dataflow API

Testing Suite for dataflow API

Containerization

CI/CD Pipeline

Local Architecture Diagram

Production Architecture Diagram

Links

About

Releases

Packages

Languages

RyanHUNGry/dataflow

Folders and files

Latest commit

History

Repository files navigation

Streamline your data workflows 🚀

Introduction

Table Of Contents

Installation

Dependencies

Starting a Server

Application Architecture

Environments

AWS RDS

AWS S3

AWS Lambda

dataflow API

Testing Suite for dataflow API

Containerization

CI/CD Pipeline

Local Architecture Diagram

Production Architecture Diagram

Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages