Skip to content

RyanHUNGry/dataflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Streamline your data workflows 🚀

Introduction

dataflow is a specialized issue tracker designed to streamline and enhance your data science and data analysis projects. The platform offers a unique approach to project management through the concept of flows, while also providing an array of additional features tailored to empower your data-related tasks. Whether you're a data scientist, analyst, or enthusiast, dataflow is here to optimize your workflow.

Please consider that current development is focused on the backend, core architecture, and internal developer tooling. This means a frontend won't be released in the near future. As such, this repository will document application architecture, APIs, and other non-userfacing concepts.

Table Of Contents

Installation

Dependencies

  1. Clone the repository and install backend dependencies:
git clone https://github.com/RyanHUNGry/dataflow.git && cd ./dataflow/backend && npm install
  1. Create an environment variables file:
cd backend && touch.env
  1. Fill out the following environment variables inside .env:
NODE_ENV=... # 

PG_DEV_DATABASE=... # development database
PG_DEV_USERNAME=... 
PG_DEV_PASSWORD=...
PG_DEV_HOST=...

DEV_PORT=...

DEV_JWT_SECRET=...

AWS_PUBLIC_KEY:...
AWS_SECRET_KEY:...

Starting a Server

  1. Start a server on default http://localhost:8000:
npm start
# run nodemon process for development
npm run watch
  1. Ping the API with a tool such as Postman

Application Architecture

Environments

dataflow has a traditional three environment setup using environment variables to dictate development, test, and production settings.

AWS RDS

dataflow uses AWS RDS PostgreSQL instances for data storage. There are three databases inside the instance for development, test, and production. Connection to dataflow is facilitated via PostgreSQL connection protocol with SSL encryption.

AWS S3

dataflow uses AWS S3 buckets to store datasets related to a flow and also summary statistics related to each dataset. There are three folders within each bucket for development, test, and production.

AWS Lambda

To compute summary statistics of datasets uploaded to S3, AWS Lambda runs a Python script utilizing Pandas to compute and then output to a second bucket. The environment is inferred by Lambda using the object folder prefix.

dataflow API

The dataflow API is powered by Node.js and Express.js. Passport.js is used for authentication middleware with JWT tokens. Knex.js is used as a query builder to query against the AWS RDS PostgreSQL databases. The NODE_ENV environment variable can be used to configure how the API connects with external services. This API will listen on port 8000.

Testing Suite for dataflow API

The dataflow API comes with full unit and integration test suites. These tests should be run under test NODE_ENV so that proper connection to external services are used. The tests themselves depend on a Mocha, Chai, and Sinon stack.

Containerization

Much like other REST APIs, the dataflow API is stateless and quite serverless. Thus, containerization of the application only depends on installing the application itself, and connecting to services with proper environment variables and credentials.

CI/CD Pipeline

WIP

Local Architecture Diagram

Production Architecture Diagram

Links

  1. Production application: Docker Hub
  2. Production API: http://54.215.249.98:8000/

About

Workflows for data scientists.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published