mini-etl

A simple ETL (Extract, Transform, Load) tool for processing files.

Quick Start

Create a new pipeline folder
Create a subfolder called input and place the files to process in this folder
Create a pipeline.yaml file in the root of the pipeline folder
Run the pipeline

🎉 The processed files will be saved in the output folder!

To run the pipeline, use the following commands:

 ./scripts/pipelines/build.sh
 ./scripts/pipelines/rerun.sh relative/path/to/pipeline

Pipeline Anatomy

Each pipeline is a directory. The directory name is the pipeline name. The directory contains the following:

input/: The input directory containing the files to process.
output/: The output directory where the processed files are saved.
pipeline.yaml: The pipeline configuration file.

Input Directory

The input/ directory contains the files to process. The files can be of any type. The pipeline processes all files in the input/ directory.

Output Directory

The output/ directory contains the processed files. The pipeline saves the processed files in the output/ directory.

Pipeline Configuration (`pipeline.yaml`)

The root of the pipeline directory contains a pipeline.yaml file that defines the pipeline configuration. The configuration file is a YAML file with the following structure:

example pipeline.yaml

pipeline:
- name: PyPDF
  transformer: transformers/extract-from-pdf/pypdf.py

The pipeline key is a list of steps. Each step is a dictionary with the following keys:

name: The name of the step.
transformer: The path to the transformer script.
copy_src_files: (optional) A boolean value indicating whether to copy the source files to the output directory. Default is false.

Pipeline Execution

The pipeline processes the files in the input/ directory using the transformers defined in the pipeline.yaml file. The pipeline executes the transformers in the order they are defined in the configuration file.

Each transformer is given the path to an input directory and a path to an output directory. The transformer is responsible for reading the files in the input directory, processing them, and writing the processed files to the output directory.

The script will create intermediate input/output directories for each step in the pipeline. The intermediate directories are named after the step name.

Transformers

Transformers are Python modules that process the input files. Each transformer is a Python script that reads the input files and writes the processed files to the output directory.

The script will be given the following environment variables:

PIPELINE_DIR: The path to the pipeline directory.
INPUT_DIR: The path to the input directory.
OUTPUT_DIR: The path to the output directory.

Writing a Transformer

[todo]

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples/0		examples/0
scripts		scripts
transformers		transformers
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples/0

examples/0

scripts

scripts

transformers

transformers

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

mini-etl

Quick Start

Pipeline Anatomy

Input Directory

Output Directory

Pipeline Configuration (`pipeline.yaml`)

Pipeline Execution

Transformers

Writing a Transformer

About

Releases

Packages

Languages

License

bitovi/mini-etl

Folders and files

Latest commit

History

Repository files navigation

mini-etl

Quick Start

Pipeline Anatomy

Input Directory

Output Directory

Pipeline Configuration (pipeline.yaml)

Pipeline Execution

Transformers

Writing a Transformer

About

Resources

License

Stars

Watchers

Forks

Languages

Pipeline Configuration (`pipeline.yaml`)