Skip to content

A serverless and event-driven approach to build data quality pipeline with AWS Lambda and Great Expectations

License

Notifications You must be signed in to change notification settings

vittoriopolverino/great-expectations-lambda

Repository files navigation

🧙 Great Expectations Lambda

A serverless and event-driven approach to build data quality pipeline with AWS Lambda and Great Expectations


📜 Table of Contents


🧐 About

Great Expectations is an open-source data quality framework based on Python. GE enables engineers to write tests, review reports, and assess the quality of data. It is a plugable tool, meaning you can easily add new expectations and customize final reports.

AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers.

Unfortunately, AWS Lambda imposes certain quotas and limits on the size of the deployment package:

  • 50 MB (zipped, for direct upload)
  • 250 MB (unzipped). This quota applies to all the files you upload, including layers and custom runtimes.

As a result, deploying GE on lambda takes some ingenuity. However, we can solve this problem by packaging and deploying Lambda functions as container images of up to 10 GB


🏁 Getting Started

Install packages in the virtualenv:

pipenv install --dev

💻 Usage

Make sure to have Docker installed

docker --version

Run the following script to build the docker image, run the container and locally test the lambda function (AWS account not needed)

script/docker.sh

Run the following script to locally export the HTML documentation generated by Great Expectations. If no local path is specified C:/great_expectations_data_docs/ will be used as default

script/export_data_docs.sh example/local/path

Go to the exported folder and open the index.html file

img/ge_data_docs.png


🚀 Deploy

I personally recommend Serverless to deploy lambda functions. Alternatively, in the infra folder you can find a Terraform example to create the AWS infrastructure.

I've also added a script example to tag and push the docker image to Amazon ECR and automatically update the lambda code with the newly pushed images.

script/naive_deploy.sh

⛏️ Built Using


✏️ Authors

About

A serverless and event-driven approach to build data quality pipeline with AWS Lambda and Great Expectations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published