Skip to content

Short demo/guide for an airflow built-in operator to transfer files stored on GCP to AWS S3 with minimal configuration

Notifications You must be signed in to change notification settings

cmosta0/gcp-aws-storage-transfer-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Demo project to transfer files from GCP to AWS using Airflow

This project is created with the goal to show the cloud integrations on airflow and the ability to process and transfer files though multi-cloud providers with built-in airflow-operators.

Purpose

Transfer GCP bucket to AWS bucket using airflow built-in operators (GoogleCloudStorageToS3Operator).

Requirements

  • AWS account
  • GCP account
  • Python 3.7 on installation host (virtualenv installed)

Before proceed with installation, make sure you're using a gcp service-account with the right permissions to access bucket objects. You can use: https://console.cloud.google.com/iam-admin/troubleshooter and search for storage.objects.list.

Local setup

  • Step 1. Create and activate virtual environment (py3.7): (Below steps assumes that you're located on the root project folder location). You can skip this step if you open your project with PyCharm and create the virtual environment from there.

    • Step 1.1 - Create virtual environment (optional name, recommended: venv):

      virtualenv --python=/Library/Frameworks/Python.framework/Versions/3.7/bin/python3 venv   

      Note: First confirm your python version path

    • Step 1.2 - Activate virtual environment

      source venv/bin/activate

      Note: Once you have created and activated your virtual environment, you need to set it as python iterpreter on PyCharm.

    • Step 1.3 - Dependencies installation: (instructions for command line - this can be also done using PyCharm tools, just make sure you're using project virtual environment and not python global installation).

      pip install -r requirements.txt
  • Step 2 - Initialize airflow and render dags.

    • Step 2.1 - Define airflow home: open 2 terminals (project root location) and execute below command in both of them. This will let airflow knows which is the folder to use for airflow instance (project application).

      export AIRFLOW_HOME=./airflow
    • Step 2.2 - Initialize airflow: In one of the terminals where you defined AIRFLOW_HOME, execute below command to initialize our airflow instance (after you run it, you'll see more files -related to airflow instance- inside $AIRFLOW_HOME directory).

      airflow initd
    • Step 2.3 - Start airflow: This step and the next one will start 2 background processes, both of them requires $AIFLOW_HOME defined.

      • Step 2.3.1 - Start airflow scheduler: In one of the two sessions that we have opened with AIRFLOW_HOME, run below command to start airflow scheduler:

        airflow scheduler
      • Step 2.3.2 - Start airflow webserver: The only pending thing is turn on our webserver to start running our dags, so in the other terminal run the below command to turn it on (note: you can specify a different port if you want).

        airflow webserver -p 8084
      • Step 2.3.3 - Open airflow webserver and verify installation: At this point we only need to verify everything is running as expected and our dags (located on $AIRFLOW_HOME/dags/) are rendered on dashboard (Note: Even though webserver should display them almost immediate, refresh the browser after a minute just to make sure. This shouldn't take more than that).

        Open: http://localhost:8084/admin/ 
  • Step 3 - Cloud accounts configuration.

    • Step 3.1 - Airflow connections: This step is to define our cloud providers connections and allow authentications for airflow operators.

      • Step 3.1.1 - AWS connection: On this example we'll use default connection for aws, in this case is aws_default. For this, follow these steps (on webserver):

        Admin -> Click on edit button for aws_default connection.

        Validate that connection type is Amazon Web Services and Extra has the region of your destination bucket.

        Now, you'll set Login with your AWS_ACCESS_KEY and Password should have AWS_SECRET_ACCESS_KEY.

      • Step 3.1.2 - GCP connection: Similar than AWS connection, we'll use default gcp connection (already created by airflow installation):

        Admin -> Click on edit button for google_cloud_default connection.

        Validate that connection type is Google Cloud Platform and for Scopers assign: https://www.googleapis.com/auth/cloud-platform.

        Now, you'll set Project Id depending on your gcp project and Keyfile JSON with the content of your service user (Guide: https://cloud.google.com/iam/docs/creating-managing-service-account-keys).

    • Step 3.2 - Airflow variables: Since the POC dag (airflow/dags/transfer_data_gcp_to_aws_dag.py) has source (gcp bucket) and destination (aws bucket) parametrized, airflow-variable will store the values for them.

      • Step 3.2.1 - AWS_BUCKET variable: Go to Admin -> Variables and create a new variable with the name AWS_BUCKET. It's important the format of the bucket name, ensure that you have a valid format like: s3://aws-destination-bucket/.

      • Step 3.2.1 - GCP_BUCKET variable: Go to Admin -> Variables and create a new variable with the name GCP_BUCKET. For this variable, gcp operator does not require prefix (gcs) on the bucket name, a valid bucket name for GCP_BUCKET is: gcp-source-bucket.

    • Step 3.3 - Local AWS config: AWS works with boto3, therefore we need to create a file for aws-account authentication. For this we need to run:

      touch ~/.boto

    The content for .boto should be:

    [Credentials]
    aws_access_key_id = YOUR_ACCESS_KEY
    aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

And that's all, you can test the dag with a manual run and then schedule it as per your needs.

References:

About

Short demo/guide for an airflow built-in operator to transfer files stored on GCP to AWS S3 with minimal configuration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages