Skip to content

Data pipeline to process and analyse Twitter data in a distributed fashion using Apache Spark and Airflow in AWS environment

Notifications You must be signed in to change notification settings

christopherkindl/twitter-data-pipeline-using-airflow-and-apache-spark

Repository files navigation

Data pipeline to process and analyse Twitter data in a distributed fashion using Apache Spark and Airflow in AWS environment

This repository shows the development of a scalable data pipeline in AWS using parallelisation techniques via Apache Spark on Amazon EMR and orchestrating workflows via Apache Airflow. The data analysis part consists of a simple sentiment analysis using a rule-based approach and a topic analysis using word frequencies by applying common NLP techniques.

The data pipeline is used for an existing web application which allows end-user to analyse housing prices based on locations of subway stations. More precisely, users see the average housing price of properties that are within a radius of less than 1km of a particular subway station. Therefore, the new data pipeline shown in this repository is used to make the application richer and, thus, incorporate sentiment scoring and topics analysis to give users a better sense of the common mood and an indication of what type of milieu lives in a particular area.

The python-based web scraper using BeautifulSoup to fetch geo-specific housing prices from property website across London is provided in the repository as well but will not be extensively discussed here.

alt text

Prerequisites

  1. Git to clone the repository
  2. AWS account to run the pipeline in the cloud environment
  3. Twitter developer account with access to the standard API to fetch tweets
  4. Database - In this project, we use a PostgreSQL database. Database related code snippets (e.g. uploading data to db) might differ when using other databases
  5. Webhosting and domain with WordPress to run the client application
  6. wpDataTables - WordPress plugin to easily access databases and create production-ready tables and charts

0. Setup of the cloud environment in AWS

AWS provides Amazon Managed Workflows for Apache Airflow (MWAA) that makes it very easy to run Apache Airflow on AWS.

  1. Go to the MWAA console and click create a new environment to follow the step-by-step configuration guide

  2. Select an existing S3 bucket or create a new one and define the path where the Airflow DAG (the script which executes all tasks you want to run for the data pipeline) should be loaded from. The bucket name must start with airflow-

  3. Upload requirements.txt that contains our python libraries to run the Airflow DAG. AWS will install them via pip install. Hint: If your DAG runs on libraries that are not available in pip, you can upload a plugins.zip in which you can include your desired libraries as Python wheels

  4. Each environment runs in a Amazon Virtual Private Cloud (VPC) using private subnets in two availability zones. AWS recommends using a private network for the webserver access. For simplicity, we select a public network that allows us to log in over the internet. Lastly, we let MWAA create a new security group

  5. For the environment class, we select pw1.small as it corresponds best to our DAG workload

  6. Activate Airflow task logs using the default setting. This allows to have logging information which is especially helpful for debugging

  7. Keep the default settings

  8. Create a new role or use an existing one and complete the setup by clicking create new environment

1a. Variables and connections for the MWAA environment

MWAA provides variables to store and retrieve arbitrary content or settings as a simple key-value store within Airflow. They can be created directly from the user interface (Admin > Variables) or bulk uploaded via JSON files.

{
    "london-housing-webapp": {
        "bucket_name": "london-housing-webapp",
        "key1": "input/subway_station_information.csv", 
        "output_key": "api_output/twitter_results.parquet",
        "db_name": "postgres",
        "consumer_key": "{{TWITTER API KEY}}",
        "consumer_secret": "{{TWITTER API SECRET}}",
        "access_token": "{{TWITTER API ACCESS TOKEN}}",
        "access_token_secret": "{{{TWITTER API ACCESS TOKEN}}"
    }
}

A sample variabes file is provided in the repository that contains all variables used in this project.

Airflow also allows to define so-called connection objects. In our case, we need a connection to AWS itself (Airflow acts as an external system to AWS) and to our database in which the final results will be stored.

1b. General settings in the Airflow DAG

Define basic configuration information, such as schedule_interval or start_date in section default_args and dag of the DAG. This is also the place where we incorporate our variables and connection objects

default_args = {
    'start_date': datetime(2021, 3, 8),
    'owner': 'Airflow',
    'filestore_base': '/tmp/airflowtemp/',
    'email_on_failure': True,
    'email_on_retry': False,
    'aws_conn_id': 'aws_default_christopherkindl',
    'bucket_name': Variable.get('london-housing-webapp', deserialize_json=True)['bucket_name'],
    'postgres_conn_id': 'postgres_id_christopherkindl',
    'output_key': Variable.get('london-housing-webapp',deserialize_json=True)['output_key'],
    'db_name': Variable.get('london-housing-webapp', deserialize_json=True)['db_name']
}

dag = DAG('london-housing-webapp',
          description='fetch tweets via API, run sentiment and topic analysis via Spark, save results to PostgreSQL',
          schedule_interval='@weekly',
          catchup=False,
          default_args=default_args,
          max_active_runs=1)

2. Taks in the Airflow DAG

Basic architecture

A typical Airflow DAG consists of different tasks that either fetch, transform or process data in various ways. Heavy data analysis tasks are not recommended to run within the MWAA environment due to its modest workload capacity. ML tasks are called via Amazon SageMaker, whereas complex data analyses can be done in a distributed fashion on Amazon EMR. In our case, we run the data analysis on an Amazon EMR cluster using Apache Spark (via Python API PySpark).

We can either write customized functions (e.g. request data via Twitter API) or can make use of predefined modules which are usually there to trigger external activities (e.g. data analysis in Spark on Amazon EMR).

Example of a customized function which is then assigned to a PythonOperator to function as a task:

# custom function

def create_schema(**kwargs):
    '''
    1. connect to Postgre Database
    2. create schema and tables in which the final data will be stored
    3. execute query
    '''
    pg_hook = PostgresHook(postgres_conn_id=kwargs['postgres_conn_id'], schema=kwargs['db_name'])
    conn = pg_hook.get_conn()
    cursor = conn.cursor()
    log.info('initialised connection')
    sql_queries = """

    CREATE SCHEMA IF NOT EXISTS london_schema;
    CREATE TABLE IF NOT EXISTS london_schema.sentiment(
        "tweets" varchar,
        "date" timestamp,
        "station" varchar(256),
        "sentiment" numeric
    );

    CREATE TABLE IF NOT EXISTS london_schema.topics(
        "date" timestamp,
        "topics" varchar
    );

    CREATE TABLE IF NOT EXISTS london_schema.data_lineage(
        "batch_nr" numeric,
        "job_nr" numeric,
        "timestamp" timestamp,
        "step_airflow" varchar(256),
        "source" varchar(256),
        "destination" varchar(256)
    );
    """

    # execute query
    cursor.execute(sql_queries)
    conn.commit()
    log.info("created schema and table")



# qualify as a task

create_schema = PythonOperator(
    task_id='create_schema',
    provide_context=True,
    python_callable=create_schema,
    op_kwargs=default_args,
    dag=dag,

)

The custom function above creates schema and tables directly into the PostgreSQL database to which the final data will be uploaded to. Note how op_kwargs = default_args allows to interface with the general configuration information provided.

Indicating the order of the tasks

Tasks can be executed sequentially or simultaneously. The order can be indicated with the following example syntax:

start_data_pipeline >> create_schema >> get_twitter_data >> create_emr_cluster >> step_adder
step_adder >> step_checker >> terminate_emr_cluster >> summarised_data_lineage_spark >> save_result_to_postgres_db
save_result_to_postgres_db >> end_data_pipeline

All tasks of the DAG

alt text

See the entire Airflow DAG code of this project here

3. Run Spark on Amazon EMR

Permissions

Change IAM policy to the following setting to allow MWAA to interface with Amazon EMR.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeStep",
                "elasticmapreduce:AddJobFlowSteps",
                "elasticmapreduce:RunJobFlow"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": [
                "arn:aws:iam::{{AWS ID}}:role/EMR_DefaultRole",
                "arn:aws:iam::{{AWS ID}}:role/EMR_EC2_DefaultRole"
            ]
        }
    ]
}

Motivation to use Spark for data analysis

We want to set up an infrastructure that allows big data analysis in a distributed fashion. Apache Spark as our big data framework of choice has the following two main advantages: First, Spark’s in-memory data engine can perform tasks very efficient due to parallelisation logic. Second, Spark’s developer-friendly API reduces much of the grunt work of distributed computing and can be accessed in various languages. In our case, we use PySpark which is a Python API to interface with Spark on a high level. This means it is suitable for interacting with an existing cluster but does not contain tools to set up a new, standalone cluster.
The parallelisation logic of a distributed architecture is the main driver to speed up processing and, thus, enable scalability. Using Spark’s DataFrame or Resilient Distributed Dataset (RDD) allows distributing data computation across a cluster.

Under the hood

We use Amazon’s big data platform EMR to run our Spark cluster. A Spark cluster can be characterised by a master node that serves as the central coordinator and worker nodes on which the tasks/jobs are executed (=parallelisation). It requires a distributed storage layer which is HDFS (Hadoop Distributed File System) in our case. S3 object storage is used as our main data storage and HDFS as our intermediate temporary memory on which the script will access the Twitter data and write the results. Temporary means that the processed data to HDFS will disappear after the termination of the cluster. The reason to use HDFS is, that it is substantially faster than writing the results directly to the S3 bucket.

Interaction between Airflow and Amazon EMR

Every step that will be made on the cluster will be triggered by our Airflow DAG: First, we create the Spark cluster by providing specific configuration details and launch Hadoop for the distributed data storage simultaneously. In terms of the cluster configuration, we use one master node and two worker nodes all running on a m5.xlarge instance (16 GB RAM, 4 CPU cores) given the relatively small dataset size. Next, we trigger a bootstrap action to install non-standard python libraries (vaderSentiment, NLTK for NLP pre-processing steps) on which the sentiment and topic analysis scripts depend on. The file is loaded from an S3 bucket and submitted in the form of a bash script.

Airflow offers pre-defined modules to quickly interact with Amazon EMR. The example below shows how an Amazon EMR cluster with Spark (PySpark) and Hadoop application is created using EmrCreateJobFlowOperator().

# define configuration setting for EMR cluster

JOB_FLOW_OVERRIDES = {
    "Name": "sentiment_analysis",
    "ReleaseLabel": "emr-5.33.0",
    "LogUri": "s3n://{{ BUCKET_NAME}}/logs/", # define path where logging files should be saved
    "BootstrapActions": [
        {'Name': 'install python libraries',
                'ScriptBootstrapAction': {
                # path where the .sh file to install non-standard python libaries is loaded from
                'Path': 's3://{{ BUCKET_NAME}}/scripts/python-libraries.sh'} 
                            }
                        ],
    "Applications": [{"Name": "Hadoop"}, {"Name": "Spark"}], # We want our EMR cluster to have HDFS and Spark
    "Configurations": [
        {
            "Classification": "spark-env",
            "Configurations": [
                {
                    "Classification": "export",
                    "Properties": {
                    "PYSPARK_PYTHON": "/usr/bin/python3"
                    },
                }
            ],
        }
    ],
    "Instances": {
        "InstanceGroups": [
            {
                "Name": "Master node",
                "Market": "SPOT",
                "InstanceRole": "MASTER",
                "InstanceType": "m5.xlarge",
                "InstanceCount": 1, # one master node
            },
            {
                "Name": "Core - 2", 
                "Market": "SPOT", # Spot instances are a "use as available" instances
                "InstanceRole": "CORE",
                "InstanceType": "m5.xlarge",
                "InstanceCount": 2, # two worker nodes
            },
        ],
        "Ec2SubnetId": "{{SUBNET_ID}}",
        "KeepJobFlowAliveWhenNoSteps": True,
        "TerminationProtected": False, # this lets us programmatically terminate the cluster
    },
    "JobFlowRole": "EMR_EC2_DefaultRole",
    "ServiceRole": "EMR_DefaultRole",
}


# qualify it as an Airflow task
create_emr_cluster = EmrCreateJobFlowOperator(
    task_id="create_emr_cluster",
    job_flow_overrides=JOB_FLOW_OVERRIDES, # 
    aws_conn_id="aws_default_christopherkindl",
    emr_conn_id="emr_default_christopherkindl",
    dag=dag,
)

Submitting Spark jobs

We can submit our Spark job that contains the python file for the sentiment analysis as well as data movement steps. Here, the configuration s3-dist-cp allows us to transfer data within S3, or between HDFS and S3. The python file is loaded from the S3 bucket while the scraped Twitter data is moved from the S3 bucket to the HDFS for the sentiment analysis and vice versa once the analysis is done. The same procedure is applied to run the topic analysis. We add a step sensor that will periodically check if the last step is completed, skipped or terminated. After the step sensor identifies the completion of the sentiment analysis (e.g. moving final data from HDFS to S3 bucket), a final step is added to terminate the cluster. The last step is necessary as AWS operates on a pay-per-use model (EMR is usually billed per second) and leaving unneeded resources running is wasteful anyways.

Hint: The pyspark scripts to run the analyses are not discussed in detail here. In-code comments should be sufficient to understand the concept of each analysis. We can start a Spark session and fetch the Twitter data as follows.

# code snippet of sentiment_analysis.py

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", type=str, help="HDFS input", default="/twitter_results")
    parser.add_argument("--output", type=str, help="HDFS output", default="/output")
    args = parser.parse_args()
    spark = SparkSession.builder.appName("SentimentAnalysis").getOrCreate()
    
    # refers to function defined above which can be found in the complete script
    sentiment_analysis(input_loc=args.input, output_loc=args.output)

To read more about Spark sessions and Spark contexts, check this post.

The figure below summarises the tasks to set up the EMR environment and execute jobs in Spark followed by a code snippet that shows how Spark jobs/steps are defined and called in the Airflow DAG

alt text

Spark-specific jobs:

# define spark jobs that are executed on Amazon EMR

SPARK_STEPS = [
    {
        "Name": "move raw data from S3 to HDFS",
        "ActionOnFailure": "CANCEL_AND_WAIT",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
                "s3-dist-cp",
                "--src=s3://{{BUCKET_NAME}}/api_output/twitter_results.parquet", # refer to the path where the tweets are stored
                "--dest=/twitter_results", # define destination path in HDFS
            ],
        },
    },
    {
        "Name": "run sentiment analysis",
        "ActionOnFailure": "CANCEL_AND_WAIT",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
                "spark-submit",
                "--deploy-mode",
                "client",
                # refer to the path where the .py script for the data analysis is stored
                "s3://{{BUCKET_NAME}}/scripts/sentiment_analysis.py", 
            ],
        },
    },
    {
        "Name": "move final result of sentiment analysis from HDFS to S3",
        "ActionOnFailure": "CANCEL_AND_WAIT",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
                "s3-dist-cp",
                "--src=/output", # path to store results temporarily in HDFS
                "--dest=s3://{{BUCKET_NAME}}/sentiment/", # final destination in S3 Bucket
            ],
        },
    },
    {
        "Name": "run topics analysis",
        "ActionOnFailure": "CANCEL_AND_WAIT",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
                "spark-submit",
                "--deploy-mode",
                "client",
                # path where .py file for second analysis is stored
                "s3://{{BUCKET_NAME}}/scripts/topic_analysis.py", 
            ],
        },
    },
    {
        "Name": "move final result of topic analysis from HDFS to S3",
        "ActionOnFailure": "CANCEL_AND_WAIT",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
                "s3-dist-cp",
                "--src=/output",
                "--dest=s3://{{BUCKET_NAME}}/topics/",
            ],
        },
    },
]

# qualify it as an Airflow task
step_adder = EmrAddStepsOperator(
    task_id="add_steps",
    job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr_cluster', key='return_value') }}",
    aws_conn_id="aws_default_christopherkindl",
    steps=SPARK_STEPS,
    dag=dag,
)

Key Airflow modules to interface with Amazon EMR:

  • EmrCreateJobFlowOperator(): to create EMR cluster with desired applications on it
  • EmrAddStepsOperator(): to define jobs
  • EmrStepSensor(): to watch steps
  • EmrTerminateJobFlowOperator(): to terminate EMR cluster

4. Launch Airflow DAG

Upload the final Airflow DAG to the corresponding path as explained in the MWAA environment setup guide. Go to the Airflow user interface and start the DAG by switching the button to On (Hint: use a date in the past to trigger the DAG immediately).

Useful housekeeping things to know:

  • Log files can be accessed by clicking on the coloured status squares which appear in the Tree view mode
  • When spark steps are running, you can watch it in Amazon EMR (AWS management console > EMR > Clusters) directly and see how the steps are executed
  • Log files of Spark jobs are not shown in the Airflow generated log files, they have to be enabled when configuring the EMR cluster by providing a S3 path (see the example in readme section Interaction between Airflow and Amazon EMR)

A more detailed description can be found here.

5. Connect database to the web application

For simplification, this documentation does not cover the detailed development process of the website itself. Using the WordPress plugin wpDataTables allows us to easily access any common database (MySQL, PostgreSQL, etc). NoSQL databases are not supported.

Once installed, you can connect to a database (WordPress Website Admin Panel > wpDataTables > Settings > separate DB connection) and run a query (WordPress Website Admin Panel > wpDataTables > Create a Table/Chart) that is automatically transformed into a table or chat:

alt image

Using views to avoid complex queries at client-side

To anticipate a better website performance, we avoid writing a complex query at the client-side and, thus, create a view within the schema that already has both data sources (housing prices, sentiment data) combined. The topic analysis data has its own query due to its generalised form and is accessed directly since it does not require any transformation steps at the client-side.

Hint: Views can be easily created using a database administration tool, such as pgAdmin

The figure below summarises the interaction between the client-side and the database.

alt image

About

Data pipeline to process and analyse Twitter data in a distributed fashion using Apache Spark and Airflow in AWS environment

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published