- Scrape Historic Tweets
- Create labelled dataset using Amazon Comprehend
- Train a BERT model using the labelled Dataset
- Create a microservice using the trained model
- Scrape live data on a user specified topic
- Ingest data into kafka
- Set up kafka producer to produce event stream
- Set up kafka consumers to process the events, extract essential information and perform sentiment analysis on the tweet
- Stream this live data in a topic
- Read the live stream into Druid
- Flatten the data and store it as rows and column in a database
- Visualize and Analyze data using Turnilo
- Dockerize various components
- Use K8 to manage containers
- Deploy to EC2
If you already have an account, skip this step.
Go to this link and follow the instructions. You will need a valid debit or credit card. You will not be charged, it is only to validate your ID.
Install the AWS CLI Version 1 for your operating system. Please follow the appropriate link below based on your operating system.
** Please make sure you add the AWS CLI version 1 executable to your command line Path.
Verify that AWS CLI is installed correctly by running aws --version
.
- You should see something similar to
aws-cli/1.17.0 Python/3.7.4 Darwin/18.7.0 botocore/1.14.0
.
You need to retrieve AWS credentials that allow your AWS CLI to access AWS resources.
- Sign into the AWS console. This simply requires that you sign in with the email and password you used to create your account. If you already have an AWS account, be sure to log in as the root user.
- Choose your account name in the navigation bar at the top right, and then choose My Security Credentials.
- Expand the Access keys (access key ID and secret access key) section.
- Press Create New Access Key.
- Press Download Key File to download a CSV file that contains your new AccessKeyId and SecretKey. Keep this file somewhere where you can find it easily.
Now, you can configure your AWS CLI with the credentials you just created and downloaded.
-
In your Terminal, run
aws configure
.i. Enter your AWS Access Key ID from the file you downloaded.
ii. Enter the AWS Secret Access Key from the file.
iii. For Default region name, enterus-east-1
.
iv. For Default output format, enterjson
. -
Run
aws s3 ls
in your Terminal. If your AWS CLI is configured correctly, you should see nothing (because you do not have any existing AWS S3 buckets) or if you have created AWS S3 buckets before, they will be listed in your Terminal window.
** If you get an error, then please try to configure your AWS CLI again.
-
Create a free Twitter user account, This will allow you to access the Twitter developer portal.
-
Navigate to Twitter Dev Site, sign in, and create a new application. After that, fill out all the app details. Once you do this, you should have your access keys.
Follow the instructions of your operating system:
Install Docker Desktop. Use one of the links below to download the proper Docker application depending on your operating system. Create a DockerHub account if asked.
i. Excecute the files "first.bat" and "second.bat" in order, as administrator.
ii. Restart your computer.
iii.Excecute the following commands in terminal, as administrator.
```
REG ADD "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion" /f /v EditionID /t REG_SZ /d "Professional"
REG ADD "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion" /f /v ProductName /t REG_SZ /d "Windows 10 Pro"
```
iv. Follow this link to install Docker.
v. Restart your computer, do not log out.
vi. Excecute the following commands in terminal, as administrator.
```
REG ADD "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion" /v EditionID /t REG_SZ /d "Core"\
REG ADD "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion" /v ProductName /t REG_SZ /d "Windows 10 Home"
```
Open a Terminal window and type docker run hello-world
to make sure Docker is installed properly . It should appear the following message:
Hello from Docker!
This message shows that your installation appears to be working correctly.
Finally, in the Terminal window excecute docker pull tensorflow/tensorflow:2.1.0-py3-jupyter
.
Follow the instructions for your operating system.
Follow the instructions for your operating system.
If you already have a prefered text editor, skip this step.
Follow the following instructions to install zookeeper and kafka on your system.
Once done you can use the following commands to run the kafka server.
Start Zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
Start Kafka
bin/kafka-server-start.sh config/server.properties
Follow the following instructions to install Druid on your system.
- Java 8 (8u92+) or later
- Linux, Mac OS X, or other Unix-like OS (Windows is not supported)
- Node.js - 10.x or 8.x version.
- npm - 6.5.0 version.
Once you have the pre-requisite packages:
Install Turnilo distribution using npm.
npm install -g turnilo
To connect to the existing Druid broker using --druid command line option. Turnilo will automatically introspect your Druid broker and figure out available datasets.
turnilo --druid http[s]://druid-broker-hostname[:port]
- Docker Client
Use the commands to install Superset Incubator
git clone https://github.com/apache/incubator-superset/
cd incubator-superset
docker-compose up
Use the port: http://localhost:8088
to access superset portal
- Run requirements.txt
pip install -U -r requirements.txt
This command will instal all the required packages and update any older packages.
- Now that we have our enviornment set up, we will create an S3 bucket.
Follow this link and create a S3 bucket.
-
Scraping Tweets: To run the scraping pipeline follow the detailed instructions in the Scraping Pipeline folder.
This pipeline will scrape historic tweets using tweepy library and label the dataset and save it on s3 bucket.
Run the scraping pipeline using to following command:
python annotation_pipeline.py --environment=conda run
-
Training Pipeline: To run the scraping pipeline follow the detailed instructions in the Training Pipeline folder.
This pipeline will read the labelled dataset from the s3 and train a ML sentiment analysis model (BERT), which we will use to service a flask api. Run the scraping pipeline using to following command:
python training.py run
-
Run the Flask App: You can use a docker hub image to run this app or run it locally, you will find detailed instructions on how to run the api here
This is a sentiment analysis api, which will take in a text input (tweet, in our case) and provide us with a sentiment and it's score. Run the api using to following command:
python app.py
-
Analysis Pipeline: This is a kafka pipeline which will injest real-time tweets and perform sentiment analysis on them and process each tweet as a event, we then store this events in druid and flatten the data, and then use turnilo for visualization.
Detailed instruction on how to run this pipeline can be found here -
Now that we have our kafka stream running, we will start Druid and configure it to ingest the kafka stream.
To start Druid use the following command:
./bin/start-micro-quickstart
Configure Druid to take in the kafka stream using the following steps
Once configured, Druid will ingest real time data from kafka and store it in a database -
Now that we have our data in the Druid database, we use turnilo for Data Visualization and Analysis
To start Turnilo use the following command:
turnilo --druid DRUITPORT
DRUIDPORT is the port where Druid is running, which ishttp://localhost:8888
by default. -
Load the Superset Dashboard
One you open Superset, load the druid dataset into it using the following link
Then select import, and import theanalysis.json
file, which will start up the dashboard.
- Create a react web app as the front end of the system
- Currently we have our kafka cluster and micro-service running on EC2, we'd like to house our database on cloud too, so it's remotely accessible