In the era of information explosion,using local machines to analyse big data might take hours to generate worth trends or strategies. This project is to demonstrate how to use AWS services and google Colab to process big data.
- Create an AWS account
- Connect google colab with an ipynb file
- In AWS(RDS), create a data base
- Follow References 1 to build up a connection between local progreSQL and RDS
- Extract datasets from
https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
- Use Spark to rocess ETL method to clean the data
- Use Spark
.write.jdbc
method to load data into progreSQL
Project
├── Image
│ ├── kitchen_review_info.png
│ ├── kitchen_customers.png
│ ├── kitchen_products.png
│ ├── kitchen_vine_info.png
│ ├── tools_customers.png
│ ├── tools_products.png
│ ├── tools_review_info.png
│ └── tools_vine_info.png
├── README.md
├── requirements.txt
├── reviews_us_Kitchen.ipynb
└── reviews_us_Tools.ipynb
- A Colab account - Colab Notebooks
- An AWS account - S3 and RDS service
Remember to closely monitor any AWS resources that you choose to use. It’s crucial that you clean up and stop, or shut down any AWS resources to avoid accruing additional costs. - S3 bucket permission setting:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "getobject",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "<bucket codes>/*"
}
]
}