Pick one of the 50 datasets. Use PySpark to perform the ETL process, connect to an AWS RDS instance, and load the transformed data into pgAdmin.
-How many Vine reviews and non-Vine reviews were there?
-How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars?
-What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars?
1.) The majority of reviews 99.9% are non-vine reviewers
2a.) Percent of 5 star app reviews from paid/, 'helpful' dataset: = 25%
2b.) Percent of 5 star app reviews from nonpaid/, 'helpful' dataset: = 49%
3.) Additionally we would recommend testing a larger sample the 'helpful' parameters reduced the count considerably.