Skip to content

MSBX5420/Team-Torreys-Peak

Repository files navigation

MSBX 5420 - Spring 2020

Unstructured and Distributed Data Modeling and Analysis

Leeds School of Business, University of Colorado Boulder

Project Description

COVID-19 has a strong impact on human lives. Our mission is to analyze how COVID-19 affects the newspaper industry.

Prerequisites

  • AWS cluster with pyspark (python 3) environment
  • pandas & pyspark & numpy for data processing
  • tmtoolkit & nltk for text mining and topic modeling
  • matplotlib & wordcloud & seaborn for data visualization

Installing

pip install --user pandas pyspark tmtoolkit nltk numpy wordcloud matplotlib seaborn

Deployment

  • For local environment:

  • For cluster platform:

    • All files can be executed on any cloud service. We will give an example on running files on AWS cluster.
    • Tuning path for reading files is a pain, we totally understand that. So if you want to run our code on AWS, we strongly recommend save news.csv at the same folder as all other .ipynb files. In that way, all you need to do is change path to :

    • You can use aws cluster jupyter notebook interface to interact with our code;
    • You can also use spark-submit to submit our work to your personal cluster and check results with provided hadoop link
    spark-submit --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 1G --executor-cores 1 --driver-memory 1G /aws_cluster_hadoop_path/files_you_want_to_execute.py(ipynb)

Documents

Contact Information

Author:

Instructor: