Skip to content

data enginerring project - visualize visa numbers by country, time issued from japan

License

Notifications You must be signed in to change notification settings

erjan/data_engineering_japan_visas_pyspark

Repository files navigation

data_engineering_japan_visas_pyspark

data engineering project - visualize visa numbers by country, time issued from japan

This is small data engineering project to learn how to install apache spark cluster on server, learn the workflow of interaction with apache spark/local machine via pyspark.

Original tutorial: https://www.youtube.com/watch?v=f-IcM8mFmDc&t=160s

Visualized map:

Screenshot_8

2nd map: Screenshot_13

  1. create venv in local project folder: python -m venv japan-visa-de

  2. download dataset of japan visa csv file - https://www.kaggle.com/datasets/yutodennou/visa-issuance-by-nationality-and-region-in-japan

  3. create vm in ec2 (t2.xlarge), download ssh key, move ssh key to project folder using "scp" cmd

  4. chmod 400 your private_key.pem

  5. install docker compose via image

  6. run docker compose to bring up spark cluster

  7. enable inbound rule in sec group in aws ec2 to see spark master web ui on port 9090

  8. write pyspark code , upload on the spark cluster machine and execute using spark-submit

  9. download back results of work to local machine - visualized images/html

About

data enginerring project - visualize visa numbers by country, time issued from japan

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages