Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#86 Decoupling the scraper from the backend and editing the scraper to make it more versatile with different quarters #92

Open
wants to merge 37 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
84b7ca3
Making new scraper folder and editing docker file to accommodate
CTrando Mar 17, 2019
59321d7
#86: decouple scraper from backend
snowme34 Apr 1, 2019
0d19ed8
#86: Allowing the scraper to work with multiple quarters - backend is…
CTrando Apr 3, 2019
279b401
Fixing comment and pruning quarter insert script
CTrando Apr 4, 2019
c9afcc0
Making webreg upload script create database and quarters table for fu…
CTrando Apr 5, 2019
4e9ebc5
Adding necessary config file
CTrando Apr 5, 2019
db0f27e
add quarter parameter to application
snowme34 Apr 8, 2019
922b633
bug fix, add quarter variable, note frontend still not sending the ne…
snowme34 Apr 8, 2019
4cc3bb1
#86 trying to add default quarter support for backend
snowme34 Apr 9, 2019
490ca06
#86 fix import path
snowme34 Apr 9, 2019
e3f76a9
Fixing chrome version
CTrando Apr 15, 2019
47faaba
Merge branch 'scratch/issue86' of github.com:ucsdscheduleplanner/UCSD…
CTrando Apr 15, 2019
682a549
#86 trying to prune Dockerfile
snowme34 Apr 19, 2019
4452996
#86 working on pruning requirements.txt
snowme34 Apr 19, 2019
3c6a080
#86 add note for possible caching bug
snowme34 Apr 19, 2019
9ad923d
#86 disable WI19, no one cares
snowme34 Apr 19, 2019
498ce1e
#86 add mysql non-root user, based on config.ini
snowme34 Apr 19, 2019
29e49a2
Adding golang routes and adding back department functionality and cou…
CTrando Apr 26, 2019
6862d9a
Merge branch 'scratch/issue86' of github.com:ucsdscheduleplanner/UCSD…
CTrando Apr 26, 2019
2872b6a
Adding other github repos, they will not be submodules so will not ac…
CTrando Apr 26, 2019
03b4504
Finished moving backend to Golang
CTrando May 3, 2019
72ca1a6
Adding more tests and working out compatability issues with frontend
CTrando May 3, 2019
db27989
Creating module for Go backend and restructure the code
snowme34 Jun 2, 2019
b771ba5
Add 2 functions in RoutesCommon to remove repeats; Add systematical w…
snowme34 Jun 3, 2019
1531f72
Add development config
snowme34 Jun 3, 2019
0eb55b0
Update RoutesCommon functions and error processing
snowme34 Jun 5, 2019
a7bde26
Cleanup Routes Code and update comments
snowme34 Jun 5, 2019
d8a97fa
Make a straightforward constructor for db struct and update the exist…
snowme34 Jun 6, 2019
fafe144
Add miltiple constructors for db since lower-case struct fields are n…
snowme34 Jun 6, 2019
513216b
Test using mock db
snowme34 Jun 6, 2019
98c989a
refactor, fix error handling
snowme34 Aug 3, 2019
49ddc45
refacotr more
snowme34 Aug 4, 2019
f6af1f6
refactor, restructure, add a ctx package with env
snowme34 Aug 5, 2019
9eb7de2
change embarrassing package name
snowme34 Aug 6, 2019
06feb39
refactor, add handler factoary
snowme34 Aug 11, 2019
3dd064a
remove python code
snowme34 Aug 11, 2019
ddf6944
add Docker support for go backend
snowme34 Aug 19, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file removed backend/datautil/__init__.py
Empty file.
35 changes: 0 additions & 35 deletions backend/datautil/sql_to_postgres.py

This file was deleted.

Empty file.
5 changes: 0 additions & 5 deletions backend/docker-run.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@
#!/bin/bash

# download data or not
if [[ $SDSCHEDULE_SCRAPE -eq 1 ]]; then
python3 -u datautil/webreg_scrape_upload.py
fi

# uwsgi or flask
if [[ $ENV == "PROD" ]]; then
useradd app
Expand Down
Empty file removed backend/scraper/__init__.py
Empty file.
7 changes: 7 additions & 0 deletions docker-compose-production.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,13 @@ services:
volumes:
- "sdschedule-data:/var/lib/mysql"
restart: always
sdschedule-scraper:
container_name: sdschedule-scraper
build: scraper
depends_on:
- "sdschedule-database"
environment:
- "PYTHONUNBUFFERED=0"
sdschedule-backend:
container_name: sdschedule-backend
build: backend
Expand Down
8 changes: 7 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,13 @@ services:
- "MYSQL_DATABASE=classes"
volumes:
- "sdschedule-data:/var/lib/mysql"
sdschedule-scraper:
container_name: sdschedule-scraper
build: scraper
depends_on:
- "sdschedule-database"
environment:
- "PYTHONUNBUFFERED=0"
sdschedule-backend:
container_name: sdschedule-backend
build: backend
Expand All @@ -19,7 +26,6 @@ services:
environment:
- "ENV=DEV"
- "PYTHONUNBUFFERED=0"
- "SDSCHEDULE_SCRAPE=${SDSCHEDULE_SCRAPE}"
sdschedule-frontend:
container_name: sdschedule-frontend
build: frontend
Expand Down
22 changes: 22 additions & 0 deletions scraper/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM joyzoursky/python-chromedriver:3.7

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update
RUN apt-get -y install locales
RUN sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen && locale-gen
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8

RUN apt-get -y install python3-lxml python3-pip
RUN apt-get -y install default-libmysqlclient-dev

WORKDIR /app
COPY ./requirements.txt /app/requirements.txt
RUN pip3 install -r requirements.txt
COPY . /app

ENV PYTHONPATH /app

CMD ["bash", "./docker-run.sh"]
4 changes: 4 additions & 0 deletions scraper/config/config.example.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[DB]
USERNAME=root
PASSWORD=password
ENDPOINT=sdschedule-database
2 changes: 2 additions & 0 deletions scraper/docker-run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/bin/bash
python3 -u ./webreg_scrape_upload.py
Binary file added scraper/driver/chromedriver_linux
Binary file not shown.
File renamed without changes.
47 changes: 47 additions & 0 deletions scraper/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
arrow==0.4.2
awsebcli==3.14.8
beautifulsoup4==4.6.3
blessed==1.15.0
botocore==1.12.71
bs4==0.0.1
cached-property==1.5.1
cement==2.8.2
certifi==2018.11.29
chardet==3.0.4
Click==7.0
colorama==0.3.9
docker==3.6.0
docker-compose==1.21.2
docker-pycreds==0.4.0
dockerpty==0.4.1
docopt==0.6.2
docutils==0.14
Flask==1.0.2
Flask-Caching==1.4.0
Flask-Compress==1.4.0
Flask-Cors==3.0.7
ics==0.4
idna==2.6
itsdangerous==1.1.0
Jinja2==2.10
jmespath==0.9.3
jsonschema==2.6.0
lxml==4.2.5
MarkupSafe==1.1.0
mysqlclient==1.3.14
pathspec==0.5.5
python-dateutil==2.7.5
pytz==2018.7
PyYAML==3.13
requests==2.18.4
selenium==3.8.0
semantic-version==2.5.0
six==1.12.0
SQLAlchemy==1.2.15
termcolor==1.1.0
texttable==0.9.1
urllib3==1.24.1
wcwidth==0.1.7
websocket-client==0.54.0
Werkzeug==0.14.1
uwsgi==2.0.17.1
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
import os
import shutil
import sqlite3
import requests
import sys
import traceback

import traceback
from threading import Thread, Lock
from requests.exceptions import Timeout

from settings import DATABASE_PATH, HOME_DIR, MAX_RETRIES
import requests
from requests.exceptions import Timeout

from settings import CAPES_URL, CAPES_HTML_PATH
from settings import DATABASE_PATH, MAX_RETRIES

CAPES_HOST = 'cape.ucsd.edu'
CAPES_ACCEPT = 'html'
Expand All @@ -17,7 +18,7 @@
class CAPESScraper:

def __init__(self):
# Read all departments from the SQL database
# Read all departments from the SQL database
self.database = sqlite3.connect(DATABASE_PATH)
self.cursor = self.database.cursor()
self.cursor.execute("SELECT DEPT_CODE FROM DEPARTMENT")
Expand All @@ -35,15 +36,15 @@ def __init__(self):
shutil.rmtree(CAPES_HTML_PATH)
os.makedirs(CAPES_HTML_PATH)

# Thread-safe way of marking that at least one thread has crashed
# Thread-safe way of marking that at least one thread has crashed
def set_crashed(self):
self.mutex.acquire()
self.mutex.acquire()
try:
self.crashed = True
finally:
self.mutex.release()

# Thread-safe way of checking if the program has crashed
# Thread-safe way of checking if the program has crashed
def has_crashed(self):
local_crashed = False
self.mutex.acquire()
Expand All @@ -63,7 +64,7 @@ def iter_departments(self):
pool = []
pool_size = os.cpu_count()
print("Initializing {} threads ...".format(pool_size))

# Allocate a pool of threads; each worker handles an equal subset of the work
for i in range(pool_size):
t = Thread(target=self.iter_departments_by_thread_handle_errors, args=[i, pool_size])
Expand All @@ -78,7 +79,7 @@ def iter_departments_by_thread_handle_errors(self, thread_id, num_threads):
# If a thread receives an error during execution, kill all threads & mark program as crashed
try:
self.iter_departments_by_thread(thread_id, num_threads)
except:
except:
print("Error encountered by thread {}. Gracefully exiting ...".format(thread_id), file=sys.stderr)
traceback.print_exc(file=sys.stderr)
self.set_crashed()
Expand All @@ -88,7 +89,7 @@ def iter_departments_by_thread(self, thread_id, num_threads):

# Iterate through each department that the thread is assigned to
for counter in range(thread_id, len(self.departments), num_threads):
# Exit if any part of the scraper has crashed
# Exit if any part of the scraper_impl has crashed
CTrando marked this conversation as resolved.
Show resolved Hide resolved
if self.has_crashed():
print("Thread {} is exiting gracefully ...".format(thread_id), file=sys.stderr)
return
Expand Down Expand Up @@ -118,9 +119,9 @@ def get_page_with_retries(self, page_url, thread_id):
'Accept': CAPES_ACCEPT,
'User-Agent': CAPES_USER_AGENT
})
return response
return response
except Timeout as timeout_exception:
retries += 1
retries += 1
print ("[T{0}] Failed to download page {1}.".format(thread_id, page_url))
if retries < max_retries:
print ("[T{0}] {1}/{2} attempts. Retrying ...".format(thread_id, retries, max_retries))
Expand All @@ -130,7 +131,7 @@ def get_page_with_retries(self, page_url, thread_id):

# Tries to store the given page contents into a file in our cache
def store_page(self, department, page_contents, thread_id):
# Cache page content appropriately
# Cache page content appropriately
with open(os.path.join(CAPES_HTML_PATH, department + '.html'), 'w') as f:
f.write(page_contents)
print('[T{0}] Saving'.format(thread_id), department, 'to', f.name, '...')
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,14 @@
from selenium.webdriver.support.wait import WebDriverWait

from scraper.scraper_util import Browser
from settings import COURSES_HTML_PATH
from settings import COURSES_HTML_PATH, QUARTERS_TO_SCRAPE
from settings import DATABASE_PATH, DATABASE_FOLDER_PATH
from settings import SCHEDULE_OF_CLASSES_URL
from settings import TIMEOUT, DEPT_SEARCH_TIMEOUT

QUARTER_INSERT_SCRIPT = """let select = document.getElementById("selectedTerm");
let opt = document.createElement('option');
opt.value = "WI19";
CTrando marked this conversation as resolved.
Show resolved Hide resolved
opt.value = "SP19";
opt.innerHTML = "bad";
select.appendChild(opt);
document.getElementById("selectedTerm").value = "{}";
Expand Down Expand Up @@ -65,30 +65,32 @@ def join(self, **kwargs):
def scrape_departments(self):
with Browser() as self.browser:
while not self.work_queue.empty():
department = self.work_queue.get()
self.scrape_department(department)
work = self.work_queue.get()
department = work["department"]
quarter = work["quarter"]
self.scrape_department(department, quarter)
self.work_queue.task_done()

def scrape_department(self, department):
def scrape_department(self, department, quarter):
# If a thread receives an error during execution, kill all threads & mark program as crashed try:
try:
self._scrape_department(department)
self._scrape_department(department, quarter)
except:
print("Error encountered by thread {}. Gracefully exiting ...".format(self.thread_id), file=sys.stderr)
traceback.print_exc(file=sys.stderr)

def _scrape_department(self, department):
def _scrape_department(self, department, quarter):
try:
self.get_page(SCHEDULE_OF_CLASSES_URL)
WebDriverWait(self.browser, TIMEOUT).until(EC.presence_of_element_located((By.ID, 'selectedSubjects')))
self.search_department(department)
self.scrape_pages(department)
self.search_department(department, quarter)
self.scrape_pages(department, quarter)
except:
print("Thread {} is exiting gracefully ...".format(self.thread_id), file=sys.stderr)

def search_department(self, department):
def search_department(self, department, quarter):
# Script for running with the bug where we insert our own quarter code in the form
# browser.execute_script(QUARTER_INSERT_SCRIPT.format(QUARTER))
self.browser.execute_script(QUARTER_INSERT_SCRIPT.format(quarter))
dept_select = Select(self.browser.find_element_by_id("selectedSubjects"))
truncated_dept = department + (4 - len(department)) * " "
WebDriverWait(self.browser, TIMEOUT).until(
Expand Down Expand Up @@ -122,7 +124,7 @@ def get_page(self, page_url):
self.max_retries))
raise timeout_exception

def scrape_pages(self, department):
def scrape_pages(self, department, quarter):
# now I should be at the course pages
current_page = 1
base_url = self.browser.current_url
Expand All @@ -141,20 +143,21 @@ def scrape_pages(self, department):
return True

html = self.browser.page_source
self.save_page(department, html, current_page)
self.save_page(department, quarter, html, current_page)

current_page += 1
current_url = base_url + "?page={}".format(current_page)
self.get_page(current_url)

# Attempts to store the given page contents into a file in our cache
def save_page(self, department, page_contents, num_page):
def save_page(self, department, quarter, page_contents, num_page):
quarter_path = os.path.join(COURSES_HTML_PATH, quarter)
# Create department folder if it doesn't exist
department_path = os.path.join(COURSES_HTML_PATH, department)
department_path = os.path.join(quarter_path, department)
if not os.path.exists(department_path):
os.makedirs(department_path)
file_path = os.path.join(department_path, str(num_page) + '.html')
log_msg = '[T{0}] Saving {1} (#{2}) to {3}'.format(self.thread_id, department, num_page, file_path)
log_msg = '[T{0}] Saving {1} (Page #{2}) (Quarter {3}) to {4}'.format(self.thread_id, department, num_page, quarter, file_path)
writer.write(file_path, page_contents, log_msg)


Expand All @@ -170,8 +173,10 @@ def __init__(self):
# fetching the data returns a tuple with one element,
# so using list comprehension to convert the data
self.departments = [i[0] for i in self.cursor.fetchall()]
for department in self.departments:
self.department_queue.put(department)

for quarter in QUARTERS_TO_SCRAPE:
for department in self.departments:
self.department_queue.put({"department": department, "quarter": quarter})

# Recreate top level folder
if os.path.exists(COURSES_HTML_PATH):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,9 @@
import sqlite3

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

from settings import DATABASE_PATH, DATABASE_FOLDER_PATH
from settings import DEPARTMENT_URL
from settings import HOME_DIR
from settings import DRIVER_PATH

class DepartmentScraper:
Expand Down
Loading