Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#86 Decoupling the scraper from the backend and editing the scraper to make it more versatile with different quarters #92

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

CTrando
Copy link
Collaborator

@CTrando CTrando commented Apr 3, 2019

No description provided.

@CTrando
Copy link
Collaborator Author

CTrando commented Apr 8, 2019

Make sure to work on a separate branch when you start doing the golang development. Don't worry about the frontend for now, if the request doesn't have a quarter default it to the latest one.

@snowme34
Copy link
Collaborator

Make sure to work on a separate branch when you start doing the golang development. Don't worry about the frontend for now, if the request doesn't have a quarter default it to the latest one.

For the "latest one", since the backend and scraper do not share the same config currently, I just created another config entry for the backend. Don't know if that works

@snowme34
Copy link
Collaborator

For the current error:

sdschedule-scraper     | Traceback (most recent call last):
sdschedule-scraper     |   File "./webreg_scrape_upload.py", line 51, in <module>
sdschedule-scraper     |     main()
sdschedule-scraper     |   File "./webreg_scrape_upload.py", line 23, in main
sdschedule-scraper     |     department_scraper = DepartmentScraper()
sdschedule-scraper     |   File "/app/scraper_impl/department_scraper.py", line 20, in __init__
sdschedule-scraper     |     self.browser = webdriver.Chrome(chrome_options=options, executable_path=DRIVER_PATH)
sdschedule-scraper     |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 75, in __init__
sdschedule-scraper     |     desired_capabilities=desired_capabilities)
sdschedule-scraper     |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 154, in __init__
sdschedule-scraper     |     self.start_session(desired_capabilities, browser_profile)
sdschedule-scraper     |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 243, in start_session
sdschedule-scraper     |     response = self.execute(Command.NEW_SESSION, parameters)
sdschedule-scraper     |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 311, in execute
sdschedule-scraper     |     self.error_handler.check_response(response)
sdschedule-scraper     |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 237, in check_response
sdschedule-scraper     |     raise exception_class(message, screen, stacktrace)
sdschedule-scraper     | selenium.common.exceptions.SessionNotCreatedException: Message: session not created: Chrome version must be >= 69.0.3497.0
sdschedule-scraper     |   (Driver info: chromedriver=72.0.3626.69 (3c16f8a135abc0d4da2dff33804db79b849a7c38),platform=Linux 4.9.0-8-amd64 x86_64)
sdschedule-scraper     | 

I did some search and realized that this is a problem with versions of chrome and chrome-driver.

The driver in the driver directory seems to mismatch with the version needed by the container.

And I don't quite understand the reason for using this image.

@snowme34
Copy link
Collaborator

We might need to prune the requirements.txt a little bit. I'm trying to figure it out

@snowme34
Copy link
Collaborator

snowme34 commented Apr 19, 2019

Still working on pruning.

Hard to test since each scraping takes at least 20min.

Maybe we want a feature that our scraper only scrapes for one department for testing.

Don't know if the new Dockerfile and requirements.txt are good

@snowme34
Copy link
Collaborator

snowme34 commented May 30, 2019

Correct me if I'm wrong: it seems that the code creates a new db connection each time data is requested, which should not be the case, right?

I can take a look at this

docker-compose.yml Outdated Show resolved Hide resolved
@@ -4,9 +4,11 @@ services:
sdschedule-database:
container_name: sdschedule-database
image: mariadb:latest
ports:
- 10800:3306 # TODEL: DEV port
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODEL: Temporary Development Code

backend/db/db.go Outdated Show resolved Hide resolved
@snowme34
Copy link
Collaborator

snowme34 commented Jun 6, 2019

I am confused how you tested the code previously. I believe testing should not use real database.

I fixed the tests except one, this is the output:

go test -v ./...                                                                                 
=== RUN   TestGetClassData                                                                         
--- FAIL: TestGetClassData (0.00s)                                                                 
    GetClassData_test.go:73: json: cannot unmarshal object into Go value of type []routes.Subclass 
    GetClassData_test.go:77: Subclasses for CSE 11 should definitely not be 0!                     
=== RUN   TestGetRequestClassDataFails                                                             
--- PASS: TestGetRequestClassDataFails (0.00s)                                                     
=== RUN   TestGetDepartmentSummary                                                                 
--- PASS: TestGetDepartmentSummary (0.00s)                                                         
=== RUN   TestGetDepartmentSummaryFailsOnPost                                                      
--- PASS: TestGetDepartmentSummaryFailsOnPost (0.00s)                                              
=== RUN   TestGetDepartments                                                                       
--- PASS: TestGetDepartments (0.00s)                                                               
=== RUN   TestGetDepartmentsOnPostFails                                                            
--- PASS: TestGetDepartmentsOnPostFails (0.00s)                                                    
=== RUN   TestGetInstructors                                                                       
--- PASS: TestGetInstructors (0.00s)                                                               
=== RUN   TestGetTypes                                                                             
--- PASS: TestGetTypes (0.00s)                                                                     
FAIL                                                                                               
FAIL    github.com/ucsdscheduleplanner/UCSD-Schedule-Planner/backend/test       0.314s             

I did not change the code for getClassData at all but the JSON thing does not seem to be working.

@snowme34
Copy link
Collaborator

snowme34 commented Jun 6, 2019

Brief:

  1. I employed principles including but not limited to SRP, DRY, OO, OCP and so on when typing on the keyboard

Sample code:

const logTagDepartment = "[Department]"

// GetDepartments is a pre-http.HandlerFunc for department route and will become a closure with *DatabaseStruct
func GetDepartments(writer http.ResponseWriter, request *http.Request, ds *db.DatabaseStruct) {

	if request.Method != "GET" {
		errInvalidMethod(logTagDepartment, writer, request)
		return
	}

	queryAndResponse(
		ds, logTagDepartment, writer, request,
		rowScannerOneString,
		"SELECT DISTINCT DEPT_CODE FROM DEPARTMENT",
		"DEPARTMENT")
}

Two functions calls is exactly what is needed.

  1. I did not change the way of query for GetClassData since I do not quite understand the reason behind using a for loop rather than using one query

  2. I did not put too much efforts on re-writing the testing files (no major programming design principles applied) but one test is still failing

  3. Perhaps instead of "db", we should call that file "ds" or "Env" for the environment of the whole server-side software

  4. I left several TODO and TODEL in the code

  5. I'm not extremely confident about the way to handle errors and logging (for routes)

  6. The next step is to write Dockerfile and try to integrate it with other componenets

@CTrando
Copy link
Collaborator Author

CTrando commented Jun 16, 2019

So if there is a test failing that is an issue because I made the tests such that the JSON format is what the frontend expects. Also I agree that it's not great to require a database connection in order to run the tests, but these integration tests are what's needed for full confidence, unit tests won't guarantee anything.

Copy link
Collaborator Author

@CTrando CTrando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So overall I do like the changes you've done. There are a lot of issues - I think you've fallen for premature optimization but everyone does. You've tried to make methods for each individual usecase that you might have and I think that is not a great way to proceed - sure if a specific pattern is done a lot we can make a method for it, but most of these cases we have only used once or twice.

Moreover some granularity is good. Instead of 10 custom error methods, you can do something like

err(ERR_STRING, type)

And you can have a bunch of different types of error messages instead of having a bunch of different types of error methods. Try to build up your intuition on this - it might take some time but always remember that you program to make stuff easier for yourself. So if you rip out a bunch of stuff to make things look nicer but when you want to add a new error message or a new usecase you have to write a bunch of boilerplate code then consider if you are making the right design choices.

backend/backend.go Outdated Show resolved Hide resolved
backend/backend.go Outdated Show resolved Hide resolved
backend/backend.go Outdated Show resolved Hide resolved
backend/backend.go Outdated Show resolved Hide resolved
backend/db/db.go Outdated Show resolved Hide resolved
backend/routes/Department.go Outdated Show resolved Hide resolved
backend/routes/Instructors.go Outdated Show resolved Hide resolved
backend/routes/RoutesCommon.go Outdated Show resolved Hide resolved
backend/routes/RoutesCommon.go Outdated Show resolved Hide resolved
docker-compose.yml Outdated Show resolved Hide resolved
@snowme34
Copy link
Collaborator

Thank you so much for your feedback!

I will try to follow them and try to dive deeper! I found this sql query interesting: https://stackoverflow.com/questions/1136380/sql-where-in-clause-multiple-columns

@snowme34
Copy link
Collaborator

snowme34 commented Aug 3, 2019

Let's use the username: "splanner", shall we?


I refactored things. It compiles

  • add several structs, like ErrorStruct and QueryStruct
  • separate queryAndResponse into query and response
  • rename files based on the recommendations here
  • more comments
  • rewrite the beast, class data route, to create a query on the fly

Next step is

  1. add env to remove awkward hard-coded variable
  2. use docker to add integration (test)
  3. more refactor!
  4. fix the awkward multiple query for class data by means of row constructor
  5. improve error according to https://tools.ietf.org/html/rfc7807
  6. use env instead of *db
  7. closure the query function for handlers with env.db
  8. add caching

@snowme34
Copy link
Collaborator

snowme34 commented Aug 4, 2019

This is a long pull request.

I finally figure out why the class data route test fails. The test expects a pure array of json. But the code returns a map like this: {"ABC 11": [...], "MATH 12": [...]}

  • unit tests are suggested to be put right next to the source code by effective go
  • test directory is fine with integration test

Let's first reach a consensus about out API:

  • contract
    • format
    • keys and values
  • duplicates
    • what happens if user requests same classes multiple times

Copy link
Collaborator Author

@CTrando CTrando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks really good. Don't merge this yet of course but I'm impressed by your code quality. Just a few things here and there. If the tests fail because they expect some format, that is the format that the frontend expects.

backend/backend.go Outdated Show resolved Hide resolved

// TODO: use env in handlers
// TODO: make a handler factory
http.HandleFunc("/api_course_nums", route.MakeHandler(route.GetCourseNums, db, route.LogPrefixCourseNums))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this LogPrefix?

backend-py/config/config.example.ini Outdated Show resolved Hide resolved
)

// LogPrefixClassData log prefix for ClassData route
const LogPrefixClassData = "[ClassData]"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is only one prefix can't you just make it named LogPrefix instead of having to put ClassData at the end? I realize you have this in other places too, but can't you do Package.LogPrefix instead of LogPrefixPackage in those other spots?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they are all under the route package, they are all in the same name scope. I actually struggled with that too but decided to go with this

backend/route/classdata.go Outdated Show resolved Hide resolved
backend/route/types.go Outdated Show resolved Hide resolved
backend/store/db.go Outdated Show resolved Hide resolved
backend/util/util.go Show resolved Hide resolved
[]interface{}{"LTEN", "26", "LTEN", "26"},
},
{
fmt.Sprintf(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you are combining all statements together instead of making separate statements. This is probably for a performance increase, but have you noticed any markably different performance differences? If the code is must more reasonable doing each query separately and you don't have full confidence in your method of doing this, then I would say the performance drop is justified. However, if you are confident then I'm willing to go along with it.


}

func TestGetClassData(t *testing.T) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests for get class data and departments and such, these are the big ones that the web app uses.

Copy link
Collaborator

@snowme34 snowme34 Aug 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add after a consensus for REST API

@CTrando
Copy link
Collaborator Author

CTrando commented Aug 10, 2019

Ok sure, I can write something up on the contract.

@snowme34
Copy link
Collaborator

snowme34 commented Aug 10, 2019

Ok sure, I can write something up on the contract.

That will be great! Include but not limit to, api names (do we really need the api prefix? I have no idea), api format, api input format, api input check, cache-ability.

Should work as the first scratch, but somethings existed before, like caching, is not implemented yet.


About the query

Performance of simple query always triumphs. However, back and forth database calls should be generally avoided. That is sync calls between components, which is to put components in series.

But of course, there is never a perfect solution that works optimally for all situations. I think I will be confident if I read the front-end code which calls that api.

@CTrando
Copy link
Collaborator Author

CTrando commented Aug 12, 2019

Performance of simple query always triumphs. However, back and forth database calls should be generally avoided. That is sync calls between components, which is to put components in series.

Sure I agree with you that it will be more performant, but I'm saying there is a tradeoff between readability and maintainability in combining all the queries into one. By doing this way we add more edge cases and potential mishaps. That's why you should be confident in your current algorithm before switching to this, but I agree doing it in one query is more performant.

@CTrando
Copy link
Collaborator Author

CTrando commented Aug 12, 2019

Ok let's define the format.

We have

  • api_department?quarter={QUARTER} : Returns a list of departments for the given quarter, e.g ["CSE", "ECE", etc] - This is called when the page initially loads up to get a list of departments to show.
  • api_classpreview?quarter={QUARTER}&department={DEPARTMENT} : Returns a list of objects of the form:
    {courseNum: COURSE_NUM, data : {description: DESCRIPTION, instructors: [INSTRUCTORS], types: [TYPES]}}. It's used to show the user a preview of the class, for example it shows the user that 11 is a course number for department CSE. INSTRUCTORS and TYPES are a list of instructors and types such as [Joe Politz, Rick Ord], [LE, DI].
  • api_classdata: This one is up for debate. We can either make it take in only one class (a department + courseNum pair) and return the data for just that class, or we can have it such that it takes in a list of pairs and it gives results for that entire one. I think maybe it would be better to do it as multiple small requests because it would be easier to process on the backend and frontend.

@snowme34
Copy link
Collaborator

snowme34 commented Aug 12, 2019

Performance of simple query always triumphs. However, back and forth database calls should be generally avoided. That is sync calls between components, which is to put components in series.

Sure I agree with you that it will be more performant, but I'm saying there is a tradeoff between readability and maintainability in combining all the queries into one. By doing this way we add more edge cases and potential mishaps. That's why you should be confident in your current algorithm before switching to this, but I agree doing it in one query is more performant.

If we restrict the quarter to be identical per query (it's weird for people to schedule multiple quarters on one calendar), then my SQL building function will be much simpler (just creating correct number of question marks)


I was thinking using a api.example.com domain for apis would be better than the "api_" prefix


api_department?quarter={QUARTER}

[{"AAA", "BBB"}]

api_classreview?quarter={QUARTER}&department={DEPARTMENT}

[{courseNum: 99, data : {description: abyss, instructors: ["A", "B"], types: ["LE", "DI"]}}, {courseNum: 100, data : {description: void, instructors: ["A", "B"], types: ["LE", "DI"]}}]

api_classdata

POST with a list of requests

{
"quarter": "CLASS_DATA",
"classes": [{"department": "AAA", "courseNum": "1"}, {"department": "AAA", "courseNum": "2"}] 
}

I would like more details about the response format that fits the front-end 's needs well.

[...]

I couldn't figure out why it's easier to process with one class per request. For front end, we will send an array of promises and wait_all, otherwise it will be an overkill for performance. For back end, we will have exponential number of requests with more users since they will add more and more classes.

https://restfulapi.net/rest-api-design-tutorial-with-example/

P.S. about row-constructor performance, according to this page, it seems that as long as we create proper index for composite keys, there should be no huge difference. And it should be safe to conclude that doing the same thing with multiple queries will be slower than with one query.


Finally I came up with a solution that works! We just need to support 2 HTTP methods for "class_data", the GET with query strings returns one class. the POST with a json array returns an array. And we could potentially hard-code some upper limit about the post request and may even check the duplications inside. I'm looking forward to your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants