Allow exact matches and create greater abstraction of the base scraper class. #114

markkvdb · 2020-10-06T12:05:35Z

Allow exact matches and create greater abstraction of the base scraper class

Description

Improved search keyword encoding with support for exact phrase #80 requests support for exact search queries. This is now incorporated by a different way of constructing the url to search. An option is added to scraper (--exact-result for CLI and exact_result in the YAML) which activates the exact query. Basically this just means adding quotes around the query.
The search url constructing has been adapted to simplify adding more locals. For non-english locales, the search url is slightly different for Monster. This can now be set with one-line. Furthermore, if a locale has unique search options than this can also be added with one line.
While working on the search url construction I observed that more functionality built into the scrapers could be moved to the base scraper. This comes from the following observation: every scraper starts by collecting the number of jobs and pages (get_n_pages) and then proceeds to obtain all search pages (get_job_soup_page) and listings on those pages (_parse_job_listings_to_bs4).

Context of change

Please add options that are relevant and mark any boxes that apply.

Software (software that runs on the PC)
Library (library that runs on the PC)
Tool (tool that assists coding development)
Other

Type of change

Please mark any boxes that apply.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

The only real new functionality is allowing for exact queries. Testing this consists of making sure that the CLI flag and YAML entry are properly handled. This was easy since I just follow the procedure of similar_results. The actual change in the code base to allow this was a single line to add quotes to query, hence little testing is required.

Checklist:

Please mark any boxes that have been completed.

I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
Any dependent changes have been merged and published in downstream modules.

codecov-commenter · 2020-10-06T12:51:52Z

Codecov Report

Merging #114 into master will increase coverage by 1.80%.
The diff coverage is 37.50%.

@@            Coverage Diff             @@
##           master     #114      +/-   ##
==========================================
+ Coverage   35.72%   37.53%   +1.80%     
==========================================
  Files          22       22              
  Lines        1447     1420      -27     
==========================================
+ Hits          517      533      +16     
+ Misses        930      887      -43

Impacted Files	Coverage Δ
jobfunnel/config/manager.py	`30.43% <ø> (+0.64%)`	⬆️
jobfunnel/config/settings.py	`66.66% <ø> (ø)`
jobfunnel/backend/scrapers/base.py	`36.01% <31.66%> (-3.48%)`	⬇️
jobfunnel/backend/scrapers/monster.py	`32.11% <38.09%> (+5.07%)`	⬆️
jobfunnel/backend/scrapers/indeed.py	`34.10% <40.00%> (+7.11%)`	⬆️
jobfunnel/backend/scrapers/glassdoor.py	`36.60% <42.85%> (+5.67%)`	⬆️
jobfunnel/config/cli.py	`89.47% <100.00%> (+0.14%)`	⬆️
jobfunnel/config/search.py	`77.14% <100.00%> (+0.67%)`	⬆️
jobfunnel/resources/defaults.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a31e09f...616d351. Read the comment docs.

PaulMcInnis

I like the implementation of DEFAULT_EXACT_RESULT but I disagree with extending the base scraper to restrict it to the 'pages of results' type of jobs provider website. I think we may want to try building seperate abstract base classes per-provider-workflow to help with this.

I also think the names of the abstract methods need to be revisited and documented more thoroughly. In particular I think we might want to try using the words Stem and Complete to separate the scraping of a job from the listing vs. from the page dedicated to that job (in the method names and docstrings). I wonder if we could even build StemJob vs CompleteJob ?

PaulMcInnis · 2020-10-06T20:35:37Z