Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow exact matches and create greater abstraction of the base scraper class. #114

Closed
wants to merge 16 commits into from

Conversation

markkvdb
Copy link
Collaborator

@markkvdb markkvdb commented Oct 6, 2020

Allow exact matches and create greater abstraction of the base scraper class

Description

  • Improved search keyword encoding with support for exact phrase #80 requests support for exact search queries. This is now incorporated by a different way of constructing the url to search. An option is added to scraper (--exact-result for CLI and exact_result in the YAML) which activates the exact query. Basically this just means adding quotes around the query.
  • The search url constructing has been adapted to simplify adding more locals. For non-english locales, the search url is slightly different for Monster. This can now be set with one-line. Furthermore, if a locale has unique search options than this can also be added with one line.
  • While working on the search url construction I observed that more functionality built into the scrapers could be moved to the base scraper. This comes from the following observation: every scraper starts by collecting the number of jobs and pages (get_n_pages) and then proceeds to obtain all search pages (get_job_soup_page) and listings on those pages (_parse_job_listings_to_bs4).

Context of change

Please add options that are relevant and mark any boxes that apply.

  • Software (software that runs on the PC)
  • Library (library that runs on the PC)
  • Tool (tool that assists coding development)
  • Other

Type of change

Please mark any boxes that apply.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

The only real new functionality is allowing for exact queries. Testing this consists of making sure that the CLI flag and YAML entry are properly handled. This was easy since I just follow the procedure of similar_results. The actual change in the code base to allow this was a single line to add quotes to query, hence little testing is required.

Checklist:

Please mark any boxes that have been completed.

  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • Any dependent changes have been merged and published in downstream modules.

@markkvdb markkvdb added this to the 3.0.1 milestone Oct 6, 2020
@markkvdb markkvdb linked an issue Oct 6, 2020 that may be closed by this pull request
@codecov-commenter
Copy link

Codecov Report

Merging #114 into master will increase coverage by 1.80%.
The diff coverage is 37.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #114      +/-   ##
==========================================
+ Coverage   35.72%   37.53%   +1.80%     
==========================================
  Files          22       22              
  Lines        1447     1420      -27     
==========================================
+ Hits          517      533      +16     
+ Misses        930      887      -43     
Impacted Files Coverage Δ
jobfunnel/config/manager.py 30.43% <ø> (+0.64%) ⬆️
jobfunnel/config/settings.py 66.66% <ø> (ø)
jobfunnel/backend/scrapers/base.py 36.01% <31.66%> (-3.48%) ⬇️
jobfunnel/backend/scrapers/monster.py 32.11% <38.09%> (+5.07%) ⬆️
jobfunnel/backend/scrapers/indeed.py 34.10% <40.00%> (+7.11%) ⬆️
jobfunnel/backend/scrapers/glassdoor.py 36.60% <42.85%> (+5.67%) ⬆️
jobfunnel/config/cli.py 89.47% <100.00%> (+0.14%) ⬆️
jobfunnel/config/search.py 77.14% <100.00%> (+0.67%) ⬆️
jobfunnel/resources/defaults.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a31e09f...616d351. Read the comment docs.

Copy link
Owner

@PaulMcInnis PaulMcInnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the implementation of DEFAULT_EXACT_RESULT but I disagree with extending the base scraper to restrict it to the 'pages of results' type of jobs provider website. I think we may want to try building seperate abstract base classes per-provider-workflow to help with this.

I also think the names of the abstract methods need to be revisited and documented more thoroughly. In particular I think we might want to try using the words Stem and Complete to separate the scraping of a job from the listing vs. from the page dedicated to that job (in the method names and docstrings). I wonder if we could even build StemJob vs CompleteJob ?

Returns:
List[BeautifulSoup]: list of jobs soups we can use to make a Job
"""

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I am in favour of implementing abstract classes to get job pages -> listings -> soups, I think we should put this workflow into it's own class such as BaseMultiPageScraper so that we can write scrapers for static web-pages that only have a single page (sort of like monster).

I think this way we can do the single-page scroll type of job sites as a BaseSinglePageScraper.

return max_pages

@abstractmethod
def _extract_pages_and_total_listings(self, soup: BeautifulSoup) -> Tuple[int, int]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all abstract methods should have a name without the _, I had added that to indicate that those methods were private to the specific scraper class.

additionally all the stubs should have a detailed docstring explaining the expected implementation

def _extract_pages_and_total_listings(self, soup: BeautifulSoup) -> Tuple[int, int]:
"""Method to extract the total number of listings and pages."""

def _get_job_soups_page(self, page: int,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the difference between this and above get_job_soups

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the docstring here refers to indeed but it is for the base scraper.


return list(job_soup_dict.values())

def _get_n_pages(self, max_pages: Optional[int] = None) -> int:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer get_num_result_pages

@@ -423,6 +416,132 @@ def _validate_get_set(self) -> None:
[field.name for field in excluded_fields]
)

def get_job_soups(self) -> List[BeautifulSoup]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming here is also a bit confusing, perhaps we can call it get_job_listings_as_soup ? the naming somewhat conflicts with the below _get_job_soups_page

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the naming. It's a bit confusing as of now. Will think about more consistent and clearer naming.

@PaulMcInnis PaulMcInnis modified the milestones: 3.0.1, 3.0.2 Oct 11, 2020
@PaulMcInnis
Copy link
Owner

Closing this for now due to inactivity, 100% open to revisiting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
New Features
  
Awaiting triage
Development

Successfully merging this pull request may close these issues.

Improved search keyword encoding with support for exact phrase
3 participants