-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow exact matches and create greater abstraction of the base scraper class. #114
Conversation
Codecov Report
@@ Coverage Diff @@
## master #114 +/- ##
==========================================
+ Coverage 35.72% 37.53% +1.80%
==========================================
Files 22 22
Lines 1447 1420 -27
==========================================
+ Hits 517 533 +16
+ Misses 930 887 -43
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the implementation of DEFAULT_EXACT_RESULT
but I disagree with extending the base scraper to restrict it to the 'pages of results' type of jobs provider website. I think we may want to try building seperate abstract base classes per-provider-workflow to help with this.
I also think the names of the abstract methods need to be revisited and documented more thoroughly. In particular I think we might want to try using the words Stem
and Complete
to separate the scraping of a job from the listing vs. from the page dedicated to that job (in the method names and docstrings). I wonder if we could even build StemJob
vs CompleteJob
?
Returns: | ||
List[BeautifulSoup]: list of jobs soups we can use to make a Job | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I am in favour of implementing abstract classes to get job pages -> listings -> soups, I think we should put this workflow into it's own class such as BaseMultiPageScraper
so that we can write scrapers for static web-pages that only have a single page (sort of like monster).
I think this way we can do the single-page scroll type of job sites as a BaseSinglePageScraper
.
return max_pages | ||
|
||
@abstractmethod | ||
def _extract_pages_and_total_listings(self, soup: BeautifulSoup) -> Tuple[int, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all abstract methods should have a name without the _
, I had added that to indicate that those methods were private to the specific scraper class.
additionally all the stubs should have a detailed docstring explaining the expected implementation
def _extract_pages_and_total_listings(self, soup: BeautifulSoup) -> Tuple[int, int]: | ||
"""Method to extract the total number of listings and pages.""" | ||
|
||
def _get_job_soups_page(self, page: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about the difference between this and above get_job_soups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the docstring here refers to indeed but it is for the base scraper.
|
||
return list(job_soup_dict.values()) | ||
|
||
def _get_n_pages(self, max_pages: Optional[int] = None) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer get_num_result_pages
@@ -423,6 +416,132 @@ def _validate_get_set(self) -> None: | |||
[field.name for field in excluded_fields] | |||
) | |||
|
|||
def get_job_soups(self) -> List[BeautifulSoup]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming here is also a bit confusing, perhaps we can call it get_job_listings_as_soup
? the naming somewhat conflicts with the below _get_job_soups_page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the naming. It's a bit confusing as of now. Will think about more consistent and clearer naming.
Closing this for now due to inactivity, 100% open to revisiting this. |
Allow exact matches and create greater abstraction of the base scraper class
Description
--exact-result
for CLI andexact_result
in the YAML) which activates the exact query. Basically this just means adding quotes around the query.get_n_pages
) and then proceeds to obtain all search pages (get_job_soup_page
) and listings on those pages (_parse_job_listings_to_bs4
).Context of change
Please add options that are relevant and mark any boxes that apply.
Type of change
Please mark any boxes that apply.
How Has This Been Tested?
The only real new functionality is allowing for exact queries. Testing this consists of making sure that the CLI flag and YAML entry are properly handled. This was easy since I just follow the procedure of
similar_results
. The actual change in the code base to allow this was a single line to add quotes to query, hence little testing is required.Checklist:
Please mark any boxes that have been completed.