Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding question mark to the sample fails #40

Open
entrptaher opened this issue May 1, 2023 · 8 comments
Open

Adding question mark to the sample fails #40

entrptaher opened this issue May 1, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@entrptaher
Copy link

The following code,

training_file = BeautifulSoup("<p>with a question mark?</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
                'title': 'with a question mark?'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

Throws error

mlscraper.samples.NoMatchFoundException: No match found on page (self.page=<Page self.soup.name='[document]' classes=None, text=with a que...>, self.value='with a question mark?'

But the following code works just without the question mark in the html,

training_file = BeautifulSoup("<p>with a question mark</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
                'title': 'with a question mark'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)
@entrptaher
Copy link
Author

The same file worked with autoscraper without any issue.

@lorey lorey added the bug Something isn't working label May 1, 2023
@lorey
Copy link
Owner

lorey commented May 1, 2023

Thanks, that's very weird.

  • Which version are you using?
  • since generate_all_value_matches just calls BeautifulSoup's find all in the latest version, I have no answer yet.

@lorey
Copy link
Owner

lorey commented May 1, 2023

Does the same happen for <html><body><p>what?</p></body></html>?

@lorey lorey added invalid This doesn't seem right and removed bug Something isn't working labels May 1, 2023
@lorey
Copy link
Owner

lorey commented May 1, 2023

This is how it's meant to be called, not sure what you're trying to achieve.

    training_set = TrainingSet()
    html = "<html><body><p>with a question mark?</p></body></html>"
    page = Page(html)
    sample = Sample(page, {
        'title': 'with a question mark?'})
    training_set.add_sample(sample)
    scraper = train_scraper(training_set)
    print(scraper)

@lorey lorey closed this as completed May 1, 2023
@lorey
Copy link
Owner

lorey commented May 1, 2023

Prettify creates whitespace that mlscraper currently is sensitive to. I know this is not perfect, but it's on the roadmap.

@lorey
Copy link
Owner

lorey commented May 1, 2023

Related: #15

@entrptaher
Copy link
Author

entrptaher commented May 2, 2023

I found the issue here,

    def _generate_find_all(self, item):
        assert isinstance(item, str), "can only search for str at the moment"

        # text
        # - since text matches including whitespace, a regex is used
        target_regex = re.compile(r"^\s*%s\s*$" % html.escape(item))

This generates a wrong regex,

with a question mark?
re.compile('^\\s*with a question mark?\\s*$')

Using re.escape fixes this issue,

    def _generate_find_all(self, item):
        assert isinstance(item, str), "can only search for str at the moment"

        # text
        # - since text matches including whitespace, a regex is used
        target_regex = re.compile(r"^\s*%s\s*$" % html.escape(re.escape(item)))

@lorey
Copy link
Owner

lorey commented May 2, 2023

Good catch!

@lorey lorey reopened this May 2, 2023
@lorey lorey added bug Something isn't working and removed invalid This doesn't seem right labels May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants