High memory usage with github page as sample #41

entrptaher · 2023-05-01T08:28:42Z

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'https://github.com/lorey/mlscraper/issues/38'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
# please add at least two samples in practice to get meaningful rules!
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'title': 'Scraper not found error'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('https://github.com/lorey/mlscraper/issues/27')
result = scraper.get(Page(resp.content))
print(result)

lorey · 2023-05-01T14:11:54Z

Thanks for reporting.

General tip: Add more than one sample to see if it persists. Using only one sample cannot yield statistically sound heuristics. Will add a warning. It even says so in the comment you coped!

lorey · 2023-05-01T14:13:13Z

It's not a memory leak, I assume, it's just using a lot of memory. Usually potential CSS rules get reduced by applying them to every sample. If you add only one, that does not work.

lorey · 2023-05-01T14:40:10Z

Same code produces the following result for me withing seconds:

<DictScraper self.scraper_per_key={'title': <ValueScraper self.selector=<CssRuleSelector self.css_rule='bdi'>, self.extractor=<TextValueExtractor>>}>
{'title': 'Improve version pinning'}

Please add dependencies (pip freeze, etc.) and further information to reproduce, so far I'm unable to understand your issue and it works for me with the latest version from the develop branch.

Even the readme clearly states:

If you want to check the new release, use pip install --pre mlscraper to test the release candidate. You can also install the latest (unstable) development version of mlscraper via pip install git+https://github.com/lorey/mlscraper#egg=mlscraper, e.g. to check new features or to see if a bug has been fixed already.

entrptaher · 2023-05-02T02:24:34Z

If I run this on google colab, I don't get high memory usage but I get 'is not in list' error. However this still causes high memory locally with python 3.10 and mlscraper (both pre and develop versions).

Link: https://colab.research.google.com/drive/1frHuWVaAq-86FhhwCSyYlel-qBxaPDIs?usp=sharing

Both python Version 3.9 and 3.10 tested on google colab and locally on ubuntu 22.04

requirements.txt

beautifulsoup4==4.12.2
certifi==2022.12.7
charset-normalizer==3.1.0
idna==3.4
lxml==4.9.2
mlscraper==1.0.0rc3
more-itertools==9.1.0
requests==2.29.0
soupsieve==2.4.1
urllib3==1.26.15

Not sure what is going on. You seem to get a good result while I cannot, using the same code.

lorey · 2023-05-02T07:40:08Z

Why don't you add a second example?

lorey · 2023-05-02T08:01:18Z

W was now able to reproduce, will look into this if I find the time.

Seems like it does not stop generating CSS selectors although the tag is unique already.

siavashcsr · 2023-07-24T15:44:43Z

Any news on this issue? I run into the same problem with another website

drjgouveia · 2023-08-01T18:01:16Z

I'm running into the same problem :(

callumb123 · 2023-11-09T15:38:35Z

Has anyone managed to work around this yet? Tried a number of different sites with 5+ samples for each but always running out of memory.

ErikZhang-9762 · 2024-03-20T07:16:32Z

Same problem，i have 32g memory but always running out of memory :(

lorey changed the title ~~Memory Leak with github page as sample~~ High memory usage with github page as sample May 1, 2023

lorey self-assigned this May 1, 2023

lorey added bug Something isn't working and removed bug Something isn't working labels May 1, 2023

lorey closed this as completed May 1, 2023

lorey added the invalid This doesn't seem right label May 1, 2023

lorey reopened this May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage with github page as sample #41

High memory usage with github page as sample #41

entrptaher commented May 1, 2023

lorey commented May 1, 2023 •

edited

lorey commented May 1, 2023

lorey commented May 1, 2023 •

edited

entrptaher commented May 2, 2023

lorey commented May 2, 2023 •

edited

lorey commented May 2, 2023

siavashcsr commented Jul 24, 2023

drjgouveia commented Aug 1, 2023

callumb123 commented Nov 9, 2023

ErikZhang-9762 commented Mar 20, 2024

High memory usage with github page as sample #41

High memory usage with github page as sample #41

Comments

entrptaher commented May 1, 2023

lorey commented May 1, 2023 • edited

lorey commented May 1, 2023

lorey commented May 1, 2023 • edited

entrptaher commented May 2, 2023

lorey commented May 2, 2023 • edited

lorey commented May 2, 2023

siavashcsr commented Jul 24, 2023

drjgouveia commented Aug 1, 2023

callumb123 commented Nov 9, 2023

ErikZhang-9762 commented Mar 20, 2024

lorey commented May 1, 2023 •

edited

lorey commented May 1, 2023 •

edited

lorey commented May 2, 2023 •

edited