Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending the web scraper. #18

Open
ba11b0y opened this issue Oct 4, 2017 · 13 comments
Open

Extending the web scraper. #18

ba11b0y opened this issue Oct 4, 2017 · 13 comments

Comments

@ba11b0y
Copy link
Contributor

ba11b0y commented Oct 4, 2017

There hasn't been much work on the web scraping part.
I am interested to work on this.
Since this is going to be a generic one, what I have thought as of now includes:

  1. A generic web scraper which scrapes all images, links and the text.
  2. Use scrapy for this maybe.

Still a beginner, any tips or corrections?

@shubhodeep9
Copy link
Member

@invinciblycool I like the thought, I would suggest, a detailed list of missing components you find in the current code of scraper, then we will assign you the work.

@ashwini0529
Copy link
Member

@invinciblycool XML format could be added.

@ba11b0y
Copy link
Contributor Author

ba11b0y commented Oct 5, 2017

@ashwini0529 I have added the XML response to web.py. Let me know if any corrections are needed
@shubhodeep9 I will update the detailed list as soon as my exams get over 😄

@ba11b0y
Copy link
Contributor Author

ba11b0y commented Oct 5, 2017

@ashwini0529 @shubhodeep9 Couldn't resist the excitement 😄
These are some features in my mind which can be added :

  • If no JSON response is returned by the URL, only the source of the page is returned. We could have a more better scraper which returns either:
  1. A dictionary or a JSON reponse:
{
  "assets":
  {
    "images":
    [
      "link of image1 on the page",
      "link of image2 on the page"
    ],
    "videos":
    [
      "link to embedded video1",
      "link to embedded video2"
    ]
  },
  "content":
  {
    "text": "all raw text from the page",
    "html": "all html from the page"
  }
}
  1. Or creates dedicated directories for the above keys of the dictionaries and actually saves the content to the respective directory.(Inspired from httrack)
  • Another feature could be adding a specific scrape option.
    For Example:
    web.scrape(url, scrape_content = "images") returns all the links to images in or saves the images locally.

@ashwini0529
Copy link
Member

Hey @invinciblycool Sounds good.
Sounds like a great idea to start with. Go ahead. We can add more features. 🎉

@shubhodeep9
Copy link
Member

@invinciblycool Add a TO-DO with your PR, and we will keep this issue alive until we feel satisfied. So that whenever someone gets a new idea on web-scraping, they can add to that TO-DO

@ashwini0529
Copy link
Member

Also, please add a [WIP] tag in your PR message. 😄

@ba11b0y
Copy link
Contributor Author

ba11b0y commented Oct 5, 2017

@ashwini0529 To start working if you could make it clear that should the function be returning a response or should create folders and save the content locally. Thanks.
@shubhodeep9 Just confirming a TO-DO with the PR or the issue.

@ashwini0529
Copy link
Member

Hey @invinciblycool you can take a look at the QR Code function. I think you can make something like that.
Probable usage like what it was for QRCode:
img = hackr.image.qrcode("https://github.com/pytorn/hackr", dest_path="/tmp/hackr_qrcode.png")

@ba11b0y
Copy link
Contributor Author

ba11b0y commented Oct 6, 2017

I guess then we agree on saving all the content locally.
Will start working on it ASAP.

@ashwini0529
Copy link
Member

Hey @invinciblycool Updates?

@ba11b0y
Copy link
Contributor Author

ba11b0y commented Oct 20, 2017

Sorry for the delay, I will try opening a PR by this week.
Happy Diwali BTW. ✨

@ashwini0529
Copy link
Member

Perfect @invinciblycool
Happy hacking and Happy Diwali! 😄 🎇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants