Skip to content

This repository will assist you in scrapping data from multiple websites. It will identify, download and classify the latest pdf files published on a website as per the users requirement. This can be used for automating various operations involved in market research.

Notifications You must be signed in to change notification settings

Erdos1729/webscrapping-identify-download-classify-published-pdfs-from-multiple-urls

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Webscrapping to identify and download latest pdf documents. Classify these documents into pre-defined categories.

  • This repository will assist you in scrapping data from multiple websites. It will download the latest pdf files published on a website in a specific folder as per the users requirement. This can be used for automating various operations involved in market research.
  • Once the pdfs are downloaded they are classified into oil/no_oil/foreign_language categories based on a string based rule
  • You can customize these rules for classification as per your need

Instructions

  • pip install -r requirements
  • Run radar_automation.py

Reference

I devised the solution from the following pages of the documentation:

  • [Urllib] package that collects several modules for working with URLs
  • [beautyfulsoup4] to scrape information from web pages
  • [PDFminer] is a text extraction tool for PDF documents
  • [NLTK] for natural language processing
  • Keyword based search in extracted text for rule based classification

About

This repository will assist you in scrapping data from multiple websites. It will identify, download and classify the latest pdf files published on a website as per the users requirement. This can be used for automating various operations involved in market research.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages