Skip to content

baranbasaran/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scripts This project includes the following scripts:

    scrapping.py: Scrapes data from a webpage
    preprocessing.py: Preprocesses the data from the scraping step
    inverted_index.py: Creates an inverted index from the preprocessed data
    TF-IDF.py: Calculates the TF-IDF values for the inverted index
    cosine_similarity.py: Calculates the cosine similarity between two documents
    IR.py: Implements a simple information retrieval system

System requirements Python 3 Scrappy library (install using pip install scrappy)

Language and version All scripts are written in Python 3.

List of files In addition to the scripts, this project includes the following files: stopwords.txt: A list of stopwords to be used in the preprocessing step

Instructions to run To run the scripts in this project, follow these steps:

    Ensure that you have Python 3 and the Scrappy library installed on your machine.

        Open a terminal window and navigate to the directory where the scripts are located.

            Run the scripts in the following order:

                python scrapping.py "doc"
                python preprocessing.py "doc" "preprocess" "stopwords.txt"
                python inverted_index.py "preprocess" "indexing"
                python TF-IDF.py indexing.txt tfIdf
                python cosine_similarity.py tfIdf.txt D3 D5
                python IR.py doc "Griffith College Dublin IRWS course menu" "stopwords.txt"

            Each script performs a specific task in the overall process, and the output of each script is used as input for the next script. For example, the output of the scrapping.py script is used as input for the preprocessing.py script.

Input and output Input The input to each script varies, but generally includes file paths and data.

    scrapping.py: Takes a file path as input, indicating where the scraped data should be saved.
    preprocessing.py: Takes a file path for the input data and a file path for the output data, as well as a file path for a list of stopwords.
    inverted_index.py: Takes a file path for the input data and a file path for the output data.
    TF-IDF.py: Takes a file path for the input data and a file path for the output data.
    cosine_similarity.py: Takes a file path for the input data, and two document identifiers for the documents to compare.
    IR.py: Takes a file path for the input data, a query string, a file path for a list of stopwords.
Output
    The output of each script is either a file or a calculation.

    scrapping.py: Outputs the scraped data to a file.
    preprocessing.py: Outputs the preprocessed data to a file.
    inverted_index.py: Outputs the inverted index to a file.
    TF_IDF.py: Outputs the calculated tf-idf values to a file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages