Link Crawler

This script allows you to crawl a website and collect links from its webpages based on a specified regex pattern. It can be useful for extracting links from websites for various purposes such as data scraping or analysis.

Prerequisites

Before running the script, make sure you have the following installed:

Python 3.x
argparse library
requests library
re module
os module
sys module
base64 module
urllib.parse module
bs4 (BeautifulSoup) library
shutil module

You can install the required dependencies using pip:

pip install argparse requests bs4

Usage

To use the script, follow these steps:

Clone or download the script file to your local machine.
Open a terminal or command prompt.
Navigate to the directory where the script is located.
Run the following command:
```
python link_crawler.py -u <url> -p <pattern> [-d] [-c]
```
Replace <url> with the URL of the website you want to crawl, and <pattern> with the regex pattern to match the links.

Optional flags:
- -d or --domain: Include the website domain for internal links. By default, it deletes the domain name from internal links and then searches for the pattern.
- -c or --clear-directory: Clear the directory if it already exists for this command. By default, if the command is entered with a duplicate pattern and domain, the search is not performed.
The script will start crawling the website, collecting links from its webpages, and display the results.
- If links matching the regex pattern are found, the script will save them to a links.txt file in the corresponding directory.
- If no links are found, the script will display a message accordingly.

Note: The script crawls webpages within the specified website by following links found in HTML tags such as <a>, <link>, <script>, <base>, <form>, and more (in all tags that contain links). It searches for href, src, and data-src attributes in these tags to extract the links.

Note: this script finds any link anywhere on the webpage, even outside of the attributes of the tags.

Examples

Here are a few examples of how you can use the script:

Crawl a website and collect all links from its webpages:
```
python link_crawler.py -u https://example.com -p ".*"
```
This will crawl the example.com website, collect all links from its webpages, and save them to links.txt in the data/<host>/<pattern>/ directory.
Crawl a website and collect only specific links matching a pattern:
```
python link_crawler.py -u https://example.com -p "https://example.com/downloads/.*"
```
This will crawl the example.com website and collect only the links that match the pattern https://example.com/downloads/.
Crawl a website and putting domains in internal links:
```
python link_crawler.py -u https://example.com -p ".*" -d
```
This will crawl the example.com website, collect all links from its webpage, putting domains in internal links, and save them to links.txt.
Clear the directory and crawl the website to collect fresh links:
```
python link_crawler.py -u https://example.com -p ".*" -c
```
This will clear the existing directory (if any) for the specified command and crawl the example.com website to collect fresh links.

License

This script is licensed under the MIT License. Feel free to modify and use it according to your needs.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
link_collector.py		link_collector.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

link_collector.py

link_collector.py

requirements.txt

requirements.txt

Repository files navigation

Link Crawler

Prerequisites

Usage

Examples

License

About

Releases

Packages

Languages

License

github-1970/link-crawler

Folders and files

Latest commit

History

Repository files navigation

Link Crawler

Prerequisites

Usage

Examples

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages