Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

script: detect broken "More information" links #12289

Open
vitorhcl opened this issue Feb 17, 2024 · 8 comments
Open

script: detect broken "More information" links #12289

vitorhcl opened this issue Feb 17, 2024 · 8 comments

Comments

@vitorhcl
Copy link
Member

vitorhcl commented Feb 17, 2024

I did a Python script with 111 123 lines to detect broken "More information" links that is absurdly faster than the one-liner we have on the wiki using asynchrous code (aiohttp, aiopath and aioconsole). Should I open a PR to put it in scripts or put it in the wiki?

Here it is:

#!/usr/bin/env python3
# SPDX-License-Identifier: MIT

import random
import re
import asyncio
import sys
from aiofile import AIOFile, Reader, Writer
import aiohttp.client_exceptions
from aioconsole import aprint
from aiofile import async_open
from aiopath import AsyncPath


async def find_md_files(search_path: AsyncPath) -> list[AsyncPath]:
    """Find all .md files in the specified search path."""
    md_files = set()
    async for path_dir in search_path.glob("*"):
        await aprint(path_dir.name)
        async for file in search_path.glob("*/*.md"):
            md_files.add(file)
    return md_files


async def append_if_is_file(path_list: list[AsyncPath], path: AsyncPath):
    """Append the file to the list if it exists"""
    if await path.is_file():
        path_list.add(path)


async def filter_files(md_files: list[AsyncPath]) -> list[AsyncPath]:
    """Filter out non-file paths from the list."""
    filtered_files = set()
    await asyncio.gather(
        *(append_if_is_file(filtered_files, path) for path in md_files)
    )
    return filtered_files


async def process_file(
    file: AsyncPath,
    writer: Writer,
    output_file: AsyncPath,
    session: aiohttp.ClientSession,
) -> None:
    """Extract the link of a single .md file and check it."""
    async with file.open("r") as f:
        try:
            content = await f.read()
        except Exception as e:
            await aprint(file.parts[-3:])
            return

    url = extract_link(content)

    if url is not None:
        await check_url_and_write_if_bad(url, writer, output_file, session)


def extract_link(content: str) -> list[str]:
    """Extract the link of '> More information: '."""
    return next(
        (
            match.group(1)
            for match in re.finditer(r"> More information: <(.+)>", content)
        ),
        None,
    )


async def check_url_and_write_if_bad(
    url: str, writer: Writer, output_file: AsyncPath, session: aiohttp.ClientSession
) -> None:
    """Check URL status and write bad URLs to a file."""
    await aprint(f"??? {url}")
    code = -1
    try:
        code = await check_url(url, session)
    except aiohttp.ClientError as exc:
        if hasattr(exc, "strerr"):
            await aprint(f"\033[31m{exc.strerr}\033[0m")
        if hasattr(exc, "message"):
            await aprint(f"\033[31m{exc.message}\033[0m")
        else:
            await aprint(f"\033[31m{exc}\033[0m")
    await aprint(f"{code} {url}")

    if 200 > code or code >= 400:
        await writer(f"{code}|{url}\n")


async def check_url(url: str, session: aiohttp.ClientSession) -> int:
    """Get the status code of a URL."""
    async with session.head(url) as response:
        return response.status


async def find_and_write_bad_urls(
    output_file: AsyncPath, search_path: str = "."
) -> None:
    """Find and write bad URLs to a specified file."""
    search_path = AsyncPath(search_path)
    await aprint("Getting pages...")
    md_files = await filter_files(await find_md_files(search_path))
    await aprint("Found all pages!")

    async with AIOFile(output_file.name, "a") as afp:
        writer = Writer(afp)
        async with aiohttp.ClientSession(
            trust_env=True, timeout=aiohttp.ClientTimeout(total=500)
        ) as session:
            await asyncio.gather(
                *(process_file(file, writer, output_file, session) for file in md_files)
            )
        await afp.fsync()


async def main():
    await find_and_write_bad_urls(AsyncPath("bad-urls.txt"), search_path="./pages")


if __name__ == "__main__":
    asyncio.run(main())

Edit: I forgot to remove 2 test lines 😅
Update: now it writes to bad-urls.txt sequentially using AIOFile.Writer, and doesn't write partial text anymore

@vitorhcl
Copy link
Member Author

Note: I'm not writing to /tmp/bad-urls.txt for Windows compability, but the user is free to change this on the script in the main function.

@sbrl
Copy link
Member

sbrl commented Feb 18, 2024

Hey, this is cool! Does it handle rate limiting to avoid hosts from blocking it? This was a key issue with the design of the current script IIRC.

@vitorhcl
Copy link
Member Author

It doesn't handle it currently, sometimes I have to wait a while to run the script again but IIRC it can eventually do an entire run without being blocked if not preceded by many runs.

Either way, we can sure add this to the code :)

@vitorhcl
Copy link
Member Author

I think the right place for this script is in scripts/, given that its complexity increased a bit, and thus making contributions easier. This way it's possible to add another features too like regex matching for URLs and automatic link updating for redirection.

@gutjuri
Copy link
Member

gutjuri commented Feb 18, 2024

Thanks! nice work! Unfortunately, the script errors on my machine with OSError: [Errno 24] Too many open files: 'pages/osx/lpstat.md'. Is there a way in asyncio to limit the amount of concurrency, i.e., that the maximum amount of concurrently running threats is limited to e.g. 500?

I'm running Ubuntu btw, so there is a limit on how many files can be opened simultaneously (however, I'm not sure on the exact limit).

EDIT: fixed it with a semaphore:

#!/usr/bin/env python3
# SPDX-License-Identifier: MIT

import random
import re
import asyncio
import sys
from aiofile import AIOFile, Reader, Writer
import aiohttp.client_exceptions
from aioconsole import aprint
from aiofile import async_open
from aiopath import AsyncPath

MAX_CONCURRENCY = 500

sem = asyncio.Semaphore(MAX_CONCURRENCY)

async def find_md_files(search_path: AsyncPath) -> list[AsyncPath]:
    """Find all .md files in the specified search path."""
    md_files = set()
    async for path_dir in search_path.glob("*"):
        await aprint(path_dir.name)
        async for file in search_path.glob("*/*.md"):
            md_files.add(file)
    return md_files


async def append_if_is_file(path_list: list[AsyncPath], path: AsyncPath):
    """Append the file to the list if it exists"""
    if await path.is_file():
        path_list.add(path)


async def filter_files(md_files: list[AsyncPath]) -> list[AsyncPath]:
    """Filter out non-file paths from the list."""
    filtered_files = set()
    await asyncio.gather(
        *(append_if_is_file(filtered_files, path) for path in md_files)
    )
    return filtered_files


async def process_file(
    file: AsyncPath,
    writer: Writer,
    output_file: AsyncPath,
    session: aiohttp.ClientSession,
) -> None:
    """Extract the link of a single .md file and check it."""
    async with sem:
        async with file.open("r") as f:
            try:
                content = await f.read()
            except Exception as e:
                await aprint(file.parts[-3:])
                return

    url = extract_link(content)

    if url is not None:
        await check_url_and_write_if_bad(url, writer, output_file, session)


def extract_link(content: str) -> list[str]:
    """Extract the link of '> More information: '."""
    return next(
        (
            match.group(1)
            for match in re.finditer(r"> More information: <(.+)>", content)
        ),
        None,
    )


async def check_url_and_write_if_bad(
    url: str, writer: Writer, output_file: AsyncPath, session: aiohttp.ClientSession
) -> None:
    """Check URL status and write bad URLs to a file."""
    await aprint(f"??? {url}")
    code = -1
    try:
        code = await check_url(url, session)
    except aiohttp.ClientError as exc:
        if hasattr(exc, "strerr"):
            await aprint(f"\033[31m{exc.strerr}\033[0m")
        if hasattr(exc, "message"):
            await aprint(f"\033[31m{exc.message}\033[0m")
        else:
            await aprint(f"\033[31m{exc}\033[0m")
    await aprint(f"{code} {url}")

    if 200 > code or code >= 400:
        await writer(f"{code}|{url}\n")


async def check_url(url: str, session: aiohttp.ClientSession) -> int:
    """Get the status code of a URL."""
    async with session.head(url) as response:
        return response.status


async def find_and_write_bad_urls(
    output_file: AsyncPath, search_path: str = "."
) -> None:
    """Find and write bad URLs to a specified file."""
    search_path = AsyncPath(search_path)
    await aprint("Getting pages...")
    md_files = await filter_files(await find_md_files(search_path))
    await aprint("Found all pages!")

    async with AIOFile(output_file.name, "a") as afp:
        writer = Writer(afp)
        async with aiohttp.ClientSession(
            trust_env=True, timeout=aiohttp.ClientTimeout(total=500)
        ) as session:
            await asyncio.gather(
                *(process_file(file, writer, output_file, session) for file in md_files)
            )
        await afp.fsync()


async def main():
    await find_and_write_bad_urls(AsyncPath("bad-urls.txt"), search_path="./pages")


if __name__ == "__main__":
    asyncio.run(main())

Also, Manned.org seems to do rate-limiting, so we should definitely implement this, ideally on a per-domain basis. For example, I get 503 (service unavailable) error codes on some existing manned pages.

@vitorhcl
Copy link
Member Author

Thanks for adding this limit.

As for the per-domain rate limit, I have an idea: splitting the links in lists, where each list contains links that belong to a specific domains, and we alternate between the lists putting them in the right order and using ayncio.sleep for respecting each domain timeout.

@sebastiaanspeck
Copy link
Member

sebastiaanspeck commented Feb 19, 2024

I think this script could also be useful for https://github.com/tldr-pages/tldr-maintenance to help contributors spot them, instead of locally checking them. I proposed this idea a while ago, but back then the current script in the Wiki was not ideal for a lot of runs.

@sbrl
Copy link
Member

sbrl commented Feb 23, 2024

@vitorhcl that approach sounds good to me! I think we would want to bake that into the script, especially if we are to add it to tldr-maintenance. Perhaps for future edits you would like to open a pull request so we can track history as the script evolves rather than confining to this issue?

Thanks so much again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants