script: detect broken "More information" links #12289

vitorhcl · 2024-02-17T16:53:13Z

I did a Python script with ~~111~~ 123 lines to detect broken "More information" links that is absurdly faster than the one-liner we have on the wiki using asynchrous code (aiohttp, aiopath and aioconsole). Should I open a PR to put it in scripts or put it in the wiki?

Here it is:

#!/usr/bin/env python3
# SPDX-License-Identifier: MIT

import random
import re
import asyncio
import sys
from aiofile import AIOFile, Reader, Writer
import aiohttp.client_exceptions
from aioconsole import aprint
from aiofile import async_open
from aiopath import AsyncPath


async def find_md_files(search_path: AsyncPath) -> list[AsyncPath]:
    """Find all .md files in the specified search path."""
    md_files = set()
    async for path_dir in search_path.glob("*"):
        await aprint(path_dir.name)
        async for file in search_path.glob("*/*.md"):
            md_files.add(file)
    return md_files


async def append_if_is_file(path_list: list[AsyncPath], path: AsyncPath):
    """Append the file to the list if it exists"""
    if await path.is_file():
        path_list.add(path)


async def filter_files(md_files: list[AsyncPath]) -> list[AsyncPath]:
    """Filter out non-file paths from the list."""
    filtered_files = set()
    await asyncio.gather(
        *(append_if_is_file(filtered_files, path) for path in md_files)
    )
    return filtered_files


async def process_file(
    file: AsyncPath,
    writer: Writer,
    output_file: AsyncPath,
    session: aiohttp.ClientSession,
) -> None:
    """Extract the link of a single .md file and check it."""
    async with file.open("r") as f:
        try:
            content = await f.read()
        except Exception as e:
            await aprint(file.parts[-3:])
            return

    url = extract_link(content)

    if url is not None:
        await check_url_and_write_if_bad(url, writer, output_file, session)


def extract_link(content: str) -> list[str]:
    """Extract the link of '> More information: '."""
    return next(
        (
            match.group(1)
            for match in re.finditer(r"> More information: <(.+)>", content)
        ),
        None,
    )


async def check_url_and_write_if_bad(
    url: str, writer: Writer, output_file: AsyncPath, session: aiohttp.ClientSession
) -> None:
    """Check URL status and write bad URLs to a file."""
    await aprint(f"??? {url}")
    code = -1
    try:
        code = await check_url(url, session)
    except aiohttp.ClientError as exc:
        if hasattr(exc, "strerr"):
            await aprint(f"\033[31m{exc.strerr}\033[0m")
        if hasattr(exc, "message"):
            await aprint(f"\033[31m{exc.message}\033[0m")
        else:
            await aprint(f"\033[31m{exc}\033[0m")
    await aprint(f"{code} {url}")

    if 200 > code or code >= 400:
        await writer(f"{code}|{url}\n")


async def check_url(url: str, session: aiohttp.ClientSession) -> int:
    """Get the status code of a URL."""
    async with session.head(url) as response:
        return response.status


async def find_and_write_bad_urls(
    output_file: AsyncPath, search_path: str = "."
) -> None:
    """Find and write bad URLs to a specified file."""
    search_path = AsyncPath(search_path)
    await aprint("Getting pages...")
    md_files = await filter_files(await find_md_files(search_path))
    await aprint("Found all pages!")

    async with AIOFile(output_file.name, "a") as afp:
        writer = Writer(afp)
        async with aiohttp.ClientSession(
            trust_env=True, timeout=aiohttp.ClientTimeout(total=500)
        ) as session:
            await asyncio.gather(
                *(process_file(file, writer, output_file, session) for file in md_files)
            )
        await afp.fsync()


async def main():
    await find_and_write_bad_urls(AsyncPath("bad-urls.txt"), search_path="./pages")


if __name__ == "__main__":
    asyncio.run(main())

Edit: I forgot to remove 2 test lines 😅
Update: now it writes to bad-urls.txt sequentially using AIOFile.Writer, and doesn't write partial text anymore

vitorhcl · 2024-02-17T20:28:50Z

Note: I'm not writing to /tmp/bad-urls.txt for Windows compability, but the user is free to change this on the script in the main function.

sbrl · 2024-02-18T00:28:11Z

Hey, this is cool! Does it handle rate limiting to avoid hosts from blocking it? This was a key issue with the design of the current script IIRC.

vitorhcl · 2024-02-18T03:35:21Z

It doesn't handle it currently, sometimes I have to wait a while to run the script again but IIRC it can eventually do an entire run without being blocked if not preceded by many runs.

Either way, we can sure add this to the code :)

vitorhcl · 2024-02-18T03:38:43Z

I think the right place for this script is in scripts/, given that its complexity increased a bit, and thus making contributions easier. This way it's possible to add another features too like regex matching for URLs and automatic link updating for redirection.

gutjuri · 2024-02-18T15:07:28Z

Thanks! nice work! Unfortunately, the script errors on my machine with OSError: [Errno 24] Too many open files: 'pages/osx/lpstat.md'. Is there a way in asyncio to limit the amount of concurrency, i.e., that the maximum amount of concurrently running threats is limited to e.g. 500?

I'm running Ubuntu btw, so there is a limit on how many files can be opened simultaneously (however, I'm not sure on the exact limit).

EDIT: fixed it with a semaphore:

#!/usr/bin/env python3
# SPDX-License-Identifier: MIT

import random
import re
import asyncio
import sys
from aiofile import AIOFile, Reader, Writer
import aiohttp.client_exceptions
from aioconsole import aprint
from aiofile import async_open
from aiopath import AsyncPath

MAX_CONCURRENCY = 500

sem = asyncio.Semaphore(MAX_CONCURRENCY)

async def find_md_files(search_path: AsyncPath) -> list[AsyncPath]:
    """Find all .md files in the specified search path."""
    md_files = set()
    async for path_dir in search_path.glob("*"):
        await aprint(path_dir.name)
        async for file in search_path.glob("*/*.md"):
            md_files.add(file)
    return md_files


async def append_if_is_file(path_list: list[AsyncPath], path: AsyncPath):
    """Append the file to the list if it exists"""
    if await path.is_file():
        path_list.add(path)


async def filter_files(md_files: list[AsyncPath]) -> list[AsyncPath]:
    """Filter out non-file paths from the list."""
    filtered_files = set()
    await asyncio.gather(
        *(append_if_is_file(filtered_files, path) for path in md_files)
    )
    return filtered_files


async def process_file(
    file: AsyncPath,
    writer: Writer,
    output_file: AsyncPath,
    session: aiohttp.ClientSession,
) -> None:
    """Extract the link of a single .md file and check it."""
    async with sem:
        async with file.open("r") as f:
            try:
                content = await f.read()
            except Exception as e:
                await aprint(file.parts[-3:])
                return

    url = extract_link(content)

    if url is not None:
        await check_url_and_write_if_bad(url, writer, output_file, session)


def extract_link(content: str) -> list[str]:
    """Extract the link of '> More information: '."""
    return next(
        (
            match.group(1)
            for match in re.finditer(r"> More information: <(.+)>", content)
        ),
        None,
    )


async def check_url_and_write_if_bad(
    url: str, writer: Writer, output_file: AsyncPath, session: aiohttp.ClientSession
) -> None:
    """Check URL status and write bad URLs to a file."""
    await aprint(f"??? {url}")
    code = -1
    try:
        code = await check_url(url, session)
    except aiohttp.ClientError as exc:
        if hasattr(exc, "strerr"):
            await aprint(f"\033[31m{exc.strerr}\033[0m")
        if hasattr(exc, "message"):
            await aprint(f"\033[31m{exc.message}\033[0m")
        else:
            await aprint(f"\033[31m{exc}\033[0m")
    await aprint(f"{code} {url}")

    if 200 > code or code >= 400:
        await writer(f"{code}|{url}\n")


async def check_url(url: str, session: aiohttp.ClientSession) -> int:
    """Get the status code of a URL."""
    async with session.head(url) as response:
        return response.status


async def find_and_write_bad_urls(
    output_file: AsyncPath, search_path: str = "."
) -> None:
    """Find and write bad URLs to a specified file."""
    search_path = AsyncPath(search_path)
    await aprint("Getting pages...")
    md_files = await filter_files(await find_md_files(search_path))
    await aprint("Found all pages!")

    async with AIOFile(output_file.name, "a") as afp:
        writer = Writer(afp)
        async with aiohttp.ClientSession(
            trust_env=True, timeout=aiohttp.ClientTimeout(total=500)
        ) as session:
            await asyncio.gather(
                *(process_file(file, writer, output_file, session) for file in md_files)
            )
        await afp.fsync()


async def main():
    await find_and_write_bad_urls(AsyncPath("bad-urls.txt"), search_path="./pages")


if __name__ == "__main__":
    asyncio.run(main())

Also, Manned.org seems to do rate-limiting, so we should definitely implement this, ideally on a per-domain basis. For example, I get 503 (service unavailable) error codes on some existing manned pages.

vitorhcl · 2024-02-19T00:43:00Z

Thanks for adding this limit.

As for the per-domain rate limit, I have an idea: splitting the links in lists, where each list contains links that belong to a specific domains, and we alternate between the lists putting them in the right order and using ayncio.sleep for respecting each domain timeout.

sebastiaanspeck · 2024-02-19T05:48:26Z

I think this script could also be useful for https://github.com/tldr-pages/tldr-maintenance to help contributors spot them, instead of locally checking them. I proposed this idea a while ago, but back then the current script in the Wiki was not ideal for a lot of runs.

sbrl · 2024-02-23T21:46:00Z

@vitorhcl that approach sounds good to me! I think we would want to bake that into the script, especially if we are to add it to tldr-maintenance. Perhaps for future edits you would like to open a pull request so we can track history as the script evolves rather than confining to this issue?

Thanks so much again!

vitorhcl mentioned this issue Feb 19, 2024

id3v2: use hyphen minus (U+002D) instead of hyphen (U+2010) #12302

Merged

5 tasks

vitorhcl mentioned this issue Mar 14, 2024

scripts/check-more-info-urls.py: add script #12506

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

script: detect broken "More information" links #12289

script: detect broken "More information" links #12289

vitorhcl commented Feb 17, 2024 •

edited

vitorhcl commented Feb 17, 2024

sbrl commented Feb 18, 2024

vitorhcl commented Feb 18, 2024

vitorhcl commented Feb 18, 2024

gutjuri commented Feb 18, 2024 •

edited

vitorhcl commented Feb 19, 2024

sebastiaanspeck commented Feb 19, 2024 •

edited

sbrl commented Feb 23, 2024

script: detect broken "More information" links #12289

script: detect broken "More information" links #12289

Comments

vitorhcl commented Feb 17, 2024 • edited

vitorhcl commented Feb 17, 2024

sbrl commented Feb 18, 2024

vitorhcl commented Feb 18, 2024

vitorhcl commented Feb 18, 2024

gutjuri commented Feb 18, 2024 • edited

vitorhcl commented Feb 19, 2024

sebastiaanspeck commented Feb 19, 2024 • edited

sbrl commented Feb 23, 2024

vitorhcl commented Feb 17, 2024 •

edited

gutjuri commented Feb 18, 2024 •

edited

sebastiaanspeck commented Feb 19, 2024 •

edited