A small Python script for crawling a website and finding broken links.
The script starts from a given page, crawls all internal HTML pages within the same domain, collects links from <a href> tags, and checks whether those links are reachable.
What it does:
- crawls all internal pages within the domain of the starting URL;
- checks internal links on all discovered pages;
- checks external links only at the first level, meaning only where they appear on an internal page of the site;
- ignores
mailto:,tel:,javascript:,data:, and anchor links; - prints a report with pages and broken links at the end.
Python 3.10+ is required.
Install dependencies:
pip install requests beautifulsoup4 tqdm urllib3Basic run:
python check_links.py https://example.com/Example with additional parameters:
python check_links.py https://example.com/ --timeout 15 --delay 0.2 --max-pages 500start_url- starting URL, for examplehttps://example.com/--timeout- HTTP request timeout in seconds, default is10--delay- delay between page fetches in seconds, default is0--max-pages- optional limit for the number of internal pages to crawl--user-agent- value for theUser-Agentheader
The script uses a queue to crawl pages inside the domain.
Each internal page is added to the queue only once and parsed only once. It uses two sets for that:
seen_in_queue- prevents the same page from being added to the queue more than once;visited_pages- prevents an already visited page from being processed again.
This means the HTML of each internal page is fetched only once.
That said, the same link may still be checked separately as a URL. For example, an internal page may first receive a HEAD or GET request during link validation, and later a full GET when the crawler actually reaches that page. That is expected and keeps the logic simpler.
A link is included in the report if:
- the server returns an HTTP status of
400or higher; - the request fails with a network error;
- the server does not respond within the timeout.
The script is useful, but the internet remains the internet.
- Some websites do not handle
HEADcorrectly, so the script falls back toGETin those cases. - If a site returns
200 OKfor missing pages and shows a custom 404 page, this cannot be detected by status code alone. - The script does not execute JavaScript, so it will not see links that appear only after client-side rendering.
robots.txtand sitemap files are not used at the moment.- URL canonicalization is basic: fragments such as
#sectionare removed, and trailing/is normalized.
- save the report as CSV or JSON;
- add parallel link checking;
- distinguish HTML pages and binary files more accurately;
- detect “soft 404” pages that return
200even though the page does not really exist; - add exclusions for paths, domains, or URL patterns.
Broken links report:
Page: https://example.com/about
- https://example.com/missing-page [status=404]
- https://external.example.org/old-link [status=410]
Page: https://example.com/contact
- https://example.com/form [error=HTTPSConnectionPool(host='example.com', ...)]
Total broken link entries: 3
Add one if you feel like it. For now, it is just a useful little utility, not a museum artifact.