Skip to content

snakeye/integrity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

integrity

A small Python script for crawling a website and finding broken links.

The script starts from a given page, crawls all internal HTML pages within the same domain, collects links from <a href> tags, and checks whether those links are reachable.

What it does:

  • crawls all internal pages within the domain of the starting URL;
  • checks internal links on all discovered pages;
  • checks external links only at the first level, meaning only where they appear on an internal page of the site;
  • ignores mailto:, tel:, javascript:, data:, and anchor links;
  • prints a report with pages and broken links at the end.

Installation

Python 3.10+ is required.

Install dependencies:

pip install requests beautifulsoup4 tqdm urllib3

Usage

Basic run:

python check_links.py https://example.com/

Example with additional parameters:

python check_links.py https://example.com/ --timeout 15 --delay 0.2 --max-pages 500

Parameters

  • start_url - starting URL, for example https://example.com/
  • --timeout - HTTP request timeout in seconds, default is 10
  • --delay - delay between page fetches in seconds, default is 0
  • --max-pages - optional limit for the number of internal pages to crawl
  • --user-agent - value for the User-Agent header

How it works

The script uses a queue to crawl pages inside the domain.

Each internal page is added to the queue only once and parsed only once. It uses two sets for that:

  • seen_in_queue - prevents the same page from being added to the queue more than once;
  • visited_pages - prevents an already visited page from being processed again.

This means the HTML of each internal page is fetched only once.

That said, the same link may still be checked separately as a URL. For example, an internal page may first receive a HEAD or GET request during link validation, and later a full GET when the crawler actually reaches that page. That is expected and keeps the logic simpler.

What counts as a broken link

A link is included in the report if:

  • the server returns an HTTP status of 400 or higher;
  • the request fails with a network error;
  • the server does not respond within the timeout.

Limitations

The script is useful, but the internet remains the internet.

  • Some websites do not handle HEAD correctly, so the script falls back to GET in those cases.
  • If a site returns 200 OK for missing pages and shows a custom 404 page, this cannot be detected by status code alone.
  • The script does not execute JavaScript, so it will not see links that appear only after client-side rendering.
  • robots.txt and sitemap files are not used at the moment.
  • URL canonicalization is basic: fragments such as #section are removed, and trailing / is normalized.

Ideas for improvement

  • save the report as CSV or JSON;
  • add parallel link checking;
  • distinguish HTML pages and binary files more accurately;
  • detect “soft 404” pages that return 200 even though the page does not really exist;
  • add exclusions for paths, domains, or URL patterns.

Example output

Broken links report:

Page: https://example.com/about
  - https://example.com/missing-page [status=404]
  - https://external.example.org/old-link [status=410]

Page: https://example.com/contact
  - https://example.com/form [error=HTTPSConnectionPool(host='example.com', ...)]

Total broken link entries: 3

License

Add one if you feel like it. For now, it is just a useful little utility, not a museum artifact.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages