Bulk URL Checker with uv: Validate Website Accessibility in Python

Learn how to build a powerful URL checker script using uv that validates multiple websites concurrently, detects broken links, and generates detailed reports.

Bulk URL Checker with uv: Validate Website Accessibility in Python

Managing websites and ensuring all links are working properly is a crucial task for web developers, SEO specialists, and content managers. Broken links can hurt your SEO rankings, frustrate users, and damage your website’s credibility. With uv, creating a powerful URL checker script is straightforward and efficient.

In this comprehensive guide, we’ll build a feature-rich URL validation tool that can check hundreds of URLs concurrently, categorize different types of errors, generate detailed reports, and save problematic URLs for further investigation. Whether you’re auditing a website, validating external links, or monitoring API endpoints, this script has you covered.

New to uv?

If you’re new to uv or want to learn how to set up full Python projects, start with our comprehensive guide Getting Started with uv: Setting Up Your Python Project in 2025 before diving into this advanced script.

What Makes This URL Checker Special?

Unlike basic URL validation tools, our script offers enterprise-level features:

  • Concurrent Processing: Check multiple URLs simultaneously using ThreadPoolExecutor
  • Smart URL Handling: Automatically adds HTTPS protocol to URLs without schemes
  • Comprehensive Error Detection: Identifies timeouts, connection errors, and HTTP status codes
  • Detailed Reporting: Provides response times, status codes, and error categorization
  • File Input/Output: Read URLs from files and save problematic URLs for review
  • Progress Tracking: Real-time progress indicators during bulk checking
  • Flexible Configuration: Customizable timeout settings and worker thread counts
  • Cross-Platform: Works seamlessly on macOS, Windows, and Linux

The Complete URL Checker Script

Let’s start with our comprehensive URL checker script. Save this as url_checker.py:

#!/usr/bin/env -S uv run
# /// script
# dependencies = [
#     "requests",
# ]
# ///

import requests
from urllib.parse import urlparse
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
import sys

def check_url(url, timeout=10):
    """
    Check if a URL is accessible and return status information.

    Args:
        url (str): The URL to check
        timeout (int): Timeout in seconds (default: 10)

    Returns:
        dict: Contains url, status, error_type, and response_time
    """
    # Add http:// if no scheme is provided
    if not url.startswith(('http://', 'https://')):
        url = 'https://' + url

    start_time = time.time()

    try:
        response = requests.get(url, timeout=timeout, allow_redirects=True)
        response_time = time.time() - start_time

        return {
            'url': url,
            'status': 'OK',
            'status_code': response.status_code,
            'error_type': None,
            'response_time': round(response_time, 2)
        }

    except requests.exceptions.Timeout:
        return {
            'url': url,
            'status': 'TIMEOUT',
            'status_code': None,
            'error_type': 'Connection timeout',
            'response_time': timeout
        }

    except requests.exceptions.ConnectionError as e:
        return {
            'url': url,
            'status': 'CONNECTION_ERROR',
            'status_code': None,
            'error_type': f'Connection error: {str(e)[:100]}...',
            'response_time': time.time() - start_time
        }

    except requests.exceptions.RequestException as e:
        return {
            'url': url,
            'status': 'ERROR',
            'status_code': None,
            'error_type': f'Request error: {str(e)[:100]}...',
            'response_time': time.time() - start_time
        }

def read_urls_from_file(filename):
    """Read URLs from a text file, one per line."""
    urls = []
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            for line in file:
                url = line.strip()
                if url and not url.startswith('#'):  # Skip empty lines and comments
                    urls.append(url)
        return urls
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        return []
    except Exception as e:
        print(f"Error reading file: {e}")
        return []

def check_urls_batch(urls, timeout=10, max_workers=10):
    """
    Check multiple URLs concurrently.

    Args:
        urls (list): List of URLs to check
        timeout (int): Timeout per request in seconds
        max_workers (int): Maximum number of concurrent threads

    Returns:
        list: List of results for each URL
    """
    results = []

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all tasks
        future_to_url = {executor.submit(check_url, url, timeout): url for url in urls}

        # Process completed tasks
        for i, future in enumerate(as_completed(future_to_url), 1):
            result = future.result()
            results.append(result)

            # Progress indicator
            print(f"Checked {i}/{len(urls)} URLs: {result['url']} - {result['status']}")

    return results

def main():
    # Configuration
    filename = input("Enter the filename containing URLs (or press Enter for 'urls.txt'): ").strip()
    if not filename:
        filename = 'urls.txt'

    timeout = input("Enter timeout in seconds (or press Enter for 10): ").strip()
    timeout = int(timeout) if timeout.isdigit() else 10

    print(f"\nReading URLs from '{filename}'...")
    urls = read_urls_from_file(filename)

    if not urls:
        print("No URLs found to check.")
        return

    print(f"Found {len(urls)} URLs to check.")
    print(f"Using timeout: {timeout} seconds")
    print("-" * 50)

    # Check all URLs
    results = check_urls_batch(urls, timeout=timeout)

    # Separate problematic URLs
    problematic_urls = [r for r in results if r['status'] != 'OK']
    working_urls = [r for r in results if r['status'] == 'OK']

    print("\n" + "=" * 50)
    print("SUMMARY")
    print("=" * 50)
    print(f"Total URLs checked: {len(results)}")
    print(f"Working URLs: {len(working_urls)}")
    print(f"Problematic URLs: {len(problematic_urls)}")

    if problematic_urls:
        print("\n" + "=" * 50)
        print("PROBLEMATIC URLs")
        print("=" * 50)

        # Group by error type
        timeout_urls = [r for r in problematic_urls if r['status'] == 'TIMEOUT']
        connection_error_urls = [r for r in problematic_urls if r['status'] == 'CONNECTION_ERROR']
        other_error_urls = [r for r in problematic_urls if r['status'] == 'ERROR']

        if timeout_urls:
            print(f"\nTIMEOUT ERRORS ({len(timeout_urls)}):")
            for result in timeout_urls:
                print(f"  - {result['url']}")

        if connection_error_urls:
            print(f"\nCONNECTION ERRORS ({len(connection_error_urls)}):")
            for result in connection_error_urls:
                print(f"  - {result['url']}")
                print(f"    Error: {result['error_type']}")

        if other_error_urls:
            print(f"\nOTHER ERRORS ({len(other_error_urls)}):")
            for result in other_error_urls:
                print(f"  - {result['url']}")
                print(f"    Error: {result['error_type']}")

        # Save problematic URLs to file
        with open('problematic_urls.txt', 'w', encoding='utf-8') as f:
            f.write("# Problematic URLs found during check\n")
            f.write(f"# Checked on: {time.strftime('%Y-%m-%d %H:%M:%S')}\n\n")

            if timeout_urls:
                f.write("# TIMEOUT ERRORS\n")
                for result in timeout_urls:
                    f.write(f"{result['url']}\n")
                f.write("\n")

            if connection_error_urls:
                f.write("# CONNECTION ERRORS\n")
                for result in connection_error_urls:
                    f.write(f"{result['url']}\n")
                f.write("\n")

            if other_error_urls:
                f.write("# OTHER ERRORS\n")
                for result in other_error_urls:
                    f.write(f"{result['url']}\n")

        print(f"\nProblematic URLs saved to 'problematic_urls.txt'")

    if working_urls:
        print(f"\nWORKING URLs ({len(working_urls)}):")
        for result in working_urls:
            print(f"  ✓ {result['url']} (Status: {result['status_code']}, Time: {result['response_time']}s)")

if __name__ == "__main__":
    print("URL Connection Checker")
    print("=" * 30)
    main()

How the Script Works

Our URL checker script is built around several key components that work together to provide comprehensive URL validation:

Core Functions Breakdown

FunctionPurposeKey Features
check_url()Validates individual URLsHandles timeouts, connection errors, measures response time
read_urls_from_file()Loads URLs from text filesSkips comments and empty lines, handles file errors
check_urls_batch()Processes multiple URLs concurrentlyUses ThreadPoolExecutor, provides progress tracking
main()Orchestrates the entire processInteractive configuration, result categorization

Error Detection Categories

The script categorizes different types of URL problems:

  • OK: URL is accessible and returns a valid HTTP response
  • TIMEOUT: URL takes longer than the specified timeout to respond
  • CONNECTION_ERROR: Network-level issues (DNS resolution, connection refused)
  • ERROR: Other HTTP-related errors (invalid URLs, server errors)

Running the Script

The beauty of using uv is that you can run this script immediately without any setup. Save the script as url_checker.py and follow these steps:

Prerequisites

No additional software installation is required! The script only uses the requests library, which uv will automatically install when you first run the script.

Basic Usage

1. Create a URL List File

First, create a text file with URLs to check. Save this as urls.txt:

# Website URLs to check
https://www.google.com
https://www.github.com
https://www.stackoverflow.com
https://nonexistent-website-12345.com
https://httpstat.us/500
https://httpstat.us/404
# Add more URLs here
bitdoze.com
example.com

2. Run the Script

# Basic execution
uv run url_checker.py

The script will prompt you for:

  • Filename: Press Enter to use urls.txt or specify a different file
  • Timeout: Press Enter for 10 seconds or specify a custom timeout

3. Example Output

URL Connection Checker
==============================
Enter the filename containing URLs (or press Enter for 'urls.txt'):
Enter timeout in seconds (or press Enter for 10):

Reading URLs from 'urls.txt'...
Found 8 URLs to check.
Using timeout: 10 seconds
--------------------------------------------------
Checked 1/8 URLs: https://www.google.com - OK
Checked 2/8 URLs: https://www.github.com - OK
Checked 3/8 URLs: https://www.stackoverflow.com - OK
Checked 4/8 URLs: https://nonexistent-website-12345.com - CONNECTION_ERROR
Checked 5/8 URLs: https://httpstat.us/500 - OK
Checked 6/8 URLs: https://httpstat.us/404 - OK
Checked 7/8 URLs: https://bitdoze.com - OK
Checked 8/8 URLs: https://example.com - OK

==================================================
SUMMARY
==================================================
Total URLs checked: 8
Working URLs: 7
Problematic URLs: 1

==================================================
PROBLEMATIC URLs
==================================================

CONNECTION ERRORS (1):
  - https://nonexistent-website-12345.com
    Error: Connection error: HTTPSConnectionPool(host='nonexistent-website-12345.com', port=443)...

Problematic URLs saved to 'problematic_urls.txt'

WORKING URLs (7):
  ✓ https://www.google.com (Status: 200, Time: 0.15s)
  ✓ https://www.github.com (Status: 200, Time: 0.23s)
  ✓ https://www.stackoverflow.com (Status: 200, Time: 0.18s)
  ✓ https://httpstat.us/500 (Status: 500, Time: 1.02s)
  ✓ https://httpstat.us/404 (Status: 404, Time: 0.98s)
  ✓ https://bitdoze.com (Status: 200, Time: 0.45s)
  ✓ https://example.com (Status: 200, Time: 0.32s)

Advanced Usage Examples

1. Custom Configuration

# Run with custom file and timeout
uv run url_checker.py
# When prompted:
# Filename: my_links.txt
# Timeout: 5

2. Checking Different Types of URLs

Create specialized URL lists for different purposes:

API Endpoints (api_endpoints.txt):

https://api.github.com
https://jsonplaceholder.typicode.com/posts/1
https://httpbin.org/get
https://api.openweathermap.org/data/2.5/weather

Social Media Links (social_links.txt):

https://twitter.com/username
https://linkedin.com/in/profile
https://facebook.com/page
https://instagram.com/account

Internal Website Links (internal_links.txt):

https://yourwebsite.com/about
https://yourwebsite.com/contact
https://yourwebsite.com/blog
https://yourwebsite.com/products

3. Performance Optimization

For large URL lists, you can modify the script to use more concurrent workers:

# In the check_urls_batch function call, increase max_workers
results = check_urls_batch(urls, timeout=timeout, max_workers=20)

Performance Guidelines:

URL CountRecommended WorkersExpected Time
1-505-1010-30 seconds
51-20010-1530-60 seconds
201-50015-251-3 minutes
500+20-303+ minutes

Understanding the Output

Status Codes and Their Meanings

Status CodeMeaningAction Required
200OK - Page loads successfullyNone
301/302Redirect - Page movedUpdate URL if permanent
404Not Found - Page doesn’t existRemove or fix URL
500Server Error - Website issueContact website owner
TimeoutNo response within time limitCheck URL or increase timeout
Connection ErrorNetwork/DNS issuesVerify URL spelling

Generated Files

The script creates a problematic_urls.txt file containing:

# Problematic URLs found during check
# Checked on: 2025-01-15 14:30:25

# TIMEOUT ERRORS
https://slow-website.com

# CONNECTION ERRORS
https://nonexistent-site.com
https://typo-in-url.co

# OTHER ERRORS
https://broken-ssl-site.com

Common Use Cases

1. Website Audit

Use the script to audit your website’s external links:

# Extract all external links from your website first
# Then check them with the script
uv run url_checker.py

Validate backlinks and external references:

# backlinks.txt
https://partner-site1.com/link-to-us
https://directory.com/our-listing
https://blog.com/article-mentioning-us

3. API Endpoint Monitoring

Monitor API endpoints for availability:

# api_health.txt
https://api.yourservice.com/health
https://api.yourservice.com/status
https://api.yourservice.com/version

4. Competitor Analysis

Check competitor websites for availability:

# competitors.txt
https://competitor1.com
https://competitor2.com
https://competitor3.com

Troubleshooting Common Issues

Issue 1: “File not found” Error

Problem: Script can’t find the URL file

Solution:

# Make sure the file exists in the same directory
ls -la urls.txt

# Or use absolute path
/full/path/to/urls.txt

Issue 2: Too Many Timeouts

Problem: Many URLs showing timeout errors

Solutions:

  • Increase timeout value (try 20-30 seconds)
  • Check your internet connection
  • Reduce concurrent workers to avoid overwhelming your network

Issue 3: SSL Certificate Errors

Problem: SSL-related connection errors

Solution: The script uses requests with default SSL verification. For testing purposes, you could modify the script to handle SSL issues, but this is not recommended for production use.

Script Customization Options

1. Add User-Agent Header

Some websites block requests without proper user agents:

# In the check_url function, modify the requests.get call:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, timeout=timeout, allow_redirects=True, headers=headers)

2. Add Response Size Tracking

# Add to the return dictionary in check_url:
'content_length': len(response.content) if response else 0

3. Export Results to CSV

import csv

# Add this function to save results as CSV
def save_to_csv(results, filename='url_check_results.csv'):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['url', 'status', 'status_code', 'response_time', 'error_type']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for result in results:
            writer.writerow(result)

Best Practices

1. Respectful Checking

  • Use reasonable timeouts: Don’t set timeouts too low (minimum 5 seconds)
  • Limit concurrent requests: Don’t overwhelm servers with too many simultaneous requests
  • Add delays for large batches: Consider adding small delays between batches

2. File Organization

  • Use descriptive filenames: social_media_links.txt, api_endpoints.txt
  • Add comments: Use # to add context to your URL lists
  • Regular updates: Keep your URL lists current

3. Monitoring and Automation

  • Schedule regular checks: Use cron jobs or task schedulers
  • Set up alerts: Monitor the problematic_urls.txt file
  • Track trends: Keep historical data of URL health

Integration with Other Tools

1. Combine with Web Scraping

# Extract URLs from a webpage first
import requests
from bs4 import BeautifulSoup

def extract_links_from_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    links = [a.get('href') for a in soup.find_all('a', href=True)]
    return links

2. Integration with CI/CD

# GitHub Actions example
name: URL Health Check
on:
  schedule:
    - cron: "0 9 * * 1" # Every Monday at 9 AM

jobs:
  url-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Install uv
        run: curl -LsSf https://astral.sh/uv/install.sh | sh
      - name: Check URLs
        run: uv run url_checker.py

What’s Next?

Now that you’ve mastered URL checking with uv, you might want to explore more automation scripts:

  • Web Scraping: Build scrapers to extract URLs automatically
  • API Monitoring: Create scripts to monitor API health and performance
  • SEO Tools: Develop tools for SEO analysis and link building
  • Website Monitoring: Set up comprehensive website health monitoring

For more advanced Python automation with uv, check out our other guides:

Conclusion

The URL checker script demonstrates the power and simplicity of using uv for Python automation tasks. With just a few lines of code and zero configuration, you can validate hundreds of URLs, detect broken links, and generate comprehensive reports.

Key benefits of this approach:

  • Zero Setup: No virtual environments or dependency management needed
  • High Performance: Concurrent processing for fast results
  • Comprehensive Reporting: Detailed error categorization and timing information
  • Flexible Input: Support for file-based URL lists with comments
  • Actionable Output: Problematic URLs saved for easy follow-up

Whether you’re maintaining a website, conducting SEO audits, or monitoring API endpoints, this script provides a solid foundation that you can customize and extend for your specific needs. The combination of uv’s simplicity and Python’s powerful libraries makes automation tasks like this both accessible and powerful.

Ready to start checking your URLs? Save the script, create your URL list, and experience the efficiency of automated link validation!

Related Posts