Open Source SEO Audit Tool: Script-LLM Hybrid Architecture for Professional Site Analysis

Introduction: Bridging the Gap Between Automation and Intelligence

For SEO professionals, site auditing has traditionally been a labor-intensive process requiring meticulous manual verification of dozens of technical factors. The challenge lies in distinguishing between tasks that can be automated with deterministic scripts and those requiring semantic understanding—a distinction that most existing tools fail to address adequately.

This comprehensive open-source solution introduces a revolutionary two-layer architecture that combines the reliability of Python scripts with the contextual intelligence of Large Language Models (LLMs). The result is a professional-grade SEO audit tool that eliminates the tedium of manual checks while avoiding the hallucination risks associated with pure AI-driven approaches.

The Problem: Why Traditional SEO Auditing Falls Short

SEO auditing encompasses a wide range of verification tasks, each with different requirements for accuracy and interpretation. Consider the following scenarios:

Deterministic Checks (80% of audit tasks):

Verifying robots.txt existence and syntax compliance
Validating sitemap.xml structure and URL coverage
Checking canonical tag implementation
Measuring page load performance scores
Counting heading hierarchy (H1, H2, H3)

These tasks have clear, binary answers. A file either exists or it doesn't. A tag is either present or absent. Using an LLM for these checks introduces unnecessary risk—the model might confidently assert that a robots.txt file exists when it actually doesn't.

Semantic Judgment Tasks (20% of audit tasks):

Evaluating whether an H1 tag semantically matches the page's keyword intent
Assessing whether a meta description provides compelling value propositions
Determining if internal link distribution supports user navigation goals
Judging whether content depth adequately covers the topic

These tasks require contextual understanding that scripts alone cannot provide. An LLM excels here, interpreting nuance and intent in ways that rigid rule-based systems cannot.

The Solution: A Two-Layer Architecture

The seo-audit-skill tool implements a sophisticated dual-layer approach:

Layer 1: Python Scripts for Deterministic Verification

The foundation consists of specialized Python scripts, each responsible for a specific category of SEO checks:

check-site.py: Handles site-level verification including robots.txt parsing (RFC 9309 compliant), sitemap.xml validation, 404 response handling (distinguishing true 404s from soft 404s and homepage redirects), URL normalization checks (HTTP to HTTPS migration, www consistency, trailing slash standardization), internationalization hreflang tag verification, JSON-LD Schema markup validation, E-E-A-T trust page detection (About, Contact, Privacy, Terms pages), and PageSpeed Insights score retrieval for both mobile and desktop.
check-page.py: Performs page-level analysis including URL slug optimization (lowercase formatting, hyphen usage, keyword presence, stop word detection), Title tag optimization (50-60 character length, keyword positioning), Meta Description quality (120-160 characters, keyword alignment, specific value proposition clarity), H1 tag verification (single H1 presence, keyword matching, semantic intent alignment), Canonical tag validation (self-referencing correctness, redirect chain matching), image alt text completeness, word count verification (minimum 500 words for substantive content), keyword placement analysis (presence within first 100 words), heading structure evaluation (H2 quantity, H3/H2 ratio, keyword distribution), and internal link distribution assessment.
check-schema.py: Extracts and validates JSON-LD structured data markup.
check-pagespeed.py: Interfaces with Google PageSpeed Insights API to retrieve performance metrics.
fetch-page.py: Retrieves raw HTML with built-in SSRF (Server-Side Request Forgery) protection.

Each script outputs structured JSON to stdout, using exit codes to indicate status: 0 for pass/warning conditions, 1 for failures. This design enables seamless integration into automated pipelines and CI/CD workflows.

Layer 2: LLM Agent for Semantic Analysis

The LLM layer activates only when semantic judgment is required. By feeding the structured JSON output from Layer 1 into the LLM, the system provides context-aware recommendations without risking factual accuracy. The LLM interprets the data, identifies patterns, and generates human-readable insights—tasks where probabilistic reasoning adds genuine value.

Comprehensive Audit Coverage: Version 1.0 Features

The current release supports over 20 distinct SEO verification categories, available in two configurations:

Basic Version (seo-audit)

Designed for rapid daily audits, this version provides essential checks:

Site-Level Verification:

robots.txt parsing with RFC 9309 standard compliance
sitemap.xml structure and coverage validation
404 handling differentiation (true 404 vs. soft 404 vs. homepage redirect)
URL normalization (HTTP→HTTPS migration, www consistency, trailing slash standardization)
Internationalization support via hreflang tag verification
Schema.org JSON-LD markup validation
E-E-A-T trust signal detection (About, Contact, Privacy, Terms pages)
PageSpeed Insights performance scoring (mobile and desktop variants)

Page-Level Analysis:

URL slug optimization (lowercase formatting, hyphen usage, keyword presence, stop word detection)
Title tag optimization (50-60 character length, keyword positioning strategy)
Meta Description quality (120-160 characters, keyword alignment, specific value proposition clarity)
H1 tag verification (single H1 requirement, keyword matching, semantic intent alignment)
Canonical tag validation (self-referencing correctness, redirect chain matching)
Image alt text completeness and descriptiveness
Word count verification (minimum 500 words for substantive content)
Keyword placement analysis (presence within first 100 words for SEO impact)
Heading structure evaluation (H2 quantity optimization, H3/H2 hierarchical ratio, keyword distribution)
Internal link distribution and anchor text diversity

Full Version (seo-audit-full)

The comprehensive version includes additional advanced checks for enterprise-grade auditing requirements.

Installation and Usage: Two Flexible Approaches

The tool offers multiple installation methods to accommodate different workflows:

Method 1: Command-Line Interface (Recommended)

# Install the skill package
npx skills add JeffLi1993/seo-audit-skill

# Or install specific versions
npx skills add JeffLi1993/seo-audit-skill --skill seo-audit
npx skills add JeffLi1993/seo-audit-skill --skill seo-audit-full

Method 2: Claude Code Plugin Integration

# Add from marketplace
/plugin marketplace add JeffLi1993/seo-audit-skill

# Install the plugin
/plugin install seo-audit-skill

Once installed, auditing becomes a conversational experience:

audit this page: https://example.com

The system generates a comprehensive, structured report identifying issues, explaining why they matter, and providing actionable remediation steps.

Project Structure and Technical Implementation

The repository follows a clean, modular architecture:

seo-audit-skill/
├── seo-audit/
│   ├── SKILL.md                    # Skill definition and agent workflow
│   ├── references/REFERENCE.md     # Field definitions and edge cases
│   ├── assets/report-template.html # HTML report output template
│   └── scripts/
│       ├── check-site.py           # robots.txt + sitemap → JSON
│       ├── check-page.py           # TDK + H1 + canonical + slug → JSON
│       ├── check-schema.py         # JSON-LD extraction + validation → JSON
│       ├── check-pagespeed.py      # PageSpeed Insights API → JSON
│       └── fetch-page.py           # Raw HTML fetching with SSRF protection
└── seo-audit-full/
    ├── SKILL.md
    ├── references/REFERENCE.md
    └── assets/report-template.html

All dependencies are minimal: pip install requests is the only requirement beyond standard Python libraries.

The Philosophy: Why This Approach Matters

In an era where AI tools proliferate rapidly, genuine expertise lies not in knowing how to use AI, but in understanding problems deeply enough to know when AI should—and should not—be applied.

The development process for this tool exemplifies this principle. Before writing a single line of code, the creator manually audited dozens of websites, identifying which checks were deterministic (suitable for scripts) and which required semantic interpretation (suitable for LLMs). This ground-up understanding enabled the design of an architecture that leverages each technology's strengths while mitigating their weaknesses.

The lesson extends beyond SEO: meaningful AI augmentation requires domain expertise first, tool selection second. Without the foundational work of understanding the problem space, AI-powered solutions risk producing mediocre results—automating the wrong things or introducing errors that wouldn't exist with traditional approaches.

Community and Contribution

This open-source project invites collaboration from the SEO and developer communities:

Star the repository if you find it useful
Report issues when you encounter bugs or edge cases
Submit pull requests to contribute improvements or new features
Share experiences and discuss SEO best practices

The tool is freely available under an open-source license, reflecting a commitment to advancing the state of SEO auditing for practitioners at all levels.

Conclusion: Automation with Intelligence

The seo-audit-skill project represents a thoughtful synthesis of deterministic automation and contextual intelligence. By respecting the boundaries between what scripts do best (factual verification) and what LLMs do best (semantic interpretation), it delivers professional-grade audits without the reliability concerns of pure AI approaches.

For SEO professionals drowning in manual audit tasks, this tool offers liberation from repetitive work while preserving the nuanced judgment that separates good SEO from great SEO. The result is more time for strategic thinking—the very work that no tool, however sophisticated, can ever automate.