Joseph Anady

Posted on May 24 • Originally published at thatdevpro.com

Content Audit Methodology

#seo #contentmarketing #analytics #audit

Originally published at thatdevpro.com. This framework reference is part of the 14-tier Engine Optimization stack from ThatDevPro, an SDVOSB-certified veteran-owned web + AI engineering studio. You are reading the dev.to mirror; the source-of-truth canonical version with embedded validation tools lives at the link above.

The Canonical Reference for Inventorying Every Published URL, Scoring Quality, and Deciding the Fate of Each Page Across Keep, Update, Consolidate, Redirect, and Delete

A comprehensive installation and audit reference for content auditing as an SEO and AEO discipline. Content audit is the recurring process of taking complete inventory of every URL a site has published, scoring each one against a quality and performance rubric, and routing each to one of five outcomes: keep with maintenance, update with targeted improvement, consolidate by merging into a stronger canonical, redirect by 301 to a topical successor, or delete by returning 410 Gone. Audit discipline is the highest leverage activity in mature SEO programs because pruning low quality content lifts perceived quality of the entire site and reallocates crawl budget toward pages that earn citation, rank, and convert. This document specifies the inventory methodology, the twelve criterion quality scorecard, the five way decision matrix, the consolidation and sunset protocols, the section level audit, the AI citation audit layer, the topical cluster audit, the audit cadence by site size, and the Bubbles hosted toolchain that runs the entire pipeline on a single Debian server with no CDN or proxy in the path. Dual purpose: installation manual and audit document.

Cross stack note: code samples are written in plain HTML and Bash. For React, Vue, Svelte, Next.js, Nuxt, SvelteKit, Astro, Hugo, 11ty, Remix, WordPress, Shopify, and Webflow equivalents, see framework-cross-stack-implementation.md. The audit pipeline substrate is Python 3.11, pandas 2.x, and Jupyter on the same Debian host that runs nginx.

1. Document Purpose

Content audit is the recurring process of inventorying every published URL on a site, scoring each one against a quality and performance rubric, and routing each to one of five fates: keep with maintenance, update with targeted improvement, consolidate by merging into a stronger canonical, redirect by 301 to a topical successor, or delete by returning 410 Gone. A site that audits once at launch and never again accumulates dead weight every quarter; a site that audits on the cadence in Section 13 keeps its portfolio in continuous alignment with current quality bars and query intent.

Audit discipline is the highest leverage activity in mature SEO programs. First, pruning low quality content lifts the perceived quality of the entire site. Ahrefs August 2025 measured organic click lift of 1 to 23 percent across 47 of 50 case studies on bottom 30 percent pruning. The lift is sitewide. Second, audit reallocates crawl budget toward pages that produce business value. Third, audit produces the data needed for every other content decision: the topical cluster audit shows where the cluster is incomplete, the AI citation audit shows which page patterns earn AI Overview citation, the section level audit shows where individual blocks need refresh while the surrounding article stays static.

The 2026 emphasis on Google's Helpful Content System and on AI Overview citation makes audit increasingly load bearing. A site with significant low quality content risks algorithmic demotion under HCS; the same site loses AI Overview citation to competitors whose content is denser, more current, and more cleanly entity declared.

1.1 Three Operating Modes

Mode A, Install Mode. Establish audit infrastructure on a site that has never had systematic audit. Sections 2 through 14 in order. The first full audit on a new client engagement is the highest value deliverable for the first 90 days.

Mode B, Audit Mode. Run a recurring audit on a site that already has audit discipline. Skip the inventory build; pull the prior audit's inventory; refresh metrics, rescore, reroute, produce the updated work queue. Mode B is what most ongoing client engagements run on a quarterly anchor.

Mode C, Hybrid Mode. Partial or stale inventory. Reconcile against current sitemap, GSC, and GA4, fill the gaps, proceed with Mode B scoring.

1.2 Conflict Resolution Rules

Conflict	Rule
Sitemap inventory shorter than GSC plus GA4 inventory	Critical. Merge all three. Section 3.
Quality scoring without traffic and engagement data	Reject. Section 4 requires both.
Page has zero traffic but holds backlinks	Do not delete. Section 8 specifies 301 to topical successor.
Two pages compete for the same query and both have traffic	Consolidate, do not refresh both. Section 7.
Page level decision is Update but only one section is decayed	Section level audit (Section 10).
Audit last run more than 12 months ago	Treat as new install. Mode A.
Inventory exceeds 5000 URLs and client wants annual full audit	Reject. Section 13 specifies continuous rolling.

1.3 Required Tools

Sitemap fetch via curl; Screaming Frog SEO Spider CLI on Linux (or Sitebulb headless) for full URL crawl; GSC Search Analytics API for indexed URL discovery and per URL performance; GA4 Data API for engagement and conversion attribution; Ahrefs or Semrush API for backlink and ranking data; Python 3.11 with pandas 2.x for inventory merge and scoring; Jupyter for analyst review; spreadsheet (Google Sheets or self hosted Nextcloud) for client shared output. No CDN, no proxy in the audit pipeline. The Bubbles host (169.155.162.118, Debian, nginx, 16 GB RAM) runs the entire stack.

1.4 Relationship to Neighboring Frameworks

Broader site audit: framework-initialaudit.md. Quarterly cadence: framework-ongoingaudit.md. Update or refresh execution: framework-contentrefresh.md. Topical cluster: framework-topicalauthority.md. Internal link: framework-internallinking.md. AI citation: framework-aicitations.md, framework-aioverviews.md. GSC data pull: framework-gscanalysis.md. Quality scoring inputs: framework-eeat.md, framework-hcs.md, framework-infogain.md. Health score: framework-sqrg.md.

2. Client Variables Intake

# CONTENT AUDIT FRAMEWORK CLIENT VARIABLES

# Business and Site Identity (REQUIRED)
business_name: ""
primary_domain: ""
business_industry: ""
ymyl_classification: ""              # full_ymyl, partial_ymyl, lite_ymyl, non_ymyl
cms_or_stack: ""
host_environment: ""                 # bubbles_nginx, valkyrie_nginx, third_party

# Portfolio Scale (REQUIRED)
total_indexable_pages: 0
total_content_pages_excluding_product: 0
total_product_or_listing_pages: 0
oldest_content_publication_year: 0
pages_published_more_than_24_months_ago: 0
pages_published_more_than_12_months_ago: 0
pages_with_zero_inbound_internal_links: 0

# Inventory Source State (REQUIRED)
sitemap_url: ""
sitemap_url_count: 0
gsc_property_verified: false
gsc_indexed_page_count: 0
ga4_property_verified: false
ga4_pages_with_traffic_last_12mo: 0
ahrefs_or_semrush_access: false

# Prior Audit State
prior_audit_exists: false
prior_audit_date: ""
prior_audit_inventory_count: 0
prior_audit_decisions_executed: 0
prior_audit_decisions_pending: 0

# Decision Routing Capacity (REQUIRED)
update_capacity_hours_per_quarter: 0
consolidation_capacity_pages_per_quarter: 0
redirect_or_delete_capacity_per_quarter: 0
section_level_audit_in_use: false

# AI Citation Layer (REQUIRED)
priority_queries_tracked_for_aio: 0
queries_currently_cited_in_aio: 0
ai_citation_audit_integrated: false

# Toolchain (REQUIRED)
crawler_tool: ""                     # screaming_frog_cli, sitebulb_headless, custom_python
gsc_api_credentials_provisioned: false
ga4_api_credentials_provisioned: false
ahrefs_api_credentials_provisioned: false
python_pandas_environment_ready: false
jupyter_notebook_location: ""
audit_csv_output_location: ""
client_shared_spreadsheet_location: ""

Audit routes to baseline frameworks when prerequisites fail. If gsc_property_verified is false, work routes to framework-gscanalysis.md Section 2 verification. If ga4_property_verified is false, work routes to GA4 setup before audit. If sitemap_url_count is zero or far off from gsc_indexed_page_count and ga4_pages_with_traffic_last_12mo, the sitemap is broken and work routes to sitemap repair. Audit against a broken substrate produces inventory gaps that bias every downstream decision.

3. The Content Inventory

The most common audit failure mode is incomplete inventory: working only from the sitemap and missing pages GSC has discovered through external links, or working only from GA4 and missing pages that have impressions but no clicks. Search Engine Journal March 2025 (200 mid market sites): combined inventory (sitemap union GSC union GA4) exceeded each individual list by 10 to 30 percent. The pages in the gap are typically orphans, legacy pages, tag and category archives, paginated results, parameter URLs, and pages the CMS publishes outside the sitemap.

3.1 The Four Source Inventory Build

Source 1: Sitemap fetch. Pull sitemap.xml from the canonical domain. If the site uses a sitemap index, follow the index and fetch each child. Resolve each <loc> to its canonical URL.

SITEMAP="https://example.com/sitemap.xml"
DIR="/home/user/clients/[clientname]/audits"
for CHILD in $(curl -s "${SITEMAP}" | grep -oE '<loc>[^<]+</loc>' | sed -E 's/<\/?loc>//g'); do
  curl -s "${CHILD}" | grep -oE '<loc>[^<]+</loc>' | sed -E 's/<\/?loc>//g'
done | sort -u > "${DIR}/sitemap-urls.txt"

If sitemap.xml itself contains URLs rather than sub-sitemaps, the inner curl returns its own <loc> entries directly.

Source 2: Full URL crawl. Run Screaming Frog SEO Spider CLI on Linux with content extraction enabled.

screamingfrogseospider --crawl https://example.com/ --headless --save-crawl --export-tabs "Internal:All" --output-folder /home/user/clients/[clientname]/audits/

For Sitebulb headless the equivalent invocation uses sitebulb run --url https://example.com/ --output /home/user/clients/[clientname]/audits/. Either tool produces a CSV with one row per URL and columns for status code, content type, indexability, response time, word count, H1, title, meta description, outbound link counts.

Source 3: GSC discovered URLs. Pull the GSC Search Analytics API for every URL with at least one impression over the last 16 months. Service account JSON stored at /home/user/clients/[clientname]/secrets/gsc.json with siteFullUser permission.

from google.oauth2 import service_account
from googleapiclient.discovery import build

creds = service_account.Credentials.from_service_account_file("/home/user/clients/[clientname]/secrets/gsc.json", scopes=["https://www.googleapis.com/auth/webmasters.readonly"])
service = build("searchconsole", "v1", credentials=creds)
request = {"startDate": "2025-01-14", "endDate": "2026-05-14", "dimensions": ["page"], "rowLimit": 25000}
response = service.searchanalytics().query(siteUrl="https://example.com/", body=request).execute()
with open("/home/user/clients/[clientname]/audits/gsc-urls.csv", "w") as f:
    f.write("url,clicks,impressions,ctr,position\n")
    for row in response.get("rows", []):
        f.write(f'"{row["keys"][0]}",{row["clicks"]},{row["impressions"]},{row["ctr"]},{row["position"]}\n')

Source 4: GA4 historical URLs. Pull the GA4 Data API for every page path with at least one session over the last 12 months. Pages in GA4 but not in sitemap are orphans or unindexed pages receiving direct traffic.

from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import DateRange, Dimension, Metric, RunReportRequest

client = BetaAnalyticsDataClient.from_service_account_file("/home/user/clients/[clientname]/secrets/ga4.json")
request = RunReportRequest(
    property=f"properties/{GA4_PROPERTY_ID}",
    date_ranges=[DateRange(start_date="2025-05-14", end_date="2026-05-14")],
    dimensions=[Dimension(name="pagePath")],
    metrics=[Metric(name="sessions"), Metric(name="averageSessionDuration"), Metric(name="conversions")],
    limit=100000)
response = client.run_report(request)
with open("/home/user/clients/[clientname]/audits/ga4-urls.csv", "w") as f:
    f.write("path,sessions,avg_session_duration,conversions\n")
    for r in response.rows:
        f.write(f'"{r.dimension_values[0].value}",{r.metric_values[0].value},{r.metric_values[1].value},{r.metric_values[2].value}\n')

3.2 The Merge and Deduplication

The four source CSV files merge into a single canonical inventory CSV. The merge key is the normalized URL (canonical protocol, canonical host, canonical trailing slash, no tracking parameters, no anchor fragments).

import pandas as pd
DIR = "/home/user/clients/[clientname]/audits/"
sitemap = pd.read_csv(DIR+"sitemap-urls.txt", header=None, names=["url"])
crawler = pd.read_csv(DIR+"crawler-urls.csv")
gsc = pd.read_csv(DIR+"gsc-urls.csv")
ga4 = pd.read_csv(DIR+"ga4-urls.csv")

def normalize(url):
    url = url.lower().strip().split("?")[0].split("#")[0]
    if not url.endswith("/") and "." not in url.rsplit("/", 1)[1]: url += "/"
    return url

for df in (sitemap, crawler, gsc): df["url"] = df["url"].apply(normalize)
ga4["url"] = ga4["path"].apply(lambda p: normalize("https://example.com" + p))
sitemap["in_sitemap"], crawler["in_crawler"], gsc["in_gsc"], ga4["in_ga4"] = True, True, True, True

inventory = sitemap.merge(crawler, on="url", how="outer").merge(gsc, on="url", how="outer").merge(ga4, on="url", how="outer")
inventory.to_csv(DIR+"inventory.csv", index=False)

Every row is a URL with boolean flags for which sources discovered it plus the performance signals. URLs flagged in only one source are diagnostic: sitemap but not crawler suggests broken internal navigation; crawler but not sitemap suggests missing sitemap entries; GA4 but not crawler or sitemap suggests orphan pages or pages excluded by robots.

3.3 Sanity Checks and Categorization

Healthy sitemap to combined ratio is 0.7 to 0.9. Matching exactly indicates the merge failed; above 2x indicates URL normalization broke. Each inventory row is categorized by content type (article, guide, landing, product, listing, author, legal, corporate, other), topic cluster (per site taxonomy), age bucket (new 0 to 6 months, young 6 to 12, mature 12 to 24, old 24 to 48, legacy 48 plus), and performance tier (top 10 percent, top 25 percent, median, bottom 25 percent, bottom 10 percent, zero traffic).

4. Quality Scoring Rubric 2026

Twelve criteria, each scored 0 to 5, total possible 60. The rubric captures both classic SEO quality signals and the 2026 AI citation signals.

4.1 The Twelve Criteria

C1. Traffic last 12 months. GSC organic clicks over the trailing 12 month window. 0 for zero clicks, 1 for under 50 per year, 2 for 50 to 500, 3 for 500 to 5000, 4 for 5000 to 25000, 5 for over 25000. Adjust bands by site size.

C2. Engagement metrics. GA4 average engagement time per session, scroll depth, bounce rate. 0 for zero or negative engagement (auto bounce, sub 10 second sessions); 1 to 5 banded against the site median.

C3. Conversion contribution. GA4 conversion count attributed to the page over the trailing 12 months. 0 for zero conversions; 1 to 5 banded against site conversion median.

C4. Expert review level. Manual rating of E-E-A-T markers: credentialed byline, declared reviewer for YMYL, first hand experience, primary source citations. Per framework-eeat.md rubric.

C5. Recency. dateModified relative to topic volatility. A page on tax filing dated 2022 is more decayed than a page on the Pythagorean theorem dated 2022. 5 for fully current, 3 for moderate currency, 0 for stale relative to topic. Cross reference framework-contentrefresh.md decay scorecard.

C6. Depth. Word count, section count, topical coverage breadth. Surfer SEO January 2026 (210000 URLs): pages over 2100 words earn featured snippet at 2.7 times the rate of pages under 1000 words. Depth is comprehensive coverage, not word padding. 5 for comprehensive multi section coverage, 3 for solid coverage with gaps, 0 for thin content.

C7. Originality. Information Gain per framework-infogain.md. 5 for multiple original contributions (data, first hand observation, contrarian finding, novel synthesis), 3 for at least one, 0 for entirely derivative.

C8. Accuracy. Factual accuracy of every numeric claim, citation, named source, dated reference. 5 for zero detected errors and current data, 3 for minor inaccuracies, 0 for systematic errors or invented statistics.

C9. Multimedia. Images with alt text and descriptive captions, videos with transcripts, diagrams, charts. 5 for rich multimedia matching content, 3 for adequate, 0 for prose only when topic warrants visuals.

C10. Internal link equity. Count of internal links pointing into the page from topically related pages. Princeton GEO SIGKDD 2024: AI citation probability rises sharply at three or more inbound from topically related. 5 for 10 plus inbound, 3 for 3 to 9, 0 for orphan.

C11. Backlink earnings. Referring domains and total inbound links from Ahrefs or Semrush. 5 for 20 plus referring domains, 3 for 5 to 19, 0 for zero.

C12. AI citation presence. Manual sampling on priority queries across Google AI Overview, ChatGPT Search, Perplexity, Claude Search, Bing Copilot. 5 for cited on multiple priority queries across multiple engines, 3 for cited on at least one, 0 for never cited.

4.2 Scoring Tiers

The total maps to a tier that feeds the Section 5 decision matrix. A: exemplar (50 to 60). B: strong (40 to 49). C: serviceable (30 to 39). D: weak, needs intervention (20 to 29). F: candidate for consolidation, redirect, or deletion (0 to 19). Per Search Engine Land November 2025 (47 SaaS audits): the median post audit portfolio distribution is 8 percent A, 22 percent B, 35 percent C, 25 percent D, 10 percent F. A portfolio with over 30 percent F has never had systematic audit; under 5 percent F has likely already had recent pruning.

4.3 Scoring Time and Automation

Per URL: quick scoring 3 to 5 minutes, standard scoring (full 12 criterion) 12 to 18 minutes, deep scoring (with competitor comparison and multi engine AI sampling) 30 to 60 minutes. For a 1000 URL inventory, full standard scoring is 200 to 300 hours. Sampling becomes economically necessary above 2500 URLs (Section 13). The Python pipeline computes C1, C2, C3, C5 (date portion), C10, C11 directly from API data; the analyst concentrates on C4, C6, C7, C8, C9, C12. The hybrid model cuts per URL scoring time roughly in half.

import pandas as pd
inventory = pd.read_csv("/home/user/clients/[clientname]/audits/inventory.csv")

def score_traffic(c):
    return 0 if c == 0 else (1 if c < 50 else (2 if c < 500 else (3 if c < 5000 else (4 if c < 25000 else 5))))
def score_internal_links(n):
    return 0 if n == 0 else (1 if n < 3 else (3 if n < 10 else 5))

inventory["C1"] = inventory["clicks"].apply(score_traffic)
inventory["C10"] = inventory["internal_inbound_count"].apply(score_internal_links)
inventory.to_csv("/home/user/clients/[clientname]/audits/inventory-scored.csv", index=False)

5. The Decision Matrix

The decision matrix routes each scored URL to one of five outcomes. The matrix is deterministic given the tier and per criterion thresholds; the analyst exercises judgment only on edge cases.

5.1 The Five Routes

Route	Definition	Reference
Keep	Performing and high quality. Schedule routine refresh.	5.2
Update	Decayed or partially decayed. Targeted improvement.	Section 6
Consolidate	Two or more pages compete for same query. Merge into canonical.	Section 7
Redirect	Dead but holds backlink equity. 301 to topical successor.	8.2
Delete	No traffic, no backlinks, no topical fit. 410 Gone.	8.1

5.2 Keep and Maintain Criteria

All must be true: Tier A or B (total 40 plus); C1 traffic 3 or higher; C5 recency 3 or higher; C8 accuracy 4 or higher; no cannibalization with another page on the same primary query. Keep pages receive scheduled refresh per framework-contentrefresh.md Section 6 cadence with no immediate structural intervention.

5.3 Update and Refresh Criteria

Triggers any: Tier B or C (total 30 to 49) with at least one criterion at 2 or below; Tier A with any criterion at 2 or below; page decayed in last 90 days (28 day click drop of 30 percent or more); page missing AI Overview citation it previously held. Routes to framework-contentrefresh.md Section 7 production workflow with the specific weak criteria as the trigger.

5.4 Consolidate and Merge Criteria

Triggers: GSC Performance filtered by query shows two or more pages with 100 plus impressions on the same primary query; topical overlap pairs two pages on the same primary topic; one page outranks the other but the loser still earns clicks and backlinks. Routes to Section 7.

5.5 Redirect and Sunset Criteria

Triggers: Tier D or F (total under 30); C1 traffic 0 or 1; C2 engagement 0 or 1; C11 backlink 2 or higher OR C10 internal link 3 or higher; a topical successor exists on the site. Routes to Section 8.2 with internal link cleanup.

5.6 Delete and 410 Criteria

All four required: Tier F (total under 20); C1 traffic 0; C11 backlink 0 or 1; C10 internal link 0 or 1. Routes to Section 8.1.

5.7 The Per Criterion Threshold Table

Criterion	Keep min	Consolidate signal	Redirect signal	Delete signal
C1 traffic	3	2 to 4 (both pages)	0 or 1	0
C2 engagement	3	1 to 4	0 or 1	0
C3 conversions	2	1 to 4	0	0
C4 E-E-A-T	3	1 to 4	0 to 2	0 to 2
C5 recency	3	1 to 4	0 or 1	0
C6 depth	3	1 to 4	0 to 2	0 to 2
C7 originality	3	1 to 4	0 to 2	0 to 2
C8 accuracy	4	2 to 5	0 to 3	0 to 3
C9 multimedia	2	0 to 4	0 to 3	0 to 3
C10 internal links	3	1 to 4	3 or higher	0 or 1
C11 backlinks	2	1 to 4	2 or higher	0 or 1
C12 AI citation	2	1 to 4	0 to 2	0 to 2

The table is enforced by the Python pipeline; the analyst sees a routing recommendation per URL and overrides only on documented edge cases.

6. Update versus Refresh Distinction

The terms update and refresh are often used interchangeably; this framework distinguishes them deliberately because the operational workflows are different.

6.1 The Distinction

Refresh is the rolling decay protection workflow run on healthy Keep tier pages on a scheduled cadence. The workflow lives in framework-contentrefresh.md. Refresh is preventive: a page currently performing well receives a substantive review and modest update at the cadence appropriate to its content type (weekly for news, quarterly for evergreen, semi annual for YMYL). The refresh keeps the page in the AI Overview candidate pool and defends against gradual decay.

Update is the targeted improvement workflow run on flagged Update tier pages after audit identifies a specific weakness. The workflow is similar to refresh (same dateModified discipline, same schema preservation, same changelog requirement) but reactive rather than preventive. An update addresses the specific criteria flagged by the audit rather than reviewing the whole page.

6.2 Why the Distinction Matters

Three operational reasons. Capacity allocation: refresh capacity is the quarterly anchor on the whole portfolio, update capacity is the weekly work queue from audit. Success measurement: refresh ROI is measured by 28 day pre versus 28 day post; update ROI is measured by criterion specific improvement. Audit log entry: refresh log entries reference the cadence trigger, update log entries reference the specific audit finding being addressed.

6.3 The Cross Reference

For Keep tier pages requiring scheduled refresh, route to framework-contentrefresh.md Section 7 production workflow with quarterly anchor cadence. For Update tier pages flagged by Section 5.3 criteria, route to that same workflow with the specific audit finding as the trigger documentation. The anti pattern: updating a Keep tier page that does not need work because "it has been six months." Calendar refresh; audit driven update is trigger based by definition.

7. Consolidation Methodology

Consolidation merges two or more pages competing for the same query into a single stronger canonical. Search Engine Journal April 2025 (114 audits): mean 12.4 cannibalization pairs per audit on portfolios over 500 pages.

7.1 Cannibalization Detection

Step 1: GSC query mining. Use the same GSC API client as Section 3 with dimensions=["query", "page"] over a 6 month window. Queries where two or more URLs exceed 100 impressions are candidates: pairs = df.groupby("query").filter(lambda g: len(g) > 1 and g["impressions"].sum() > 100).

Step 2: Topical overlap grouping. Within the inventory categorization, group pages by topic cluster. Pairs within the same cluster sharing primary keywords above 50 percent are candidates.

Step 3: Backlink overlap analysis. Pull referring domains for each candidate URL from Ahrefs API. If the two URLs share more than 30 percent of referring domains, consolidation compounds their backlink profile.

7.2 The Five Comparison Axes

The page winning on three or more axes is the canonical; the other becomes the loser. Axes: traffic (12 month organic clicks); conversion (GA4 attributed conversions); backlinks (Ahrefs referring domains, weighted by authority); AI citation (manual sampling on target query); Information Gain (manual review of original contributions per framework-infogain.md). A page winning on traffic but losing on backlinks and AI citation is not automatically the canonical; the consolidation must preserve the loser's link equity and Information Gain by merging them into the canonical before the redirect.

7.3 The Merge Workflow

Identify canonical and loser per Section 7.2.
Inventory unique content in the loser the canonical lacks (H2 sections, FAQ entries, data tables, examples, case studies, expert quotes).
Merge unique content into the canonical. Add new sections where appropriate. Update H1 if merged scope warrants.
Update schema. Article headline matches new H1. FAQPage extends with new questions. HowTo extends if procedural content merged. dateModified updates to today.
Update internal link strategy to reflect the consolidated scope.
Configure 301: location = /old-loser-path/ { return 301 /canonical-path/; }
Update internal links sitewide pointing to the loser to point to the canonical directly.
Submit canonical to IndexNow (framework-contentrefresh.md Section 7.11) and GSC URL Inspection.
Remove loser URL from sitemap.xml.
Log consolidation in audit log with both URLs, comparison scores, canonical's pre consolidation 28 day baseline.

7.4 Internal Link Rewiring Script

LOSER="/old-loser-path/"; CANONICAL="/canonical-path/"
SITE_ROOT="/var/www/sites/[domain]/"
grep -rln "${LOSER}" "${SITE_ROOT}" --include="*.html" --include="*.md" | while read FILE; do
  python3 -c "open('${FILE}','w').write(open('${FILE}').read().replace('${LOSER}','${CANONICAL}'))"
done

Test on a staging copy before running against production.

7.5 The nginx Redirect Pattern

For a single redirect, the inline location block above. For dozens or hundreds, the map directive scales:

map $request_uri $consolidation_redirect {
    /old-loser-path-a/  /canonical-a/;
    /old-loser-path-b/  /canonical-b/;
    default             "";
}
server {
    listen 443 ssl http2;
    server_name example.com;
    location / {
        if ($consolidation_redirect != "") { return 301 $consolidation_redirect; }
        try_files $uri $uri/ /index.html;
    }
}

7.6 Consolidation ROI Measurement

Capture canonical's 28 day pre consolidation baseline (clicks, impressions, average position, conversions). After publish and 301 propagation (1 to 4 weeks), capture 28 day post metrics. Ahrefs case study compilation (2024 to 2025): median 47 percent organic click lift on the canonical at 90 days post consolidation across 23 documented consolidations. If consolidation produces zero or negative lift: the merge was cosmetic, or the 301 failed. Validate:

curl -I "https://example.com/old-loser-path/" | head -1     # Expect: HTTP/2 301
curl -sI "https://example.com/old-loser-path/" | grep -i "^location:"
curl -I "https://example.com/canonical-path/" | head -1     # Expect: HTTP/2 200

GSC URL Inspection on both URLs: the loser should report "URL is not on Google" with canonical as destination, the canonical should report "URL is on Google" with consolidation metadata updated.

8. Sunset and Pruning Protocol

Sunset removes pages with no salvage value (410) or pages with backlink equity but no traffic (301 to topical successor). Pruning is the portfolio level discipline of recurring sunset.

8.1 The 410 Gone Workflow

410 is the cleaner signal: "this page is gone permanently, do not re index." Trigger criteria (Section 5.6): Tier F, C1=0, C11=0 or 1, C10=0 or 1. Workflow: verify no internal links flow from priority pages (if any do, route to 301 or remove the internal links first); verify no significant backlinks (if any referring domain above DR 30 exists, route to 301 instead); configure location = /dead-page-path/ { return 410; }; remove URL from sitemap.xml; remove from navigation, related posts widgets, category archives; submit GSC URL Removal for sensitive URLs needing faster de indexing; log the deletion. Validate: curl -I "https://example.com/dead-page-path/" | head -1 expects HTTP/2 410.

8.2 The 301 Redirect Workflow

301 is appropriate when the page has backlink equity worth preserving or where a clear topical successor exists. Trigger criteria (Section 5.5). Workflow: identify the closest topical successor (must genuinely cover the same or a parent topic; do not redirect to homepage or generic category archive, which Google treats as soft 404 per Intero Digital 2025 guidance); configure 301; remove the dead URL from sitemap.xml; update internal links sitewide to point to the successor directly; submit successor to IndexNow; submit successor for GSC URL Inspection; log the redirect.

8.3 The Noindex Alternative

For pages that should not be deleted or redirected but should not earn search visibility: noindex. Appropriate for legal pages (terms, privacy, dmca), low value archive pages with historical importance, category or tag archives that serve navigation but not search. Use <meta name="robots" content="noindex, follow">. follow allows link equity to flow through the page while the page itself does not rank. noindex, nofollow severs link equity.

8.4 The 1 Percent Rule and the Pruning Lift

Ahrefs August 2025 (50 case studies, mid market sites): pruning the bottom 30 percent of pages by 12 month organic clicks lifted overall site organic clicks by 1 to 23 percent in 47 of 50 cases. Mean lift was 7.4 percent at 90 days, 9.1 percent at 180 days. The lift is sitewide. Mechanism: the site's perceived quality rises as the bottom decile is removed, crawl budget reallocates to surviving pages, internal link equity concentrates on stronger pages. The 1 percent rule: even a site that prunes only the dead orphan pages at the bottom typically sees measurable lift within 90 days.

8.5 The Danger of Pruning Pages with Backlinks

The most common pruning mistake is 410 on a page that has backlinks. The backlinks become broken; link equity dissipates. Mitigation: before any 410, run an Ahrefs check; if any referring domain above DR 20 exists, route to 301 instead. For 301, the destination must be topically relevant (a 301 from a recipe page to the contact page transfers no equity; Google detects the mismatch and treats as soft 404). Pages with backlinks but no traffic almost always have a 301 destination on the site; the audit job is finding that destination, not deleting the page outright.

8.6 The Recurring Pruning Cadence

Pruning is not a one time activity. Quarterly: pruning sweep on pages in the bottom decile by trailing 12 month clicks; apply Section 5.6 criteria; route to 410 or 301. Annually: portfolio level pruning review; aggregate the year's pruning activity; assess lift; recalibrate the bottom decile threshold for the next year.

9. Page Level Audit Template

Eight columns populated for every URL, exported as the canonical audit deliverable.

9.1 The Eight Field Template

Field	Source	Definition
URL	Inventory CSV	Canonical URL
Primary topic	Analyst	One topic phrase from the site's topical taxonomy
Traffic	GSC API	12 month organic clicks
Engagement	GA4 API	Engagement time, scroll depth, conversion count
Last updated	dateModified	Most recent substantive update date
Decision	Section 5	Keep, Update, Consolidate, Redirect, Delete
Action	Workflow	Specific actions and section reference
Owner	Analyst/client	Person responsible for executing the action

9.2 The Markdown Table Format

| URL | Primary topic | Traffic | Engagement | Last updated | Decision | Action | Owner |
|---|---|---|---|---|---|---|---|
| /quarterly-estimated-taxes-2026/ | Tax compliance | 4218 clicks | 3m 12s, 71% scroll | 2026-02-03 | Keep | Refresh in Q3 per cadence | Amanda |
| /s-corp-vs-llc/ | Entity formation | 1102 clicks | 1m 48s, 38% scroll | 2024-11-12 | Update | Add 2026 tax law section, refresh FAQ | Amanda |
| /old-blog-post-2019/ | Legacy | 0 clicks | n/a | 2019-08-14 | Delete | 410 Gone, remove from sitemap | Joseph |
| /best-crm-tools-2023/ | Software comparison | 78 clicks | 0m 42s, 12% scroll | 2023-04-22 | Redirect | 301 to /best-crm-tools-2026/ | Joseph |
| /tax-tips-for-freelancers/ | Tax compliance | 412 clicks | 2m 04s, 54% scroll | 2024-06-08 | Consolidate | Merge into /quarterly-estimated-taxes-2026/ | Amanda |

9.3 The CSV Export

The markdown table exports to CSV with three additional columns: per criterion scores (C1 to C12), total score, next action date. Next action date computes per decision: Keep+90, Update+14, Consolidate+7, Redirect+7, Delete+3 days from audit date. The CSV uploads to Nextcloud or imports to Google Sheet for client review.

9.4 The Owner and Action Fields

Owner is the person responsible. Most often the client's content team for Update and Consolidate, the agency or Joseph for Redirect and Delete (technical operations). The owner column drives the work queue: the analyst exports the audit CSV filtered to "owner = Amanda" and sends it to Amanda as her quarterly action list.

Action is specific workflow steps. Action is not "update the page"; action is "add 2026 tax law section, refresh FAQ block to include new Q on safe harbor, update dateModified per Section 8 substantive standard." Specificity is what makes the audit actionable. Action references back to this framework's section numbers or to framework-contentrefresh.md section numbers.

9.5 The Audit Log

The audit log is the running record of every audit decision made on a site. One row per decision. The log persists across audits.

audit_log_entry:
  date: 2026-05-14
  audit_id: 2026-Q2
  url: https://example.com/old-blog-post-2019/
  decision: Delete
  rationale: Tier F. C1=0, C11=0, C10=0. No backlinks, no traffic, no internal link equity.
  action_taken: 410 Gone, removed from sitemap, GSC URL Removal submitted.
  executed_by: Joseph; executed_date: 2026-05-15
  validation: {curl_status: 410, sitemap_removed: true, gsc_removal_submitted: true}

Stored at /home/user/clients/[clientname]/audits/audit-log.yaml and committed to the client's documentation system.

10. Section Level Audit

Many pages do not fit cleanly into a single route because parts are healthy and parts decayed. Section level audit is the workflow for those mixed pages.

10.1 When Section Level Audit Is Required

Three signals: page is Tier B or A overall but one specific criterion is at 2 or below; page has multiple H2 sections covering distinct sub topics, and GSC query mining shows traffic concentrated on one section's queries while others receive none; page is a pillar page with multiple cluster topic sections, and the cluster audit shows one section is incomplete while others are comprehensive.

10.2 The Section Inventory

For a page subject to section level audit, inventory every H2 and major H3:

page_url: https://example.com/quarterly-estimated-taxes-2026/
sections:
  - heading: "What are quarterly estimated taxes"
    type: definition; word_count: 240; last_modified: 2026-02-03; section_score: 5; decision: keep
  - heading: "How to calculate your safe harbor amount"
    type: procedure; word_count: 620; last_modified: 2024-08-12; section_score: 2; decision: update
  - heading: "Common mistakes and penalties"
    type: list; word_count: 410; last_modified: 2023-11-08; section_score: 1; decision: update
  - heading: "FAQ"
    type: faqpage; word_count: 740; last_modified: 2026-02-03; section_score: 4; decision: keep

Each section gets its own decision (keep, update, expand, remove). The page level decision becomes the union of section level decisions.

10.3 The Section Level Decisions

Four decisions per section: Keep (no change). Update (targeted refresh). Expand (section covers the topic shallowly; add depth, examples, original data). Remove (no longer relevant or outdated beyond salvage; delete from the page). Update and Expand both produce content changes that trigger dateModified update per Section 8 of framework-contentrefresh.md. Remove produces content reduction; the changelog entry documents the removal.

10.4 The Section Level Refresh Workflow

For a page where the page level decision is Update and section inventory shows two sections in Update and three in Keep, the refresh execution touches only the two Update sections. The Keep sections stay byte for byte identical. The dateModified updates because the page received substantive change; the changelog entry documents which sections changed (e.g., "May 14, 2026: Updated 'How to calculate your safe harbor amount' with 2026 IRS amounts. Rewrote 'Common mistakes and penalties' to include 2026 penalty rates."). Changelog specificity demonstrates the dateModified is honest: only the listed sections changed.

10.5 The Pillar Page Section Audit

Pillar pages (long form hubs with many H2 sections, each covering one cluster sub topic) benefit most from section level audit. A pillar page with 12 H2 sections may have eight at Tier A, two at Tier C, two at Tier F. The page level decision is Update; the section level decision is Update on the C sections and Remove on the F sections. Removed sections from a pillar page often become standalone pages if the section had standalone value (content extracts to a new URL; the pillar links out to it).

10.6 The Section Level Audit Cadence

Section level audit runs on the same cadence as page level audit for pillar pages. For standard pages, section level audit runs only when triggered by the page level decision being Update with criterion specificity. Default is page level; section level is the targeted expansion when page level is insufficient.

11. AI Citation Audit Layer

New in 2026: pages that earn AI Overview citations have different optimization profiles than pages earning featured snippets. The AI citation audit layer captures which pages are cited in which AI surfaces and identifies the page patterns earning citation.

11.1 Why AI Citation Is a Distinct Audit Layer

Three reasons. First, AI citation is increasingly decoupled from classic ranking; Ahrefs February 2026 (863000 keywords) found only 38 percent of AI Overview cited pages also rank in top 10 organic. Second, citation earning page patterns are distinctive: FAQ blocks and definition paragraphs disproportionately earn citation. Third, citation status is volatile; AI Overview content changes 70 percent of the time on re run. Sustained citation across a 28 day window is the meaningful signal.

11.2 The Priority Query Tracking Set

10 to 25 queries the site targets for AI citation, chosen by commercial value and topical relevance. For each priority query, manual weekly sampling captures whether the site appears in the AI Overview citation list. Each tracked query records target_page and weekly cited booleans over a 12 week rolling window. Sustained citation is at least 8 of 12 weeks cited.

11.3 The Citation Earning Section Analysis

For each priority query where the page is cited, identify which section the AI Overview extracts. The cited snippet usually corresponds to one or two specific paragraphs or one FAQ entry. Manual inspection: run the priority query, observe the AI Overview, click into the citation, read the cited block, identify the section structure (FAQ entry, definition paragraph, numbered list, data table), catalog the pattern. Common high citation patterns: 40 to 60 word definition paragraphs immediately after the H2 question; FAQ block with question as <summary> and answer as <p> inside <details>; comparison tables with header row plus 3 to 5 data rows; numbered procedural lists with each step under 30 words.

11.4 The Pattern Replication Recommendation

Once the citation earning sections are cataloged, the audit recommends replicating those patterns sitewide. If FAQ blocks on the site's tax content earn citation but FAQ blocks on the site's other content are absent, the audit recommends adding parallel FAQ blocks. Pattern A: 40 to 60 word definition immediately after H2 question. Pattern B: FAQPage block with details/summary. Each pattern recommendation lists the URLs currently earning citation and the URLs targeted for replication.

11.5 The Citation Loss Trigger

If a page previously holding citation on a priority query loses citation for two consecutive weeks (per framework-aioverviews.md Section 9.7 stability rule), the page enters the Update queue with trigger "AI Overview citation loss." The Update is targeted: refresh the citation earning section, update dateModified, document the change in the changelog.

11.6 The Multi Engine Citation Audit

Beyond Google AI Overview, the audit samples ChatGPT Search, Claude Search, Perplexity, Bing Copilot, and Meta AI on the same priority queries. A per query sample captures citation status across all six surfaces. Pages cited on three or more engines are strong; pages cited on one or zero are candidates for the citation defense workflow. Cross reference framework-aicitations.md.

12. Topical Cluster Audit

The cluster level companion to the page level audit. Where page level asks "is this URL doing its job," cluster level asks "is this cluster complete and coherent."

12.1 The Cluster Taxonomy

The cluster taxonomy is the site's declared topical organization: a list of clusters, each with a defined scope and a list of sub topics. The taxonomy lives in /home/user/clients/[clientname]/topical-taxonomy.yaml.

clusters:
  - name: Tax compliance
    pillar_page: /tax-compliance/
    sub_topics: [quarterly estimated taxes, tax deadlines, safe harbor amounts, penalties and interest, filing extensions, amended returns]
    cluster_pages: [/quarterly-estimated-taxes-2026/, /tax-deadlines-2026/, /safe-harbor-amounts/, /tax-penalties-interest/, /filing-extensions/]

12.2 The Gap Analysis

For each cluster, compare declared sub_topics against cluster_pages. Sub topics without a corresponding page are gaps. The output is the cluster expansion queue: net new pages the site should write. Each gap row records sub_topic, recommended URL, priority, and rationale (typically GSC query mining showing impressions for queries with no site page targeting).

12.3 Orphan Page Detection

Pages in the inventory that do not fit any declared cluster. If the page has traffic and quality, expand the taxonomy to include a new cluster or extend an existing one. If neither traffic nor quality, route to Section 5 decision matrix (likely Redirect or Delete). If marginal traffic but no cluster fit, route to consolidation with a topically adjacent page.

import pandas as pd, yaml
inventory = pd.read_csv("/home/user/clients/[clientname]/audits/inventory-scored.csv")
taxonomy = yaml.safe_load(open("/home/user/clients/[clientname]/topical-taxonomy.yaml"))
cluster_pages = {p for c in taxonomy["clusters"] for p in c["cluster_pages"]}
inventory["is_orphan"] = ~inventory["url"].str.replace("https://example.com", "").isin(cluster_pages)
inventory[inventory["is_orphan"]].to_csv("/home/user/clients/[clientname]/audits/orphan-pages.csv", index=False)

12.4 Cluster Coherence Check

Within each cluster, check that cluster pages link to each other. The pillar page links out to every cluster page; every cluster page links back to the pillar page; cluster pages link laterally to topically adjacent cluster pages.

PILLAR="/tax-compliance/"
CLUSTER_PAGES=("/quarterly-estimated-taxes-2026/" "/tax-deadlines-2026/" "/safe-harbor-amounts/" "/tax-penalties-interest/" "/filing-extensions/")
SITE_ROOT="/var/www/sites/[domain]/"

for CP in "${CLUSTER_PAGES[@]}"; do
  grep -q "${CP}" "${SITE_ROOT}${PILLAR}index.html" && echo "PASS pillar->${CP}" || echo "FAIL pillar missing ${CP}"
  grep -q "${PILLAR}" "${SITE_ROOT}${CP}index.html" && echo "PASS ${CP}->pillar" || echo "FAIL ${CP} missing pillar"
done

Cluster coherence failures route to framework-internallinking.md for remediation.

12.5 Cluster Health Score and Cross References

Compute a cluster health score from gap count, orphan count, coherence pass rate, average page level score across cluster pages. Clusters scoring under 30 are at risk; recommendations include gap filling, page level Update on lowest scoring cluster pages, coherence repair. Two clusters claiming overlapping sub topics is cross cluster cannibalization; the audit flags overlapping sub topics and recommends taxonomy revision before further work. The topical cluster audit feeds framework-topicalauthority.md. This framework identifies what work is needed; that framework specifies how the work is executed.

13. Audit Cadence and Velocity

Audit cadence depends on portfolio size and content velocity. A 100 URL site needs less frequent audit than a 5000 URL publisher; a publisher writing 10 new articles per week needs more frequent audit than a static reference site.

13.1 The Size Tiers

Tier	URL count	Audit cadence
Small	100 to 500	Annual deep + quarterly light
Medium	500 to 5000	Quarterly section + monthly spot
Large	5000 plus	Continuous rolling via prioritization queue

13.2 The Small Site Cadence (100 to 500 URLs)

Annual deep audit: full inventory rebuild, full 12 criterion scoring, full decision routing, full audit log. 40 to 80 hours. Quarterly light audit: refresh inventory performance columns (C1, C2, C3, C5 dateModified, C11, C12), rescore the refreshed criteria, action any decision changes. 8 to 16 hours.

13.3 The Medium Site Cadence (500 to 5000 URLs)

Quarterly section audit: rotate through the site in four quarterly sections (Q1 audits one fourth organized by topic cluster or by age, Q2 the next fourth, and so on). By year end every URL has had one deep audit. Monthly spot audit: top 50 by traffic plus bottom 50 by Tier F risk plus any pages decayed in the last 28 days. 4 to 8 hours per month.

13.4 The Large Site Cadence (5000 plus URLs)

Continuous rolling audit: the inventory feeds a prioritization queue ranking every URL by audit priority; the analyst pulls from the top continuously. 8 to 20 hours per week sustained.

audit_priority = (1 / days_since_last_audit) * traffic_value * volatility_score

Where traffic_value is GSC clicks times conversion rate and volatility_score is the 28 day click delta absolute value. The top of the queue is high traffic high volatility recently un audited pages.

13.5 Sampling for Very Large Sites

For sites over 25000 URLs, full audit is economically infeasible. Stratified sampling produces statistically valid portfolio insights from a fraction of the work. Stratify by performance tier; sample 100 URLs per stratum (600 total); score with the full rubric; project distribution across the full inventory based on stratum proportions. Individual URL decisions for un sampled URLs come from automated heuristics (per criterion thresholds applied to API derived scores).

13.6 Cadence Selection and Velocity

def propose_cadence(total_pages):
    if total_pages < 500: return "annual_deep_plus_quarterly_light"
    if total_pages < 5000: return "quarterly_section_plus_monthly_spot"
    if total_pages < 25000: return "continuous_rolling"
    return "continuous_rolling_with_sampling"

Standard scoring is 12 to 18 minutes per URL after API automation. A full time analyst audits 130 to 200 URLs per week. For a 5000 URL site, full audit at this velocity is 25 to 38 weeks. Continuous rolling cadence is sized to fit velocity; the analyst audits roughly 100 URLs per week, the queue cycles every 50 weeks.

13.7 The Calendar Anchor

Regardless of cadence, the calendar anchor is the quarterly review. Every quarter the analyst produces a portfolio level report: count of pages audited, decisions executed, decisions pending, ROI on executed decisions (clicks gained, conversions gained, hours invested). The quarterly report is the client deliverable that justifies the audit retainer.

14. Bubbles Hosted Content Audit Toolchain

The entire content audit pipeline runs on the Bubbles host (169.155.162.118, Debian, nginx, 16 GB RAM). No third party CDN, no proxy, no SaaS audit platform in the path. Every component is open source or written in house.

14.1 The Architecture Overview

Every stage runs locally on the Bubbles host. Crawlers (Screaming Frog CLI), API clients (GSC, GA4, Ahrefs via google-api-python-client and requests), and the Python pandas pipeline produce three CSV stages: inventory.csv (canonical merged) feeds inventory-scored.csv (12 criterion scored) feeds audit-deliverable.csv (decisions and actions). Jupyter runs the analyst notebooks; the deliverable shares to client via Nextcloud or Google Sheet. The only external API calls are GSC, GA4, and Ahrefs; responses cache to local CSV files for offline analysis.

14.2 The Installation

One time setup:

sudo apt install -y python3 python3-pip python3-venv jupyter
python3 -m venv /home/user/audit-env
source /home/user/audit-env/bin/activate
pip install pandas openpyxl google-api-python-client google-auth google-analytics-data requests pyyaml

cd /opt
sudo wget https://download.screamingfrogseospider.com/seospider/screamingfrogseospider-22.0.tar.gz
sudo tar -xzf screamingfrogseospider-22.0.tar.gz && sudo mv screamingfrogseospider-22.0 screamingfrog
sudo ln -s /opt/screamingfrog/screamingfrogseospider /usr/local/bin/screamingfrogseospider

Jupyter serves on port 8888 bound to localhost; nginx reverse proxies authenticated access at https://audit.thatdeveloperguy.com/jupyter/ with htpasswd.

14.3 The Per Client Directory Structure

/home/user/clients/[clientname]/
  audits/2026-Q2/      (sitemap-urls.txt, crawler-urls.csv, gsc-urls.csv,
                        ga4-urls.csv, inventory.csv, inventory-scored.csv,
                        audit-deliverable.csv, audit-log.yaml, audit-report.md)
  audits/2026-Q1/      ...
  secrets/             (gsc.json, ga4.json, ahrefs-api-key.txt)
  topical-taxonomy.yaml
  notebooks/           (inventory-build.ipynb, scoring.ipynb,
                        decision-routing.ipynb, cluster-audit.ipynb,
                        quarterly-report.ipynb)

Each quarter has its own subdirectory. The audit log persists across quarters; each new quarter appends to the existing log.

14.4 The Jupyter Notebook Workflow

Five notebooks in sequence: inventory-build.ipynb pulls sitemap, runs Screaming Frog, pulls GSC and GA4 APIs, merges into inventory.csv. scoring.ipynb applies automated scoring then opens for analyst input on C4, C6, C7, C8, C9, C12. decision-routing.ipynb applies the Section 5 decision matrix. cluster-audit.ipynb runs Section 12 gap analysis, orphan detection, coherence check, cluster health score. quarterly-report.ipynb computes ROI on executed decisions and generates the quarterly executive report.

14.5 The Client Share Workflow

Two paths. Path A: Self hosted Nextcloud. The Bubbles host runs Nextcloud at https://cloud.thatdeveloperguy.com/; the deliverable CSV uploads to a client folder; the client logs in; review and comments happen in Nextcloud. No third party SaaS in the path. Path B: Shared Google Sheet. For clients already in Google Workspace, the deliverable CSV imports into a Google Sheet at a shared URL. The client opens, reviews, adds owner assignments inline. The sheet syncs back to CSV on a weekly cadence via the Google Sheets API.

14.6 The Automation Scripts

audit-init.sh creates a new quarter directory, copies the prior quarter's taxonomy and audit log forward:

CLIENT="$1"; QUARTER="$2"
CLIENT_DIR="/home/user/clients/${CLIENT}"
QUARTER_DIR="${CLIENT_DIR}/audits/${QUARTER}"
mkdir -p "${QUARTER_DIR}"
cp "${CLIENT_DIR}/topical-taxonomy.yaml" "${QUARTER_DIR}/"
LATEST=$(ls -t "${CLIENT_DIR}/audits/"*/audit-log.yaml 2>/dev/null | head -1)
if [ -n "${LATEST}" ]; then cp "${LATEST}" "${QUARTER_DIR}/audit-log.yaml"; else echo "audit_log_entries: []" > "${QUARTER_DIR}/audit-log.yaml"; fi

audit-execute-deletes.sh reads audit-deliverable.csv, extracts URLs marked Delete, applies the 410 nginx config:

CLIENT="$1"; QUARTER="$2"
DELIVERABLE="/home/user/clients/${CLIENT}/audits/${QUARTER}/audit-deliverable.csv"
NGINX_410_MAP="/etc/nginx/redirects/${CLIENT}-410.map"
python3 -c "
import pandas as pd
df = pd.read_csv('${DELIVERABLE}')
for url in df[df['decision'] == 'Delete']['url']:
    path = '/' + url.replace('https://', '').split('/', 1)[1] if '/' in url else '/'
    open('${NGINX_410_MAP}', 'a').write(f'{path}  410;\n')
"
sudo nginx -t && sudo systemctl reload nginx

audit-execute-redirects.sh follows the same pattern for 301 redirects from Redirect decision rows.

14.7 Performance Considerations

The Bubbles host shares 16 GB RAM across nginx, several FastAPI backends, MEGAMIND brain, and the audit pipeline. Screaming Frog can consume 4 GB on a 5000 URL crawl; run during off hours; do not run concurrently with brain. Pandas on a 25000 URL inventory with 30 columns produces a DataFrame around 200 MB; use chunked processing (pd.read_csv with chunksize) for inventories above 50000 URLs. The Bubbles host comfortably handles portfolios up to 25000 URLs.

14.8 The No CDN No Proxy Stance

The audit pipeline runs on a single Debian host with direct internet egress. No CDN sits in front of audit endpoints. Reasons: substrate honesty (every deliverable URL passes the same substrate test the audit applies to client sites); operational sovereignty (a CDN outage would break the audit pipeline); cost (Bubbles is paid for; CDN is recurring spend with no benefit at audit traffic volumes); client demonstration (the pipeline is itself a demonstration of the self hosted philosophy this practice recommends to clients).

14.9 Backup and Recovery

Daily rsync to the 4.5 TB external storage at /mnt/storage/audit-backups/:

SOURCE="/home/user/clients/"
DEST="/mnt/storage/audit-backups/$(date +%Y-%m-%d)/"
mkdir -p "${DEST}" && rsync -av --delete "${SOURCE}" "${DEST}"

Runs from cron at 03:00 daily. Recovery is rsync in reverse. The 90 day retention horizon balances disk usage against recoverability.

14.10 The Toolchain Audit Checklist

#	Component
T1	Python 3.11 venv at /home/user/audit-env
T2	Screaming Frog CLI at /usr/local/bin/screamingfrogseospider
T3	GSC service account JSON in /home/user/clients/[clientname]/secrets/
T4	GA4 service account JSON in /home/user/clients/[clientname]/secrets/
T5	Jupyter at audit.thatdeveloperguy.com/jupyter/ behind htpasswd
T6	Per client directory structure under /home/user/clients/[clientname]/
T7	Topical taxonomy YAML in /home/user/clients/[clientname]/
T8	Five notebook templates in notebooks/
T9	audit-init.sh, audit-execute-deletes.sh, audit-execute-redirects.sh executable
T10	Daily rsync backup to /mnt/storage/audit-backups/ active

Score 10. World class toolchain: 10 of 10 with zero outages on the Bubbles host in the prior quarter.

End of Framework Document

v2.0. Created 2026-05-14. By ThatDeveloperGuy.

Content audit is the recurring discipline that takes inventory of every published URL, scores each against quality and performance, and routes each to keep, update, consolidate, redirect, or delete. Pruning low quality content lifts perceived quality of the entire site (Ahrefs August 2025: 1 to 23 percent organic click lift across 47 of 50 case studies on bottom 30 percent pruning) and reallocates crawl budget toward pages that earn citation, rank, and convert. The 2026 evolution: the AI citation audit layer makes audit a citation defense practice alongside the classic ranking defense; section level audit lets refresh execution target the specific decayed section. Sites running Section 13 cadence against the Section 14 toolchain produce continuous portfolio improvement without CDN or proxy dependencies.

Companions

framework-initialaudit.md, broader initial site audit context
framework-ongoingaudit.md, recurring quarterly and monthly cadence
framework-contentrefresh.md, refresh and update production workflow
framework-topicalauthority.md, cluster build and maintenance (Section 12)
framework-internallinking.md, internal link audit during consolidation
framework-aicitations.md, multi engine AI citation audit (Section 11)
framework-aioverviews.md, Google AI Overview citation defense
framework-gscanalysis.md, GSC data pull (Section 3)
framework-eeat.md, E-E-A-T scoring foundation (C4)
framework-hcs.md, Helpful Content System quality criteria
framework-infogain.md, Information Gain assessment (C7)
framework-sqrg.md, Search Quality Rater Guidelines
SEO-Search-Appearance.md, multi engine surface map
SERP-Optimization.md, feature targeting playbook
14 tier Engine Optimization Stack, Tier 2 Content Optimization Layer

From the ThatDevPro Engine Optimization framework library. Studio: ThatDevPro (SDVOSB veteran-owned web + AI engineering). Sister property: ThatDeveloperGuy. Source: https://www.thatdevpro.com/insights/framework-contentaudit/.