Originally published at thatdevpro.com. This framework reference is part of the 14-tier Engine Optimization stack from ThatDevPro, an SDVOSB-certified veteran-owned web + AI engineering studio. You are reading the dev.to mirror; the source-of-truth canonical version with embedded validation tools lives at the link above.
The Canonical Reference for Inventorying Every Published URL, Scoring Quality, and Deciding the Fate of Each Page Across Keep, Update, Consolidate, Redirect, and Delete
A comprehensive installation and audit reference for content auditing as an SEO and AEO discipline. Content audit is the recurring process of taking complete inventory of every URL a site has published, scoring each one against a quality and performance rubric, and routing each to one of five outcomes: keep with maintenance, update with targeted improvement, consolidate by merging into a stronger canonical, redirect by 301 to a topical successor, or delete by returning 410 Gone. Audit discipline is the highest leverage activity in mature SEO programs because pruning low quality content lifts perceived quality of the entire site and reallocates crawl budget toward pages that earn citation, rank, and convert. This document specifies the inventory methodology, the twelve criterion quality scorecard, the five way decision matrix, the consolidation and sunset protocols, the section level audit, the AI citation audit layer, the topical cluster audit, the audit cadence by site size, and the Bubbles hosted toolchain that runs the entire pipeline on a single Debian server with no CDN or proxy in the path. Dual purpose: installation manual and audit document.
Cross stack note: code samples are written in plain HTML and Bash. For React, Vue, Svelte, Next.js, Nuxt, SvelteKit, Astro, Hugo, 11ty, Remix, WordPress, Shopify, and Webflow equivalents, see framework-cross-stack-implementation.md. The audit pipeline substrate is Python 3.11, pandas 2.x, and Jupyter on the same Debian host that runs nginx.
1. Document Purpose
Content audit is the recurring process of inventorying every published URL on a site, scoring each one against a quality and performance rubric, and routing each to one of five fates: keep with maintenance, update with targeted improvement, consolidate by merging into a stronger canonical, redirect by 301 to a topical successor, or delete by returning 410 Gone. A site that audits once at launch and never again accumulates dead weight every quarter; a site that audits on the cadence in Section 13 keeps its portfolio in continuous alignment with current quality bars and query intent.
Audit discipline is the highest leverage activity in mature SEO programs. First, pruning low quality content lifts the perceived quality of the entire site. Ahrefs August 2025 measured organic click lift of 1 to 23 percent across 47 of 50 case studies on bottom 30 percent pruning. The lift is sitewide. Second, audit reallocates crawl budget toward pages that produce business value. Third, audit produces the data needed for every other content decision: the topical cluster audit shows where the cluster is incomplete, the AI citation audit shows which page patterns earn AI Overview citation, the section level audit shows where individual blocks need refresh while the surrounding article stays static.
The 2026 emphasis on Google's Helpful Content System and on AI Overview citation makes audit increasingly load bearing. A site with significant low quality content risks algorithmic demotion under HCS; the same site loses AI Overview citation to competitors whose content is denser, more current, and more cleanly entity declared.
1.1 Three Operating Modes
Mode A, Install Mode. Establish audit infrastructure on a site that has never had systematic audit. Sections 2 through 14 in order. The first full audit on a new client engagement is the highest value deliverable for the first 90 days.
Mode B, Audit Mode. Run a recurring audit on a site that already has audit discipline. Skip the inventory build; pull the prior audit's inventory; refresh metrics, rescore, reroute, produce the updated work queue. Mode B is what most ongoing client engagements run on a quarterly anchor.
Mode C, Hybrid Mode. Partial or stale inventory. Reconcile against current sitemap, GSC, and GA4, fill the gaps, proceed with Mode B scoring.
1.2 Conflict Resolution Rules
| Conflict | Rule |
|---|---|
| Sitemap inventory shorter than GSC plus GA4 inventory | Critical. Merge all three. Section 3. |
| Quality scoring without traffic and engagement data | Reject. Section 4 requires both. |
| Page has zero traffic but holds backlinks | Do not delete. Section 8 specifies 301 to topical successor. |
| Two pages compete for the same query and both have traffic | Consolidate, do not refresh both. Section 7. |
| Page level decision is Update but only one section is decayed | Section level audit (Section 10). |
| Audit last run more than 12 months ago | Treat as new install. Mode A. |
| Inventory exceeds 5000 URLs and client wants annual full audit | Reject. Section 13 specifies continuous rolling. |
1.3 Required Tools
Sitemap fetch via curl; Screaming Frog SEO Spider CLI on Linux (or Sitebulb headless) for full URL crawl; GSC Search Analytics API for indexed URL discovery and per URL performance; GA4 Data API for engagement and conversion attribution; Ahrefs or Semrush API for backlink and ranking data; Python 3.11 with pandas 2.x for inventory merge and scoring; Jupyter for analyst review; spreadsheet (Google Sheets or self hosted Nextcloud) for client shared output. No CDN, no proxy in the audit pipeline. The Bubbles host (169.155.162.118, Debian, nginx, 16 GB RAM) runs the entire stack.
1.4 Relationship to Neighboring Frameworks
Broader site audit: framework-initialaudit.md. Quarterly cadence: framework-ongoingaudit.md. Update or refresh execution: framework-contentrefresh.md. Topical cluster: framework-topicalauthority.md. Internal link: framework-internallinking.md. AI citation: framework-aicitations.md, framework-aioverviews.md. GSC data pull: framework-gscanalysis.md. Quality scoring inputs: framework-eeat.md, framework-hcs.md, framework-infogain.md. Health score: framework-sqrg.md.
2. Client Variables Intake
# CONTENT AUDIT FRAMEWORK CLIENT VARIABLES
# Business and Site Identity (REQUIRED)
business_name: ""
primary_domain: ""
business_industry: ""
ymyl_classification: "" # full_ymyl, partial_ymyl, lite_ymyl, non_ymyl
cms_or_stack: ""
host_environment: "" # bubbles_nginx, valkyrie_nginx, third_party
# Portfolio Scale (REQUIRED)
total_indexable_pages: 0
total_content_pages_excluding_product: 0
total_product_or_listing_pages: 0
oldest_content_publication_year: 0
pages_published_more_than_24_months_ago: 0
pages_published_more_than_12_months_ago: 0
pages_with_zero_inbound_internal_links: 0
# Inventory Source State (REQUIRED)
sitemap_url: ""
sitemap_url_count: 0
gsc_property_verified: false
gsc_indexed_page_count: 0
ga4_property_verified: false
ga4_pages_with_traffic_last_12mo: 0
ahrefs_or_semrush_access: false
# Prior Audit State
prior_audit_exists: false
prior_audit_date: ""
prior_audit_inventory_count: 0
prior_audit_decisions_executed: 0
prior_audit_decisions_pending: 0
# Decision Routing Capacity (REQUIRED)
update_capacity_hours_per_quarter: 0
consolidation_capacity_pages_per_quarter: 0
redirect_or_delete_capacity_per_quarter: 0
section_level_audit_in_use: false
# AI Citation Layer (REQUIRED)
priority_queries_tracked_for_aio: 0
queries_currently_cited_in_aio: 0
ai_citation_audit_integrated: false
# Toolchain (REQUIRED)
crawler_tool: "" # screaming_frog_cli, sitebulb_headless, custom_python
gsc_api_credentials_provisioned: false
ga4_api_credentials_provisioned: false
ahrefs_api_credentials_provisioned: false
python_pandas_environment_ready: false
jupyter_notebook_location: ""
audit_csv_output_location: ""
client_shared_spreadsheet_location: ""
Audit routes to baseline frameworks when prerequisites fail. If gsc_property_verified is false, work routes to framework-gscanalysis.md Section 2 verification. If ga4_property_verified is false, work routes to GA4 setup before audit. If sitemap_url_count is zero or far off from gsc_indexed_page_count and ga4_pages_with_traffic_last_12mo, the sitemap is broken and work routes to sitemap repair. Audit against a broken substrate produces inventory gaps that bias every downstream decision.
3. The Content Inventory
The most common audit failure mode is incomplete inventory: working only from the sitemap and missing pages GSC has discovered through external links, or working only from GA4 and missing pages that have impressions but no clicks. Search Engine Journal March 2025 (200 mid market sites): combined inventory (sitemap union GSC union GA4) exceeded each individual list by 10 to 30 percent. The pages in the gap are typically orphans, legacy pages, tag and category archives, paginated results, parameter URLs, and pages the CMS publishes outside the sitemap.
3.1 The Four Source Inventory Build
Source 1: Sitemap fetch. Pull sitemap.xml from the canonical domain. If the site uses a sitemap index, follow the index and fetch each child. Resolve each <loc> to its canonical URL.
SITEMAP="https://example.com/sitemap.xml"
DIR="/home/user/clients/[clientname]/audits"
for CHILD in $(curl -s "${SITEMAP}" | grep -oE '<loc>[^<]+</loc>' | sed -E 's/<\/?loc>//g'); do
curl -s "${CHILD}" | grep -oE '<loc>[^<]+</loc>' | sed -E 's/<\/?loc>//g'
done | sort -u > "${DIR}/sitemap-urls.txt"
If sitemap.xml itself contains URLs rather than sub-sitemaps, the inner curl returns its own <loc> entries directly.
Source 2: Full URL crawl. Run Screaming Frog SEO Spider CLI on Linux with content extraction enabled.
screamingfrogseospider --crawl https://example.com/ --headless --save-crawl --export-tabs "Internal:All" --output-folder /home/user/clients/[clientname]/audits/
For Sitebulb headless the equivalent invocation uses sitebulb run --url https://example.com/ --output /home/user/clients/[clientname]/audits/. Either tool produces a CSV with one row per URL and columns for status code, content type, indexability, response time, word count, H1, title, meta description, outbound link counts.
Source 3: GSC discovered URLs. Pull the GSC Search Analytics API for every URL with at least one impression over the last 16 months. Service account JSON stored at /home/user/clients/[clientname]/secrets/gsc.json with siteFullUser permission.
from google.oauth2 import service_account
from googleapiclient.discovery import build
creds = service_account.Credentials.from_service_account_file("/home/user/clients/[clientname]/secrets/gsc.json", scopes=["https://www.googleapis.com/auth/webmasters.readonly"])
service = build("searchconsole", "v1", credentials=creds)
request = {"startDate": "2025-01-14", "endDate": "2026-05-14", "dimensions": ["page"], "rowLimit": 25000}
response = service.searchanalytics().query(siteUrl="https://example.com/", body=request).execute()
with open("/home/user/clients/[clientname]/audits/gsc-urls.csv", "w") as f:
f.write("url,clicks,impressions,ctr,position\n")
for row in response.get("rows", []):
f.write(f'"{row["keys"][0]}",{row["clicks"]},{row["impressions"]},{row["ctr"]},{row["position"]}\n')
Source 4: GA4 historical URLs. Pull the GA4 Data API for every page path with at least one session over the last 12 months. Pages in GA4 but not in sitemap are orphans or unindexed pages receiving direct traffic.
from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import DateRange, Dimension, Metric, RunReportRequest
client = BetaAnalyticsDataClient.from_service_account_file("/home/user/clients/[clientname]/secrets/ga4.json")
request = RunReportRequest(
property=f"properties/{GA4_PROPERTY_ID}",
date_ranges=[DateRange(start_date="2025-05-14", end_date="2026-05-14")],
dimensions=[Dimension(name="pagePath")],
metrics=[Metric(name="sessions"), Metric(name="averageSessionDuration"), Metric(name="conversions")],
limit=100000)
response = client.run_report(request)
with open("/home/user/clients/[clientname]/audits/ga4-urls.csv", "w") as f:
f.write("path,sessions,avg_session_duration,conversions\n")
for r in response.rows:
f.write(f'"{r.dimension_values[0].value}",{r.metric_values[0].value},{r.metric_values[1].value},{r.metric_values[2].value}\n')
3.2 The Merge and Deduplication
The four source CSV files merge into a single canonical inventory CSV. The merge key is the normalized URL (canonical protocol, canonical host, canonical trailing slash, no tracking parameters, no anchor fragments).
import pandas as pd
DIR = "/home/user/clients/[clientname]/audits/"
sitemap = pd.read_csv(DIR+"sitemap-urls.txt", header=None, names=["url"])
crawler = pd.read_csv(DIR+"crawler-urls.csv")
gsc = pd.read_csv(DIR+"gsc-urls.csv")
ga4 = pd.read_csv(DIR+"ga4-urls.csv")
def normalize(url):
url = url.lower().strip().split("?")[0].split("#")[0]
if not url.endswith("/") and "." not in url.rsplit("/", 1)[1]: url += "/"
return url
for df in (sitemap, crawler, gsc): df["url"] = df["url"].apply(normalize)
ga4["url"] = ga4["path"].apply(lambda p: normalize("https://example.com" + p))
sitemap["in_sitemap"], crawler["in_crawler"], gsc["in_gsc"], ga4["in_ga4"] = True, True, True, True
inventory = sitemap.merge(crawler, on="url", how="outer").merge(gsc, on="url", how="outer").merge(ga4, on="url", how="outer")
inventory.to_csv(DIR+"inventory.csv", index=False)
Every row is a URL with boolean flags for which sources discovered it plus the performance signals. URLs flagged in only one source are diagnostic: sitemap but not crawler suggests broken internal navigation; crawler but not sitemap suggests missing sitemap entries; GA4 but not crawler or sitemap suggests orphan pages or pages excluded by robots.
3.3 Sanity Checks and Categorization
Healthy sitemap to combined ratio is 0.7 to 0.9. Matching exactly indicates the merge failed; above 2x indicates URL normalization broke. Each inventory row is categorized by content type (article, guide, landing, product, listing, author, legal, corporate, other), topic cluster (per site taxonomy), age bucket (new 0 to 6 months, young 6 to 12, mature 12 to 24, old 24 to 48, legacy 48 plus), and performance tier (top 10 percent, top 25 percent, median, bottom 25 percent, bottom 10 percent, zero traffic).
4. Quality Scoring Rubric 2026
Twelve criteria, each scored 0 to 5, total possible 60. The rubric captures both classic SEO quality signals and the 2026 AI citation signals.
4.1 The Twelve Criteria
C1. Traffic last 12 months. GSC organic clicks over the trailing 12 month window. 0 for zero clicks, 1 for under 50 per year, 2 for 50 to 500, 3 for 500 to 5000, 4 for 5000 to 25000, 5 for over 25000. Adjust bands by site size.
C2. Engagement metrics. GA4 average engagement time per session, scroll depth, bounce rate. 0 for zero or negative engagement (auto bounce, sub 10 second sessions); 1 to 5 banded against the site median.
C3. Conversion contribution. GA4 conversion count attributed to the page over the trailing 12 months. 0 for zero conversions; 1 to 5 banded against site conversion median.
C4. Expert review level. Manual rating of E-E-A-T markers: credentialed byline, declared reviewer for YMYL, first hand experience, primary source citations. Per framework-eeat.md rubric.
C5. Recency. dateModified relative to topic volatility. A page on tax filing dated 2022 is more decayed than a page on the Pythagorean theorem dated 2022. 5 for fully current, 3 for moderate currency, 0 for stale relative to topic. Cross reference framework-contentrefresh.md decay scorecard.
C6. Depth. Word count, section count, topical coverage breadth. Surfer SEO January 2026 (210000 URLs): pages over 2100 words earn featured snippet at 2.7 times the rate of pages under 1000 words. Depth is comprehensive coverage, not word padding. 5 for comprehensive multi section coverage, 3 for solid coverage with gaps, 0 for thin content.
C7. Originality. Information Gain per framework-infogain.md. 5 for multiple original contributions (data, first hand observation, contrarian finding, novel synthesis), 3 for at least one, 0 for entirely derivative.
C8. Accuracy. Factual accuracy of every numeric claim, citation, named source, dated reference. 5 for zero detected errors and current data, 3 for minor inaccuracies, 0 for systematic errors or invented statistics.
C9. Multimedia. Images with alt text and descriptive captions, videos with transcripts, diagrams, charts. 5 for rich multimedia matching content, 3 for adequate, 0 for prose only when topic warrants visuals.
C10. Internal link equity. Count of internal links pointing into the page from topically related pages. Princeton GEO SIGKDD 2024: AI citation probability rises sharply at three or more inbound from topically related. 5 for 10 plus inbound, 3 for 3 to 9, 0 for orphan.
C11. Backlink earnings. Referring domains and total inbound links from Ahrefs or Semrush. 5 for 20 plus referring domains, 3 for 5 to 19, 0 for zero.
C12. AI citation presence. Manual sampling on priority queries across Google AI Overview, ChatGPT Search, Perplexity, Claude Search, Bing Copilot. 5 for cited on multiple priority queries across multiple engines, 3 for cited on at least one, 0 for never cited.
4.2 Scoring Tiers
The total maps to a tier that feeds the Section 5 decision matrix. A: exemplar (50 to 60). B: strong (40 to 49). C: serviceable (30 to 39). D: weak, needs intervention (20 to 29). F: candidate for consolidation, redirect, or deletion (0 to 19). Per Search Engine Land November 2025 (47 SaaS audits): the median post audit portfolio distribution is 8 percent A, 22 percent B, 35 percent C, 25 percent D, 10 percent F. A portfolio with over 30 percent F has never had systematic audit; under 5 percent F has likely already had recent pruning.
4.3 Scoring Time and Automation
Per URL: quick scoring 3 to 5 minutes, standard scoring (full 12 criterion) 12 to 18 minutes, deep scoring (with competitor comparison and multi engine AI sampling) 30 to 60 minutes. For a 1000 URL inventory, full standard scoring is 200 to 300 hours. Sampling becomes economically necessary above 2500 URLs (Section 13). The Python pipeline computes C1, C2, C3, C5 (date portion), C10, C11 directly from API data; the analyst concentrates on C4, C6, C7, C8, C9, C12. The hybrid model cuts per URL scoring time roughly in half.
import pandas as pd
inventory = pd.read_csv("/home/user/clients/[clientname]/audits/inventory.csv")
def score_traffic(c):
return 0 if c == 0 else (1 if c < 50 else (2 if c < 500 else (3 if c < 5000 else (4 if c < 25000 else 5))))
def score_internal_links(n):
return 0 if n == 0 else (1 if n < 3 else (3 if n < 10 else 5))
inventory["C1"] = inventory["clicks"].apply(score_traffic)
inventory["C10"] = inventory["internal_inbound_count"].apply(score_internal_links)
inventory.to_csv("/home/user/clients/[clientname]/audits/inventory-scored.csv", index=False)
5. The Decision Matrix
The decision matrix routes each scored URL to one of five outcomes. The matrix is deterministic given the tier and per criterion thresholds; the analyst exercises judgment only on edge cases.
5.1 The Five Routes
| Route | Definition | Reference |
|---|---|---|
| Keep | Performing and high quality. Schedule routine refresh. | 5.2 |
| Update | Decayed or partially decayed. Targeted improvement. | Section 6 |
| Consolidate | Two or more pages compete for same query. Merge into canonical. | Section 7 |
| Redirect | Dead but holds backlink equity. 301 to topical successor. | 8.2 |
| Delete | No traffic, no backlinks, no topical fit. 410 Gone. | 8.1 |
5.2 Keep and Maintain Criteria
All must be true: Tier A or B (total 40 plus); C1 traffic 3 or higher; C5 recency 3 or higher; C8 accuracy 4 or higher; no cannibalization with another page on the same primary query. Keep pages receive scheduled refresh per framework-contentrefresh.md Section 6 cadence with no immediate structural intervention.
5.3 Update and Refresh Criteria
Triggers any: Tier B or C (total 30 to 49) with at least one criterion at 2 or below; Tier A with any criterion at 2 or below; page decayed in last 90 days (28 day click drop of 30 percent or more); page missing AI Overview citation it previously held. Routes to framework-contentrefresh.md Section 7 production workflow with the specific weak criteria as the trigger.
5.4 Consolidate and Merge Criteria
Triggers: GSC Performance filtered by query shows two or more pages with 100 plus impressions on the same primary query; topical overlap pairs two pages on the same primary topic; one page outranks the other but the loser still earns clicks and backlinks. Routes to Section 7.
5.5 Redirect and Sunset Criteria
Triggers: Tier D or F (total under 30); C1 traffic 0 or 1; C2 engagement 0 or 1; C11 backlink 2 or higher OR C10 internal link 3 or higher; a topical successor exists on the site. Routes to Section 8.2 with internal link cleanup.
5.6 Delete and 410 Criteria
All four required: Tier F (total under 20); C1 traffic 0; C11 backlink 0 or 1; C10 internal link 0 or 1. Routes to Section 8.1.
5.7 The Per Criterion Threshold Table
| Criterion | Keep min | Consolidate signal | Redirect signal | Delete signal |
|---|---|---|---|---|
| C1 traffic | 3 | 2 to 4 (both pages) | 0 or 1 | 0 |
| C2 engagement | 3 | 1 to 4 | 0 or 1 | 0 |
| C3 conversions | 2 | 1 to 4 | 0 | 0 |
| C4 E-E-A-T | 3 | 1 to 4 | 0 to 2 | 0 to 2 |
| C5 recency | 3 | 1 to 4 | 0 or 1 | 0 |
| C6 depth | 3 | 1 to 4 | 0 to 2 | 0 to 2 |
| C7 originality | 3 | 1 to 4 | 0 to 2 | 0 to 2 |
| C8 accuracy | 4 | 2 to 5 | 0 to 3 | 0 to 3 |
| C9 multimedia | 2 | 0 to 4 | 0 to 3 | 0 to 3 |
| C10 internal links | 3 | 1 to 4 | 3 or higher | 0 or 1 |
| C11 backlinks | 2 | 1 to 4 | 2 or higher | 0 or 1 |
| C12 AI citation | 2 | 1 to 4 | 0 to 2 | 0 to 2 |
The table is enforced by the Python pipeline; the analyst sees a routing recommendation per URL and overrides only on documented edge cases.
6. Update versus Refresh Distinction
The terms update and refresh are often used interchangeably; this framework distinguishes them deliberately because the operational workflows are different.
6.1 The Distinction
Refresh is the rolling decay protection workflow run on healthy Keep tier pages on a scheduled cadence. The workflow lives in framework-contentrefresh.md. Refresh is preventive: a page currently performing well receives a substantive review and modest update at the cadence appropriate to its content type (weekly for news, quarterly for evergreen, semi annual for YMYL). The refresh keeps the page in the AI Overview candidate pool and defends against gradual decay.
Update is the targeted improvement workflow run on flagged Update tier pages after audit identifies a specific weakness. The workflow is similar to refresh (same dateModified discipline, same schema preservation, same changelog requirement) but reactive rather than preventive. An update addresses the specific criteria flagged by the audit rather than reviewing the whole page.
6.2 Why the Distinction Matters
Three operational reasons. Capacity allocation: refresh capacity is the quarterly anchor on the whole portfolio, update capacity is the weekly work queue from audit. Success measurement: refresh ROI is measured by 28 day pre versus 28 day post; update ROI is measured by criterion specific improvement. Audit log entry: refresh log entries reference the cadence trigger, update log entries reference the specific audit finding being addressed.
6.3 The Cross Reference
For Keep tier pages requiring scheduled refresh, route to framework-contentrefresh.md Section 7 production workflow with quarterly anchor cadence. For Update tier pages flagged by Section 5.3 criteria, route to that same workflow with the specific audit finding as the trigger documentation. The anti pattern: updating a Keep tier page that does not need work because "it has been six months." Calendar refresh; audit driven update is trigger based by definition.
7. Consolidation Methodology
Consolidation merges two or more pages competing for the same query into a single stronger canonical. Search Engine Journal April 2025 (114 audits): mean 12.4 cannibalization pairs per audit on portfolios over 500 pages.
7.1 Cannibalization Detection
Step 1: GSC query mining. Use the same GSC API client as Section 3 with dimensions=["query", "page"] over a 6 month window. Queries where two or more URLs exceed 100 impressions are candidates: pairs = df.groupby("query").filter(lambda g: len(g) > 1 and g["impressions"].sum() > 100).
Step 2: Topical overlap grouping. Within the inventory categorization, group pages by topic cluster. Pairs within the same cluster sharing primary keywords above 50 percent are candidates.
Step 3: Backlink overlap analysis. Pull referring domains for each candidate URL from Ahrefs API. If the two URLs share more than 30 percent of referring domains, consolidation compounds their backlink profile.
7.2 The Five Comparison Axes
The page winning on three or more axes is the canonical; the other becomes the loser. Axes: traffic (12 month organic clicks); conversion (GA4 attributed conversions); backlinks (Ahrefs referring domains, weighted by authority); AI citation (manual sampling on target query); Information Gain (manual review of original contributions per framework-infogain.md). A page winning on traffic but losing on backlinks and AI citation is not automatically the canonical; the consolidation must preserve the loser's link equity and Information Gain by merging them into the canonical before the redirect.
7.3 The Merge Workflow
- Identify canonical and loser per Section 7.2.
- Inventory unique content in the loser the canonical lacks (H2 sections, FAQ entries, data tables, examples, case studies, expert quotes).
- Merge unique content into the canonical. Add new sections where appropriate. Update H1 if merged scope warrants.
- Update schema. Article headline matches new H1. FAQPage extends with new questions. HowTo extends if procedural content merged. dateModified updates to today.
- Update internal link strategy to reflect the consolidated scope.
- Configure 301:
location = /old-loser-path/ { return 301 /canonical-path/; } - Update internal links sitewide pointing to the loser to point to the canonical directly.
- Submit canonical to IndexNow (framework-contentrefresh.md Section 7.11) and GSC URL Inspection.
- Remove loser URL from sitemap.xml.
- Log consolidation in audit log with both URLs, comparison scores, canonical's pre consolidation 28 day baseline.
7.4 Internal Link Rewiring Script
LOSER="/old-loser-path/"; CANONICAL="/canonical-path/"
SITE_ROOT="/var/www/sites/[domain]/"
grep -rln "${LOSER}" "${SITE_ROOT}" --include="*.html" --include="*.md" | while read FILE; do
python3 -c "open('${FILE}','w').write(open('${FILE}').read().replace('${LOSER}','${CANONICAL}'))"
done
Test on a staging copy before running against production.
7.5 The nginx Redirect Pattern
For a single redirect, the inline location block above. For dozens or hundreds, the map directive scales:
map $request_uri $consolidation_redirect {
/old-loser-path-a/ /canonical-a/;
/old-loser-path-b/ /canonical-b/;
default "";
}
server {
listen 443 ssl http2;
server_name example.com;
location / {
if ($consolidation_redirect != "") { return 301 $consolidation_redirect; }
try_files $uri $uri/ /index.html;
}
}
7.6 Consolidation ROI Measurement
Capture canonical's 28 day pre consolidation baseline (clicks, impressions, average position, conversions). After publish and 301 propagation (1 to 4 weeks), capture 28 day post metrics. Ahrefs case study compilation (2024 to 2025): median 47 percent organic click lift on the canonical at 90 days post consolidation across 23 documented consolidations. If consolidation produces zero or negative lift: the merge was cosmetic, or the 301 failed. Validate:
curl -I "https://example.com/old-loser-path/" | head -1 # Expect: HTTP/2 301
curl -sI "https://example.com/old-loser-path/" | grep -i "^location:"
curl -I "https://example.com/canonical-path/" | head -1 # Expect: HTTP/2 200
GSC URL Inspection on both URLs: the loser should report "URL is not on Google" with canonical as destination, the canonical should report "URL is on Google" with consolidation metadata updated.
8. Sunset and Pruning Protocol
Sunset removes pages with no salvage value (410) or pages with backlink equity but no traffic (301 to topical successor). Pruning is the portfolio level discipline of recurring sunset.
8.1 The 410 Gone Workflow
410 is the cleaner signal: "this page is gone permanently, do not re index." Trigger criteria (Section 5.6): Tier F, C1=0, C11=0 or 1, C10=0 or 1. Workflow: verify no internal links flow from priority pages (if any do, route to 301 or remove the internal links first); verify no significant backlinks (if any referring domain above DR 30 exists, route to 301 instead); configure location = /dead-page-path/ { return 410; }; remove URL from sitemap.xml; remove from navigation, related posts widgets, category archives; submit GSC URL Removal for sensitive URLs needing faster de indexing; log the deletion. Validate: curl -I "https://example.com/dead-page-path/" | head -1 expects HTTP/2 410.
8.2 The 301 Redirect Workflow
301 is appropriate when the page has backlink equity worth preserving or where a clear topical successor exists. Trigger criteria (Section 5.5). Workflow: identify the closest topical successor (must genuinely cover the same or a parent topic; do not redirect to homepage or generic category archive, which Google treats as soft 404 per Intero Digital 2025 guidance); configure 301; remove the dead URL from sitemap.xml; update internal links sitewide to point to the successor directly; submit successor to IndexNow; submit successor for GSC URL Inspection; log the redirect.
8.3 The Noindex Alternative
For pages that should not be deleted or redirected but should not earn search visibility: noindex. Appropriate for legal pages (terms, privacy, dmca), low value archive pages with historical importance, category or tag archives that serve navigation but not search. Use <meta name="robots" content="noindex, follow">. follow allows link equity to flow through the page while the page itself does not rank. noindex, nofollow severs link equity.
8.4 The 1 Percent Rule and the Pruning Lift
Ahrefs August 2025 (50 case studies, mid market sites): pruning the bottom 30 percent of pages by 12 month organic clicks lifted overall site organic clicks by 1 to 23 percent in 47 of 50 cases. Mean lift was 7.4 percent at 90 days, 9.1 percent at 180 days. The lift is sitewide. Mechanism: the site's perceived quality rises as the bottom decile is removed, crawl budget reallocates to surviving pages, internal link equity concentrates on stronger pages. The 1 percent rule: even a site that prunes only the dead orphan pages at the bottom typically sees measurable lift within 90 days.
8.5 The Danger of Pruning Pages with Backlinks
The most common pruning mistake is 410 on a page that has backlinks. The backlinks become broken; link equity dissipates. Mitigation: before any 410, run an Ahrefs check; if any referring domain above DR 20 exists, route to 301 instead. For 301, the destination must be topically relevant (a 301 from a recipe page to the contact page transfers no equity; Google detects the mismatch and treats as soft 404). Pages with backlinks but no traffic almost always have a 301 destination on the site; the audit job is finding that destination, not deleting the page outright.
8.6 The Recurring Pruning Cadence
Pruning is not a one time activity. Quarterly: pruning sweep on pages in the bottom decile by trailing 12 month clicks; apply Section 5.6 criteria; route to 410 or 301. Annually: portfolio level pruning review; aggregate the year's pruning activity; assess lift; recalibrate the bottom decile threshold for the next year.
9. Page Level Audit Template
Eight columns populated for every URL, exported as the canonical audit deliverable.
9.1 The Eight Field Template
| Field | Source | Definition |
|---|---|---|
| URL | Inventory CSV | Canonical URL |
| Primary topic | Analyst | One topic phrase from the site's topical taxonomy |
| Traffic | GSC API | 12 month organic clicks |
| Engagement | GA4 API | Engagement time, scroll depth, conversion count |
| Last updated | dateModified | Most recent substantive update date |
| Decision | Section 5 | Keep, Update, Consolidate, Redirect, Delete |
| Action | Workflow | Specific actions and section reference |
| Owner | Analyst/client | Person responsible for executing the action |
9.2 The Markdown Table Format
| URL | Primary topic | Traffic | Engagement | Last updated | Decision | Action | Owner |
|---|---|---|---|---|---|---|---|
| /quarterly-estimated-taxes-2026/ | Tax compliance | 4218 clicks | 3m 12s, 71% scroll | 2026-02-03 | Keep | Refresh in Q3 per cadence | Amanda |
| /s-corp-vs-llc/ | Entity formation | 1102 clicks | 1m 48s, 38% scroll | 2024-11-12 | Update | Add 2026 tax law section, refresh FAQ | Amanda |
| /old-blog-post-2019/ | Legacy | 0 clicks | n/a | 2019-08-14 | Delete | 410 Gone, remove from sitemap | Joseph |
| /best-crm-tools-2023/ | Software comparison | 78 clicks | 0m 42s, 12% scroll | 2023-04-22 | Redirect | 301 to /best-crm-tools-2026/ | Joseph |
| /tax-tips-for-freelancers/ | Tax compliance | 412 clicks | 2m 04s, 54% scroll | 2024-06-08 | Consolidate | Merge into /quarterly-estimated-taxes-2026/ | Amanda |
9.3 The CSV Export
The markdown table exports to CSV with three additional columns: per criterion scores (C1 to C12), total score, next action date. Next action date computes per decision: Keep+90, Update+14, Consolidate+7, Redirect+7, Delete+3 days from audit date. The CSV uploads to Nextcloud or imports to Google Sheet for client review.
9.4 The Owner and Action Fields
Owner is the person responsible. Most often the client's content team for Update and Consolidate, the agency or Joseph for Redirect and Delete (technical operations). The owner column drives the work queue: the analyst exports the audit CSV filtered to "owner = Amanda" and sends it to Amanda as her quarterly action list.
Action is specific workflow steps. Action is not "update the page"; action is "add 2026 tax law section, refresh FAQ block to include new Q on safe harbor, update dateModified per Section 8 substantive standard." Specificity is what makes the audit actionable. Action references back to this framework's section numbers or to framework-contentrefresh.md section numbers.
9.5 The Audit Log
The audit log is the running record of every audit decision made on a site. One row per decision. The log persists across audits.
audit_log_entry:
date: 2026-05-14
audit_id: 2026-Q2
url: https://example.com/old-blog-post-2019/
decision: Delete
rationale: Tier F. C1=0, C11=0, C10=0. No backlinks, no traffic, no internal link equity.
action_taken: 410 Gone, removed from sitemap, GSC URL Removal submitted.
executed_by: Joseph; executed_date: 2026-05-15
validation: {curl_status: 410, sitemap_removed: true, gsc_removal_submitted: true}
Stored at /home/user/clients/[clientname]/audits/audit-log.yaml and committed to the client's documentation system.
10. Section Level Audit
Many pages do not fit cleanly into a single route because parts are healthy and parts decayed. Section level audit is the workflow for those mixed pages.
10.1 When Section Level Audit Is Required
Three signals: page is Tier B or A overall but one specific criterion is at 2 or below; page has multiple H2 sections covering distinct sub topics, and GSC query mining shows traffic concentrated on one section's queries while others receive none; page is a pillar page with multiple cluster topic sections, and the cluster audit shows one section is incomplete while others are comprehensive.
10.2 The Section Inventory
For a page subject to section level audit, inventory every H2 and major H3:
page_url: https://example.com/quarterly-estimated-taxes-2026/
sections:
- heading: "What are quarterly estimated taxes"
type: definition; word_count: 240; last_modified: 2026-02-03; section_score: 5; decision: keep
- heading: "How to calculate your safe harbor amount"
type: procedure; word_count: 620; last_modified: 2024-08-12; section_score: 2; decision: update
- heading: "Common mistakes and penalties"
type: list; word_count: 410; last_modified: 2023-11-08; section_score: 1; decision: update
- heading: "FAQ"
type: faqpage; word_count: 740; last_modified: 2026-02-03; section_score: 4; decision: keep
Each section gets its own decision (keep, update, expand, remove). The page level decision becomes the union of section level decisions.
10.3 The Section Level Decisions
Four decisions per section: Keep (no change). Update (targeted refresh). Expand (section covers the topic shallowly; add depth, examples, original data). Remove (no longer relevant or outdated beyond salvage; delete from the page). Update and Expand both produce content changes that trigger dateModified update per Section 8 of framework-contentrefresh.md. Remove produces content reduction; the changelog entry documents the removal.
10.4 The Section Level Refresh Workflow
For a page where the page level decision is Update and section inventory shows two sections in Update and three in Keep, the refresh execution touches only the two Update sections. The Keep sections stay byte for byte identical. The dateModified updates because the page received substantive change; the changelog entry documents which sections changed (e.g., "May 14, 2026: Updated 'How to calculate your safe harbor amount' with 2026 IRS amounts. Rewrote 'Common mistakes and penalties' to include 2026 penalty rates."). Changelog specificity demonstrates the dateModified is honest: only the listed sections changed.
10.5 The Pillar Page Section Audit
Pillar pages (long form hubs with many H2 sections, each covering one cluster sub topic) benefit most from section level audit. A pillar page with 12 H2 sections may have eight at Tier A, two at Tier C, two at Tier F. The page level decision is Update; the section level decision is Update on the C sections and Remove on the F sections. Removed sections from a pillar page often become standalone pages if the section had standalone value (content extracts to a new URL; the pillar links out to it).
10.6 The Section Level Audit Cadence
Section level audit runs on the same cadence as page level audit for pillar pages. For standard pages, section level audit runs only when triggered by the page level decision being Update with criterion specificity. Default is page level; section level is the targeted expansion when page level is insufficient.
11. AI Citation Audit Layer
New in 2026: pages that earn AI Overview citations have different optimization profiles than pages earning featured snippets. The AI citation audit layer captures which pages are cited in which AI surfaces and identifies the page patterns earning citation.
11.1 Why AI Citation Is a Distinct Audit Layer
Three reasons. First, AI citation is increasingly decoupled from classic ranking; Ahrefs February 2026 (863000 keywords) found only 38 percent of AI Overview cited pages also rank in top 10 organic. Second, citation earning page patterns are distinctive: FAQ blocks and definition paragraphs disproportionately earn citation. Third, citation status is volatile; AI Overview content changes 70 percent of the time on re run. Sustained citation across a 28 day window is the meaningful signal.
11.2 The Priority Query Tracking Set
10 to 25 queries the site targets for AI citation, chosen by commercial value and topical relevance. For each priority query, manual weekly sampling captures whether the site appears in the AI Overview citation list. Each tracked query records target_page and weekly cited booleans over a 12 week rolling window. Sustained citation is at least 8 of 12 weeks cited.
11.3 The Citation Earning Section Analysis
For each priority query where the page is cited, identify which section the AI Overview extracts. The cited snippet usually corresponds to one or two specific paragraphs or one FAQ entry. Manual inspection: run the priority query, observe the AI Overview, click into the citation, read the cited block, identify the section structure (FAQ entry, definition paragraph, numbered list, data table), catalog the pattern. Common high citation patterns: 40 to 60 word definition paragraphs immediately after the H2 question; FAQ block with question as <summary> and answer as <p> inside <details>; comparison tables with header row plus 3 to 5 data rows; numbered procedural lists with each step under 30 words.
11.4 The Pattern Replication Recommendation
Once the citation earning sections are cataloged, the audit recommends replicating those patterns sitewide. If FAQ blocks on the site's tax content earn citation but FAQ blocks on the site's other content are absent, the audit recommends adding parallel FAQ blocks. Pattern A: 40 to 60 word definition immediately after H2 question. Pattern B: FAQPage block with details/summary. Each pattern recommendation lists the URLs currently earning citation and the URLs targeted for replication.
11.5 The Citation Loss Trigger
If a page previously holding citation on a priority query loses citation for two consecutive weeks (per framework-aioverviews.md Section 9.7 stability rule), the page enters the Update queue with trigger "AI Overview citation loss." The Update is targeted: refresh the citation earning section, update dateModified, document the change in the changelog.
11.6 The Multi Engine Citation Audit
Beyond Google AI Overview, the audit samples ChatGPT Search, Claude Search, Perplexity, Bing Copilot, and Meta AI on the same priority queries. A per query sample captures citation status across all six surfaces. Pages cited on three or more engines are strong; pages cited on one or zero are candidates for the citation defense workflow. Cross reference framework-aicitations.md.
12. Topical Cluster Audit
The cluster level companion to the page level audit. Where page level asks "is this URL doing its job," cluster level asks "is this cluster complete and coherent."
12.1 The Cluster Taxonomy
The cluster taxonomy is the site's declared topical organization: a list of clusters, each with a defined scope and a list of sub topics. The taxonomy lives in /home/user/clients/[clientname]/topical-taxonomy.yaml.
clusters:
- name: Tax compliance
pillar_page: /tax-compliance/
sub_topics: [quarterly estimated taxes, tax deadlines, safe harbor amounts, penalties and interest, filing extensions, amended returns]
cluster_pages: [/quarterly-estimated-taxes-2026/, /tax-deadlines-2026/, /safe-harbor-amounts/, /tax-penalties-interest/, /filing-extensions/]
12.2 The Gap Analysis
For each cluster, compare declared sub_topics against cluster_pages. Sub topics without a corresponding page are gaps. The output is the cluster expansion queue: net new pages the site should write. Each gap row records sub_topic, recommended URL, priority, and rationale (typically GSC query mining showing impressions for queries with no site page targeting).
12.3 Orphan Page Detection
Pages in the inventory that do not fit any declared cluster. If the page has traffic and quality, expand the taxonomy to include a new cluster or extend an existing one. If neither traffic nor quality, route to Section 5 decision matrix (likely Redirect or Delete). If marginal traffic but no cluster fit, route to consolidation with a topically adjacent page.
import pandas as pd, yaml
inventory = pd.read_csv("/home/user/clients/[clientname]/audits/inventory-scored.csv")
taxonomy = yaml.safe_load(open("/home/user/clients/[clientname]/topical-taxonomy.yaml"))
cluster_pages = {p for c in taxonomy["clusters"] for p in c["cluster_pages"]}
inventory["is_orphan"] = ~inventory["url"].str.replace("https://example.com", "").isin(cluster_pages)
inventory[inventory["is_orphan"]].to_csv("/home/user/clients/[clientname]/audits/orphan-pages.csv", index=False)
12.4 Cluster Coherence Check
Within each cluster, check that cluster pages link to each other. The pillar page links out to every cluster page; every cluster page links back to the pillar page; cluster pages link laterally to topically adjacent cluster pages.
PILLAR="/tax-compliance/"
CLUSTER_PAGES=("/quarterly-estimated-taxes-2026/" "/tax-deadlines-2026/" "/safe-harbor-amounts/" "/tax-penalties-interest/" "/filing-extensions/")
SITE_ROOT="/var/www/sites/[domain]/"
for CP in "${CLUSTER_PAGES[@]}"; do
grep -q "${CP}" "${SITE_ROOT}${PILLAR}index.html" && echo "PASS pillar->${CP}" || echo "FAIL pillar missing ${CP}"
grep -q "${PILLAR}" "${SITE_ROOT}${CP}index.html" && echo "PASS ${CP}->pillar" || echo "FAIL ${CP} missing pillar"
done
Cluster coherence failures route to framework-internallinking.md for remediation.
12.5 Cluster Health Score and Cross References
Compute a cluster health score from gap count, orphan count, coherence pass rate, average page level score across cluster pages. Clusters scoring under 30 are at risk; recommendations include gap filling, page level Update on lowest scoring cluster pages, coherence repair. Two clusters claiming overlapping sub topics is cross cluster cannibalization; the audit flags overlapping sub topics and recommends taxonomy revision before further work. The topical cluster audit feeds framework-topicalauthority.md. This framework identifies what work is needed; that framework specifies how the work is executed.
13. Audit Cadence and Velocity
Audit cadence depends on portfolio size and content velocity. A 100 URL site needs less frequent audit than a 5000 URL publisher; a publisher writing 10 new articles per week needs more frequent audit than a static reference site.
13.1 The Size Tiers
| Tier | URL count | Audit cadence |
|---|---|---|
| Small | 100 to 500 | Annual deep + quarterly light |
| Medium | 500 to 5000 | Quarterly section + monthly spot |
| Large | 5000 plus | Continuous rolling via prioritization queue |
13.2 The Small Site Cadence (100 to 500 URLs)
Annual deep audit: full inventory rebuild, full 12 criterion scoring, full decision routing, full audit log. 40 to 80 hours. Quarterly light audit: refresh inventory performance columns (C1, C2, C3, C5 dateModified, C11, C12), rescore the refreshed criteria, action any decision changes. 8 to 16 hours.
13.3 The Medium Site Cadence (500 to 5000 URLs)
Quarterly section audit: rotate through the site in four quarterly sections (Q1 audits one fourth organized by topic cluster or by age, Q2 the next fourth, and so on). By year end every URL has had one deep audit. Monthly spot audit: top 50 by traffic plus bottom 50 by Tier F risk plus any pages decayed in the last 28 days. 4 to 8 hours per month.
13.4 The Large Site Cadence (5000 plus URLs)
Continuous rolling audit: the inventory feeds a prioritization queue ranking every URL by audit priority; the analyst pulls from the top continuously. 8 to 20 hours per week sustained.
audit_priority = (1 / days_since_last_audit) * traffic_value * volatility_score
Where traffic_value is GSC clicks times conversion rate and volatility_score is the 28 day click delta absolute value. The top of the queue is high traffic high volatility recently un audited pages.
13.5 Sampling for Very Large Sites
For sites over 25000 URLs, full audit is economically infeasible. Stratified sampling produces statistically valid portfolio insights from a fraction of the work. Stratify by performance tier; sample 100 URLs per stratum (600 total); score with the full rubric; project distribution across the full inventory based on stratum proportions. Individual URL decisions for un sampled URLs come from automated heuristics (per criterion thresholds applied to API derived scores).
13.6 Cadence Selection and Velocity
def propose_cadence(total_pages):
if total_pages < 500: return "annual_deep_plus_quarterly_light"
if total_pages < 5000: return "quarterly_section_plus_monthly_spot"
if total_pages < 25000: return "continuous_rolling"
return "continuous_rolling_with_sampling"
Standard scoring is 12 to 18 minutes per URL after API automation. A full time analyst audits 130 to 200 URLs per week. For a 5000 URL site, full audit at this velocity is 25 to 38 weeks. Continuous rolling cadence is sized to fit velocity; the analyst audits roughly 100 URLs per week, the queue cycles every 50 weeks.
13.7 The Calendar Anchor
Regardless of cadence, the calendar anchor is the quarterly review. Every quarter the analyst produces a portfolio level report: count of pages audited, decisions executed, decisions pending, ROI on executed decisions (clicks gained, conversions gained, hours invested). The quarterly report is the client deliverable that justifies the audit retainer.
14. Bubbles Hosted Content Audit Toolchain
The entire content audit pipeline runs on the Bubbles host (169.155.162.118, Debian, nginx, 16 GB RAM). No third party CDN, no proxy, no SaaS audit platform in the path. Every component is open source or written in house.
14.1 The Architecture Overview
Every stage runs locally on the Bubbles host. Crawlers (Screaming Frog CLI), API clients (GSC, GA4, Ahrefs via google-api-python-client and requests), and the Python pandas pipeline produce three CSV stages: inventory.csv (canonical merged) feeds inventory-scored.csv (12 criterion scored) feeds audit-deliverable.csv (decisions and actions). Jupyter runs the analyst notebooks; the deliverable shares to client via Nextcloud or Google Sheet. The only external API calls are GSC, GA4, and Ahrefs; responses cache to local CSV files for offline analysis.
14.2 The Installation
One time setup:
sudo apt install -y python3 python3-pip python3-venv jupyter
python3 -m venv /home/user/audit-env
source /home/user/audit-env/bin/activate
pip install pandas openpyxl google-api-python-client google-auth google-analytics-data requests pyyaml
cd /opt
sudo wget https://download.screamingfrogseospider.com/seospider/screamingfrogseospider-22.0.tar.gz
sudo tar -xzf screamingfrogseospider-22.0.tar.gz && sudo mv screamingfrogseospider-22.0 screamingfrog
sudo ln -s /opt/screamingfrog/screamingfrogseospider /usr/local/bin/screamingfrogseospider
Jupyter serves on port 8888 bound to localhost; nginx reverse proxies authenticated access at https://audit.thatdeveloperguy.com/jupyter/ with htpasswd.
14.3 The Per Client Directory Structure
/home/user/clients/[clientname]/
audits/2026-Q2/ (sitemap-urls.txt, crawler-urls.csv, gsc-urls.csv,
ga4-urls.csv, inventory.csv, inventory-scored.csv,
audit-deliverable.csv, audit-log.yaml, audit-report.md)
audits/2026-Q1/ ...
secrets/ (gsc.json, ga4.json, ahrefs-api-key.txt)
topical-taxonomy.yaml
notebooks/ (inventory-build.ipynb, scoring.ipynb,
decision-routing.ipynb, cluster-audit.ipynb,
quarterly-report.ipynb)
Each quarter has its own subdirectory. The audit log persists across quarters; each new quarter appends to the existing log.
14.4 The Jupyter Notebook Workflow
Five notebooks in sequence: inventory-build.ipynb pulls sitemap, runs Screaming Frog, pulls GSC and GA4 APIs, merges into inventory.csv. scoring.ipynb applies automated scoring then opens for analyst input on C4, C6, C7, C8, C9, C12. decision-routing.ipynb applies the Section 5 decision matrix. cluster-audit.ipynb runs Section 12 gap analysis, orphan detection, coherence check, cluster health score. quarterly-report.ipynb computes ROI on executed decisions and generates the quarterly executive report.
14.5 The Client Share Workflow
Two paths. Path A: Self hosted Nextcloud. The Bubbles host runs Nextcloud at https://cloud.thatdeveloperguy.com/; the deliverable CSV uploads to a client folder; the client logs in; review and comments happen in Nextcloud. No third party SaaS in the path. Path B: Shared Google Sheet. For clients already in Google Workspace, the deliverable CSV imports into a Google Sheet at a shared URL. The client opens, reviews, adds owner assignments inline. The sheet syncs back to CSV on a weekly cadence via the Google Sheets API.
14.6 The Automation Scripts
audit-init.sh creates a new quarter directory, copies the prior quarter's taxonomy and audit log forward:
CLIENT="$1"; QUARTER="$2"
CLIENT_DIR="/home/user/clients/${CLIENT}"
QUARTER_DIR="${CLIENT_DIR}/audits/${QUARTER}"
mkdir -p "${QUARTER_DIR}"
cp "${CLIENT_DIR}/topical-taxonomy.yaml" "${QUARTER_DIR}/"
LATEST=$(ls -t "${CLIENT_DIR}/audits/"*/audit-log.yaml 2>/dev/null | head -1)
if [ -n "${LATEST}" ]; then cp "${LATEST}" "${QUARTER_DIR}/audit-log.yaml"; else echo "audit_log_entries: []" > "${QUARTER_DIR}/audit-log.yaml"; fi
audit-execute-deletes.sh reads audit-deliverable.csv, extracts URLs marked Delete, applies the 410 nginx config:
CLIENT="$1"; QUARTER="$2"
DELIVERABLE="/home/user/clients/${CLIENT}/audits/${QUARTER}/audit-deliverable.csv"
NGINX_410_MAP="/etc/nginx/redirects/${CLIENT}-410.map"
python3 -c "
import pandas as pd
df = pd.read_csv('${DELIVERABLE}')
for url in df[df['decision'] == 'Delete']['url']:
path = '/' + url.replace('https://', '').split('/', 1)[1] if '/' in url else '/'
open('${NGINX_410_MAP}', 'a').write(f'{path} 410;\n')
"
sudo nginx -t && sudo systemctl reload nginx
audit-execute-redirects.sh follows the same pattern for 301 redirects from Redirect decision rows.
14.7 Performance Considerations
The Bubbles host shares 16 GB RAM across nginx, several FastAPI backends, MEGAMIND brain, and the audit pipeline. Screaming Frog can consume 4 GB on a 5000 URL crawl; run during off hours; do not run concurrently with brain. Pandas on a 25000 URL inventory with 30 columns produces a DataFrame around 200 MB; use chunked processing (pd.read_csv with chunksize) for inventories above 50000 URLs. The Bubbles host comfortably handles portfolios up to 25000 URLs.
14.8 The No CDN No Proxy Stance
The audit pipeline runs on a single Debian host with direct internet egress. No CDN sits in front of audit endpoints. Reasons: substrate honesty (every deliverable URL passes the same substrate test the audit applies to client sites); operational sovereignty (a CDN outage would break the audit pipeline); cost (Bubbles is paid for; CDN is recurring spend with no benefit at audit traffic volumes); client demonstration (the pipeline is itself a demonstration of the self hosted philosophy this practice recommends to clients).
14.9 Backup and Recovery
Daily rsync to the 4.5 TB external storage at /mnt/storage/audit-backups/:
SOURCE="/home/user/clients/"
DEST="/mnt/storage/audit-backups/$(date +%Y-%m-%d)/"
mkdir -p "${DEST}" && rsync -av --delete "${SOURCE}" "${DEST}"
Runs from cron at 03:00 daily. Recovery is rsync in reverse. The 90 day retention horizon balances disk usage against recoverability.
14.10 The Toolchain Audit Checklist
| # | Component |
|---|---|
| T1 | Python 3.11 venv at /home/user/audit-env |
| T2 | Screaming Frog CLI at /usr/local/bin/screamingfrogseospider |
| T3 | GSC service account JSON in /home/user/clients/[clientname]/secrets/ |
| T4 | GA4 service account JSON in /home/user/clients/[clientname]/secrets/ |
| T5 | Jupyter at audit.thatdeveloperguy.com/jupyter/ behind htpasswd |
| T6 | Per client directory structure under /home/user/clients/[clientname]/ |
| T7 | Topical taxonomy YAML in /home/user/clients/[clientname]/ |
| T8 | Five notebook templates in notebooks/ |
| T9 | audit-init.sh, audit-execute-deletes.sh, audit-execute-redirects.sh executable |
| T10 | Daily rsync backup to /mnt/storage/audit-backups/ active |
Score 10. World class toolchain: 10 of 10 with zero outages on the Bubbles host in the prior quarter.
End of Framework Document
v2.0. Created 2026-05-14. By ThatDeveloperGuy.
Content audit is the recurring discipline that takes inventory of every published URL, scores each against quality and performance, and routes each to keep, update, consolidate, redirect, or delete. Pruning low quality content lifts perceived quality of the entire site (Ahrefs August 2025: 1 to 23 percent organic click lift across 47 of 50 case studies on bottom 30 percent pruning) and reallocates crawl budget toward pages that earn citation, rank, and convert. The 2026 evolution: the AI citation audit layer makes audit a citation defense practice alongside the classic ranking defense; section level audit lets refresh execution target the specific decayed section. Sites running Section 13 cadence against the Section 14 toolchain produce continuous portfolio improvement without CDN or proxy dependencies.
Companions
- framework-initialaudit.md, broader initial site audit context
- framework-ongoingaudit.md, recurring quarterly and monthly cadence
- framework-contentrefresh.md, refresh and update production workflow
- framework-topicalauthority.md, cluster build and maintenance (Section 12)
- framework-internallinking.md, internal link audit during consolidation
- framework-aicitations.md, multi engine AI citation audit (Section 11)
- framework-aioverviews.md, Google AI Overview citation defense
- framework-gscanalysis.md, GSC data pull (Section 3)
- framework-eeat.md, E-E-A-T scoring foundation (C4)
- framework-hcs.md, Helpful Content System quality criteria
- framework-infogain.md, Information Gain assessment (C7)
- framework-sqrg.md, Search Quality Rater Guidelines
- SEO-Search-Appearance.md, multi engine surface map
- SERP-Optimization.md, feature targeting playbook
- 14 tier Engine Optimization Stack, Tier 2 Content Optimization Layer
From the ThatDevPro Engine Optimization framework library. Studio: ThatDevPro (SDVOSB veteran-owned web + AI engineering). Sister property: ThatDeveloperGuy. Source: https://www.thatdevpro.com/insights/framework-contentaudit/.
Top comments (0)