Nano Banana's dramatic interpretation of this workflow.
Create an AI-Ready mirror of your static website in Markdown using a pre-commit hook
A Simple, No-Build-Tools Guide
Markdown is generally easier for AI systems to parse than HTML. This guide shows you how to automatically generate clean Markdown versions of your HTML pages every time you commit changes, which can improve how AI tools consume your content.
No build system, no remote workflows, no technical overhead. Just set it and forget it.
How it works
Step 1. Add the pre-commit hook
This script runs automatically on your local machine every time you commit HTML changes. It does three things:
- Adds a link tag to each HTML file signaling a Markdown version exists
- Generates clean Markdown files from your HTML content
- Automatically stages both the HTML and Markdown files in your commit
You don't need to manually edit your HTML files or remember to generate Markdown. The pre-commit hook handles everything.
The pre-commit hook handles the conversion automatically.
Create a directory named hooks in your project root and add a file named pre-commit inside it (or ask your AI assistant to handle this):
hooks/pre-commit
Important: To make the hook active and trackable in your repository history, run these two commands in your terminal:
chmod +x hooks/pre-commit
ln -sf ../../hooks/pre-commit .git/hooks/pre-commit
By storing the hook in a hooks folder rather than the hidden directory, your automation logic is version-controlled, visible in your commit history, and shared with anyone who clones the repo.
Prerequisites: You need Python 3 installed with beautifulsoup and lxml packages. Install them with:
pip3 install --break-system-packages beautifulsoup4 lxml
Common excludes to consider: drafts, archive, vendor, dist, build. The workflow always excludes node_modules and hidden folders.
Best approach: Ask your AI assistant to create this file for you. Just say: "Create a tracked git pre-commit hook at hooks/pre-commit that generates markdown versions of HTML files, adds meta tags, and symlinks itself to the git folder."
Or, if you want to do it manually, paste the below code block and update the base URL to your domain:
#!/bin/bash
# Pre-commit hook to:
# 1. Generate markdown versions of HTML files
# 2. Add meta tags to HTML files
set -e
# Get list of HTML files being committed
HTML_FILES=$(git diff --cached --name-only --diff-filter=ACM | grep '\.html$' || true)
if [ -z "$HTML_FILES" ]; then
exit 0
fi
# Check if Python is available
if ! command -v python3 &> /dev/null; then
echo "ERROR: Python 3 is required"
exit 1
fi
# Check if required Python packages are installed
if ! python3 -c "import bs4, lxml" 2>/dev/null; then
echo "ERROR: Required Python packages not found"
echo "Install them with: pip3 install beautifulsoup4 lxml"
exit 1
fi
echo "Processing HTML files..."
# Run the Python script
python3 - << 'PY'
from bs4 import BeautifulSoup, NavigableString, Tag, Comment
from pathlib import Path
import subprocess
BASE_URL = "https://yourdomain.com" # ← CHANGE THIS TO YOUR DOMAIN
on:
push:
branches:
- main
paths:
- "*.html"
- "**/*.html"
permissions:
contents: write
jobs:
html_to_md:
runs-on: ubuntu-latest
steps:
- name: Check out repo
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: |
pip install beautifulsoup4 lxml
- name: Generate Markdown files
run: |
mkdir -p ai
python - << 'PY'
from bs4 import BeautifulSoup, NavigableString, Tag, Comment
from pathlib import Path
BASE_URL = "https://yourdomain.com" # ← YOU MUST CHANGE THIS
EXCLUDE_FOLDERS = {'node_modules'} # ← Folders to skip
def get_md_path(html_path):
"""Determine the markdown file path for an HTML file."""
parts = html_path.parts
if html_path.name == "index.html":
if len(parts) == 1:
return "ai/index.md"
else:
return f"ai/{parts[-2]}.md"
else:
stem = html_path.stem
if len(parts) > 1:
prefix = "-".join(parts[:-1])
return f"ai/{prefix}-{stem}.md"
return f"ai/{stem}.md"
def add_link_tag_if_missing(html_path, md_path):
"""Add the alternate link tag to HTML if missing."""
content = html_path.read_text(encoding="utf-8")
soup = BeautifulSoup(content, "lxml")
existing = soup.find("link", {"rel": "alternate", "type": "text/markdown"})
if existing:
return False
head = soup.find("head")
if not head:
return False
new_link = soup.new_tag("link")
new_link["rel"] = "alternate"
new_link["type"] = "text/markdown"
new_link["href"] = f"/{md_path}"
comment = Comment(" Markdown version for AI bots ")
title = head.find("title")
if title:
title.insert_after("\n ")
title.insert_after(new_link)
title.insert_after("\n ")
title.insert_after(comment)
title.insert_after("\n\n ")
else:
head.append("\n ")
head.append(comment)
head.append("\n ")
head.append(new_link)
head.append("\n")
html_path.write_text(str(soup), encoding="utf-8")
return True
def increment_app_version():
"""Increment the window.appVersion and related cache-busters in index.html."""
index_path = Path("index.html")
if not index_path.exists():
return
import re
content = index_path.read_text(encoding="utf-8")
# 1. Match window.appVersion = "X.Y.Z";
js_pattern = r'(window\.appVersion\s*=\s*")(\d+)\.(\d+)\.(\d+)(")'
match = re.search(js_pattern, content)
if match:
prefix, major, minor, patch, suffix = match.groups()
old_version = f"{major}.{minor}.{patch}"
new_patch = int(patch) + 1
new_version = f"{major}.{minor}.{new_patch}"
# Update JS variable
content = re.sub(js_pattern, f"\\g<1>{new_version}\\g<5>", content)
# 2. Update all ?v=X.Y.Z occurrences
content = content.replace(f"?v={old_version}", f"?v={new_version}")
index_path.write_text(content, encoding="utf-8")
return True
return False
increment_app_version()
# Find all HTML files and process them
files = []
for html_path in Path(".").rglob("*.html"):
if any(part.startswith('.') or part in EXCLUDE_FOLDERS for part in html_path.parts):
continue
try:
md_path = get_md_path(html_path)
add_link_tag_if_missing(html_path, md_path)
files.append((str(html_path), md_path))
except Exception:
continue
def normalize_space(text):
return " ".join(text.split())
def inline_to_md(node):
pieces = []
for child in getattr(node, "children", []):
if isinstance(child, NavigableString):
pieces.append(str(child))
elif isinstance(child, Tag):
name = child.name.lower()
if name == "a":
text = normalize_space(child.get_text(" ", strip=True))
href = child.get("href", "").strip()
if not text:
continue
if not href:
pieces.append(text)
continue
if href.startswith("http") or href.startswith("mailto:") or href.startswith("#"):
resolved = href
else:
resolved = f"{BASE_URL}{href}" if href.startswith("/") else f"{BASE_URL}/{href}"
pieces.append(f"[{text}]({resolved})")
continue
if name in ("strong", "b"):
pieces.append(f"**{normalize_space(inline_to_md(child))}**")
continue
if name in ("em", "i"):
pieces.append(f"*{normalize_space(inline_to_md(child))}*")
continue
pieces.append(inline_to_md(child))
return "".join(pieces)
def table_to_md(tbl):
headers = []
thead = tbl.find("thead")
if thead:
hr = thead.find("tr")
if hr:
headers = [normalize_space(c.get_text()) for c in hr.find_all(["th","td"])]
if not headers:
fr = tbl.find("tr")
if fr and fr.find("th"):
headers = [normalize_space(c.get_text()) for c in fr.find_all("th")]
rows = []
tbody = tbl.find("tbody")
all_tr = tbody.find_all("tr") if tbody else tbl.find_all("tr")
for tr in all_tr:
if not thead and tr.find("th") and not tr.find("td"):
continue
cells = [normalize_space(c.get_text()) for c in tr.find_all(["td","th"])]
if cells:
rows.append(cells)
if not headers and rows:
headers = rows.pop(0)
if not headers:
return []
n = len(headers)
out = ["| " + " | ".join(headers) + " |", "| " + " | ".join(["---"]*n) + " |"]
for r in rows:
while len(r) < n:
r.append("")
out.append("| " + " | ".join(r[:n]) + " |")
return out
def html_to_markdown(html_path, md_path):
path = Path(html_path)
if not path.exists():
return
soup = BeautifulSoup(path.read_text(encoding="utf-8"), "lxml")
title_tag = soup.find("title")
title = title_tag.get_text(strip=True) if title_tag else ""
root = soup.find("main") or soup.body or soup
allowed = ["h1","h2","h3","h4","h5","h6","p","li","pre","table"]
elements = [tag for tag in root.find_all(allowed) if not tag.find_parent("nav")]
lines = []
if title:
lines.append(f"# {title}")
lines.append("")
for el in elements:
name = el.name.lower()
if name == "pre":
code = el.find("code")
txt = code.get_text() if code else el.get_text()
lang = ""
src = code if code else el
for c in (src.get("class") or []):
if c.startswith("language-"):
lang = c[9:]
break
lines.append(f"```{lang}")
lines.append(txt.rstrip())
lines.append("```")
lines.append("")
continue
if name == "table":
tbl_lines = table_to_md(el)
if tbl_lines:
lines.extend(tbl_lines)
lines.append("")
continue
text = normalize_space(inline_to_md(el))
if not text:
continue
if name.startswith("h"):
lines.append(f"{'#' * int(name[1])} {text}")
lines.append("")
elif name == "p":
lines.append(text)
lines.append("")
elif name == "li":
parent = el.find_parent(["ol", "ul"])
if parent and parent.name == "ol":
siblings = [s for s in parent.find_all("li", recursive=False)]
try:
idx = siblings.index(el) + 1
except ValueError:
idx = 1
lines.append(f"{idx}. {text}")
else:
lines.append(f"- {text}")
lines.append("")
Path(md_path).write_text("\n".join(lines).rstrip() + "\n", encoding="utf-8")
for src, dst in files:
html_to_markdown(src, dst)
PY
- name: Commit and push changes
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add index.html 2>/dev/null || true
git add -A
if git diff --staged --quiet; then
echo "No changes to commit."
else
git commit -m "Auto-generate Markdown from HTML"
git push
fi
Step 2. Push to GitHub
Done. Your Markdown files now update automatically when you commit HTML changes.
Any time you commit updates to your HTML:
- The pre-commit hook regenerates clean Markdown
- Files appear in the specified directory
- Text content stays synchronized
Your repository will now include your Markdown files automatically.
ai/
index.md
about.md
contact.md
etc...
After pushing changes, pull the latest to sync the files to your local machine. In GitHub Desktop, click Fetch origin, then Pull origin. Or ask your AI assistant to run git pull.
One-time setup: To keep your commit history clean, ask your AI assistant to run the command to enable rebase on pull:
git config pull.rebase true
This prevents messy merge commits from cluttering your project history.
You now have
- A text-based Markdown version of each page
- Zero manual conversion steps
- Automatic updates on every push
- A setup ready for AI search, embeddings, and assistants
Limitations
The script extracts text-based content: headings, paragraphs, lists, code blocks, and tables. It does not convert images or embedded media. For most text-heavy pages, this is sufficient. However, if your site relies heavily on visual content, the Markdown will be a simplified representation, not a complete mirror.
This approach works for static HTML sites using git for version control. Sites relying on client-side JavaScript rendering or dynamic server-side content won't fully convert, only pre-rendered HTML is processed.
Pro tip
Some AI companies scrape website content even when publishers explicitly forbid it. This approach won't stop them from scraping, but it gives bots a cleaner option: a text-based Markdown version of your content.
The auto-generated Markdown files are yours to edit, so you can leave out anything sensitive or private. You can even have your AI assistant edit your pre-commit hook to automatically exclude specific pages, folders, or even individual elements.
It's not foolproof, but it beats crossing your fingers and hoping the bots behave.
Model Testimonials
When content is presented in clean, well-structured Markdown, I understand it more easily. The reduced noise helps me focus on the ideas instead of the formatting, which leads to clearer and more reliable responses.
— ChatGPT
Markdown strips away the presentational complexity of HTML—no parsing nested divs, no filtering scripts, no guessing which text is navigation versus content. When I receive Markdown, I'm working with your actual words, structured simply. That directness helps.
— Claude
HTML is noisy; I have to filter through attributes and scripts to find the signal. This Markdown mirror eliminates that overhead. It delivers semantically dense, token-efficient text that lets me focus immediately on your content, not your code.
— Gemini
✶✶✶✶
About the Author
Burton Rast is a designer, a photographer, and a public speaker who loves to make things.