AI Markdown Mirror

AI Markdown Mirror

Nano Banana's dramatic interpretation of this workflow.

Create an AI-Ready mirror of your static website in Markdown using GitHub Actions

A Simple, No-Build-Tools Guide

Markdown is generally easier for AI systems to parse than HTML. This guide shows you how to automatically generate clean Markdown versions of your HTML pages every time you push to GitHub, which can improve how AI tools consume your content.

No build system, no local scripts, no technical overhead. Just set it and forget it.

How it works

Step 1. Add the GitHub Actions workflow

This script runs automatically on GitHub's servers every time you push HTML changes. It does three things:

You don't need to manually edit your HTML files. The GitHub Action handles everything.

html-to-md.yml on: push (main, *.html paths) html_to_md Check out repo 5s Set up Python 10s Install dependencies 15s Generate Markdown files 30s Commit and push 5s

GitHub Actions handles the conversion automatically.

Create this folder in the root of your project (or ask your AI assistant to create it):

.github/workflows/

Inside that folder, create a file named:

html-to-md.yml

Important: The workflow file must be located at .github/workflows/html-to-md.yml for GitHub to recognize it.

Common excludes to consider: drafts, archive, vendor, dist, build. The workflow always excludes node_modules and hidden folders.

Paste the below code block: (IMPORTANT: after pasting, update the line that says BASE_URL = "https://yourdomain.com" to your domain)

name: Generate Markdown from HTML

on:
  push:
    branches:
      - main
    paths:
      - "*.html"
      - "**/*.html"

permissions:
  contents: write

jobs:
  html_to_md:
    runs-on: ubuntu-latest
    steps:
      - name: Check out repo
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: |
          pip install beautifulsoup4 lxml

      - name: Generate Markdown files
        run: |
          mkdir -p ai
          python - << 'PY'
          from bs4 import BeautifulSoup, NavigableString, Tag, Comment
          from pathlib import Path

          BASE_URL = "https://yourdomain.com"  # ← YOU MUST CHANGE THIS
          EXCLUDE_FOLDERS = {'node_modules'}  # ← Folders to skip

          def get_md_path(html_path):
              """Determine the markdown file path for an HTML file."""
              parts = html_path.parts
              if html_path.name == "index.html":
                  if len(parts) == 1:
                      return "ai/index.md"
                  else:
                      return f"ai/{parts[-2]}.md"
              else:
                  stem = html_path.stem
                  if len(parts) > 1:
                      prefix = "-".join(parts[:-1])
                      return f"ai/{prefix}-{stem}.md"
                  return f"ai/{stem}.md"

          def add_link_tag_if_missing(html_path, md_path):
              """Add the alternate link tag to HTML if missing."""
              content = html_path.read_text(encoding="utf-8")
              soup = BeautifulSoup(content, "lxml")

              existing = soup.find("link", {"rel": "alternate", "type": "text/markdown"})
              if existing:
                  return False

              head = soup.find("head")
              if not head:
                  return False

              new_link = soup.new_tag("link")
              new_link["rel"] = "alternate"
              new_link["type"] = "text/markdown"
              new_link["href"] = f"/{md_path}"

              comment = Comment(" Markdown version for AI bots ")
              title = head.find("title")
              if title:
                  title.insert_after("\n    ")
                  title.insert_after(new_link)
                  title.insert_after("\n    ")
                  title.insert_after(comment)
                  title.insert_after("\n\n    ")
              else:
                  head.append("\n    ")
                  head.append(comment)
                  head.append("\n    ")
                  head.append(new_link)
                  head.append("\n")

              html_path.write_text(str(soup), encoding="utf-8")
              return True

          # Find all HTML files and process them
          files = []
          for html_path in Path(".").rglob("*.html"):
              if any(part.startswith('.') or part in EXCLUDE_FOLDERS for part in html_path.parts):
                  continue
              try:
                  md_path = get_md_path(html_path)
                  add_link_tag_if_missing(html_path, md_path)
                  files.append((str(html_path), md_path))
              except Exception:
                  continue

          def normalize_space(text):
              return " ".join(text.split())

          def inline_to_md(node):
              pieces = []
              for child in getattr(node, "children", []):
                  if isinstance(child, NavigableString):
                      pieces.append(str(child))
                  elif isinstance(child, Tag):
                      name = child.name.lower()
                      if name == "a":
                          text = normalize_space(child.get_text(" ", strip=True))
                          href = child.get("href", "").strip()
                          if not text:
                              continue
                          if not href:
                              pieces.append(text)
                              continue
                          if href.startswith("http") or href.startswith("mailto:") or href.startswith("#"):
                              resolved = href
                          else:
                              resolved = f"{BASE_URL}{href}" if href.startswith("/") else f"{BASE_URL}/{href}"
                          pieces.append(f"[{text}]({resolved})")
                          continue
                      if name in ("strong", "b"):
                          pieces.append(f"**{normalize_space(inline_to_md(child))}**")
                          continue
                      if name in ("em", "i"):
                          pieces.append(f"*{normalize_space(inline_to_md(child))}*")
                          continue
                      pieces.append(inline_to_md(child))
              return "".join(pieces)

          def table_to_md(tbl):
              headers = []
              thead = tbl.find("thead")
              if thead:
                  hr = thead.find("tr")
                  if hr:
                      headers = [normalize_space(c.get_text()) for c in hr.find_all(["th","td"])]
              if not headers:
                  fr = tbl.find("tr")
                  if fr and fr.find("th"):
                      headers = [normalize_space(c.get_text()) for c in fr.find_all("th")]
              rows = []
              tbody = tbl.find("tbody")
              all_tr = tbody.find_all("tr") if tbody else tbl.find_all("tr")
              for tr in all_tr:
                  if not thead and tr.find("th") and not tr.find("td"):
                      continue
                  cells = [normalize_space(c.get_text()) for c in tr.find_all(["td","th"])]
                  if cells:
                      rows.append(cells)
              if not headers and rows:
                  headers = rows.pop(0)
              if not headers:
                  return []
              n = len(headers)
              out = ["| " + " | ".join(headers) + " |", "| " + " | ".join(["---"]*n) + " |"]
              for r in rows:
                  while len(r) < n:
                      r.append("")
                  out.append("| " + " | ".join(r[:n]) + " |")
              return out

          def html_to_markdown(html_path, md_path):
              path = Path(html_path)
              if not path.exists():
                  return
              soup = BeautifulSoup(path.read_text(encoding="utf-8"), "lxml")
              title_tag = soup.find("title")
              title = title_tag.get_text(strip=True) if title_tag else ""
              root = soup.find("main") or soup.body or soup
              allowed = ["h1","h2","h3","h4","h5","h6","p","li","pre","table"]
              elements = [tag for tag in root.find_all(allowed) if not tag.find_parent("nav")]
              lines = []
              if title:
                  lines.append(f"# {title}")
                  lines.append("")
              for el in elements:
                  name = el.name.lower()
                  if name == "pre":
                      code = el.find("code")
                      txt = code.get_text() if code else el.get_text()
                      lang = ""
                      src = code if code else el
                      for c in (src.get("class") or []):
                          if c.startswith("language-"):
                              lang = c[9:]
                              break
                      lines.append(f"```{lang}")
                      lines.append(txt.rstrip())
                      lines.append("```")
                      lines.append("")
                      continue
                  if name == "table":
                      tbl_lines = table_to_md(el)
                      if tbl_lines:
                          lines.extend(tbl_lines)
                          lines.append("")
                      continue
                  text = normalize_space(inline_to_md(el))
                  if not text:
                      continue
                  if name.startswith("h"):
                      lines.append(f"{'#' * int(name[1])} {text}")
                      lines.append("")
                  elif name == "p":
                      lines.append(text)
                      lines.append("")
                  elif name == "li":
                      parent = el.find_parent(["ol", "ul"])
                      if parent and parent.name == "ol":
                          siblings = [s for s in parent.find_all("li", recursive=False)]
                          try:
                              idx = siblings.index(el) + 1
                          except ValueError:
                              idx = 1
                          lines.append(f"{idx}. {text}")
                      else:
                          lines.append(f"- {text}")
                      lines.append("")
              Path(md_path).write_text("\n".join(lines).rstrip() + "\n", encoding="utf-8")

          for src, dst in files:
              html_to_markdown(src, dst)
          PY

      - name: Commit and push changes
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add -A
          if git diff --staged --quiet; then
            echo "No changes to commit."
          else
            git commit -m "Auto-generate Markdown from HTML"
            git push
          fi

Step 2. Push to GitHub

Done. Your Markdown files now update automatically when you push HTML changes.

Any time you push updates to your HTML:

Your repo will now include:

ai/
  index.md
  about.md
  contact.md
  etc...

After pushing HTML changes, pull the latest changes to sync the Markdown files to your local machine. In GitHub Desktop, click Fetch origin, then Pull origin. Or ask your AI assistant to run git pull.

One-time setup: To keep your commit history clean, ask your AI assistant to run this command:

git config pull.rebase true

This prevents messy merge commits from cluttering your project history.


You now have

Limitations

The script extracts text-based content: headings, paragraphs, lists, code blocks, and tables. It does not convert images or embedded media. For most text-heavy pages, this is sufficient. However, if your site relies heavily on visual content, the Markdown will be a simplified representation, not a complete mirror.

This approach works for static HTML sites hosted in any environment, provided the source lives on GitHub. Sites relying on client-side JavaScript rendering or dynamic server-side content won't fully convert, only pre-rendered HTML is processed.

Pro tip

Some AI companies scrape website content even when publishers explicitly forbid it. This approach won't stop them from scraping, but it gives bots a cleaner option: a text-based Markdown version of your content.

The auto-generated Markdown files are yours to edit, so you can leave out anything sensitive or private. You can even have your AI assistant edit your GitHub Actions html-to-md.yml file to automatically exclude specific pages, folders, or even individual elements.

It's not foolproof, but it beats crossing your fingers and hoping the bots behave.

Model Testimonials

When content is presented in clean, well-structured Markdown, I understand it more easily. The reduced noise helps me focus on the ideas instead of the formatting, which leads to clearer and more reliable responses.

— ChatGPT

Markdown strips away the presentational complexity of HTML—no parsing nested divs, no filtering scripts, no guessing which text is navigation versus content. When I receive Markdown, I'm working with your actual words, structured simply. That directness helps.

— Claude

HTML is noisy; I have to filter through attributes and scripts to find the signal. This Markdown mirror eliminates that overhead. It delivers semantically dense, token-efficient text that lets me focus immediately on your content, not your code.

— Gemini

✶✶✶✶

About the Author

Burton Rast is a designer, a photographer, and a public speaker who loves to make things.