Guide
HTML to Markdown without scripts, styles, and navigation clutter
How to turn a web page into readable Markdown while keeping the article body, title, description, links, and tables.
Conversion · 9 min read · Updated 2026-06-07
Use this guide to: Extract readable HTML page content into Markdown without page chrome.
Extract the readable page
Most HTML files include more than the content you want. They may contain headers, footers, menus, tracking scripts, CSS, inline SVG, hidden templates, and app state. A good Markdown conversion should focus on the main readable content.
The page title and meta description are still useful. They give the Markdown file context, especially when the output is used in a knowledge base or document archive.
Keep links understandable
Markdown links should preserve the anchor text and destination. If a page has repeated links like `Read more`, rewrite those after conversion so the Markdown is useful outside the original web layout.
Relative links may need to be checked if the Markdown will live in a different repository or site. Convert them to absolute URLs when the destination matters.
Watch for JavaScript-rendered pages
If the source HTML does not contain the final article text and depends on JavaScript to render it, a static converter may only see the shell of the page. In that case, export the rendered article or use a source file that includes the real content.
Start by identifying the main content
HTML pages often contain far more than the article or documentation you want. A saved page can include navigation, cookie banners, related links, script templates, analytics snippets, inline styles, and hidden elements. The conversion goal is not to preserve the website shell. The goal is to keep the content someone came to read.
A good first check is to ask what the Markdown file should be useful for. If it is for an archive, keep the title, description, author, publication date, and source URL. If it is for documentation reuse, keep headings, code blocks, tables, and meaningful links. If it is for research notes, keep quotes and citations. The answer changes what should be removed.
Do not assume every visible element belongs in the Markdown. Menus, breadcrumbs, newsletter boxes, social share text, and legal footers usually become noise. They make a Markdown file look longer while making it less useful. Remove them unless they are part of the content being documented.
Preserve title, description, and source context
The page title and meta description are easy to overlook, but they are valuable when the Markdown is stored outside the original site. Add them near the top of the file when the output is used in a document archive or knowledge base. A title alone may not explain why the page mattered; a description often captures the page's purpose in one sentence.
Keep the source URL when the Markdown will be cited, reviewed, or updated later. Without a source URL, someone has to guess where the content came from. If the page is internal or private, record enough source context to find it again without exposing sensitive links in public output.
Dates need care. A page may have a published date, a modified date, or no visible date. If the date affects trust, keep it. For technical documentation, a stale date can explain why instructions no longer match the product. For policy pages, the effective date may matter more than the scrape date.
Make links and code blocks useful outside the page
HTML links often depend on surrounding layout. Link text like `Learn more`, `Read this`, or `Click` is weak in Markdown because the reader may not see the original card or button. Rewrite important links so the anchor text names the destination or action. This improves accessibility and makes the Markdown easier to scan.
Relative links are another common issue. A link such as `/docs/setup` works on the original site, but it may break inside a repository, chat transcript, or knowledge base. Convert important relative links to absolute URLs, or document the base URL at the top of the Markdown file.
Code blocks should keep language hints when possible. A fenced block marked as `go`, `bash`, or `tsx` is easier to read and can be highlighted by many tools. If a code block is split by HTML wrappers, comments, or line numbers, clean it before publishing. Code copied from a bad conversion can waste more time than the conversion saved.
Watch for pages rendered by JavaScript
Some pages ship almost no article content in the HTML response. The browser runs JavaScript, fetches data, and builds the page after load. A static HTML conversion may capture only the app shell. If the Markdown output contains navigation but not the article, this is probably the reason.
For those pages, look for a print view, export option, CMS source, documentation source file, or server-rendered version. If you control the site, export from the source system instead of scraping the rendered page. If you do not control it, save the final rendered article in a way that includes the actual text.
The cleanup rule is simple: keep content that would still help a reader if the website disappeared tomorrow. Remove everything that only helps operate the website interface. That line keeps HTML to Markdown conversion focused and prevents the output from becoming a noisy dump of page chrome.
Try the converter
Use the converter after preparing your source file, then review headings, lists, tables, and links before publishing the Markdown.
Open the converter