Practical guide

PDF to Markdown: what can and cannot be preserved

PDF is a display format, not a semantic document model. A text-based PDF can produce a useful Markdown draft, but reading order, custom fonts, tables, forms, page furniture, and scanned pages can fail. Test text selection first, use OCR for image-only pages, prefer the original editable file when available, and verify names, dates, totals, clauses, units, and page references manually.

Last reviewed July 17, 2026 · Release 2026-07-17-adsense-r9

Who this guide is for

• People converting reports and manuals
• Documentation and migration teams
• Researchers and RAG builders
• Users deciding between OCR, PDF, and original editable files

How do I know whether a PDF has a usable text layer?

Select a paragraph in the PDF viewer and paste it into a plain editor. If it is empty, scrambled, or ordered differently from the page, the text layer is missing or unreliable.

When is OCR required?

OCR is required for scanned or image-only pages and may be required for custom-font PDFs whose glyphs do not map to normal Unicode. OCR recognizes characters; Markdown cleanup organizes the recognized text. Both steps require review.

Why does reading order break?

PDFs store positioned fragments. Multi-column pages, sidebars, captions, footnotes, forms, and headers can be emitted in a sequence that differs from how a person reads the page.

Why are PDF tables difficult?

Many tables are drawn as lines and positioned text rather than a real grid. Merged cells, repeated headers, wrapped labels, footnotes, and totals can shift into the wrong column.

What should be checked in high-stakes PDFs?

Compare every name, date, amount, currency, unit, negative sign, clause number, account identifier, signature label, and page reference with the original. Use a private workflow for sensitive material.

When is the original editable file better?

DOCX, XLSX, PPTX, or source HTML usually exposes more semantic structure. Use the PDF only when the original is unavailable or the visual page remains important evidence.

Before and after example

Before

Two-column page
Left section A ... Right sidebar ...
Table drawn with lines
Footer: Confidential 4

After Markdown

## Section A

...

> Sidebar: ...

| Item | Amount |
|---|---:|
| Total | ... |

Review checklist

• Test text selection before upload.
• Identify scans, columns, sidebars, forms, and repeated page furniture.
• Compare every section boundary with the page.
• Rebuild irregular tables and verify totals row by row.
• Check custom-font punctuation and non-Latin characters.
• Use OCR, the original source, or a private workflow when required.

Risk boundary

Converted Markdown remains a draft. Use an approved private workflow for confidential, regulated, customer, financial, medical, legal, credential-containing, or private source material. Keep the original source available until all material details are verified.

Frequently asked questions

Can a scanned PDF convert without OCR?

Not reliably. A scan contains images rather than selectable text.

Why is a clean-looking PDF output scrambled?

The visual layout can be clean while the underlying fragments have no reliable reading order.

Can page headers be removed automatically?

Some can, but repeated text may also be meaningful. Review removal rather than deleting blindly.

Are custom fonts a problem?

They can map visible glyphs to unusual character codes, causing missing or substituted text.

Can legal or financial PDFs be trusted after conversion?

Only after source comparison in an approved workflow. Conversion is not authoritative transcription.

Should I use PDF or DOCX?

Use DOCX when available because it usually preserves headings, lists, and tables more clearly.

Related workflows

These links provide the next format, privacy, or review step for this specific guide.