Guide

What OCR means for PDF text extraction

OCR and PDF text extraction are related, but they are not the same workflow. One reads text that already exists in the file. The other tries to recognize letters from images.

Text extraction reads an existing text layer

提取 PDF 文本 works when the PDF already stores real text behind the page layout. This is common for digitally generated PDFs from office apps, browsers, and reporting systems.

OCR is for scanned or image PDFs

If each page is just a scan or photo, there may be no text layer to extract. OCR tries to detect letters from the image itself and generate text after the fact.

Why the results differ

Text extraction is usually cleaner because it reads structured text already inside the file. OCR is more error-prone because it must guess letters from image quality, skew, noise, and language patterns.

How to tell which case you have

If you can select and copy text in a normal viewer, the file probably has a text layer. If the whole page behaves like one image or selection fails entirely, OCR is more likely what you need.

What PDFresh supports today

PDFresh currently covers browser-side extraction from PDFs that already contain text. It does not perform OCR on image-only PDFs in the current tool flow.

Related tools and guides

Open 提取 PDF 文本 for text-layer PDFs. If the problem is actually page selection or cleanup, continue with 如何从 PDF 提取页面 or PDF 拆分与删除页面的区别. For copy failures caused by other reasons, read 为什么 PDF 文本无法复制.