OCR Support for Scanned Documents

Content indexing and entity detection for scanned documents is supported for scanned documents in PDF format with the following attributes:

  • English language documents

  • Typewritten text including text created by dot matrix printers and typewriters

  • Images that include text

  • Documents with up to 50 pages

The scanned document is processed by using optical character recognition (OCR).


When optical character recognition (OCR) is performed on scanned documents, the following considerations apply:

  • The quality of the original document affects the accuracy of the OCR process. For example, paper documents that are oxidized or discolored may not be processed.

  • Dot patterns behind text affect the accuracy of the OCR process. For example, text created with a dot matrix printer with a dot pattern behind it may not be processed.

  • Handwritten text is not supported.


You can use the Collection Report to view entity detection failures from a data source that includes scanned documents. For information about the Collection Report, see Viewing the Collection Report.

