Content indexing and entity detection for scanned documents is supported for scanned documents in PDF format with the following attributes:
English language documents
Typewritten text including text created by dot matrix printers and typewriters
Images that include text
Documents with up to 50 pages
The scanned document is processed by using optical character recognition (OCR).
When optical character recognition (OCR) is performed on scanned documents, the following considerations apply:
The quality of the original document affects the accuracy of the OCR process. For example, paper documents that are oxidized or discolored may not be processed.
Dot patterns behind text affect the accuracy of the OCR process. For example, text created with a dot matrix printer with a dot pattern behind it may not be processed.
Handwritten text is not supported.
You can use the Collection Report to view entity detection failures from a data source that includes scanned documents. For information about the Collection Report, see Viewing the Collection Report.