You can use the upload wizard to load documents in a library or corpus:
- During the creation wizard of corpus, categorization, extraction and thesaurus projects.
- When managing a library, whether when creating it or when you want to add documents to it, also from the Documents tab.
These are the steps of the wizard:
In the Upload documents dialog, select Add files then select the files to upload. Multiple selection with
Supported file formats are those managed by the Apache Tika toolkit.
ZIP files are treated as containers of files—possibly nested in sub-folders—of any other supported format unless they are internally structured as document exchange archives.
A document is not uploaded if:
- No text is extracted from the file.
- Text mainly consist of nonsense words.
- In case of automatic language recognition, text language is unrecognized or not supported.
File size exceeds:
- 200 MB for
- 1 MB for
- 100 MB for other file types.
- 200 MB for
After you confirm your selection, three tabs are displayed: Documents, Settings and PDF management, with the Settings tab selected.
You can select Add files again to add more files.
Review documents and settings.
In the Settings tab:
Set the OCR extraction strategy. OCR is applied to image files and PDF files containing images.
You can choose between:
- Smart: OCR is used only if the extraction of text from the file without OCR gets less than the specified number of characters.
- Always: OCR is always performed if file format is suitable.
- Never: OCR is not used to extract characters.
Turn off Autodetect language if you want to disable automatic language detection, then choose the documents' language from the drop-down list.
- Turn off Autodetect encoding if you want to disable automatic character encoding detection, then choose the documents' encoding from the drop-down list.
If you are not adding documents to a corpus, you can also save the uploaded documents as a new corpus that you can later use as a source of documents for libraries. To do so:
- Check Save as corpus.
- Enter the name of the new corpus in the text box beside the checkbox.
The Documents tab lists the files to upload.
To remove a document from the list, select the X icon beside the file name.
In the PDF management tab, you can set the options for PDF files.
Turn on Pdf document view if you want to process PDF file with expert.ai Extract technology. This allows text to be extracted along with the graphic layout of each page rather than plain text. With this option active, you will then be able to view and annotate the documents in the detail view within a rendering that reproduces the graphic layout of the original. With Pdf document view you have the following options:
- Enable or disable table and title detection to toggle the detection of tables and titles.
Enable or disable OCR extraction to toggle OCR extraction for scanned pages or images.
When PDF document view is turned on, the options set in the Settings tab—including OCR extraction—are ignored for PDF files.
Reading order mode to choose the algorithm used to extract text from pages:
- standard: the algorithm tries to find the way in which a human would read the text blocks on the page. Best for multi-column or mixed layout pages.
- vertical: the algorithm considers the page as single-column.
- auto: the algorithm classifies each page based on its layout then automatically chooses between standard and vertical to extract text.
Select Upload (or Create, if in a corpus creation wizard) to start the upload process.
During upload, documents' text goes through a Natural Language Understanding (NLU) analysis and documents get indexed on the extracted features.