Skip to content

Upload documents

The document upload can be performed in:

  • The creation wizard.
  • The libraries management.
  • The Documents tab of corpus and projects.

To start the process:

  • Select Upload in the creation wizard.

Or:

The steps then are the same for each of the mentioned activities:

  1. In the Upload documents dialog, select Add files. You can upload single files from a folder or a set of zipped files.
  2. In the Settings tab, displayed by default, you can set the OCR strategy based on Tesseract technology:
    1. Select the OCR extraction strategy:
      • Smart: try the conversion with no OCR. It the char obtained are lower than the entered threshold value then the full OCR conversion is performed.
      • Always, always use the OCR.
      • Never: never use OCR.
    2. Switch off Autodetect language to disable the automatic language detection. Select the preferred language manually from the Select language drop-down list.
  3. In the PDF Management tab, you can set the PDF view and Extract process management:

    1. Switch-on Pdf document view to process the documents with the expert.ai Extract technology and view the PDF files in order to work on them in their original format. If not switched, all the documents are converted with the Apache Tika toolkit and used as txt files.
      • Select or deselect Enable or disable table and title detection to enable or disable table and title detection.
      • Select or deselect Enable or disable OCR extraction to enable or disable OCR extraction.
      • Select Reading order mode:
        • standard: performs the reading algorithm (with some variants) to search for the best possible reading order, useful in documents with two or more columns or with a mixed layout.
        • auto: classifies individual pages and select automatically standard or vertical depending on the classification.
        • vertical: forces the reading, when possible, from left to right, from top to bottom. It was introduced to solve some particular layouts (for example glossary pages).
    2. Switch off Autodetect language to disable the automatic language detection. Select the preferred language manually from the Select language drop-down list.
    3. Switch off Autodetect encoding to disable the automatic character encoding detection. Select manually the preferred encoding from the Select encoding drop-down list.
    4. Select Save as corpus to save your library as a corpus and type a name for it.
  4. In the Documents tab:

    • Display the list of documents and folders to upload. They can be deleted by clicking on the X button at the right of the file name.
  5. Select Upload, or Create if you are creating a corpus.

Note

The Background tasks icon displays the status of uploading.

Supported formats and limits

Supported document formats are those managed by the Apache Tika toolkit. Documents are automatically converted to plain text files during upload.

Documents are ignored if:

  • They are empty.
  • They mainly consist of nonsense words.
  • Their language is unrecognized or not supported (in case of automatic language recognition).
  • They exceed the following values:

    • 200 MB for .zip files.
    • 1 MB for .txt files.
    • 100 MB for other file types.