Create a corpus
The procedures for creating a corpus depend on the number of items already present in the dashboard
Procedures
To create a corpus:
- Go to the main dashboard.
If there are no other projects and corpora created
- Select the Corpus card.
Or:
- Select the plus button
and choose New Corpus. This creation procedure is always available.
If there are other existing projects but no corpora
- Select Corpora
in the left menu, then Create your first corpus.
If there are other existing corpora
- Select New Corpus
in the Create item area.
Or:
- Select Corpora
in the left menu, then select New Corpus in the Create corpus area.
Common preliminary steps
- In the dialog, enter the mandatory name, select the technology version from the Tech version drop-down list and enter an optional description.
- Select Add files. You can upload single files from a folder or a set of zipped files.
- The selected files or folders are displayed in a list and can be deleted by clicking on the X button at the right of the file name.
Select Show advanced settings:
- If you want to disable automatic language detection: turn off Autodetect language and choose the language from the Select language drop-down list.
- If you want to disable automatic character encoding detection: turn off Autodetect encoding and choose the encoding from the Select encoding drop-down list.
- When done, select Hide advanced settings.
- Select Create.
Note
Watch the Background tasks to check if the corpus is ready and properly created.
Supported formats and limits
Supported document formats are those managed by the Apache Tika toolkit. Documents are automatically converted to plain text files during upload.
Documents are ignored if:
- They are empty.
- They mainly consist of nonsense words.
- (In case of automatic language recognition) Their language is unrecognized or not supported.
-
They exceed the following values:
- 50MB for
.zip
files. - 50MB for
.txt
files. - 50MB for other file types.
- 50MB for
Access the created corpus:
- Select Go to corpus from the temporary side text bars in the lower right corner.
Or:
- Select the corpus in the main dashboard.
If the number of uploaded documents doesn't immediately correspond to the right amount, refresh the page.