Create a thesaurus project
Start the wizard
- Go to the main dashboard.
- Select the plus button on the main toolbar and then Thesaurus project.
Alternative ways to start the wizard from the main dashboard are:
- If there are no projects, select the Thesaurus project card.
- If there are other projects, but no thesaurus, select Thesaurus on the left menu, then Create your first project.
If there are other thesaurus projects:
- Select Thesaurus on the left menu, then select New Thesaurus project in the Create project area in the right column.
- Select New Thesaurus project in the Create item area.
Set main properties
In the New Thesaurus project dialog, enter the project name, select the technology version and optionally enter a description for the project.
When done, select Create.
First step: select languages
To quit the wizard at any step:
- Select Exit wizard from the toolbar.
- Select the expert.ai icon in the upper left corner.
Then, in the Save changes dialog, you can select:
- Cancel to close the dialog.
- Delete project to quit the wizard and delete the project.
- Save to save the project and quit the wizard, so that the project can be opened from the main dashboard at a later time to complete the wizard.
In the Project language page of the wizard, select the project languages.
The first language you select is automatically marked as favorite . In case of multiple languages, select the star beside one of them to turn it into the favorite one.
Select Next when done.
Second step: define resources
In the Project resources page, select how to create the resources.
- Create project resources to create concepts by hand.
- Import Thesaurus to import a SKOS definition file in RDF/XML format.
In the first case, select Next: the Create Thesaurus window appears.
Under Input concepts, enter one or more words representing a concept and then press
You can repeat the step above as many times as you like, but only one concept is necessary to create the project as you can add more concepts later.
Select Next when done.
In the second case, open the file and wait until the import process is completed.
Defined concepts are displayed in the Resources and Edit Concept panels, where you can edit them. It is not mandatory to edit concepts, you can do it later, if needed, once the project has been created.
Select Next when done.
Third step: create a library
A training library is not required for this type of project. Libraries are instead used to test the results of the models and can be useful as a source of inspiration for choosing concepts and their labels.
In the Project library page you start creating the default document library for the project. You need to have at least one document in the library and you can add more later.
- Enter the library name in Library name or confirm the suggested name, then select Next.
In Corpora and folders, select the source for the library. You can select an existing corpus or upload documents from the file system.
If you choose an existing corpus, just select it from the list.
You can use these tools to find it:
- Use the search bar to look for a corpus. Your search must contain at least three characters.
- Select Show table view to view your corpora in a table format.
- Select Show card view to view your corpora in a card format.
- When in card view, you can sort items by selecting one of the options from the drop-down menu.
- When in table view, you can sort items by selecting the desired column header.
Displayed corpora are those related to the Tech version selected previously in the New Thesaurus project dialog.
If you choose to upload documents:
- Select Upload on the toolbar. The Upload documents dialog appears.
- Select Add files, then browse the file system to open the files to upload. Selecting multiple files is allowed and you can repeat this step multiple times to add files from more folders.
In the Settings tab you can:
Enable Pdf document view (disabled by default) to process PDF documents with the Extract technology in order to work on them in their original format.
- Select or deselect Enable or disable table and title detection to enable or disable table and title detection.
- Select or deselect Enable or disable OCR extraction to enable or disable OCR extraction.
Disable Autodetect language (enabled by default) to disable the automatic language detection. If the automatic language detection is disabled, select the preferred language from the Select language drop-down list.
- Disable Autodetect encoding (enabled by default) to disable the automatic character encoding detection. If the automatic character encoding detection is disabled, select the preferred encoding from the Select encoding drop-down list.
- Save your library as a corpus by selecting Save as corpus and typing a name.
In the Documents tab, review the selected documents and possibly delete them one by one by selecting the X button on the right of the file name.
- Select Upload.
When the upload is complete, a temporary corpus is created and made available in the Corpora and folders window.
Supported formats and limits
Supported document formats are those managed by the Apache Tika toolkit. Documents are automatically converted to plain text files during upload.
Documents are ignored if:
- They are empty.
- They mainly consist of nonsense words.
- Their language is unrecognized or not supported (in case of automatic language recognition).
They exceed the following values:
- 200 MB for
- 1 MB for
- 100 MB for other file types.
- 200 MB for
Final step: summary
The last step of the wizard sums up project information.
The number of stars in Thesaurus Quality is a measure of the project quality at the end of the wizard in terms of coverage of defined concepts in the default library.
Select Open project to end the wizard process and open the project.