Create a categorization project
Platform provides the following ways to create a new categorization project:
- A four step Categorization wizard.
- Importing a CPK.
Create a categorization project with the Categorization wizard
The wizard procedure depends on the number of items already present in the dashboard.
- Go to the main dashboard.
If there are no projects or corpora created
- Select the Categorization project card.
Or:
- Select the plus button
and then Categorization project. This creation procedure is always available.
If there are other existing projects or corpora but not categorization projects
- Select Categorization
in the left menu, then Create your first project.
If there are other existing categorization projects
- Select Categorization
in the left menu, then select New Categorization project in the Create project area.
Or:
- Select New Categorization project
in the Create item area.
Common preliminary steps
- In the New Categorization project dialog, enter the mandatory categorization project name in Project name, select the technology version from the Tech version drop-down menu and enter the optional description in Description.
- Select Create.
First step: Project language
In the Project language page, select the project language for your analysis engine, then Next.
Second step: Project resources
In the Project resources page:
- Select Create project Taxonomy to create the taxonomy from scratch, then Next.
- In the Create Taxonomy window, enter a category and then press
Enter
to add the next one. When done, select Next. - The taxonomy is then displayed in the Resources tab. It is possible to edit it. Select Next to go on.
Or:
- Select Import Taxonomy to import an existing taxonomy in XML format.
- Select an XML file from the file system.
- The taxonomy is then displayed in the Resources tab. It is possible to edit it. Select Next to go on.
Or:
- Select Building magic Taxonomy where related documents are grouped and automatically categorized, then select Next.
- Select the project library as described in Third step: Project library .
-
In the Magic Taxonomy dialog, switch Manual configuration if you want to set the number of nodes and the mode, such as Strict mode or Soft mode.
Info
The creation algorithm automatically suggests the documents annotations.
-
Select Next. The taxonomy is then displayed in the Resources tab. It is possible to edit it.
- Select Next to go directly to the Fourth step: Summary .
Third step: Project library
During this step, available only for Create project Taxonomy or Import Taxonomy procedures, you provide one of the libraries needed to train and test the ML model. More libraries can be added later.
In the Project library dialog:
- Enter the library name in Library name (optional step).
-
Select:
- Generic library to create a generic library.
Or:
- Training library to create a training library.
Or:
- Test library to create a test library.
-
Select Next to go on.
-
In Corpora and folders, select the source for the library. You can select an existing corpus or upload documents from the file system.
If you choose an existing corpus:
-
Select the corpus, then select Next.
Info
If you want to use a corpus, you can use these tools to find it:
- Use the search bar to look for a corpus. Your search must contain at least three characters.
- Select Show table view to view your corpora in a table format.
- Select Show card view to view your corpora in a card format.
- When in card view, you can sort items by selecting one of the options from the drop-down menu.
- When in table view, you can sort items by selecting the desired column header.
The information displayed in the existing corpora is the same displayed in the Corpus info sub-panel of the main dashboard.
Warning
Corpora displayed are related to the Tech version selected previously in the the New Categorization project dialog.
If you choose to upload documents:
- Select Upload > Add files to add the files you need. Multiple selection is allowed.
- In the Settings tab:
- Switch on Pdf document view to process with the expert.ai Extract technology and view the PDF files in order to work on them in their original format. If not switched, all the documents are converted with the Apache Tika toolkit and used as txt files.
- Select or deselect Enable or disable table and title detection to enable or disable table and title detection.
- Select or deselect Enable or disable OCR extraction to enable or disable OCR extraction.
- Switch off Autodetect language to disable automatic language detection. Select the preferred language manually from the Select language drop-down list.
- Switch off Autodetect encoding to disable the automatic character encoding detection. Select manually the preferred encoding from the Select encoding drop-down list.
- Select Save as corpus to save your library as a corpus and type a name for it.
- In the Documents tab:
- Display the list of documents and folders to upload. They can be deleted by clicking on the X button at the right of the file name.
- Select Upload.
When the upload is complete, a temporary uploaded corpus is created and made available in the window.
Supported documents formats
Supported document formats are those managed by the Apache Tika toolkit. Documents are automatically converted to plain text files during upload.
Documents are ignored if:
- They are empty.
- They mainly consist of nonsense words.
- Their language is unrecognized or not supported (in case of automatic language recognition).
-
They exceed the following values:
- 200 MB for
.zip
files. - 1 MB for
.txt
files. - 100 MB for other file types.
- 200 MB for
-
-
Select Next.
Info
It is also possible to upload an annotated library.
Fourth step: Summary
The last step shows the project details of the previous steps, like the project name, the project language, the tech version, and the library.
Select Open project to end the wizard process and start working on the project.
Info
To quit the wizard at any time:
- Select Exit wizard in the upper right corner or select the expert.ai icon in the upper left one.
- In the save changes dialog you can select:
- Cancel to quit.
- Delete project to delete the project.
- Save to save the project at that step and then reopen it from the main dashboard and continue with the wizard at a later time.
Create a categorization project importing a CPK
Note
It is not possible to import a CPK in the Categorization wizard.
- Go to the main dashboard.
-
Select:
a. The plus button , then Import CPK .
Or:
b. Import CPK in the Create item area.
Or:
c. Categorization in the left column, then Import CPK in the Create project area.
-
In the Create a new project from a CPK dialog:
-
Select Browse files to upload the CPK file from the file system .
Warning
Maximum CPK file size is 2GB.
-
If you selected 2.a. or 2.b., choose Categorization under Type of project.
- Enter the mandatory categorization project name in Project name.
- Select the technology version from the Tech version drop-down menu.
- Enter the optional description in Description.
-
-
Select Create project to start the import.
- Select Check resources in Project resources to check them, then Next to go on or select Skip this step to directly go on.
- Create the library as described above in Third step: Project library.
- Select Open project in the Project summary page to end the creation process and start to work on the project.
Note
In the imported CPK projects, it is not allowed to change the taxonomy. This is marked in the dashboard with a padlock.