Skip to content

Document classification

Document classification determines what the document text is about by mapping it to the nodes of a taxonomy.

Here is an example of performing document classification on a short English test:

This example is based on the Python SDK you find in the expert.ai developer portal.

The SDK's API client gets user credentials from two environment variables:

EAI_USERNAME
EAI_PASSWORD

Set those variables with you account credentials before running the sample program below.

The program prints a JSON representation of the results and the list of categories' id and hierarchy.

from expertai.client import ExpertAiClient
eai = ExpertAiClient()

text = "Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half."
language= 'en'

response = eai.iptc_media_topics_classification(body={"document": {"text": text}}, params={'language': language})

# Output JSON representation

print("JSON representation:")
print(response.json)


# Tab separated list of categories' id and hierarchy.

print("\nTab separated list of categories' id and hierarchy:")
document = response.json["data"]

for category in document["categories"]:
    print(category["id"], category["hierarchy"], sep="\t")

This example is based on the Java SDK you find in the expert.ai developer portal.

In the code below, replace yourusername and yourpassword with your account credentials.

The program prints a JSON representation of the results and the list of categories' id and hierarchy.

The program prints a JSON representation of the results and the list of categories' id and hierarchy.

import ai.expert.nlapi.security.Authentication;
import ai.expert.nlapi.security.Authenticator;
import ai.expert.nlapi.security.BasicAuthenticator;
import ai.expert.nlapi.security.Credential;
import ai.expert.nlapi.v1.API;
import ai.expert.nlapi.v1.Categorizer;
import ai.expert.nlapi.v1.CategorizerConfig;    
import ai.expert.nlapi.v1.message.ResponseDocument;
import ai.expert.nlapi.v1.model.DataModel;

public class Main {

    public static Authentication createAuthentication() throws Exception {
        Authenticator authenticator = new BasicAuthenticator(new Credential("yourusername", "yourpassword"));
        return new Authentication(authenticator);
    }

    public static Categorizer createCategorizer() throws Exception {
        return new Categorizer(CategorizerConfig.builder()
                .withVersion(API.Versions.V1)
                .withTaxonomy(API.Taxonomies.IPTC)
                .withLanguage(API.Languages.en)
                .withAuthentication(createAuthentication())
                .build());
    }

    public static void main(String[] args) {
        try {
            String text = "Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.";

            Categorizer categorizer = createCategorizer();

            ResponseDocument categorization = categorizer.categorize(text);


            // Output JSON representation

            System.out.println("JSON representation:");
            categorization.prettyPrint();


            // Tab separated list of categories' id and hierarchy.

            System.out.println("Tab separated list of categories' id and hierarchy:");
            DataModel data = categorization.getData();

            data.getCategories().stream().forEach(c -> System.out.println(c.getId() + "\t" + c.getHierarchy()));
        }
        catch(Exception ex) {
            ex.printStackTrace();
        }
    }
}

The following curl command posts a document to the document classification resource of the API's REST interface.
Run the command from a shell after replacing token with the actual authorization token.

curl -X POST https://nlapi.expert.ai/v1/categorize/iptc/en \
    -H 'Authorization: Bearer token' \
    -H 'Content-Type: application/json; charset=utf-8' \
    -d '{
  "document": {
    "text": "Michael Jordan was one of the best basketball players of all time. Scoring was Jordan'\''s stand-out skill, but he still holds a defensive NBA record, with eight steals in a half."
  }
}'

The server returns a JSON object like the one below.
For more information see the following pages in the reference section:

{
    "data": {
        "categories": [
            {
                "frequency": 70.62,
                "hierarchy": [
                    "Sport",
                    "Competition discipline",
                    "Basketball"
                ],
                "id": "20000851",
                "label": "Basketball",
                "namespace": "iptc_en_1.0",
                "positions": [
                    {
                        "end": 14,
                        "start": 0
                    },
                    {
                        "end": 53,
                        "start": 35
                    },
                    {
                        "end": 139,
                        "start": 136
                    }
                ],
                "score": 4005.0,
                "winner": true
            }
        ],
        "content": "Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.",
        "language": "en",
        "version": "sensei: 3.1.0; disambiguator: 14.5-QNTX-2016"
    },
    "success": true
}

The following curl command posts a document to the document classification resource of the API's REST interface.
Open a command prompt in the folder where you installed curl and run the command after replacing token with the actual authorization token.

curl -X POST https://nlapi.expert.ai/v1/categorize/iptc/en  -H "Authorization: Bearer token" -H "Content-Type: application/json; charset=utf-8" -d "{\"document\": {\"text\": \"Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.\"}}"

The server returns a JSON object like the one below.
For more information see the following pages in the reference section:

{
    "data": {
        "categories": [
            {
                "frequency": 70.62,
                "hierarchy": [
                    "Sport",
                    "Competition discipline",
                    "Basketball"
                ],
                "id": "20000851",
                "label": "Basketball",
                "namespace": "iptc_en_1.0",
                "positions": [
                    {
                        "end": 14,
                        "start": 0
                    },
                    {
                        "end": 53,
                        "start": 35
                    },
                    {
                        "end": 139,
                        "start": 136
                    }
                ],
                "score": 4005.0,
                "winner": true
            }
        ],
        "content": "Michael Jordan was one of the best basketball players of all time. Scoring was Jordan's stand-out skill, but he still holds a defensive NBA record, with eight steals in a half.",
        "language": "en",
        "version": "sensei: 3.1.0; disambiguator: 14.5-QNTX-2016"
    },
    "success": true
}