Skip to content

Detectors

Introduction

A detector is an API resource that carries out information detection on a text.

In the scope of the REST interface, the detector name together with the language code identifies the specific API resource.

The table below shows available detectors and the language they support.

Detector name English Spanish French German Italian
pii
writeprint

You can also use the dynamic self-documentation resource of the API to discover available detectors and supported languages.

To know how to request this kind of resources, read the dedicated article in the reference section.

pii detector

The pii detector is capable of identifying personal data also known as personally identifiable information (PII) and return them together with their positions in the text as linked data in JSON-LD format (see also https://json-ld.org/).

The detector's output allows you to determine if a document contains potentially sensitive data and possibly create a new version of the text in which the PII is de-identified.

These are the information types pii can detect:

Information type Notes
Personal attributes Of a real person or a fictional character
Postal address
Bank account
IP address
E-mail address
URL
Financial product Credit or debit card
Phone number

These are the properties of each information type:

Information type Property Linked data reference
Personal attributes
Full name of the person https://schema.org/Person
First name https://schema.org/givenName
Last name https://schema.org/familyName
Age https://schema.org/Number
Gender https://schema.org/gender
Nationality https://schema.org/nationality
Date of birth https://schema.org/birthDate
Place of birth https://schema.org/birthPlace
Date of death https://schema.org/deathDate
Place of death https://schema.org/deathPlace
Any date or a time related to the person https://schema.org/Date
Postal address
Full address https://schema.org/Text
Street name and house number https://schema.org/streetAddress
Country https://schema.org/addressCountry
Postal code https://schema.org/postalCode
Locality https://schema.org/addressLocality
Region https://schema.org/addressRegion
PO box number https://schema.org/postOfficeBoxNumber
Bank account
IBAN code https://schema.org/PropertyValue
IBAN code country https://schema.org/Country
IP address
Address https://schema.org/Text
E-mail address
Address https://schema.org/email
URL
URL https://schema.org/URL
Financial product
Number of the credit/debit card https://schema.org/Text
Card Verification Value (CVV) or Card Verification Code (CVC) https://schema.org/Number
Card expiration date https://schema.org/Date
Phone number
Number https://schema.org/telephone

Useful resources:

Tip

To play with the JSON-LD object and get ideas for its possible uses, take a look at the JSON-LD playground site, where you can paste the JSON-LD object returned by the pii detector.

writeprint detector

writeprint information detection resources—one for each of the supported languages— perform a stylometric analysis of the document which ranges from readability and vocabulary richness to verb types, document structure and grammar. The detector is also capable of identifying the markers of several specific-purpose languages.

Stylometric data is provided in the form of 60 indexes (see below) which, as a whole, make up for a complete fingerprint of the document, its writeprint.
By comparing a number of documents on the basis of their writeprint, author invariants are highlighted by this authorship analysis tool.

Writeprint information is returned as a JSON-LD object embedded in a broader JSON object.

Readability indexes

These are the readability indexes:

Index name Description Output data
Coleman-Liau The Coleman-Liau index, which value approximates the U.S. grade level thought necessary to comprehend the text. Value and degree of difficulty
Gulpease The Gulpease index, based on word length and best suited for the Italian language. Value and degree of difficulty
Automated Readability Index (ARI) The Automated Readability Index, which, like the Coleman-Liau index, produces an approximate representation of the US grade level needed to comprehend the text and is best suited for the English language. Value and degree of difficulty

Spelling

The following indexes measure aspects related to the spelling of words and the presence of particular punctuation characters.

Index name Notes
Sentences starting with a capital letter (ratio) The ratio of sentences in which the first word starts with an uppercase letter to the number of sentences.
Sentences starting with a small letter (ratio) The ratio of sentences in which the first word starts with a lowercase letter to the number of sentences.
Emoticons per sentence
Dots per sentence The presence of dots in addition to the period at the end of the sentence can be indicative of a concise language because of abbreviations.
Multiple dots per sentence Like ellipses and longer sequences.
Question marks per sentence Indicative of the ratio of questions to the total number of sentences.
Multiple question marks per sentence
Exclamation marks per sentence
Multiple exclamation marks per sentence
Exclamation mark question mark sequences per sentence
Commas per sentence
Colons per sentence
Semicolons per sentence
Single quotation marks per sentence
Double quotation marks per sentence

Text subdivision

The following indexes count the occurrences or measure the length of certain subdivisions of the text, from sentences to characters.

Index name Notes
Sentences
Tokens Words are tokens, but consecutive words recognized as a unit—like credit card or red carpet—and punctuation marks are also tokens.
Token length per sentence
Characters per sentence
Atoms per sentence Words and punctuation marks are both tokens (see above) and atoms, expect in the case of consecutive words recognized as a unit. In that case, the constituent words are atoms, while the multi-word unit is a single token.
Tokens per sentence
Phrases per sentence

Grammar

The following indexes count the occurrences per sentence of the different parts of speech.

Index name
Adjectives per sentence
Adverbs per sentence
Articles per sentence
Auxiliaries per sentence
Conjunctions per sentence
Nouns per sentence
Proper nouns per sentence
Punctuation per sentence
Prepositions per sentence
Pronouns per sentence
Particles per sentence
Verbs per sentence

Phrase types

The indexes that follow count the number of different phrases per sentence.

Index name
Adjective phrases per sentence
Conjunction phrases per sentence
Adverb phrases per sentence
Noun phrases per sentence
Nominal predicates per sentence
Preposition phrases per sentence
Relative phrases per sentence
Verb phrases per sentence

Language variety and errors

The following indexes measure various aspects of the text that are indicative of greater or lesser variety of the language and the presence of the most common errors.

Index name Notes
Different types of verbs This index is related to the meaning of verbs and their hypernymy relationship with other verbs.
In this hierarchical relationship, to limp is a "son" of to walk which in its turn has the archetypal pure concept of verb of movement as its farthest ancestor.
This index is the number of different archetypal verbs expressed in the text and it's indicative of the variety of the language.
Different types of verbs per sentence The number of distinct archetypal verbs (see the index above) is computed for each sentence and this is the mean value.
Named entities per sentence This index is based on the named entity recognition (NER) capability of the API.
Unknown concepts per sentence An unknown concept is a word that's not mapped to a concept in the expert.ai Knowledge Graph.
Function words per sentence Function words have little or null lexical meaning, but help create fluent and more readable sentences.
Commonly misspelled words per sentence This index is based of the most common writing errors.
Most common words per sentence This index is based on a list of the most common terms used in writing.

Language for specific purposes

The following indexes count the presence of terms associated with language for specific purposes.

Index name
Academic language words per sentence
Business language words per sentence
Crime language words per sentence
Layman language words per sentence
Legal language words per sentence
Military language words per sentence
Political language words per sentence
Social media language words per sentence

Values

Mean, standard deviation and absolute mean deviation are returned for all the "per sentence" indexes. For all but Token length per sentence, the total number of occurrences in the entire text of the document is also returned.

For indexes that are simple counters, such as Sentences, the total count of elements in the text is returned.

Useful resources

Self-documentation resource

The API provides a self-documentation resource to discover available detectors and their features. It has this path:

detectors

Therefore, the complete URL is:

https://nlapi.expert.ai/v2/detectors

It must be requested with the GET method.
It returns the list of available detectors along with the supported languages—as in the above table.

In the reference section of this manual you will find all the information you need to get detectors information using the API's RESTful interface, specifically:

Even if you use the API through a client that hides the REST interface, whether it is made by you or offered by expert.ai, the last piece of information is useful as it helps understand the data returned by the API.

Here is an example of getting detectors information:

This example is based on the Python client you can find on GitHub.

The client gets user credentials from two environment variables:

EAI_USERNAME
EAI_PASSWORD

Set those variables with your account credentials before running the sample program below.

The program prints the list of taxonomies with the language they support.

from expertai.nlapi.cloud.client import ExpertAiClient
client = ExpertAiClient()

output = client.detectors()

# Detectors

print("Detectors:\n")

for detector in output.detectors:
    print(detector.name)
    print("\tLanguages:")
    for language in detector.languages:
        print("\t\t{}".format(language.code))
    print("\tContract: {}".format(detector.contract))

This example is based on the Java client you can find on GitHub.

The client gets user credentials from two environment variables:

EAI_USERNAME
EAI_PASSWORD

Set those variables with you account credentials before running the sample program below.

The program prints the JSON response.

import ai.expert.nlapi.security.Authentication;
import ai.expert.nlapi.security.Authenticator;
import ai.expert.nlapi.security.BasicAuthenticator;
import ai.expert.nlapi.security.DefaultCredentialsProvider;
import ai.expert.nlapi.v2.API;
import ai.expert.nlapi.v2.message.TaxonomiesResponse;
import ai.expert.nlapi.v2.InfoAPI;
import ai.expert.nlapi.v2.InfoAPIConfig;

public class Main {

    public static Authentication createAuthentication() throws Exception {
        DefaultCredentialsProvider credentialsProvider = new DefaultCredentialsProvider();
        Authenticator authenticator = new BasicAuthenticator(credentialsProvider);
        return new Authentication(authenticator);
    }

    public static void main(String[] args) {
        try {
            InfoAPI infoAPI = new InfoAPI(InfoAPIConfig.builder()
               .withAuthentication(createAuthentication())
               .withVersion(API.Versions.V2)
               .build());

            TaxonomiesResponse taxonomies = infoAPI.getTaxonomies();
            taxonomies.prettyPrint();
        }
        catch(Exception ex) {
            ex.printStackTrace();
        }
    }
}

The following curl command gets the taxonomies documentation resource of the API's REST interface.
Run the command from a shell after replacing token with the actual authorization token.

curl -X GET https://nlapi.expert.ai/v2/taxonomies \
    -H 'Authorization: Bearer token'

The server returns a JSON object.

The following curl command gets the detectors documentation resource of the API's REST interface.
Open a command prompt in the folder where you installed curl and run the command after replacing token with the actual authorization token.

curl -X GET https://nlapi.expert.ai/v2/detectors -H "Authorization: Bearer token"

The server returns a JSON object.