Skip to content

Writeprint detection

Introduction

The writeprint information detection resources—one for each of the supported languages— perform a stylometric analysis of a text, ranging from readability and vocabulary richness to verb types, document structure and grammar. The detector is also capable of identifying the markers of several specific-purpose languages.

Stylometric data is provided in the form of 60 indexes (see below) which, as a whole, make up for a complete fingerprint of the document, i.e. its writeprint.
By comparing a number of documents on the basis of their writeprint, the literary attributes of the author are highlighted by this authorship analysis tool.

Writeprint information is returned as a JSON-LD object embedded in a broader JSON object.

Readability indexes

These are the readability indexes:

Index name Description Output data
Coleman-Liau The Coleman-Liau index, which value approximates the U.S. grade level thought necessary to comprehend the text. Value and degree of difficulty
Gulpease The Gulpease index, based on word length and best suited for the Italian language. Value and degree of difficulty
Automated Readability Index (ARI) The Automated Readability Index, which, like the Coleman-Liau index, produces an approximate representation of the US grade level needed to comprehend the text and is best suited for the English language. Value and degree of difficulty

Spelling

The following indexes measure aspects related to the spelling of words and the presence of particular punctuation characters.

Index name Notes
Sentences starting with a capital letter (ratio) The ratio of sentences in which the first word starts with an uppercase letter to the number of sentences.
Sentences starting with a small letter (ratio) The ratio of sentences in which the first word starts with a lowercase letter to the number of sentences.
Emoticons per sentence
Dots per sentence The presence of dots in addition to the period at the end of the sentence can be indicative of a concise language because of abbreviations.
Multiple dots per sentence such as Ellipsis points and longer sequences.
Question marks per sentence Indicative of the ratio of questions to the total number of sentences.
Multiple question marks per sentence
Exclamation marks per sentence
Multiple exclamation marks per sentence
Exclamation mark, question mark sequences per sentence
Commas per sentence
Colons per sentence
Semicolons per sentence
Single quotation marks per sentence
Double quotation marks per sentence

Text subdivision

The following indexes count the occurrences or measure the length of certain subdivisions of the text, from sentences to characters.

Index name Notes
Sentences
Tokens Words are tokens, but consecutive words recognized as a unit—like credit card or red carpet—and punctuation marks are also tokens.
Token length per sentence
Characters per sentence
Atoms per sentence Words and punctuation marks are both tokens (see above) and atoms, except in the case of consecutive words recognized as a unit. In that case, the constituent words are atoms, while the multi-word unit is a single token.
Tokens per sentence
Phrases per sentence

Grammar

The following indexes count the occurrences per sentence of the different parts of speech.

Index name
Adjectives per sentence
Adverbs per sentence
Articles per sentence
Auxiliaries per sentence
Conjunctions per sentence
Nouns per sentence
Proper nouns per sentence
Punctuation per sentence
Prepositions per sentence
Pronouns per sentence
Particles per sentence
Verbs per sentence

Phrase types

The indexes that follow count the number of different phrases per sentence.

Index name
Adjective phrases per sentence
Conjunction phrases per sentence
Adverb phrases per sentence
Noun phrases per sentence
Nominal predicates per sentence
Preposition phrases per sentence
Relative phrases per sentence
Verb phrases per sentence

Language variety and errors

The following indexes measure various aspects of the text that are indicative of greater or lesser variety of the language and the presence of the most common errors.

Index name Notes
Different types of verbs This index is related to the meaning of verbs and their hypernymy relationship with other verbs.
In this hierarchical relationship, to limp is a "son" of to walk which in its turn has the archetypal pure concept of verb of movement as its farthest ancestor.
This index is the number of different archetypal verbs expressed in the text and is indicative of the variety of the language.
Different types of verbs per sentence The number of distinct archetypal verbs (see the index above) is computed for each sentence, resulting in the mean value.
Named entities per sentence This index is based on the API named entity recognition (NER) capability.
Unknown concepts per sentence An unknown concept is a word that's not mapped to a concept in the expert.ai Knowledge Graph.
Function words per sentence Function words have little or no lexical meaning, but help create fluent and more readable sentences.
Commonly misspelled words per sentence This index is based on the most common writing errors.
Most common words per sentence This index is based on a list of the most common writing terms.

Language for specific purposes

The following indexes count the presence of terms associated with language for specific purposes.

Index name
Academic language words per sentence
Business language words per sentence
Crime language words per sentence
Layman language words per sentence
Legal language words per sentence
Military language words per sentence
Political language words per sentence
Social media language words per sentence

Values

Mean, standard deviation and absolute mean deviation are returned for all the "per sentence" indexes. For all but Token length per sentence, the total number of occurrences in the entire text of the document is also returned.

For indexes that are simple counters (ex. Sentences), the total elements count in the text is returned.

Useful resources