Writeprint detection
Introduction
The writeprint
information detection resources—one for each of the supported languages— perform a stylometric analysis of a text, ranging from readability and vocabulary richness to verb types, document structure and grammar. The detector is also capable of identifying the markers of several specific-purpose languages.
Stylometric data is provided in the form of 60 indexes (see below) which, as a whole, make up for a complete fingerprint of the document, i.e. its writeprint.
By comparing a number of documents on the basis of their writeprint, the literary attributes of the author are highlighted by this authorship analysis tool.
Writeprint information is returned as a JSON-LD object embedded in a broader JSON object.
Readability indexes
These are the readability indexes:
Index name | Description | Output data |
---|---|---|
Coleman-Liau | The Coleman-Liau index, which value approximates the U.S. grade level thought necessary to comprehend the text. | Value and degree of difficulty |
Gulpease | The Gulpease index, based on word length and best suited for the Italian language. | Value and degree of difficulty |
Automated Readability Index (ARI) | The Automated Readability Index, which, like the Coleman-Liau index, produces an approximate representation of the US grade level needed to comprehend the text and is best suited for the English language. | Value and degree of difficulty |
Spelling
The following indexes measure aspects related to the spelling of words and the presence of particular punctuation characters.
Index name | Notes |
---|---|
Sentences starting with a capital letter (ratio) | The ratio of sentences in which the first word starts with an uppercase letter to the number of sentences. |
Sentences starting with a small letter (ratio) | The ratio of sentences in which the first word starts with a lowercase letter to the number of sentences. |
Emoticons per sentence | |
Dots per sentence | The presence of dots in addition to the period at the end of the sentence can be indicative of a concise language because of abbreviations. |
Multiple dots per sentence | such as Ellipsis points and longer sequences. |
Question marks per sentence | Indicative of the ratio of questions to the total number of sentences. |
Multiple question marks per sentence | |
Exclamation marks per sentence | |
Multiple exclamation marks per sentence | |
Exclamation mark, question mark sequences per sentence | |
Commas per sentence | |
Colons per sentence | |
Semicolons per sentence | |
Single quotation marks per sentence | |
Double quotation marks per sentence |
Text subdivision
The following indexes count the occurrences or measure the length of certain subdivisions of the text, from sentences to characters.
Index name | Notes |
---|---|
Sentences | |
Tokens | Words are tokens, but consecutive words recognized as a unit—like credit card or red carpet—and punctuation marks are also tokens. |
Token length per sentence | |
Characters per sentence | |
Atoms per sentence | Words and punctuation marks are both tokens (see above) and atoms, except in the case of consecutive words recognized as a unit. In that case, the constituent words are atoms, while the multi-word unit is a single token. |
Tokens per sentence | |
Phrases per sentence |
Grammar
The following indexes count the occurrences per sentence of the different parts of speech.
Index name |
---|
Adjectives per sentence |
Adverbs per sentence |
Articles per sentence |
Auxiliaries per sentence |
Conjunctions per sentence |
Nouns per sentence |
Proper nouns per sentence |
Punctuation per sentence |
Prepositions per sentence |
Pronouns per sentence |
Particles per sentence |
Verbs per sentence |
Phrase types
The indexes that follow count the number of different phrases per sentence.
Index name |
---|
Adjective phrases per sentence |
Conjunction phrases per sentence |
Adverb phrases per sentence |
Noun phrases per sentence |
Nominal predicates per sentence |
Preposition phrases per sentence |
Relative phrases per sentence |
Verb phrases per sentence |
Language variety and errors
The following indexes measure various aspects of the text that are indicative of greater or lesser variety of the language and the presence of the most common errors.
Index name | Notes |
---|---|
Different types of verbs | This index is related to the meaning of verbs and their hypernymy relationship with other verbs. In this hierarchical relationship, to limp is a "son" of to walk which in its turn has the archetypal pure concept of verb of movement as its farthest ancestor. This index is the number of different archetypal verbs expressed in the text and is indicative of the variety of the language. |
Different types of verbs per sentence | The number of distinct archetypal verbs (see the index above) is computed for each sentence, resulting in the mean value. |
Named entities per sentence | This index is based on the API named entity recognition (NER) capability. |
Unknown concepts per sentence | An unknown concept is a word that's not mapped to a concept in the expert.ai Knowledge Graph. |
Function words per sentence | Function words have little or no lexical meaning, but help create fluent and more readable sentences. |
Commonly misspelled words per sentence | This index is based on the most common writing errors. |
Most common words per sentence | This index is based on a list of the most common writing terms. |
Language for specific purposes
The following indexes count the presence of terms associated with language for specific purposes.
Index name |
---|
Academic language words per sentence |
Business language words per sentence |
Crime language words per sentence |
Layman language words per sentence |
Legal language words per sentence |
Military language words per sentence |
Political language words per sentence |
Social media language words per sentence |
Values
Mean, standard deviation and absolute mean deviation are returned for all the "per sentence" indexes. For all but Token length per sentence, the total number of occurrences in the entire text of the document is also returned.
For indexes that are simple counters (ex. Sentences), the total elements count in the text is returned.
Useful resources
- How to request information detection API resources.
- How to interpret the output of the
writeprint
detector.