PII Knowledge Model
Overview
The PII Knowledge Model (display name: PII EN v#) aims at detecting and extracting personally identifiable information (PII) and returning it in two alternative formats including linked data in JSON-LD format (see also https://json-ld.org/).
The PII model does not de-identify PII in the text, this can be achieved by using the PII Pseudonymization and PII Anonymization Knowledge models. Alternatively, it's possible to post-process the text, using PII model output to determine where the PII are and what type they are, to replace PII with placeholders or pseudonyms.
Information types
These are the information types PII can detect:
Information type | Notes |
---|---|
Personal attributes | Of a real person or a fictional character |
Postal address | |
Bank account | |
IP address | |
E-mail address | |
URL | |
Financial product | Credit or debit card |
Phone number |
These are the properties of each information type:
Simple Vs composite information
The PII model detects both simple and composite information.
Simple information—like phone numbers and e-mail addresses—have only one property. Composite information have two or more properties, like a postal address which is composed of a street name, a locality, a ZIP code and a region.
Output structure
The model output has the same structure as any other model and is affected by the functional properties of the workflow block.
The peculiar parts of the output are the result of information extraction, i.e. the extractions
array, and the extraData
object: to have extraData
it's necessary to turn on the Output rules extra data functional option of the workflow block.
The extractions
array represents PII as extracted records, while the JSON-LD
property of the extraData
object is a JSON-LD representation of the same information.
extraData object
The extraData
object only property is JSON-LD
, for example:
"extraData": {
"JSON-LD": {
"@context": {
...
},
"@graph": [
{
"@id": "https://schema.org/email?email=m.gut%40bfu.edu",
"@type": "https://schema.org/email",
"email": "[email protected]",
"matches": [
{
"end": 211,
"name": "email",
"start": 197,
"value": "[email protected]"
}
]
},
{
"@id": "https://schema.org/telephone?telephone=(210)%20617-5256",
"@type": "https://schema.org/telephone",
"matches": [
{
"end": 153,
"name": "telephone",
"start": 138,
"value": "(210) 617-5256"
}
],
"telephone": "(210) 617-5256"
},
{
"@id": "https://schema.org/telephone?telephone=(210)%20949-3006",
"@type": "https://schema.org/telephone",
"matches": [
{
"end": 181,
"name": "telephone",
"start": 166,
"value": "(210) 949-3006"
}
],
"telephone": "(210) 949-3006"
},
{
"@id": "https://schema.org/PostalAddress?address=7400%20Merton%20Minter%20Blvd.%2C%20San%20Antonio%2C%20TX%2C%2078229-4404",
"@type": "https://schema.org/PostalAddress",
"address": "7400 Merton Minter Blvd., San Antonio, TX, 78229-4404",
"addressCountry": "United States of America",
"addressLocality": "San Antonio",
"addressRegion": "Texas",
"matches": [
{
"end": 88,
"name": "streetAddress",
"start": 64,
"value": "7400 Merton Minter Blvd."
},
{
"end": 123,
"name": "postalCode",
"start": 112,
"value": "78229-4404"
},
{
"end": 123,
"name": "address",
"start": 64,
"value": "7400 Merton Minter Blvd., 111E, San Antonio, TX 78229-4404"
},
{
"end": 111,
"name": "addressLocality",
"start": 96,
"value": "San Antonio, TX"
},
{
"end": 111,
"name": "addressRegion",
"start": 96,
"value": "San Antonio, TX"
},
{
"end": 111,
"name": "addressCountry",
"start": 96,
"value": "San Antonio, TX"
}
],
"postalCode": "78229-4404",
"streetAddress": "7400 Merton Minter Blvd."
},
{
"@id": "https://schema.org/Person?person=Mark%20Gutenberg",
"@type": "https://schema.org/Person",
"birthDate": "1984-12-08",
"birthPlace": "Hamburg",
"familyName": "Gutenberg",
"gender": "M",
"givenName": "Mark",
"matches": [
{
"end": 54,
"name": "familyName",
"start": 39,
"value": "Mark Gutenberg"
},
{
"end": 54,
"name": "gender",
"start": 39,
"value": "Mark Gutenberg"
},
{
"end": 54,
"name": "givenName",
"start": 39,
"value": "Mark Gutenberg"
},
{
"end": 54,
"name": "person",
"start": 39,
"value": "Mark Gutenberg"
},
{
"end": 260,
"name": "birthPlace",
"start": 243,
"value": "HAMBURG, GERMANY"
},
{
"end": 282,
"name": "birthDate",
"start": 272,
"value": "12/8/1984"
}
],
"person": "Mark Gutenberg"
}
]
}
}
The value of the JSON-LD
property is the JSON-LD object.
The characteristic of the JSON-LD format is to provide linked data. Specifically, PII information types and properties are linked to schema.org public vocabulary definitions.
For example, the type of the information representing a postal address corresponds to the https://schema.org/PostalAddress definition and the type's properties correspond to schema.org definitions too.
For the description of the JSON-LD format refer to the official documentation.
The @graph
property of the JSON-LD object contains the actual PII. @graph
is an array, each item of which represents a simple or composite information.
These are all the PII that may be present:
* dateTime
is an array, since there can be more than one value associated with the person.
The matches
array of each information item contains the occurrences of the properties in the text.
Each item of the array corresponds to a property. Item properties are:
name
: property namestart
: zero-based index of the first character of the occurrence in the textend
: zero-based index of the first character after the occurrence in the textvalue
: the portion of text from which the property value was taken
extractions array
These are all the templates and related fields:
Information type | Template | Field |
---|---|---|
Personal attributes | PII_PERSON |
|
person |
||
givenName |
||
familyName |
||
age |
||
gender |
||
nationality |
||
birthDate |
||
birthPlace |
||
deathDate |
||
deathPlace |
||
dateTime |
||
Postal address | PII_ADDRESS |
|
address |
||
streetAddress |
||
addressCountry |
||
addressLocality |
||
addressRegion |
||
postalCode |
||
postOfficeBoxNumber |
||
Bank account | PII_BANKACCOUNT |
|
IBAN |
||
IBANcountry |
||
IP address | PII_IP |
|
IP |
||
E-mail address | PII_EMAIL |
|
email |
||
URL | PII_URL |
|
URL |
||
Financial product (credi/debit card) | PII_FINANCIALPRODUCT |
|
creditDebitNumber |
||
CVV |
||
expirationDate |
||
Phone number | PII_TELEPHONE |
|
telephone |
Note
If you are familiar with Platform extraction projects, the template key in this model's output corresponds to the concept of group and template fields correspond to classes.