Knowledge Graph Extension (KGE) file formats
Introduction
This article describes the characteristics of the source files that can be used to extend the Knowledge Graph.
You need to use plain text files, which can be .txt
files or comma-sperated values (CSV) files with .csv
extension.
The former are used only to add syncons corresponding to named entities, while the latter allow for more operations:
- addition of syncons of various types
- creation of links between syncons
- addition of user data to syncons
.txt files
Use .txt
files if all you need to do is add named entities to the Knowledge Graph.
Files with the .txt
extension must contain a line for each entity you want to add. For each entity a Knowledge Graph syncon will be created.
The syncon lemmas, that is the main lemma plus any synonyms, must be entered in each line.
For example, a file with these contents:
Ferdinand Lewis Alcindor Jr.,Kareem Abdul-Jabbar
Larry Bird
Dr. J,Julius Erving
Karl Malone,The Mailman
Earvin Johnson,Magic Johnson
corresponds to five syncons, each of which represents a distinct entity.
To separate the lemmas, you can use comma (,
), pipe (|
) or semicolon (;
).
If a file contains lemmas with a comma, like ACME, Incorporated or Los Angeles, CA, use another of the possible characters to separate lemmas.
.csv files
File extension
CSV files must have the .csv
extension.
Headings
The first row is dedicated to fields headings, all the other rows specify the operations to be performed to extend the Knowledge Graph.
Each row can be considered a "record" with "fields" corresponding to the headings based on their position. The possible fields are: operation
, id
, type
, gloss
, lemma
, weight
, freq
, parent
, link
, domain1
, domain2
, userkey
, uservalue
.
Excel options and separator character
It is common to produce CSV files by exporting single sheets from an Excel document.
The Knowledge Graph extension procedure allows importing CSV files where the field separator is one of the following characters:
,
(comma), this is the default value when exporting Excel sheets to CSV from a computer with a US or UK English locale. Values containing commas, like Apple, Inc., are enclosed in double quotes ("Apple, Inc.").;
(semicolon), this is the default value when exporting Excel sheets to CSV from a computer with a locale in which the comma is the decimal separator (e.g., Italian, French, German).|
(pipe)\t
(tab), this is the default separator when saving tab-separated values (TSV). If the Excel export procedure generates a file with the.txt
extension, it must be changed to.csv
.
In general, an Excel sheet can be exported to a valid CSV file by selecting one of these formats in the Save As... dialog:
- CSV UTF-8 (Comma delimited) (*.csv): in this case the file will be encoded as UTF-8 with BOM with columns separated by the character defined by the specific locale.
- CSV (Comma delimited) (*.csv): this is the same as above, but the file will be encoded as ANSI and therefore will not contain special characters (e.g., Russian, Chinese, Japanese, etc.), so it should be used just when the input only contains Latin characters (e.g. Windows-1252).
- Text (Tab delimited) (*.txt): this saves the file as TSV (tab-separated), with the
.txt
extension, so it should be renamed changing it to.csv
. The file will be ANSI encoded as for the choice above.
The order in which the field headings are listed in the first row of the file is not important. Fields corresponding to unexpected headings are ignored by the import procedure.
operation
It is the only field that is always required. Allowed values are:
ADDSYN
: causes the creation of a new syncon.ADDLINK
: causes the creation of a link of typelink
between synconparent
and synconid
.ADDUSERDATA
: causes the addition of user data with keyuserkey
and valueuservalue
to synconid
.
Other fields may be optional, ignored or required on the basis of the operation type according to the following scheme.
operation | id | type | gloss | lemma | weight | freq | parent | link | domain1 | domain2 | userkey | uservalue |
ADDSYN | Optional | Optional | Optional | Required | Optional | Optional | Required | Optional | Optional | Optional | Optional | Optional |
ADDLINK | Required | Ignored | Ignored | Ignored | Ignored | Ignored | Required | Required | Ignored | Ignored | Ignored | Ignored |
ADDUSERDATA | Required | Ignored | Ignored | Ignored | Ignored | Ignored | Ignored | Ignored | Ignored | Ignored | Required | Required |
id
This is the numeric ID of the syncon involved in the operation. It is required for ADDLINK
and ADDUSERDATA
operations while it's optional for the ADDSYN
operation. When optional, ID (default starting value is 6000000) will be automatically set to the first available ID value.
If specified for ADDSYN
operations—which generate new syncons—it must be a positive integer number—with no thousand separator—between 6000000 and 9000000.
type
The word class of the syncon being added.
It is optional for the ADDSYN
operation, the default value being NPR
. It is ignored for other operations.
gloss
The gloss, that is a human-readable definition of the syncon. It is optional for the ADDSYN
operation, the default value being "no gloss". This field is ignored for other operations.
lemma
One or more lemmas for the syncon.
This field is required for the ADDSYN
operation and is ignored for other operations.
Multiple lemmas must be separated with a comma (,
).
Each lemma must be shorter than 255 characters.
If any lemma contains non-ANSI characters, the CSV file must be UTF-8 encoded.
If a lemma contains commas, they must be escaped by prefixing them with a backslash character (\
), for example: Los Angeles\, CA. If a backslash is part of the lemma's value, it must be escaped by doubling it, For example, AC\\DC1.
weight
The weight for the lemmas of the syncon.
It is optional for the ADDSYN
operation, it is ignored for other operations.
The weight is an integer between 1 and 8. If multiple weights are specified, they must be separated with commas. The specified weights correspond to the lemmas based on their position in the sequence of comma separated values.
The default value in the case of ADDSYN
operations is 1 for the first lemma of the list—which will become the main lemma for the syncon—and 2 for any other lemma for which the weight is not specified.
If more weights than lemmas are provided, excess values are ignored.
freq
The frequency for the lemmas of this syncon.
It is optional for the ADDSYN
operation, it is ignored for other operations.
The frequency is an integer between 1 and 99. If multiple frequencies are specified, they must be separated with commas. The specified frequencies correspond to the lemmas based on their position in the sequence of comma separated values.
The default value in the case of ADDSYN
operations is 1 for any lemma for which the frequency is not specified.
If more frequencies than lemmas are provided, excess values are ignored.
parent
The parent syncon ID.
It is required for ADDSYN
and ADDLINK
operations, it is ignored for ADDUSERDATA
.
It can be the ID of a syncon added during the Knowledge Graph extension operation, whose ADDSYN
operation therefore comes before in the file or in another imported file, or the ID of a syncon which is already present in the base Knowledge Graph.
The value of this field is used to link the syncon specified by id
with the syncon specified by parent
via the link specified by link
.
link
The name of the link used to connect the syncon specified by id
with that specified by parent
.
This field is optional for the ADDSYN
andADDLINK
operations, the default value being superverbum/subverbum
if type
is VER
, supernomen/subnomen
otherwise.
The field is ignored in the case of ADDUSERDATA
operations.
In the case of ADDSYN
operations, the link can only be supernomen/subnomen
or superverbum/subverbum
, while for ADDLINK
operations the value of the field can be the name of one of the links already present in the Knowledge Graph and visible in the Knowledge Graph tool window, or a custom value. In this case the extension procedure will display a warning message to make sure the user actually meant to create a new link and did not just misspell the name of an existing one, as in supernomen/subnomem.
Custom link names cannot contain spaces or underscore characters (_
).
domain1
A Knowledge Graph domain, with an optional frequency, to be added to the syncon as an attribute. The field is optional for ADDSYN
operations, it's ignored for other operations. The default value is empty, which means "no domain".
The value has this syntax:
domain[,frequency]
where domain
is one of the Knowledge Graph domains and frequency
is an integer number between 1 and 100. The default value for frequency is 50.
The sum of frequencies for domain1
and domain2
must be lower than or equal to 100.
domain2
Second Knowledge Graph domain for the syncon. It is like domain1
, but since a syncon can have up to two domains, this field allows for setting both.
userkey
The name of a user data that will be added to the syncon. The field is required for ADDUSERDATA
operations, is optional for ADDSYN
operations and it's ignored for ADDLINK
operations.
If not specified for an ADDSYN
operation, it defaults to empty, that is "no user data", but it is required if uservalue
has been specified.
uservalue
The value of the user data specified by userkey
. The field is required for ADDUSERDATA
operations, is optional for ADDSYN
operations and it's ignored for ADDLINK
operations.
If not specified for an ADDSYN
operation, it defaults to empty, that is "no user data", but it is required if userkey
has been specified.
-
Indeed it's AC/DC, but we needed an example :-) ↩