Skip to content

Knowledge Graph Extension (KGE) file formats

Introduction

This article describes the characteristics of the source files that can be used to extend the Knowledge Graph.

You need to use plain text files, which can be .txt files or comma-sperated values (CSV) files with .csv extension.
The former are used only to add syncons corresponding to named entities, while the latter allow for more operations:

  • Addition of syncons of various types
  • Creation of links between syncons
  • Addition of user data to syncons

.txt files

Use .txt files if all you need to do is add named entities to the Knowledge Graph.
Files with the .txt extension must contain a line for each entity you want to add. For each entity a Knowledge Graph syncon will be created.
The syncon lemmas, that is the main lemma plus any synonyms, must be entered in each line.
For example, a file with these contents:

Ferdinand Lewis Alcindor Jr.,Kareem Abdul-Jabbar
Larry Bird
Dr. J,Julius Erving
Karl Malone,The Mailman
Earvin Johnson,Magic Johnson

corresponds to five syncons, each of which represents a distinct entity.

To separate the lemmas, you can use comma (,), pipe (|) or semicolon (;).
If a file contains lemmas with a comma, like ACME, Incorporated or Los Angeles, CA, use another of the possible characters to separate lemmas.

.csv files

File extension

CSV files must have the .csv extension.

Headings

The first row is dedicated to fields headings, all the other rows specify the operations to be performed to extend the Knowledge Graph. Each row can be considered a "record" with "fields" corresponding to the headings based on their position. The possible fields are described below.

Excel options and separator character

It is common to produce CSV files by exporting single sheets from an Excel document.
The Knowledge Graph extension procedure allows importing CSV files where the field separator is one of the following characters:

  • , (comma), this is the default value when exporting Excel sheets to CSV from a computer with a US or UK English locale. Values containing commas, like Apple, Inc., are enclosed in double quotes ("Apple, Inc.").
  • ; (semicolon), this is the default value when exporting Excel sheets to CSV from a computer with a locale in which the comma is the decimal separator (e.g., Italian, French, German).
  • | (pipe)
  • \t (tab), this is the default separator when saving tab-separated values (TSV). If the Excel export procedure generates a file with the .txt extension, it must be changed to .csv.

In general, an Excel sheet can be exported to a valid CSV file by selecting one of these formats in the Save As... dialog:

  • CSV UTF-8 (Comma delimited) (*.csv): in this case the file will be encoded as UTF-8 with BOM with columns separated by the character defined by the specific locale.
  • CSV (Comma delimited) (*.csv): this is the same as above, but the file will be encoded as ANSI and therefore will not contain special characters (e.g., Russian, Chinese, Japanese, etc.), so it should be used just when the input only contains Latin characters (e.g. Windows-1252).
  • Text (Tab delimited) (*.txt): this saves the file as TSV (tab-separated), with the .txt extension, so it should be renamed changing it to .csv. The file will be ANSI encoded as for the choice above.

The order in which the field headings are listed in the first row of the file is not important. Fields corresponding to unexpected headings are ignored by the import procedure.

operation

It is the only field that is always required. Allowed values are:

  • ADDSYN: causes the creation of a new syncon.
  • ADDLINK: causes the creation of a link of type link between syncon parent and syncon id.
  • ADDUSERDATA: causes the addition of user data with key userkey and value uservalue to syncon id.

Other fields may be optional, ignored or required on the basis of the operation type according to the following scheme.

operationidtypeglosslemmaweightfreqparentlinkdomain1domain2userkeyuservalue
ADDSYNOptionalOptionalOptionalRequiredOptionalOptionalRequiredOptionalOptionalOptionalOptionalOptional
ADDLINKRequiredIgnoredIgnoredIgnoredIgnoredIgnoredRequiredRequiredIgnoredIgnoredIgnoredIgnored
ADDUSERDATARequiredIgnoredIgnoredIgnoredIgnoredIgnoredIgnoredIgnoredIgnoredIgnoredRequiredRequired

id

This is the numeric ID of the syncon involved in the operation. It is required for ADDLINK and ADDUSERDATA operations while it's optional for the ADDSYN operation. When optional, ID (default starting value is 6000000) will be automatically set to the first available ID value. If specified for ADDSYN operations—which generate new syncons—it must be a positive integer number—with no thousand separator—between 6000000 and 9000000.

type

The word class of the syncon being added. It is optional for the ADDSYN operation, the default value being NPR. It is ignored for other operations.

gloss

The gloss, that is a human-readable definition of the syncon. It is optional for the ADDSYN operation, the default value being "no gloss". This field is ignored for other operations.

lemma

One or more lemmas for the syncon. This field is required for the ADDSYN operation and is ignored for other operations.
Multiple lemmas must be separated with a comma (,).
Each lemma must be shorter than 255 characters.
If any lemma contains non-ANSI characters, the CSV file must be UTF-8 encoded.
If a lemma contains commas, they must be escaped by prefixing them with a backslash character (\), for example: Los Angeles\, CA. If a backslash is part of the lemma's value, it must be escaped by doubling it, For example, AC\\DC1.

weight

The weight for the lemmas of the syncon. It is optional for the ADDSYN operation, it is ignored for other operations.
The weight is an integer between 1 and 8. If multiple weights are specified, they must be separated with commas. The specified weights correspond to the lemmas based on their position in the sequence of comma separated values.

The default value in the case of ADDSYN operations is 1 for the first lemma of the list—which will become the main lemma for the syncon—and 2 for any other lemma for which the weight is not specified. If more weights than lemmas are provided, excess values are ignored.

freq

The frequency for the lemmas of this syncon. It is optional for the ADDSYN operation, it is ignored for other operations.
The frequency is an integer between 1 and 99. If multiple frequencies are specified, they must be separated with commas. The specified frequencies correspond to the lemmas based on their position in the sequence of comma separated values.

The default value in the case of ADDSYN operations is 1 for any lemma for which the frequency is not specified. If more frequencies than lemmas are provided, excess values are ignored.

parent

The parent syncon ID. It is required for ADDSYN and ADDLINK operations, it is ignored for ADDUSERDATA.
It can be the ID of a syncon added during the Knowledge Graph extension operation, whose ADDSYN operation therefore comes before in the file or in another imported file, or the ID of a syncon which is already present in the base Knowledge Graph.
The value of this field is used to link the syncon specified by id with the syncon specified by parent via the link specified by link.

The name of the link used to connect the syncon specified by id with that specified by parent.
This field is optional for the ADDSYN operation and required for the ADDLINK operation, the default value being superverbum/subverbum if type is VER, supernomen/subnomen otherwise.
The field is ignored in the case of ADDUSERDATA operations.

In the case of ADDSYN operations, the link can only be supernomen/subnomen or superverbum/subverbum, while for ADDLINK operations the value of the field can be the name of one of the links already present in the Knowledge Graph and visible in the Knowledge Graph tool window, or a custom value. In this case the extension procedure will display a warning message to make sure the user actually meant to create a new link and did not just misspell the name of an existing one, as in supernomen/subnomem.

Custom link names cannot contain spaces or underscore characters (_).

domain1

A Knowledge Graph domain, with an optional frequency, to be added to the syncon as an attribute. The field is optional for ADDSYN operations, it's ignored for other operations. The default value is empty, which means "no domain".
The value has this syntax:

domain[,frequency]

where domain is one of the Knowledge Graph domains and frequency is an integer number between 1 and 100. The default value for frequency is 50. The sum of frequencies for domain1 and domain2 must be lower than or equal to 100.

domain2

Second Knowledge Graph domain for the syncon. It is like domain1, but since a syncon can have up to two domains, this field allows for setting both.

userkey

The name of a user data that will be added to the syncon. The field is required for ADDUSERDATA operations, is optional for ADDSYN operations and it's ignored for ADDLINK operations.
If not specified for an ADDSYN operation, it defaults to empty, that is "no user data", but it is required if uservalue has been specified.

uservalue

The value of the user data specified by userkey. The field is required for the ADDUSERDATA operation, it is optional for the ADDSYN operation and it's ignored for the ADDLINK operation.
If not specified for an ADDSYN operation, it defaults to empty, that is "no user data", but it is required if userkey has been specified.

Examples

These files are examples:

The first file uses the comma as a field separator and the quotation marks—text delimiters—to quote values that contain the comma, such as that of the lemma column, which can contain multiple lemmas.
The file contains three rows in addition to the header row, all referring to the same syncon. In the first row the operation is ADDSYN and causes the addition of the syncon. In the second row, the operation is ADDLINK and causes the creation of a link between the newly added syncon and another syncon of the Knowledge Graph, in addition to the supernomen/subnomen link already created due to the first row. In the third row, the operation is ADDUSERDATA and it determines the addition of a key-value pair of data to the syncon.

The second file contains more than 2800 lines, all with operation ADDSYN, so each line adds a new concept to the graph. Each new concept is linked via the supernomen/subnomen link to an already existing concept. Also note the setting of domain1 and domain2.


  1. Indeed it's AC/DC, but we needed an example :-)