Knowledge Graph customization file formats

Introduction

This article describes the characteristics of the source files that can be used to customize the Knowledge Graph.

You need to use plain text files, which can be .txt files or comma-separated values (CSV) files with .csv extension.
The former are used only to add syncons corresponding to named entities, while the latter allow for more operations:

addition of syncons of various types
creation of relations between syncons
addition of user data to syncons

.txt files

Use .txt files if all you need to do is add named entities to the Knowledge Graph.
Files with the .txt extension must contain a line for each entity you want to add. For each entity a Knowledge Graph syncon will be created.
The syncon lemmas, that is the main lemma plus any synonyms, must be entered in each line.
For example, a file with these contents:

Ferdinand Lewis Alcindor Jr.,Kareem Abdul-Jabbar
Larry Bird
Dr. J,Julius Erving
Karl Malone,The Mailman
Earvin Johnson,Magic Johnson

corresponds to five syncons, each of which represents a distinct entity.

To separate the lemmas, you can use one of these characters:

Comma (,)
Pipe (|)
Semicolon (;).

If a file contains lemmas with a comma, like ACME, Incorporated or Los Angeles, CA, use another of the possible characters to separate lemmas.

.csv files

File extension

CSV files must have the .csv extension.

Headings

The first row is dedicated to fields headings, all the other rows specify the operations to be performed to extend the Knowledge Graph. Each row can be considered a "record" with "fields" corresponding to the headings based on their position. The possible fields are: operation, id, type, gloss, lemma, weight, freq, parent, link, domain1, domain2 , userkey, uservalue.

Excel options and separator character

It is common to produce CSV files by exporting single sheets from an Excel document.
The Knowledge Graph extension procedure allows importing CSV files where the field separator is one of the following characters:

Comma (,): this is the default value when exporting Excel sheets to CSV from a computer with a US or UK English locale. Values containing commas, like Apple, Inc., are enclosed in double quotes ("Apple, Inc.").
Semicolon (;): this is the default value when exporting Excel sheets to CSV from a computer with a locale in which the comma is the decimal separator (e.g., Italian, French, German).
Pipe (|)
Tab (\t): this is the default separator when saving tab-separated values (TSV). If the Excel export procedure generates a file with the .txt extension, it must be changed to .csv.

In general, an Excel sheet can be exported to a valid CSV file by selecting one of these formats in the Save As... dialog:

CSV UTF-8 (Comma delimited) (*.csv): in this case the file will be encoded as UTF-8 with BOM with columns separated by the character defined by the specific locale.
CSV (Comma delimited) (*.csv): this is the same as above, but the file will be encoded as ANSI and therefore will not contain special characters (e.g., Russian, Chinese, Japanese, etc.), so it should be used just when the input only contains Latin characters (e.g. Windows-1252).
Text (Tab delimited) (*.txt): this saves the file as TSV (tab-separated), with the .txt extension, so it should be renamed changing it to to .csv. the file will be ANSI encoded as for the choice above.

The order in which the field headings are listed in the first row of the file is not important. Fields corresponding to unexpected headings are ignored by the import procedure.

operation

It is the only field that is always required. Allowed values are:

ADDSYN: causes the creation of a new syncon.
ADDLINK: causes the creation of a relation of type link between syncon parent and syncon id.
ADDUSERDATA: causes the addition of user data with key userkey and value uservalue to syncon id.

The other fields are required, optional or ignored according to the operation, according to the following scheme.

operation id type gloss lemma weight freq parent link domain1 domain2 userkey uservalue

ADDSYN Optional Optional Optional Required Optional Optional Required Optional Optional Optional Optional Optional

ADDLINK Required Ignored Ignored Ignored Ignored Ignored Required Required Ignored Ignored Ignored Ignored

ADDUSERDATA Required Ignored Ignored Ignored Ignored Ignored Ignored Ignored Ignored Ignored Required Required

id

This is the numeric ID of the syncon involved in the operation. It is required for ADDLINK and ADDUSERDATA operations and optional for the ADDSYN operation, in which case the default value is 6000000 plus the cardinal number the syncon being added has in the sequence of all the syncons being added by the Knowledge Graph extension operation.
If specified for ADDSYN operations—which generate new syncons—it must be a positive integer number—with no thousand separator—between 6000000 and 9000000.

type

The word class of the syncon being added. It is optional for the ADDSYN operation, the default value being NPR. It is ignored for other operations.

gloss

The gloss, that is a human-readable definition of the syncon. It is optional for the ADDSYN operation, the default value being "no gloss". This field is ignored for other operations.

lemma

One or more lemmas for the syncon. This field is required for the ADDSYN operation and is ignored for other operations.
Multiple lemmas must be separated with a comma (,).
Each lemma must be shorter than 255 characters.
If any lemma contains non-ANSI characters, the CSV file must be UTF-8 encoded.
If a lemma contains commas, they must be escaped by prefixing them with a backslash character (\), for example: Los Angeles\, CA. If a backslash is part of the lemma value, it must be escaped by doubling it, For example, AC\\DC¹.

weight

The weight for the lemmas of the syncon. It is optional for the ADDSYN operation, it is ignored for other operations.
The weight is an integer between 1 and 8. If multiple weights are specified, they must be separated with commas. The specified weights correspond to the lemmas based on their position in the sequence of comma separated values.

The default value in the case of ADDSYN operations is 1 for the first lemma of the list—which will become the main lemma for the syncon—and 2 for any other lemma for which the weight is not specified. If more weights than lemmas are provided, excess values are ignored.

freq

The frequency for the lemmas of this syncon. It is optional for the ADDSYN operation, it is ignored for other operations.
The frequency is an integer between 1 and 99. If multiple frequencies are specified, they must be separated with commas. The specified frequencies correspond to the lemmas based on their position in the sequence of comma separated values.

The default value in the case of ADDSYN operations is 1 for any lemma for which the frequency is not specified. If more frequencies than lemmas are provided, excess values are ignored.

parent

The parent syncon ID. It is required for ADDSYN and ADDLINK operations, it is ignored for ADDUSERDATA.
It can be the ID of a syncon added during the Knowledge Graph extension operation, whose ADDSYN operation therefore comes before in the file or in another imported file, or the ID of a syncon which is already present in the base Knowledge Graph.
The value of this field is used to link the syncon specified by id with the syncon specified by parent via the link specified by link.

link

The name of the link used to connect the syncon specified by id with that specified by parent.
This field is optional for the ADDSYN andADDLINK operations, the default value being superverbum/subverbum if type is VER, supernomen/subnomen otherwise.
The field is ignored in case of ADDUSERDATA operations.

domain1

A Knowledge Graph domain, with an optional frequency, to be added to the syncon as an attribute. The field is optional for ADDSYN operations, it's ignored for other operations. The default value is empty, which means "no domain".
The value has this syntax:

domain[,frequency]

where:

domain is one of the Knowledge Graph topics
frequency is an integer number between 1 and 100. The default value for frequency is 50.

The sum of frequencies for domain1 and domain2 must be lower than—or equal to—100.

domain2

Second Knowledge Graph domain for the syncon. It is like domain1, but since a syncon can have up to two domains, this field allows for setting both.

userkey

The name of a user data that will be added to the syncon. The field is required for ADDUSERDATA operations, it is optional for the ADDSYN operation and it's ignored for the ADDLINK operation.
If not specified for an ADDSYN operation, it defaults to empty, that is "no user data", but it is required if uservalue has been specified.

uservalue

The value of the user data specified by userkey. The field is required for the ADDUSERDATA operation, is optional for ADDSYN operations and it's ignored for ADDLINK operations.
If not specified for an ADDSYN operation, it defaults to empty, that is "no user data", but it is required if userkey has been specified.

Indeed it's AC/DC, but we needed an example :-) ↩