Polymorph Data Language
- Polymorph Data Language Design Goals
- PDL - A Stream of Fields
- PDL Field Types
- Comments
- boolean
- int
- float
- bytes
- utf8
- utc
- object
- table
- key
- id + ref
- PDL + PDE Use Cases
- PDL vs. JSON
- PDL Allows for a More Concise Syntax Than JSON
- PDL Supports More Data Types Than JSON
- PDL Supports Typed Nulls
- PDL Has a Concise Syntax for Tabular Data.
- PDL Tables Can Represent Object Trees More Concisely
- PDL Supports Cyclic Object Graphs
- PDL Supports Streams of Instructions
- PDL Supports Comments
- PDL Can be Converted to a Compact, Fast to Read Binary Encoding
Jakob Jenkov |
Polymorph Data Language (PDL) is a textual data language which can be translated into the binary Polymorph Data Encoding (PDE). This enables you to write Polymorph Data Language in a text editor and convert it to PDE - or convert PDE to PDL to read it in an editor. Thus, PDL makes it easier to work with PDE.
Just like PDE is designed to be an alternative to MessagePack, CBOR and Protobuf, PDL is designed to be an alternative to JSON, YAML, XML and CSV.
Additionally, PDL can be converted into PDE if a binary format is needed or wanted. Since PDL is compatible with PDE, PDL has all the same features and advantages as PDE, such as compact tabular data encoding, compact hierarchical data encoding via nested tables, support for cyclic object graphs, support for objects with or without property names (keys) etc. Plus, you can of course have comments in a PDL script (which is not really possible / standard in JSON and CSV).
Additionally, PDL is faster to parse than JSON (according to my tests at least) and is designed explicitly to be parseable in parallel - which can bring even more speedups on large PDL scripts.
Before showing the final PDL syntax(es), let me first explain why they look they way they do. Without that explanation, you might just find them unnecessarily verbose. But, there is a reason for every single character used in the PDL syntaxes.
Polymorph Data Language Design Goals
The goals of Polymorph Data Language are:
- Be as expressive as Polymorph Data Encoding - meaning it should be able to express the same data constructs.
- Be easy to read and write.
- Be as concise as possible.
- Minify well.
- Be fast to parse.
Unfortunately is it not easy to achieve all of these goals at the same time. In some cases I will have to make trade-offs between what is easy for a human to read (and write), and what is easy for a computer to read (tokenize).
To explore various trade-offs between readability and tokenizability - I am currently experimenting with multiple syntaxes for the Polymorph Data Language. I will explain these variations and their trade-offs in the following sections.
Readability - Parseability Language Syntax Trade-offs
According to my experiments so far, the two primary factors in parser speed are:
- The number of characters to parse (syntax conciseness)
- Branch predictability during parsing (syntax uniformity)
The fewer characters a parser needs to parse, the less work it needs to do. Thus, a more concise syntax will tend to be faster to parse than a more verbose syntax - but only to a certain limit.
The more predictable the branches (if- and switch-statements) of a parser are, the less often the CPU instruction pipeline has to be flushed (cleared - as in "thrown out"), and thus the more parallel execution of instructions can be achieved. This results in higher parser speeds.
The dilemma is, however, that sometimes a syntax that is more concise results in a parser with less branch predictability. In other words, sometimes a more verbose syntax enables a parser to have higher branch predictability. Even if the CPU needs to process more bytes when tokenizing a more verbose (but more predictable (more uniform) ) syntax, it may still be able to tokenize that script faster - than an equivalent script using a less verbose (but less predictable (less uniform) ) syntax.
To explain the connection between syntax and tokenizer speed - I will dive a bit more into this topic in the following paragraphs. Please keep in mind, that this explanation is based on my current knowledge about parser and tokenizer techniques. I could have missed something. If you can see that I have, I would appreciate you notifying me (e.g. via LinkedIn or Twitter).
Designing For Branch Predictability
Branch predictability has a quite high impact on the speed of your code. If the CPU meets a branch (e.g. if- or switch-statement) that it cannot predict, it causes a flush of the instruction pipeline - which results in degraded instruction execution speed until the pipeline is full again.
My current CPU is a "12th Gen Intel(R) Core(TM) i9-12900H, 2500 Mhz, 14 Cores, 20 Threads (Logical Cores)".
This CPU is capable of reading up to 10+ GB data per second from memory using a single CPU core - when simply reading (and adding) the bytes one by one in a loop. The data read was around 1 KB in length, and this data was read over and over again, so to be fair, it's probably read from CPU cache memory most of the time.
When benchmarking the tokenization speed of different tokenizer implementations - both for the same and different PDL syntaxes - I am tokenizing a PDL script around the same size (near 1KB). These scripts are also tokenized over and over again, so these scripts are probably also stored in CPU cache memory most of the time. Thus, the bytes-per-second speed comparison is reasonably fair. Since adding all bytes of a memory area one by one is quite simple - it is quite fast to do. Thus, we cannot expect our tokenizer to ever run as fast as that - using a single core.
The various PDL tokenizers are able to process around 1.1 to 2.5 GB per second. That is a good deal lower than the 10+ GB per second when simply adding the bytes together.
The main performance loss comes from unpredictable branches. These branches typically occur in two places in the tokenizer:
- When determining the token type.
- When trying to find the end of the token.
By designing a PDL syntax (Parser Optimized (PO)) where it is easy to determine the token type and easy to find the end of the token, the branch unpredictability can be reduced - and the resulting tokenization speed increases.
The way I have done that is by having the first character of every token mark the token type, and by having every token end with the same character - a semicolon (;) . Here are a few examples of such tokens:
+123; -384; %123.45; /12345.6789; "This is text;
This of course makes the syntax a bit more verbose - but much simpler and faster to tokenize. The above syntax can be tokenized using the following tokenizer loop:
public class PdlPovTokenizer { private int nextCodePoint = 0; public void tokenize( Utf8Buffer buffer, TokenizerListener listener) { buffer.skipWhiteSpace(); while(buffer.hasMoreBytes()){ int tokenStartOffset = buffer.tempOffset; buffer.tempOffset++; while(buffer.hasMoreBytes()){ if(buffer.nextCodepointAscii() == ';'){ break; } } listener.token(tokenStartOffset, buffer.tempOffset, buffer.buffer[tokenStartOffset]); buffer.skipWhiteSpace(); } } public void tokenizeMinified( Utf8Buffer buffer, TokenizerListener listener){ while(buffer.hasMoreBytes()){ int tokenStartOffset = buffer.tempOffset; buffer.tempOffset++; while(buffer.hasMoreBytes()){ if(buffer.nextCodepointAscii() == ';'){ break; } } listener.token(tokenStartOffset, buffer.tempOffset, buffer.buffer[tokenStartOffset]); } } }
It looks pretty simple, doesn't it? It is reasonably fast.
Furthermore - if you know that the PDL you are parsing
is minified - meaning there are no white-spaces between the tokens - you can leave out the calls to
buffer.skipWhiteSpace()
- as is shown in the method tokenizeMinified()
.
In that mode - my CPU is able to tokenize 2.5 billion bytes per second
(tested with a 1014 bytes script that is tokenized repeatedly - so it's stored in the CPU cache).
Readability Trade-offs
Unfortunately, the syntax does not look so pretty when it comes to representing objects and tables (arrays). Look at this:
{; .field1; "Value 1; .field2; +12345; }; [; .column1; .column1; "Value 1.1; "Value 1.2; "Value 2.1; "Value 2.2; ];
Notice how the Object field markers {; and }; both have to be ended with a semicolon. The same is true for the Table field markers [; and ]; .
Even if it is possible to get used to reading these verbose markers, it would still be nicer if we could just use the { and } and [ and ] like we do in JSON, and in many programming languages.
It turns out - we can - but the tokenizer logic gets a bit bigger, and it executes a bit slower. I call this syntax variant Parser Friendly (PF). So - not as optimized for parsing as possible, but still friendly to parsers.
First, let's see the language example above using these simpler Object and Table markers:
{ .field1; "Value 1; .field2; +12345; } [ .column1; .column1; "Value 1.1; "Value 1.2; "Value 2.1; "Value 2.2; ]
This syntax looks a bit more familiar - but the downside is, that you now no longer always have the same character mark the end of all tokens. This makes finding the end of a token a bit more complicated.
Luckily, we store the end marker characters in a lookup table - using the first character of the token as key / offset. Here is how that could look in code:
public class PdlPfvTokenizer { private static final byte[] tokenEndCharacters = new byte[128]; { for(int i=0; i'] = '>'; tokenEndCharacters['('] = '('; tokenEndCharacters[')'] = ')'; } public void tokenize(Utf8Buffer buffer, TokenizerListener listener) { buffer.skipWhiteSpace(); while(buffer.hasMoreBytes()){ int tokenStartOffset = buffer.tempOffset; int endCharacter = tokenEndCharacters[buffer.buffer[tokenStartOffset]]; while(buffer.hasMoreBytes()){ if(buffer.nextCodepointAscii() == endCharacter){ break; } } listener.token(tokenStartOffset, buffer.tempOffset, buffer.buffer[tokenStartOffset]); buffer.skipWhiteSpace(); } } public void tokenizeMinified(Utf8Buffer buffer, TokenizerListener listener){ while(buffer.hasMoreBytes()){ int tokenStartOffset = buffer.tempOffset; int endCharacter = tokenEndCharacters[buffer.buffer[tokenStartOffset]]; while(buffer.hasMoreBytes()){ if(buffer.nextCodepointAscii() == endCharacter){ break; } } listener.token(tokenStartOffset, buffer.tempOffset, buffer.buffer[tokenStartOffset]); } } }
As you can see, the tokenizer code is still reasonably simple. However, the lookup table lookup has a performance cost. Also, you have to check the first character of a token to see if it is also the last character of that token. This is not necessary with the syntax variant where all tokens end with a semicolon. With that variant you know the first character of a token is never the last character too.
The performance of the above tokenizer drops to around 2.0 billion bytes per second on my CPU.
However, this syntax requires a few bytes less to represent the same information compared to the syntax that has all tokens end with a semicolon. In my test, the script that was 1014 bytes using the most verbose syntax (PO) only required 926 bytes to represent using the slightly less verbose (PF) syntax. Thus, the 2.0 billion bytes actually corresponds to a parser speed of 2.2 billion bytes of the PO syntax. Still slower, but it might be an acceptable trade-off for a more readable syntax. Fewer bytes also mean less bytes to store or transport - in case that is relevant.
Parallel Parsing
A nice by-product of the Parser Optimized (PO) and Parser Friendly (PF) syntaxes are, that they are reasonably easy to parse, or at least to tokenize, in parallel.
The reason these syntaxes are easy to tokenize in parallel are, that it is reasonably easy to jump into an arbitrary offset of a PDL script (in either syntax) and find the beginning of the next token - and start tokenizing from there.
Imagine you have 4 MB of data you need to parse, and a CPU with 4 cores. You can then divide the input data into 4 blocks - one for each CPU core. However, you cannot count on that a new token starts exactly on each 1 MB boundary. Thus, each CPU core, except the core tokenizing the block from offset 0, will have to start from its boundary offset - and search forward until it finds the end of the current token, and start tokenizing from the following token.
Similarly, all CPU cores, except the CPU core tokenizing the last block, will have to tokenize beyond its block's end boundary - to make sure it tokenizes the token that the CPU core tokenizing the following block will skip over (as explained in the previous section).
PDL - A Stream of Fields
Now that I have explained the logic behind the syntaxes of tokens in PDL, let me dive deeper into how PDL looks.
A PDL script is a stream of fields. This means, that there is no requirement to have a single PDL field as the "root" field of a PDL script - like you have in JSON or XML. You can have as many PDL fields at the "root level" as you want.
Each field corresponds to a field in the binary Polymorph Data Encoding (PDE). PDE fields can be either atomic - meaning the field contains a single value, or composite - meaning the field can contain nested fields.
An atomic field can be represented by a single PDL token. Here are some atomic PDL field examples:
+123; -384; %123.45; /12345.6789; "This is text;
A composite field will have a begin-token, possibly some nested fields and an end token. Here are some composite PDL field examples:
# Parser Optimized (PO) syntax: {; .key1; "Value 1; .key2; +123; }; [; .col1; .col2; "Val1; -998; "Val2; +458; ];
# Parser Friendly (PF) syntax: { .key1; "Value 1; .key2; +123; } [ .col1; .col2; "Val1; -998; "Val2; +458; ]
PDL Field Types
As mentioned earlier, the PDL field types are the same as the field types in PDE - Polymorph Data Encoding. These field types are:
- Comments
- Boolean
- Integer - positive
- Integer - negative
- Float - 32 bit
- Float - 64 bit
- Bytes
- UTF-8
- UTC
- ID
- Reference
- Key
- Object
- Table
Note: The ID field only exists in PDL - not in PDE. That is because in PDE the Reference field will simply reference a relative number of bytes back in the script buffer (pointing to the field located there) - so no explicit ID field is necessary in PDE.
Note: PDL also supports comments - but these do not have a PDE equivalent. When translating PDL into PDE, all comments are stripped - so the generated PDE does not contain the comments.
Note: I am currently experimenting with a "metadata" field type - so if you see any references to that, just keep in mind that it is still experimental. It is intended for providing metadata for other fields in the stream of fields.
Note: PDL and PDE support typed null values. In some data languages and encodings (e.g. JSON) a null value has no type information associated with it. In other words, you don't know if the null value is of type string, integer, float, object, array etc. In PDL and PDE all fields have a type and a value. The value of a field can be null, but the field still has type information associated with it. This useful when introspecting PDL or PDE messages where you do not know their schema upfront.
Comments
PDL supports comments in PDL scripts. The syntax for comments is the same in both PO and PF syntaxes. Here is how a PDL comment looks:
# PDL comment - ends with a semicolon;
boolean
The boolean PDL instruction is used to represent a boolean PDE field. You can write a boolean PDL field using either a full instruction syntax, or using its abbreviated single-token syntax. Here is how the boolean instruction syntax looks:
#PO syntax; *boolean;(;+0;);
#PF syntax; *boolean;(+0;)
The argument can be either a 0 (meaning false), or a 1 (meaning true). Here is how that looks:
#PO syntax; *boolean;(;+0;); *boolean;(;+1;);
#PF syntax; *boolean;(+0;) *boolean;(+1;)
If you leave out the argument completely, the instruction will represent a boolean PDE field with a null value. Here is how that looks:
#PO syntax; *boolean;(;);
#PF syntax; *boolean;()
If a boolean instruction has no arguments, you can leave out the parentheses, like this:
*boolean;
The above example will represent a boolean instruction with a null value.
You can also represent boolean fields via their single-token syntax. This syntax is the same for both PO and PF syntaxes. Here are how the boolean fields look using single-token syntax:
!0; # boolean field with value false; !1; # boolean field with value true; !; # boolean field with value null;
int
The int PDL instruction is used to represent an int (integer) PDE field. You can write an integer PDL field using either a full instruction syntax, or using its abbreviated single-token syntax. Here is how the int instruction syntax looks:
# PO syntax; *int;(;+0;);
# PF syntax; *int;(+0;)
The argument can be any PDL literal that can be translated into an int of up to 8 bytes in length. The most common literal to use as argument is probably the integer literals. Here are a few examples:
# PO syntax; *int;(;+123;); *int;(;+9999;); *int;(;-987654321;);
# PF syntax; *int;(+123;) *int;(+9999;) *int;(-987654321;)
It is (or will be) also possible to use e.g. a hexadecimal literal as argument, like this:
# PO syntax; *int;(;:a8f1;);
# PF syntax; *int;(:a8f1;)
The hexadecimal literal will be translated into bytes, and the value of those bytes will be used as the value of the int PDE field.
If you leave out the argument completely, the instruction will represent an int PDE field with a null value. Here is how that looks:
# PO syntax *int;(;);
# PF syntax *int;()
If an int instruction has no arguments, you can leave out the parentheses, like this:
*int;
The above example will represent an int instruction with a null value.
Second, it is possible to represent an int instruction using only an integer literal itself (a single-token integer field syntax). Here is how that looks:
+0; +1; -1;
The above integer literals (when located outside the arguments of a int instruction) will be interpreted as int instructions.
float
The float PDL instruction is used to represent a float PDE field. You can write a float PDL field using either a full instruction syntax, or using its abbreviated single-token syntax. Here is how the float instruction syntax looks:
# PO syntax; *float;(;%123.45;);
# PF syntax *float;(%123.45;)
The argument should be either an integer literal or a floating point literal. Thus, the following two examples will both be interpreted as representing a floating point field:
#PO syntax; *float;(;+123;); *float;(;%123.45;);
#PF syntax; *float;(+123;) *float;(%123.45;)
Since PDE supports encoding floating points as either 32 bit or 64 bit floating point numbers, PDL has a way for you to specify what encoding you want to use. The default encoding is 32 bit, by the way.
To force the use of a 32 bit floating point encoding you can use the % character in front of the literal value. Here is how that looks:
# PO Syntax; *float;(; %123.45; );
To force the use of a 64 bit floating point encoding you can use the / character in front of the literal value. Here is how that looks:
#PO Syntax; *float;(; /123.45; );
If you leave out the argument completely, the instruction will represent a float PDE field with a null value. Here is how that looks:
# PO syntax; *float;(;);
# PF syntax; *float;()
If a float instruction has no arguments, you can leave out the parentheses, like this:
# PO and PF syntax; *float;
The above example will represent a float instruction with a null value.
Second, it is possible to represent a float instruction using only a floating point literal itself. Here is how that looks:
# PO and PF syntax %123.45; /12345.6789;
The above floating point literals (when located outside the argument of a float instruction) will be interpreted as float instructions. Note, that both PO and PF syntax uses the same syntax for float literals.
bytes
The bytes PDL instruction is used to represent a bytes PDE field. You can write a bytes PDL field using either a full instruction syntax, or using its abbreviated single-token syntax. Here is how the bytes instruction syntax looks:
# PO syntax; *bytes;(;:a148d7f9;);
# PF syntax; *bytes;(:a148d7f9;)
The argument should be any literal value that can be translated into raw bytes. PDL has three literal types that are easily converted into bytes: Hexadecimal literals, base64 literals and UTF-8 literals. Here are examples that use each of these literals as argument:
# PO syntax; *bytes;(;:a148d7f9;); *bytes;(;|VGhpcyBpcyBiYXNlIDY0IHRleHQ=;); *bytes;(;"Hello world in bytes;);
# PF syntax; *bytes;(:a148d7f9;) *bytes;(|VGhpcyBpcyBiYXNlIDY0IHRleHQ=;) *bytes;("Hello world in bytes;)
If you leave out the argument completely, the instruction will represent a bytes PDE field with a null value. Here is how that looks:
* PO syntax; *bytes;(;);
* PF syntax; *bytes;()
If a bytes instruction has no arguments, you can leave out the parentheses, like this:
# PO + PF syntax *bytes;
The above example will represent a bytes instruction with a null value.
Second, it is possible to represent a bytes instruction using only a hexadecimal or base64 literal itself. Here is how that looks:
:a148d7f9; |VGhpcyBpcyBiYXNlIDY0IHRleHQ=;
The above hexadecimal and base64 literals (when located outside the argument of a bytes instruction) will be interpreted as bytes instructions.
utf8
The utf8 PDL instruction is used to represent an utf8 PDE field. You can write a utf8 PDL field using either a full instruction syntax, or using its abbreviated single-token syntax. Here is how the utf8 instruction looks:
# PO syntax; *utf8;(;"This is UTF-8 encoded text;);
# PF syntax; *utf8;("This is UTF-8 encoded text;)
The argument should be a UTF-8 literal.
If you leave out the argument completely, the instruction will represent a utf8 PDE field with a null value. Here is how that looks:
# PO syntax; *utf8;(;);
# PF syntax; *utf8;()
If a utf8 instruction has no arguments, you can leave out the parentheses, like this:
# PO + PF syntax; *utf8;
The above example will represent a utf8 instruction with a null value.
Second, it is possible to represent a utf8 instruction using only the utf8 literal itself. Here is how that looks:
# PO + PF syntax "hello world;
The above utf8 literal (when located outside the argument of a utf8 instruction) will be interpreted as a utf8 instruction.
utc
The utc PDL instruction is used to represent an utc PDE field - which can contain a UTC date + time. A UTC has no time zone. Only a date + time in UTC. You can convert to any time zone from a UTC time yourself. You can write a UTC PDL field using either a full instruction syntax, or using its abbreviated single-token syntax. Here is how the utc instruction looks:
# PO syntax; *utc;(;@2023-12-31T23:59:59.999;);
# PF syntax; *utc;(@2023-12-31T23:59:59.999;)
The argument should be a UTC literal.
You can leave out parts of the date time literal to specify a date and time at a more granular level. Here are examples of all valid UTC literals:
# PO syntax; *utc;(;@2023-12-31T23:59:59.999;); *utc;(;@2023-12-31T23:59:59;); *utc;(;@2023-12-31T23:59;); *utc;(;@2023-12-31T23;); *utc;(;@2023-12-31;); *utc;(;@2023-12;); *utc;(;@2023;);
# PF syntax; *utc;(@2023-12-31T23:59:59.999;) *utc;(@2023-12-31T23:59:59;) *utc;(@2023-12-31T23:59;) *utc;(@2023-12-31T23;) *utc;(@2023-12-31;) *utc;(@2023-12;) *utc;(@2023;)
If you leave out the argument completely, the instruction will represent a PDE utc field with a null value. Here is how that looks:
# PO Syntax *utc;(;);
# PF Syntax *utc;()
If a utc instruction has no arguments, you can leave out the parentheses, like this:
# PO + PF syntax; *utc;
The above example will represent a utc instruction with a null value.
It is possible to express a utc instruction using only the utc literal value itself. Here is how that looks:
# PO + PF syntax; @2023-12-31T23:59:59.999; @2023-12-31T23:59:59; @2023-12-31T23:59; @2023-12-31T23; @2023-12-31; @2023-12; @2023;
When listed like this (outside of a utc instruction argument), the above utc literals will be interpreted as utc instructions.
object
The object PDL instruction is used to represent an object PDE field. You can write an object PDL field using either a full instruction syntax, or using its abbreviated single-token syntax. Here is how the object instruction looks:
# PO syntax; *object;(;<; >;);
# PF syntax; *object;(< >)
This example shows an object instruction without any nested instructions inside its body (inside the <; + >; or < + > characters).
Since an object PDE field is a composite field, an object PDL instruction is also a composite instruction. This means, that the object PDL instruction can contain nested PDL instructions inside its body (inside the <; + >; or < + > characters).
A common use case is to have object properties nested inside an object instruction (meaning inside an object PDE field). In PDL and PDE a property is represented by a key instruction / key field followed by a value instruction / value field. This key - value pair of instructions / fields forms a property with a name (the key field) and a value (the value field). Here is an example:
# PO syntax; *object;(;<; *key;(;"firstName;); *utf8;(;"John;); *key;(;"lastName;); *utf8;(;"Doe;); >;);
# PF syntax; *object;(< *key;("firstName;) *utf8;("John;) *key;("lastName;) *utf8;("Doe;) >)
It is also possible to only nest value instructions inside an object instruction. Here is how that looks:
# PO syntax; *object;(;<; *utf8;(;"John;); *utf8;(;"Doe;); >;);
# PF syntax; *object;(< *utf8;("John;) *utf8;("Doe;) >)
Without nested key instructions / key fields you only have the index of each field to identify what it represents. In the example above, you would need to know that the first nested instruction represents the first name, and the second nested instruction represents the last name.
You can nest any instruction including object and table instructions inside an object instruction - recursively - as deeply nested as you like. This way you can represent advanced object graphs.
If you leave out the body completely, the instruction will represent a PDE object field with a null value. Here is how that looks:
# PO syntax; *object;(;);
# PF syntax; *object;()
It is possible to abbreviate the syntax of object instructions in PDL to represent object instructions using a more concise syntax.
First of all, if an object instruction has no arguments and no body, you can leave out the parentheses, like this:
# PO + PF syntax; *object;
In the above example, the first object instruction represents a PDL object instruction with a null value.
Here is how you would represent an empty object field:
# PO syntax; *object;(;<;>;);
# PF syntax; *object;(<>)
An empty object is not the same as an object with a null value.
It is also allowed to write only 'o' instead of 'object'. Here is how that looks:
# PO syntax; *o; *o;(;); *o;(;<;>;);
# PF syntax; *o; *o;() *o;(<>)
The first two examples represent PDL object instructions with a null value. The lat example represent PDL object instructions with an empty body.
Finally, it is possible to abbreviate an object instruction using only the instruction body characters to delimit it, like this:
# PO syntax; {;}; {; *key;(;"firstName;); *utf8;(;"John;); *key;(;"lastName;); *utf8;(;"Doe;); };
# PF syntax; {} { *key;("firstName;) *utf8;("John;) *key;("lastName;) *utf8;("Doe;) }
Remember, the key and utf8 instructions can also be abbreviated further - to make the total object syntax even more concise. Here is the second example above using the abbreviated syntax for its nested instructions:
# PO syntax; {; .firstName; "John; .lastName; "Doe; };
# PF syntax; { .firstName; "John; .lastName; "Doe; }
The abbreviated object notation using the { and } (or {; and }; ) characters is also referred to as an "object literal".
table
The table PDL instruction is used to represent an table PDE field. Here is how the table instruction looks:
# PO syntax; *table;(;<;>;);
# PF syntax; *table;(<>)
Since a table PDE field is a composite field, a table PDL instruction is also a composite instruction. This means, that the table PDL instruction can contain nested PDL instructions inside its body (inside the { } or {; }; characters).
The table PDE field, and thus the table PDL instruction, is used to represent tabular data, such as you find it in a CSV file, or in the result of a database query. More precisely, a table instruction consists logically of rows and columns.
To identify the columns of a table instruction you use a series of key instructions as the first instructions nested inside the table instruction. Here is how that looks:
# PO syntax; *table;(;<; *key;(;"col1;); *key;(;"col2;); *key;(;"col3;); >;);
# PF syntax; *table;(< *key;("col1;) *key;("col2;) *key;("col3;) >)
The above example defines a table with 3 columns named col1, col2 and col3.
The rows with column values are represented by the value instructions (fields) following the first series of key instructions. Here is an example of how rows of instructions in a PDL table looks:
# PO syntax; *table;(;<; *key(;"col1;); *key;("col2;); *key;(;"col3;); *utf(;"val1;); *int;(;+123;); *utc;(;@2030-01-01;); *utf(;"val2;); *int;(;+456;); *utc;(;@2031-10-12;); >;);
# PF syntax; *table;(< *key("col1;) *key;("col2;) *key;("col3;) *utf("val1;) *int;(+123;) *utc;(@2030-01-01;) *utf("val2;) *int;(+456;) *utc;(@2031-10-12;) >)
You can nest any instruction including object and table instructions inside a table instruction - recursively - as deeply nested as you like. This way you can represent advanced object and table graphs.
If you leave out the first series of key instructions inside a table instruction, the table is interpreted to be just an array - meaning a 1-dimensional list of instructions, or a table with 1 column without a column name. Here is how that could look:
# PO syntax; *table;(;<; *int;(;+123;); *int;(;+456;); *int;(;+789;); >;);
# PF syntax; *table;(< *int;(+123;) *int;(+456;) *int;(+789;) >)
If you leave out the body completely, the instruction will represent a PDE table field with a null value. Here is how that looks:
# PO syntax; *table;(;);
# PF syntax; *table;()
It is possible to abbreviate the syntax of table instructions in PDL to represent table instructions using a more concise syntax.
First of all, if a table instruction has no arguments, you can leave out the parentheses, like this:
# PO + PF syntax; *table;
It is also allowed to write only 't' instead of 'table'. Here is how that looks:
# PO syntax; *t; *t;(;); *t;(;<;>;);
# PF syntax; *t; *t;() *t;(<>)
The first two examples represent PDL table instructions with a null value. The last two examples represent PDL table instructions with empty bodies.
Finally, it is possible to abbreviate a table instruction using only the characters [ and ] to delimit it, like this:
# PO syntax; [;]; [; *key;(;"firstName;); *key;(;"lastName;); *utf8;(;"John;); *utf8;(;"Doe;); ];
# PF syntax; [] [ *key;("firstName;) *key;("lastName;) *utf8;("John;) *utf8;("Doe;) ]
Remember, the key and utf8 instructions can also be abbreviated further - to make the total table syntax even more concise. Here is the second example above using the abbreviated syntax for its nested instructions:
# PO syntax; [; .firstName; .lastName; "John; "Doe; ];
# PF syntax; [ .firstName; .lastName; "John; "Doe; ]
The abbreviated table notation using the [ and ] characters is also referred to as a "table literal".
key
The key PDL instruction is used to represent a key PDE field. Here is how the key instruction looks:
# PO syntax; *key;(;"prop1;);
# PF syntax; *key;("prop1;)
PDL key fields are typically used inside an object field to represent property names, or inside a table field to represent a column name. However, you can use them in any way you see fit.
If you leave out the argument completely, the instruction will represent a PDE key field with a null value. A key field with a null value has no particular built-in meaning - but it is possible to represent in case you find a need for it. Here is how that looks:
# PO syntax; *key;(;);
# PF syntax; *key;()
It is possible to abbreviate the key syntax for simple key values. Here are two key examples - one using the full PO syntax and the other using the abbreviated syntax for a key with the same value:
# PO Syntax *key;(;"property1;); # PO + PF syntax; .property1;
The abbreviated syntax uses a . in front of the key value. The key value no longer has to be delimited by the argument body characters ( and ), nor the quotes " " . All characters after the . and until the next white space character will be considered part of the key value.
The limitation to the above abbreviated syntax is, that you cannot use white space characters as part of the key value. In most cases, however, you would not be using white space characters as part of key values anyway, so this limitation is not too serious.
The abbreviated key notation is also referred to as a "key literal".
id + ref
The id and ref PDL instructions are used to represent references between different PDL instructions (PDE fields), typically to represent references between objects and cyclic object graphs. Here is a simple example of an object with a nested object that references its parent:
# PO syntax; *id;(;+0;); {; .name; "Parent; .child; {; .parent; *ref;(;+0;); }; };
# PF syntax; *id;(+0;) { .name; "Parent; .child; { .parent; *ref;(+0;) } }
The id instruction identifies the PDL instruction following it. When a ref instruction references an id instruction it actually references the PDL instruction following the id instruction, not the id instruction itself.
The value inside the ref instruction argument must be the same as the value inside the id instruction argument it is referencing.
Note, that the id instruction will not actually be converted to a PDE field. There is no id PDE field. Instead, the id instruction identifies the PDL instruction following it. In PDE, the ref field will contain a relative byte offset backwards in the PDE data - containing the relative offset in bytes from the beginning of the ref field to the beginning of the field being referenced.
I am currently experimenting with an abbreviated (single-token) syntax for ID and Ref fields too. They will look like this ( $idVal; + &idVal; ) :
# PF syntax; $0; # PF + PO syntax for ID { .name; "Parent; .child; { .parent; &0; #PF + PO syntax for Ref } }
PDL + PDE Use Cases
To give you an idea about what you could use PDL and PDE for, I have listed a few of the use cases I plan to use it for myself.
- Event logs
- RSS type feeds
- Web server visit logs
- Time series data logs
- Application logs
- Web service messages
I will give a few details about each use case in the following sections.
Event Logs
Event logs are streams of events written to a log (stream). Each record in the log is an independent entry in the log, with its own log offset. Here is an example of a log of events in PDL PF:
{ .eventType; "order; .time; @2030-07-01T13:00:00; .product; "Mouse; } { .eventType; "order; .time; @2030-07-01T13:30:00; .product; "Mouse Pad; } { .eventType; "complaint; .time; @2030-07-01T14:30:00; .orderId; :4e34f8a1: .text; "Bla. bla.; }
RSS Type Feeds
You could use PDL or PDE to model RSS feeds. I plan to add an RSS feed in PDL PF format as alternative to my current RSS XML format. Just for fun!
{ .channelTitle; "Jenkov.com News; .link; "https://jenkov.com/rss.xml; .description; "Bla. bla.; .language; "en; } { .itemTitle; "New article about... ; .link; "https://jenkov.com/new-article.html; .description; "Bla. bla; .guid; "ArticleXYZ123; .permaLink; !1; .pubDate; @2023-12-31; }
Web Server Visit Logs
Web server visit logs typically consist of a time, the URL visited and some information about the visitor - such as the kind of browser and device the user was using. Here is an example of a web server visit log in PDL PF:
{ .time; @2023-11-59T01:34:46; .uri; "/java/introduction.html; } { .time; @2023-11-59T01:37:12; .uri; "/java/for-loops.html; } { .time; @2023-11-59T02:12:32; .uri; "/java/while-loops.html; }
You could add more fields to the above log records, if you needed them. This is just an example.
Time Series Data Logs
Time series data typically consists of multiple measurements from the same source over time. For instance, the temperature of a city measured regularly and recorded as a time series of temperatures. Here is such an example of time series data in PDL PF:
{ .city; "Copenhagen; .time; @2030-07-01T13:00:00; .temperature; %21.4; } { .city; "Copenhagen; .time; @2030-07-01T14:00:00; .temperature; %22.4; } { .city; "Copenhagen; .time; @2030-07-01T15:00:00; .temperature; %23.4; } { .city; "Copenhagen; .time; @2030-07-01T16:00:00; .temperature; %22.9; } { .city; "Copenhagen; .time; @2030-07-01T17:00:00; .temperature; %21.9; } { .city; "Copenhagen; .time; @2030-07-01T18:00:00; .temperature; %21.0; }
Application Logs
You could use PDL / PDO as encoding for application logs. That way the logs are machine-readable, and you can take advantage of the data structures available in PDL. Here is an example of how an application log in PDL format could look:
{ .time; @2030-07-01T18:00:00; .severity; "WARN; .code; "DB_TEMP_NOT_AVAILABLE; .text; "The database is...; } { .time; @2030-07-01T18:15:00; .severity; "ERR; .code; "DB_PERM_NOT_AVAILABLE; .text; "The database is still...; }
Web Services Messages
Web services typically communicate via some data format. Currently, XML (SOAP), JSON, MessagePack, CBOR and Protobuf are popular data encodings. However, these data formats have the limitations I have mentioned elsewhere on this page.
Instead of these data formats I would like to use either PDL or PDE. Both formats offer some advantages over the above formats (in my opinion). For instance, PDL / PDE offers a compact representation of tabular data, such as the results of a database query (SQL) or a list of search results etc.
Here is an example of tabular data represented in a reasonably compact notation:
[ .col1; .col2; .col3; "Jane; "Collins; +654; "John; "Collins; +123; "Kyle; "Smith; +8855; ]
PDL vs. JSON
The most obvious differences between Polymorph Data Language (PDL) and JSON are:
- PDL allows for a more concise syntax than JSON.
- PDL supports more data types than JSON : Bytes, UTC dates and 32 / 64 bit floats.
- PDL supports typed null values.
- PDL has a concise syntax for tabular data.
- PDL tables can represent object trees more concisely.
- PDL supports cyclic object graphs.
- PDL supports streams of instructions (fields).
- PDL supports comments
- PDL can be converted to a compact, fast to read binary encoding.
The following sections will explore the above claims in more detail.
PDL Allows for a More Concise Syntax Than JSON
Most of the PDL and JSON literals are approximately the same in size, with some PDL literals being a bit shorter. The main difference occurs when representing objects or tabular data (covered in a later section). Here is an example of a JSON object and the corresponding PDL object using its most concise syntax :
{"prop1":"value1","prop2":123,"prop3":123.45} {.prop1 "value1" .prop2 123 .prop3 123.45}
As you can see, the PDL version is 3 characters shorter. That is caused by 1 character being saved per property name.
If you know what property each of the fields of an object corresponds to based solely on their index within the object, PDL enables you to represent that object as only its property values. Here is an example comparing all three notations again (the last one is without property names (key instructions)).
{"prop1":"value1","prop2":123,"prop3":123.45} {.prop1 "value1" .prop2 123 .prop3 123.45} {"value1" 123 123.45}
As you can see, the last syntax now becomes even more compact. But, now you will need some external knowledge of what each of these values represents. This may not alway suit your use case, but at least you have the option when it does.
PDL Supports More Data Types Than JSON
PDL supports more primitive data types than JSON. In PDL you can represent binary data in bytes instructions, UTC date+time in utc instructions, and you have the option to specify the size of floating points to either 32 or 64 bits (mostly useful when converting PDL to binary PDE).
PDL Supports Typed Nulls
In JSON, if a property is set to null you don't know what the type of that field is. In PDL you can specify the type of the null reference, so you can state that the instruction represents a null value of a certain type. Here are some examples - with the first line being the JSON equivalent :
{"f1":null,"f2":null,"f3":null,"f4":null} {.f1 int .f2 utf8 .f3 float .f4 o}
As you can see in the second example, even if all the property values inside the object are null, you know exactly what type each property is.
PDL Has a Concise Syntax for Tabular Data.
The PDL table instruction enables a more compact representation than JSON for tabular data. To be fair, you could approximate PDL's notation in JSON so it gets a lot closer - but you would have to apply the semantic interpretation of that yourself. I will still show you how, though.
Here is first an array of objects in JSON followed by their potential representations using a PDL table:
[ {"firstName":"Hannah","lastName":"Dodger","street":"Highstreet 45"}, {"firstName":"John","lastName":"Nayer","street":"Lowstreet 3"} {"firstName":"Raphael","lastName":"Delgado","street":"Sky Lane 1"} ] [ .firstName .lastName .street "Hannah" "Dodger" "Highstreet 45" "John" "Nayer" "Lowstreet 3" "Raphael" "Delgado" "Sky Lane 1" ] [ .firstName .lastName .street "Hannah""Dodger""Highstreet 45" "John""Nayer""Lowstreet 3" "Raphael""Delgado""Sky Lane 1" ]
The second of the PDL notations (the last table in the example) just shows how concise the tabular data could actually be represented in the given case. All unnecessary white space characters between the instructions of each row have been removed.
The PDL notation only contains the column names once - for all rows in the table. For each row beyond the first in the table, the unnecessary repetition of the property names of the objects in the row are saved.
Also the PDL notation does not need an object begin + end character for each row in the table. The number of key instructions set how many columns the table has - and the following columns will be grouped together as "rows" based on that number.
A compact way to model something similar in JSON would be:
[ ["firstName","lastName","street"], ["Hannah","Dodger","Highstreet 45"], ["John","Nayer","Lowstreet 3"], ["Raphael","Delgado","Sky Lane 1"] ]
This notation gets close to the conciseness of a PDL table - though still slightly more verbose due to nested array start + end characters.
PDL Tables Can Represent Object Trees More Concisely
You can nest any PDL instruction inside a table row. Thus, you can also nest objects or tables inside a table instruction. Nesting tables within tables can be used to create a more concise object tree notation. Here is an example showing a JSON object tree followed by a PDL object tree represented using nested tables:
{ "name":"Aya", "children":[ {"name":"Gretchen", "children":[ {"name":"Rami", children:[]}, {"name":"Fana", children:[]}, ] }, {"name":"Hansel", "children":[ {"name":"Gordia", children:[]}, {"name":"Victor", children:[]}, ] } ] }
{ .name "Aya" .children [ .name .children "Gretchen" [ .name .children "Rami" [] "Fana" [] ] "Hansel" [ .name .children "Gordia" [] "Victor" [] ] ] }
As you can see, children of a parent that all have the same property names (e.g. of the same class) can be be encoded using tables, so the property names of each child is only repeated once for the whole list of children. Thus, the more children each node has in its list, the bigger the saving compared to an object based representation in JSON (or PDL).
PDL Supports Cyclic Object Graphs
In PDL it is possible to represent cyclic object graphs. For instance a child of a parent object is able to reference that parent in PDL. This is not possible in JSON.
Here is an example of a parent node having a child that references its parent:
id(0) { .name "mother" .parent o .children [ .name .parent .children "child1" ref(0) [] "child2" ref(0) [] ] }
Notice the id(0) instruction before the first object instruction. This is the id of the parent to reference.
The two ref(0) instructions in the children references the parent field identified by the id(0) instruction.
PDL Supports Streams of Instructions
A JSON document needs to have a single root element which has to be either a JSON object or array.
In PDL there is no such requirement. You can list as many PDL instructions in a PDL file as you like. Here is an example:
"Hello" "World" 123 { .name "Diana" } @2030-01-01
The above list of instructions is referred to as a "stream" of instructions.
In PDE each PDE field at the root level of the stream is considered a "record" with its own record offset - which you can refer to by this record offset. This enables a lot of fun streaming functionality in PDE.
PDL Supports Comments
In JSON you cannot embed comments. In PDL you can use both single-line and multi-line comments.
PDL Can be Converted to a Compact, Fast to Read Binary Encoding
Polymorph Data Language (PDL) can be converted to a compact, fast to read binary encoding called Polymorph Data Encoding. See Polymorph Data Encoding for more details.
Polymorph Data Encoding (PDE) has a lot of nice features when it comes to stream reading and writing at high speeds - such as the ability to navigate PDE directly in its binary form without converting it to objects first (nice when searching through large streams of PDE fields).
You can also convert PDE to PDL so you can open it in a text editor and look at the data.
Granted, you could convert JSON to either BSON, MessagePack or CBOR and work with it there - but the reverse might not be 100% true. At least, converting binary data from these formats to JSON is a bit challenging, as you would have to represent it as a string using hexadecimal base64 encoding.
Tweet | |
Jakob Jenkov |