RION Encoding
Jakob Jenkov |
This text explains the binary encoding of RION. RION is a binary data format which is flexible enough to encode a wide variety of data.
So far the RION encoding specification is in version 1.0. RION 1.0 only contain what we know for sure makes sense to keep in RION. The field types that we are actually using and which have significant functions. Everything else will be decided in later versions, once we gain more experience with the current field type set.
Also, we haven't yet fully settled on the "extended types". We have some ideas for a few fields that could be encoded as extended types, but we have not yet analyzed these in detail.
RION Encoding
RION encoded data consists of one or more RION fields. Each field is encoded using a type-length-value (TLV) encoding. This means each field starts with a field type followed by the length of the field value and finally the field value itself.
To avoid allocating a fixed number of bytes to represent the length of a field, RION actually has two length parts. The first length part tells the number in bytes of the length-of-value counter. The second length part is the bytes making up the length-of-value counter. Thus, RION really uses a TLLV (type, length-of-length, length-of-value, value) encoding. This TLLV format enables RION to both contain very large field values as well as encode small field values compactly.
The RION field parts are summarized in this list:
- Type
- Length of Length - Number of Length Bytes
- Length Bytes
- Value
RION Fields
All RION data types are encoded as RION fields. Some RION fields contain a single, binary encoded data (raw bytes, numbers, text etc.). We sometimes refer to such fields as primitive fields, atomic fields, single value fields etc. Other RION fields contain other RION fields nested inside them. We refer to such fields as composite fields, or complex fields. Here is a list of the core field types in RION 1.0:
- Bytes
- Boolean
- Int Positive
- Int Negative
- Float
- UTF-8
- UTF-8 Short
- UTC Date-Time
- -
- -
- Array (*)
- Table
- Object
- Key
- Key Short
- Extended
The two types represented with dashes (-) are two field type IDs that are reserved for the future, and thus not yet defined. The Extended field type is also not 100% defined, but we are working on that. The rest of the fields are pretty much finalized by now. The field types will be explained in more detail later in this text.
Basic Field Encoding
The basic encoding for a RION field consists of:
- 1 lead byte containing:
- Field type (4 MSB bits)
- Length of length - number of length bytes (4 LSB bits)
- 0..15 length bytes
- 0..2^120 value bytes
The 4 bits reserved in the lead byte (the 4 most significant bits (MSB) ) for the field type gives a total of 16 different core field types with field type IDs from 0 to 15.
The 4 bits reserved in the lead byte (the 4 least significant bits) for the length of length (number of length bytes) gives a possible range of 0 to 15 length bytes, meaning 0 to 15 bytes to encode a number that tells the length of the field value.
A range of 0 to 15 length bytes can represent a field length range of 0 to 2^120 bytes. In other words, a RION field can contain values that are up to 2^120 bytes long. If you need to encode larger blocks of data than that, you would need to break it up into multiple fields.
This is the basic encoding of an RION field. As you will see later, individual fields can use slightly different encodings (variations of the above) to encode data more compactly.
The Lead Byte
The lead byte of an RION field contains the field type and the number of length bytes that follow the lead byte.
The field type takes up the top 4 bits of the lead byte (the 4 most significant bits). This gives a total of 16 different core field types. By combining fields of these 16 field types you can make pretty complex objects. One of these 16 core field types is an extension type so more types can be defined later, to add more field types to the RION field set. Extended field types are explained later.
The number of length bytes takes up the bottom 4 bits of the lead byte (the 4 least significant bits). This gives a number of length bytes from 0 to 15.
The Length Bytes
The length bytes make up a 0 to 15 byte long number. This number is encoded using network byte order, meaning the most significant byte comes first, and the least significant byte comes last.
The Value Bytes
After the length bytes comes the value bytes (if any). The order of the value bytes depends on what the bytes represent. Numbers are encoded into the value bytes using network byte order, meaning the most significant byte comes first, and the least significant byte comes last. If you write your own data types into the byte value, you can choose whatever byte order that makes sense for that data type.
Encoding Variations
As mentioned earlier, RION fields can be encoded using variations of the basic field encoding. In total there are 6 variations, when including the basic encoding in that number. The encodings are:
- Normal
- Short
- Tiny
- Extended Normal
- Extended Short
- Extended Tiny
There is no explicit indicator in a RION field telling what encoding variation it uses. You have to know that based on the field type. Thus, a given field type specifies which encoding variation it uses, and it always uses the same encoding variation.
Normal, Short and Tiny
The three first RION field encodings are illustrated here:
A normal field encoding consists of 1 lead byte, 0..15 length bytes and 0..2^120 value bytes.
A short field encoding consists of 1 lead byte and 0..15 value bytes.
A tiny field encoding consists of 1 lead byte only. The value of the field is contained in the lower 4 bits of the lead byte.
Extended Normal, Short and Tiny
Each of the three above mentioned encodings exist in an "extended" version, where the lead byte is followed byte 1 or 2 type bytes (exactly how many type bytes can follow is not yet 100% defined).
The extended versions of the first three encodings are illustrated here:
Null Values
A field which has the number of length bytes (4 least significant bits of lead byte) set to 0 is assumed to have a value of null. Such a field will have no length bytes, and no value. A field with a null value is thus only 1 byte long - the lead byte. All field types can assume the value null.
Extended fields can also assume the value null. An extended field type with the value null will consist of the lead byte + 1 or 2 bytes specifying the extended field type. It will have no length bytes, and no value bytes.
Primitive and Composite Fields
RION's many field types can be divided into two groups: Primitive fields and composite fields.
Primitive RION fields contain some kind of primitive data - a single value. This could be a boolean, byte, short, int, long, float, double, a byte sequence, UTF-8 string etc. Primitive fields can use any of the 6 field encodings.
Composite RION fields contain other RION fields inside them. Examples of composite fields are Object, Table and Array. The Bytes field type is theoretically a primitive type because it contains raw bytes, but in practice you could nest serialized RION fields inside a Bytes field too.
Since complex field types usually contain other fields inside them, their length is often longer than 15 bytes. Therefore composite field types only use the Normal and Extended Normal field encodings.
Core Field Types
The core field types are the 15 field types that use the field codes 0 to 14. These field types are encoded using either the Normal, Short or Tiny field encoding. The core field types thus only have a single lead byte specifying its type.
RION contains the following core field types:
Type | Code | Encoding | Description |
---|---|---|---|
Bytes | 0 | Normal | A sequence of raw bytes. |
Boolean | 1 | Tiny | Can contain the value of 0 (=null), or 1 (= true) and 2 (= false) . |
2 | Short | An up to 8 byte long positive (unsigned) integer | |
Int64-Negative | 3 | Short | An up to 8 byte long negative integer. |
Float | 4 | Short | Contains either a 32 bit or 64 bit floating point number. |
UTF-8 | 5 | Normal | Contains a variable length sequence of UTF-8 encoded characters. |
UTF-8-Short | 6 | Short | Contains a variable length sequence of UTF-8 encoded characters of maximally 15 characters. |
7 | Short | Contains a date + optionally a time in UTC date + time (no time zone). | |
Reserved | 8 | - | Not yet assigned. |
Reserved | 9 | - | Not yet assigned. |
Array (*) | 10 | Normal | A list of RION fields. Could be anything. The elements are not related to each other, like they are in objects and tables. |
Table | 11 | Normal | A list of the exact same type of objects. The key fields (property names) of the objects are only included once, but all value fields of all properties of all objects are included in the table. |
Object | 12 | Normal | A sequence of key and value fields making up an object with property names and property values. |
Key | 13 | Normal | A key - e.g. a property name in an Object, or the key of a key,value pair in a hashtable. |
Key-Short | 14 | Short | Like Key, but represented as a short field, meaning it can be used for all keys that are 15 bytes or less long. |
Extended | 15 | * | Signals that this is an extended field, meaning the type of the field is read from the 1-2 type bytes following the lead byte. The encoding used (Normal, Short, Tiny) depends on the extended field type. |
Each of these field types and their encodings will be explained in more detail later in this text.
Extended Field Types
The purpose of the Extended field type is to enable you to extend the set of core RION fields with additional field types. We refer to such field types as extended field types. For instance, you might need a special field type that is not covered by the core set of field types. Then you can create an extended field type to represent fields of that type. Extended field types are also a way to extend the standard RION field types with future field types.
Extended field types use one of the extended encodings (Extended Normal, Extended Short and Extended Tiny). Extended encodings use 1 or more extra type byte(s) after the lead byte. These extra type bytes contains the extended field type. The extended field type is the actual field type of an extended field. The lead byte just contains the field type "extended", so it is necessary to look at the following type byte(s) to find the exact extended field type.
Please not, that we have not settled 100% on the exact encoding of the extended field type bytes yet. The current implementation uses only a single extended field type byte after the lead byte, but that only allows for 256 extra field types. That may or may not be enough. Alternatively we have been looking at various variable-length extended field type encodings, but they typically have the disadvantage that they quickly end up requiring 2 or more type bytes to fully represent extended field types. That makes extended fields take up even more bytes, hurting the "compactness" of extended fields. This is something we will need to study in more detail before we settle on a 100% encoding of extended field type bytes.
In RION 1.0 there are no predefined extended field types. Therefore the table below is empty (for now). The "code" column in the following table is not the field type code in the lead byte, but the code in the type byte(s) following the lead byte.
Type | Code | Encoding | Description |
---|
Extended Field Types - Suggestions
The following extended field types are suggested field types, but which are not yet in use, and not yet implemented in RION Ops for Java. These field types will most likely change (!!!) so don't rely on them.
Type | Code | Encoding | Description |
---|---|---|---|
Complex-Type-Id | * | Contains a longer complex type id, e.g. a Java class name. Not 100% finalized. | |
Copy | * | Extended Short | Represents a reference to an RION field located earlier in the same RION data. Used e.g. to represent an object reference to another object, so circular object references can be represented. |
Reference-Back | * | Extended Short | Represents a reference to an RION field located earlier in the same RION data. Used e.g. to represent an object reference to another object, so circular object references can be represented. |
Reference-Forward | * | Extended Short | Represents a reference to an RION field located later in the same RION data. Used e.g. to represent an object reference to another object, so circular object references can be represented. Not 100% finalized. |
Cache-Reference | * | Extended Normal | Represents a reference to an RION field located in the cache of the other party communicating via the same network connection. Intended to be used in conjunction with IAP. Not 100% finalized. |
* | Extended Short | Represents a short reference (key <= 15 bytes) to an RION field located in the cache of the other party communicating via the same network connection. Intended to be used in conjunction with IAP. Not 100% finalized. | |
UTC-Time | * | Extended Short | Contains a time of day in UTC time (no time zones), or a duration.Not 100% finalized. |
Core Field Type Encodings
The core RION field types are those 15 field types that are not extended types. Extended types require 1 or 2 extra field type bytes, remember?
Bytes
The bytes field type is the most basic field in RION. A bytes field just contains an opaque sequence of fields. You have no information about what these bytes represent. The bytes field type can be used to transfer files, voice data and other similar byte sequences, where knowing the exact data format is not necessary in order to transfer it across a network. The bytes field type can also be used as fallback when no other RION field types match the data you want to send.
A bytes field uses a Normal field encoding, so it consists of a lead byte, 0 to 15 length bytes and 0 to 2^120 value bytes.
Here is a RION Bytes field example (in hexadecimal notation):
01 05 0001020304
This example shows a RION Bytes field with the field type 0, length-length of 1 (01 lead byte), 1 length byte with the value 05 (value is 5 bytes long), and the 5 bytes (hex) 00 01 02 03 04.
Boolean
The Boolean field type uses a Tiny field encoding. It can contain either a value of 0 (null), 1 (true) or 2 (false).
Here are 3 RION Boolean field examples (in hexadecimal notation):
10 11 12
These three examples represents the three valid values for a Boolean field: 00 = null, 01 = true, 02 = false.
Int64-Positive
The Int64 Positive field type can contain 0 to 8 bytes making up a max 64 bit unsigned number.
The Int64 Positive is a short field. Thus the length of the field value is written directly into the 4 least significant bits of the lead byte (a length of 0 means a null value).
Here is a RION Int64-Positive field example (in hexadecimal notation):
22 FFFF
This example shows a RION Int64-Positive field type 2, with a length of 2 (lead byte 22), and the value bytes FFFF (65,535 in decimal).
Int64-Negative
The Int64 Negative field type can contain 0 to 8 bytes making up a max 64 bit unsigned number. Negative integers are encoded as positive integers using this simple formula:
encoded = absolute(negativeValue + 1);
You can calculate the encoded value like this:
encoded = -(negativeValue + 1);
Using this encoding -1 is encoded as 0, and -2^7 (= -128) is encoded as 2^7 - 1 (= 127) .
The reason that negative integers are sent across the wire as positive numbers is that 2-complement negative numbers always take up the maximum number of bytes possible - because 2-complement negative numbers need the most significant bit (MSB) set to 1 to mark it as a negative number. That means 32 bit for a 32 bit integer and 64 bits for a 64 bit integer. By converting the negative values to positive values we can represent negative numbers with fewer bytes.
The Int64 Negative uses a Short field encoding. Thus the length of the field value is written directly into the 4 least significant bits of the lead byte (a length of 0 means a null value).
Here is a RION Int64-Negative field example (in hexadecimal notation):
32 FFFF
This example shows a RION Int64-Negative field type 3, with length 2 (lead byte 32), and the value bytes FFFF (65,535 in decimal). The value actually represents the negative value (hex) -010000 (-65,656 decimal).
Float
The Float field contains either 4 bytes or 8 bytes making up either a 32 bit or 64 bit floating point number. The Float field uses a short field encoding. That means, that how many bytes the Float field contains is stored directly in the 4 least significant bits of the lead byte. The Float field thus have no explicit length bytes.
The bits in the 32 bit or 64 bit floating point number correspond to the bits returned by Java's
Float.floatToIntBits()
and Double.doubleToLongBits()
functions, which follow the
"IEEE 754 floating-point "single format" bit layout" and "IEEE 754 floating-point "double format" bit layout".
Here are two RION Float field examples (in hexadecimal notation):
44 FFFFFFFF 48 AAAAAAAAFFFFFFFF
The first example contains a field type of 4, a length of 4 (lead byte 44), and a value of FFFFFFFF representing a 32 bit floating point number.
The second example contains a field type of 4, a length of 8 (lead byte 44), and a value of AAAAAAAAFFFFFFFF representing a 64 bit floating point number.
UTF-8
The UTF-8 field can contain a variable length sequence of UTF-8 encoded characters. The UTF-8 field is a normal length field, meaning it has separate length bytes. Thus, the number of length bytes used to represent the length of the field are written into the least 4 significant bits of the lead byte. After the lead bytes comes the length bytes, and after that the UTF-8 encoded characters.
A null UTF-8 field is encoded with a lead byte that has 0 written into the 4 least significant bits of the lead byte. A null UTF-8 field is thus just 1 byte long.
An empty string is different than a null string. A UTF-8 field containing an empty string should have the length length (the 4 least significant bytes) set to the value 1 (= 1 length byte). Then the lead byte should be followed by a single length byte with the value 0. An empty string UTF-8 field should have no value bytes. Thus, an empty string UTF-8 field consists of 2 bytes. The lead byte + 1 length byte with the value 0.
Here is a RION UTF-8 field example (in hexadecimal notation):
51 0B 48656C6C6F20776F726c64
This example shows a field type of 5 (UTF-8), a length-length of 1 (1 length byte) (lead byte 51), a single length byte with the value 0B (11 decimal), and 11 value bytes with the value 48656C6C6F20776F726c64 - representing the text "Hello world" .
UTF-8-Short
The UTF-8-SHORT field is like the UTF-8 field except it can only contain up to 15 bytes (not UTF-8 characters - bytes!). The UTF-8-SHORT field uses a short encoding, meaning the number of bytes contained in the UTF-8-Short field is written into the 4 least significant bytes of the lead byte. A length of 0 means null.
A UTF-8-Short field is 1 byte shorter than the same string encoded using a UTF-8 field. Short strings are often transmitted over the wire, so having a more compact field type for short strings is often useful. Examples of short strings are telephone numbers, email addresses (short ones), zip codes, city names, first names, last names, product codes, serial numbers, hash values (short ones) etc.
Since a UTF-8-Short field never has any explicit length bytes, you cannot encode an empty string as a UTF-8-Short. An empty string is encoded using 1 length byte with the value 0. Thus, empty strings can only be encoded using UTF-8 fields.
Here is a RION UTF-8 field example (in hexadecimal notation):
6B 48656C6C6F20776F726c64
This example shows a field type of 6 (UTF-8-Short), a length of B (11 decimal) (lead byte 6B), and 11 value bytes with the value 48656C6C6F20776F726c64 - representing the text "Hello world" .
UTC-Date-Time
The UTC-Date-Time field contains a UTC date (year, month, day) and optionally a time (hours, minutes, seconds, milliseconds / microseconds / nanoseconds).
The UTC-Date-Time will have no time zone information. All dates written into a UTC-Date-Time field should be represented as UTC date and time. Conversion to and from different time zones should happen when the UTC-Date-Time field is written and read. Do not transfer local times in the UTC-Date-Time field.
The UTC-Date-Time field uses a Short field encoding. It uses a binary date format which is similar to the textual ISO date format. The binary date format only uses half the bytes the textual ISO date format uses, and is faster to read and write.
The UTC-Date-Time field encodes date and time information like this:
Year | 2 bytes - values from 0 to 65535. |
Month | 1 byte - values from 1 to 12. |
Day of month | 1 byte - values from 1 to 31. |
Hour of day | 1 byte - values from 0 to 23. |
Minutes | 1 byte - values from 0 to 59. |
Seconds | 1 byte - values from 0 to 59 (60 when leap seconds occur). |
Milliseconds | 2 bytes - values from 0 to 999. |
Microseconds | 3 bytes - values from 0 to 999,999. |
Nanoseconds | 4 bytes - values from 0 to 999,999,999. |
The length of a UTC-Date-Time field specifies how much date and time information the field contains. If the length is 2 bytes, then the UTC-Date-Time field only contains a year. If the length is 3 bytes, then year + month, if the length is 4 bytes, then year + month + day etc.
As you can see, a date with year, month, day, hour, minutes and seconds takes 7 bytes to encode. Compare that to the same compressed ISO date string: 20150301235959 . The compressed ISO date string is 14 bytes. Exactly double of the RION encoding. In fact, a correct ISO date encoding must have a T between date and time, plus a Z at the end to signal "no time zone". That is a total of 16 bytes.
RION's date-time encoding is also more compressed when it comes to milliseconds, and it gives you the option to send microseconds and nanoseconds too. Something you cannot do with the ISO date format (as far as we know).
By the way, you should only provide sub-second time as either milliseconds, microseconds or nanoseconds. In other words, only as either 2, 3 or 4 bytes. Not as 2 + 3 + 4 bytes (this is not valid).
As you can see, there is no valid length of 8 bytes. It's either 2,3,4,5,6,7, 9,10 or 11. Right now a length of 8 bytes has no special meaning (it is simply invalid) but it could be used to represent a 64-bit integer (long) with the number of milliseconds since 1970. Similarly, a length of 12 bytes could be used to represent an 64-bit integer containing seconds, and a 32-bit integer containing nanoseconds (like Java's new date format does). These 8 and 12 byte representations are not yet decided on, though. You can express just as fine grained time without the 8 and 12 byte modes. It could only make it easier to convert to internal date / time representations of your programming language.
Here is a RION UTC-Date-Time field example (in hexadecimal notation):
77 07E40101000000
This example shows a field type of 7 (UTC-Date-Time), a length of 7 (lead byte 77), and 77 value bytes with the value 07E40101000000 - representing the UTC date-time 2020-01-01 00:00:00 .
Array (*)
The Array field is a normal length field just like Table and Object. The Array field is intended to contain lists of data of the same kind (but you could used mixed the field types if you need / want that).
Please note: It is possible to model an array as a single column Table field, instead of via an Array field. Hence we are considering whether RION needs an explicit Array field type at all, or if the Table field type is enough. Therefore, stick to the Table field type if you can. The RION Array field type might be removed in the future.
The difference between the Array and Table field is that the Array field does not have any Key / Key-Compact sequence in the beginning to represent the names of the columns like a Table has. An Array field just contains the value fields themselves. Each value field inside an Array field is considered independent from any other field in the same Array field. This is different from how Table and Object associate Key / Key-Compact fields with value fields.
A non-null RION Array must contain an RION Int64Pos field listing the number of elements (element count) in the array. This element must be the very first element inside the RION Array. Knowing the number of elements in an array makes it easier to allocate an array of the correct size when reading an RION Array into objects (e.g. Java objects). The Int64-Positive field should be located before any of the element fields inside the RION Array fields.
Here is a RION Array field example (in hexadecimal notation):
A1 0B 21 03 // element count field 22 FFFF // 1. element 22 0123 // 2. element 22 4567 // 3. element
This example shows a field type of A (10 decimal) (RION Array field type), a length-length of 1 (lead byte A1), a single length byte with the value 0B (11 decimal), and then 4 nested RION fields.
The first nested RION field is the element count field - a RION Int64-Positive field with the value 3, telling that the Array contains 3 elements (the nested RION fields following the element count field).
The following 3 RION fields are Int64-Positive fields with different numeric values.
The characters from // and to the right on each line are just comments. They are not actually included in the RION encoding of the above example.
Table
The Table field is a normal length field just like the Object field. The Table field is intended for tabular data, similar to a CSV file, or lists of objects of the same type.
A Table field must contain an Int64-Positive as the very first element inside the RION Table. This Int64-Positive field must be located before any of the Key / Key-Compact fields used to identify the columns of the table. This Int64-Positive must contain the number of rows in the Table field. Knowing the number of rows in the table makes it possible to allocate the right size array for the table elements before reading the elements.
After the row count Int64-Positive field a Table field should contain a sequence of Key or Key-Compact fields which are the "column" names of the data in the table. After the sequence of Key / Key-Compact fields should come a sequence of other RION fields. The RION fields following the Key / Key-Compact fields are matched to the Key / Key-Compact fields by their index. The first field belongs to the column of the first Key / Key-Compact field, the second field belongs to the column of the second Key / Key-Compact field etc.
There is no marker between the "rows" of a table. When the same number of fields as there are Key / Key-Compact fields in the table have been read or written, that is interpreted as an implicit "row" boundary. For instance, if there are 10 Key / Key-Compact fields, then every 10 fields following the Key / Key-Compact fields belong to the same "row".
Tables are a compact way to send tabular data like CSV files, or lists of objects where all the objects are of the same type, and thus have the same Key / Key-Compact fields representing their properties. The resulting size of an RION table compared to the corresponding array of objects formatted as JSON, is often down to 1/3 or even 1/4, and can go even lower, depending on the type of data you are sending across, and the length of the property names in the objects.
Tables can contain both primitive and complex RION fields as values in the rows. Thus, you could even use a Table with nested Object and Table fields to represent a complex object graph more compactly.
A Table can contain a Complex-Type-Id field containing the type (e.g. Java class name) of the rows of the table. If used, the Complex-Type-Id field should be the very first field nested inside the Table field. However, the Complex-Type-Id field is optional.
Here is a RION Table field example (in hexadecimal notation):
B1 29 21 03 // element count field E3 010101 E3 020202 E3 030303 // 3 Key-Short column fields 22 FFFF 22 ABCD 22 0123 // 1. row with 3 Int64-Positive fields 22 0123 22 4567 22 89AB // 2. row with 3 Int64-Positive fields 22 A0B1 22 C2D3 22 E4F5 // 3. row with 3 Int64-Positive fields
This example shows a field type of B (11 decimal) (RION Table field type), a length-length of 1 (1 length byte), a length byte with the value 29 (41 decimal), a row count element with the value 3 (3 rows in Table), 3 Key-Short fields specifying "keys" (i.e. column names) of each of columns in this Table, and then 3 x 3 fields representing the values for each row and column of this Table. Thus, this Table has 3 rows and 3 columns.
The characters from // and to the right on each line are just comments. They are not actually included in the RION encoding of the above example.
Object
The Object field is a normal length field meaning it consists of a lead byte, 0..15 length bytes and 0..2^120 value bytes. The number of length bytes is stored in the 4 least significant bits of the lead byte.
Inside an object you can nest other RION fields in any order you like. Thus, an Object is a mixed bag of whatever you want it to be. However, the Object field does impose a certain interpretation of certain fields and their order. This interpretation is explained in the following sections.
Note, that an Object field does not have an element count field nested inside it. Only Array and Table has that.
To mimic object properties (property name + property value pairs) use a Key or Key-Short field followed by a primitive field. The Key or Key-Short field represents the property name, and the primitive field represents the property value.
By the time you start writing an Object field you may not know its final length in bytes. To work around that problem simply reserve a number of length bytes that you know for sure will be enough to represent the final length of the object. For instance, if you know for sure that the Object field will be less than 65.536 bytes, just reserve 2 length bytes before you start writing the fields inside the Object. Then, when you have finished writing all the fields inside the Object, jump back up and insert the length into the reserved length bytes.
Of course this strategy means that you need to write the whole RION file to a buffer before you can commit it to disk or send it over the network. However, knowing the length of a field upfront is a big advantage when reading a field, so this is one of the trade-offs we have made between read speed and write speed. Anyways, writing RION data is pretty fast and RION can be very compact compared to other formats (like JSON), so this little write delay is not as big a problem as it would be with other more verbose data formats.
Here is a RION Object field example (in hexadecimal notation):
C1 15 E3 010101 22 FFFF E3 020202 22 ABCD E3 030303 22 0123
This example shows a field type of C (12 decimal) (RION Object), a length-length of 1 (1 length byte), a length byte with the value of 15 (21 decimal), and then 3 Key-Short + Int64-Positive field pairs, making up the body of this RION Object field.
Key
A Key field is a normal length field that represents a property name of an Object or a column name in a Table. You could also use a Key field to represent a key in a hashtable.
A Key field can contain whatever you need it to, but it is common to use a sequence of UTF-8 characters (e.g a property name in a Java class).
Here is a RION Key field example (in hexadecimal notation):
D1 04 6E616D65
This example shows a field type of D (13 decimal) (RION Key) , a length-length of 1 (lead byte D1), a length byte containing the value 04 (4 decimal), and the value 6E616D65 representing the value "name" (in ASCII). Thus, this example represents the key "name".
Key-Short
A Key-Short field is similar to a Key field except a Key-Short field can only contain up to 15 bytes as value. The length of the field value is encoded directly into the 4 least significant bits of the Key-Short lead byte.
E4 6E616D65
This example shows a field type of E (14 decimal) (RION Key-Short) , a length of 4 (lead byte D4) and the value 6E616D65 representing the value "name" (in ASCII). Thus, this example represents the short key "name".
Extended RION Field Encodings
As mentioned earlier in this RION encoding document, RION can contain a set of fields that are encoded using extended encoding. Extended encoding means that the field type id in the lead byte will have the value "Extended" (15). The lead byte of an extended field is followed by 1 or 2 type ID bytes.
If the value of the first byte following the lead byte is between 16 and 127, then it is a single-byte field type id. We have not yet decided how to encode 2-byte field type ids, because RION currently has no extended fields.
The length-length bits of the lead byte (least significant 4 bits) mean the same as for core RION fields. They signal the length in bytes of the length (byte count) of the field value. Extended fields can also come in short and tiny encodings. In these encodings the length-length bits change meaning to the length in bytes of the field value (for extended short encodings) or to contain the value itself (extended tiny encodings). Note, that extended short and extended tiny fields still have a field type id byte following the lead byte.
If and extended field contains length bytes (Extended normal encodings do), the length bytes will follow the field type id byte(s).
Change Log
Here is a short log of changes to the RION encoding.
Array Field Type Might Be Removed
Closer analysis has revealed that it is possible to model the same data structure as a RION Array using a RION Table. An array is essentially a table with a single column. Since arrays can be modeled as single column tables, we might as well remove the Array field type, and preserve that field type code for another, more useful field type in the future.
Element Count Now an Int64-Positive
On April 8th 2017 the element count of RION Array fields was changed, and required for RION Table fields too.
The element count of RION Array fields was an extended field type. Now that is changed to a mandatory Int64-Positive. That means, that the first field inside a non-null RION Array field must be an Int64-Positive field.
RION Table fields now also have a mandatory element count field as an Int64-Positive field. This field must also be the very first field inside the RION Table field, before any of the key fields.
Field Allocations Changed
On May 20th 2016 we have made changes to the RION field encodings. We hope that this version will be the final encoding for RION v. 1.0. We have not changed the 6 encoding types (Normal, Short, Tiny, Extended Normal, Extended Short, Extended Tiny), but the allocation of fields types to type codes has changed.
The field types "Copy" and "Complex Type ID Short" have been moved to extended field types. Furthermore, since they are not actually used by RION Ops, IAP or Stream Ops we have temporarily "suspended" them. They will most likely return in later version of RION.
Array, Table and Object has new field type codes, but keep their encodings.
The field type "Tiny" has been renamed back to "Boolean" and should now only be used to contain boolean values. The encoding for the Boolean field type is still the Tiny encoding type (1 byte), not to be confused with the former Tiny field type.
Two core field type codes are now unused - reserved for future core field types that need compact encodings.
Future versions / extensions of the RION encoding should be fully backwards compatible with this version, meaning we expect no further changes to the current field types and their type code allocations.
Negative Integer Encoding Changed
On January 13th 2017 we changed the encoding of negative integers (INT64-Negative). Rather than being encoded as the absolute value of the negative integer value, negative integers are now encoded as the absolute value -1 .
The reason for the -1 addition to the encoding is to allow all negative numbers to be encoded. With the absolute only encoding, the largest negative integer (64 bit) value -263 could not be encoded in 8 bytes, and would thus be hard to decode using a standard long (64 bit) variable. With the absolute(val) - 1 encoding all N-byte negative values can be encoded within N bytes.
Tweet | |
Jakob Jenkov |