RION Design Goals
- Fast
- Compact
- Expressive - Versatile
- Self Describing
- Typed Null Values
- Partial Parsability
- Arbitrary Hierarchical Navigation
- Cyclic Object Graphs
- Suitable as Network Protocol Message Format
- Independent of the Network Protocol
- Routable
- Easy to Allocate Memory For
- Easy to Handle For Servers
- Easy to Handle For Small Devices
Jakob Jenkov |
RION is a binary data format which is flexible enough to encode a wide variety of data. When designing RION we wanted RION to be:
- Fast
- Compact
- Expressive - Versatile
- Self describing
- Typed null values
- Partial parsability
- Arbitrary hierarchical navigation
- Cyclic Object Graphs
- Suitable as network protocol message format
- Independent of the network protocol
- Routable
- Easy to allocate memory for
- Easy to handle for servers
- Easy to handle for small devices
Fast
Performance is an important design goal for RION. When we were anyways reinventing a data format, why not try to make it as fast as possible? We have tried that with RION, and our initial measurements look promising.
We have implemented a toolkit for working with RION called RION Ops for Java. This toolkit is available as open source. Our current performance measurements are based on the performance of RION Ops. To compare RION performance to JSON we have used the Jackson JSON parser which is one of the fastest JSON parsers out there.
Being a binary format RION is naturally faster to read and write than textual formats. Booleans, integers, floating points and binary data is faster to read and write from a binary form than a textual form. We have seen performance improvements of up to 1000% (x 10) compared to reading and writing the same Java objects from / to JSON with Jackson. On average though, expect a speed increase somewhere between 50 to 200%.
The speed improvement depends on the type and size of the data being serialized. The speed difference so far seems so be largest with small objects and types that don't serialize so well to text, like boolean and floating point variables. Jackson is pretty fast at serializing integers, so there the speed improvement is somewhere between 0 and 50% on average.
The exception is when reading and writing text - in which case RION should perform about the same as textual formats like JSON and XML. But even with text RION Ops has some built-in classes that can make it faster to read and write text. These techniques could also be used in a JSON parser - but they don't seem to be so far.
Read Speed vs Write Speed
In a few cases we have had to make trade-offs between read speed / flexibility and write speed / flexibility. In these situations we have typically looked at what the gain / loss is for both read and write speed and flexibility.
In cases where the speed gain for one action was significantly bigger than the speed loss for the other action, we have decided in favour of the speed gain.
In cases where the speed gained by one side is about equal to the speed lost by the other side, we have decided in favour of increased read speed. We have done that for the two following reasons:
First, we assume that on average RION messages will be read the same or more times than they will be written. For example, you could write tabular data (similar to a CSV file) into RION files, and then have to read them those files many times again later. This could the case with data files as well as with log files (sometimes at least).
This is also true of systems that route RION messages between a sender and final receiver. An RION message can be read as a single, opaque block of bytes and thus forwarded really fast for intermediate nodes that don't need to process the data in the message.
Second, RION write speed is already higher than the RION read speed, so by deciding on the side of read speed in 50-50 cases, the speed difference between the two formats are evened out a bit.
Compact
In addition to being fast to read and write, RION was designed to be compact in serialized form. A compact data format can be transmitted faster over the network.
You might claim that compactness is not that important because you can just ZIP compress the data sent over the network. Textual data formats like JSON and XML compress quite well, so the actual difference in size of the data transmitted would be a lot smaller if JSON, XML and RION were all ZIP compressed.
However, if you send compressed data over a TLS connection (encrypted connection), your data communication might very well be vulnerable to the BREACH and CRIME attacks. Therefore it is currently (Nov. 2015) recommended to turn off compression when sending data over a TLS connection. Then, all of a sudden data compactness matters again.
The compression-over-TLS problem will probably be solved in the future. But even when it does, a compact data format is still an advantage, although a smaller one, as long as this compactness does not impact performance too much. It is still faster to unzip a smaller amount of data than a larger amount, and it is also easier to handle for the clients and servers (less memory required).
On average RION objects are 10-20% smaller than the corresponding JSON messages. How much exactly depends on what data is being sent. For instance, larger integers take more characters to encode as text than smaller integers. The same is true in RION.
Compact Tabular Data
Sending tabular data, e.g. lists of objects over the network is a common use case. When serializing an array of objects to JSON, each object is serialized as property name + property value pairs. That means that the property names are repeated for every object.
To avoid the repetition of property names when serializing arrays of the same type of objects, RION has a special Table field type. The RION table data type only contains the property names once. After the property names the property values of all objects in the array are included in the same sequence as the property names. RION tables are thus similar in structure to CSV files with a single header row.
Additionally, there is no per-row indicator overhead. The fields for each row are just nested inside the Table in one, long sequence of fields. The number of column fields in the Table specify how many columns each row has.
RION Tables are much more compact than JSON object arrays. We have seen data sizes of less than 1/3 of their JSON counterparts. Exactly how much you save depends on the length of the property names.
RION tables are faster to write because you don't have to write the property names more than once. RION tables are also faster to read because the property values can be mapped to properties in the Java objects using an index rather than a property name. Using the index of the property value saves the reading of a property name + a hash table lookup per property. And being more compact, RION tables are also faster to transmit over a network.
Compact Tree Structures
You can nest any RION field inside a Table field, include other Table fields. Thus, you can use nested Table fields to model a compact tree structure, where lists of "child" objects only have the column names repeated once per list.
Compact Object Structures
If you need maximum compactness of a single RION Object field (a set of key + value pairs), you can leave out the keys and only include the values. This corresponds to serializing a JSON object without property names, but only the property values themselves. You will need to know what property value corresponds to which property name yourself, though. You can do so via the index of the property value within the RION Object field. E.g. property value 0 is ID, value 1 is first name, value 2 is last name, value 3 is email etc.
Expressive - Versatile
RION was designed from the beginning to be very expressive. The less users would need to resort to other data formats the better. If we can all just exchange data in RION, we can use the same data parsers and generators. RION is capable of expressing everything that you can express in CSV, JSON and XML. That means that you could actually convert a CSV, JSON or XML file to a RION file without losing information.
Yes, that is also what they said about XML, but for XML it turned out not to work out. Textual data formats are naturally bad fits for sending binary data like numbers and files. We believe RION changes that, so now it is our turn to commit hubris with RION by claiming it as a general purpose data format.
RION is designed to be able to model these commonly used data types and data structures:
- Raw bytes
- Boolean fields
- Integer fields
- Floating point fields
- UTF-8 fields
- UTC Date-Time fields
- Binary file
- Stream of fields (unbounded)
- Array of fields (bounded)
- Tables (like CSV files)
- Map (key, value pairs)
- Objects with properties (key, value pairs)
- Object graphs with objects inside objects etc. (like JSON)
- Objects (elements), text and binary fields mixed (like XML)
RION has several primitive, single-value fields used to encode commonly occurring simple data types, like integers, floating points, UTF-8 strings etc.
RION also contains three composite field types: Array, Table and Object. These field types can have other RION fields nested inside them. Thus, you can create composite data structures by nesting e.g. Objects withing Arrays, or Objects within Tables, or Tables within Tables etc.
Of course the built-in RION data types aren't right for every kind of data out there. In case you need to send something that RION does not support (e.g. MPEG, JPEG etc.), you can just send it as raw bytes (which RION supports). This has been a priority from the beginning - that users can default to opaque byte sequences when transmitting data that RION is not explicitly designed to encode. It is also possible to define your own object types. More on that later.
Self Describing
We wanted RION files / objects to be self describing. It should be possible to parse an RION file / object without having a schema for it, in the same way you can with a JSON file. This is possible with RION. You may not be able to see the exact semantic meaning of the data being transmitted, but you can see what fields and data types an RION object contains without the use of a schema.
Self describing data formats tend to be easier to work with during development, as you don't need to know the schema for the data to investigate it.
Typed Null Values
In some data formats the value null
is its own "type", meaning a null value has no explicit type
information attached to it, except it is null. In other words, you cannot see if it is a null integer,
a null string, a null floating point, a null byte array, a null date-time etc. All you can see is, that the
value is null
.
To make RION as self describing as possible it should preferably have typed null values. In the current encoding every single field type can take on the value null without losing information about its type.
Partial Parsability
Quite often a service may return more data than you actually need in your specific use case. This is especially true if the service is a public service provided by someone else than your own organization.
To be able to boost performance in such situations we wanted RION to be partially parsable. By partial parsability I mean that you don't need to parse all of a RION data structure in order to find the properties you need inside it.
The fact that RION uses a TLV encoding makes it easy to jump over all bytes of a field you don't need. If a field value is 15 bytes long, you can see that already in the first bytes of that field (lead byte + length bytes). If you don't need that field, you can just jump over those 15 value bytes and inspect the next RION field directly. This is not possible with a JSON encoded data structure. In JSON you would have to inspect every single of the 15 value bytes to see where the field value ends.
This performance advantage is even bigger if the RION field you are skipping over is a composite RION field. Instead of having to inspect each of the fields nested inside that composite RION field, you can skip over all nested fields of the composite field. This is possible because composite fields also specify the full length in bytes of all nested fields in the beginning of the composite field.
Arbitrary Hierarchical Navigation
The partial parsability enables another goal we had for RION - namely arbitrary hierarchical navigation. By arbitrary hierarchical navigation we mean that it should be possible to move quickly and easily in an out of composite data structures, e.g. tree structures. If after inspecting the first nested field of a parent field you realize you don't need that parent field, it should be possible to move quickly and easily out of the parent field's body, and move on to the next field at the same level as that parent, composite field.
RION's consistent TLV encoding - also for composite field - means that this is reasonably easy to achieve.
Cyclic Object Graphs
Many popular data formats can only encode acyclic object graphs, meaning the objects in the encoded data can only have references going to object found later in the encoded data, and not backwards towards objects found earlier in the encoded data.
We want RION to be able to encode cyclic object graphs. We imagine this to be possible by adding a Reference field to RION so that objects found later in an encoded RION block can reference "back" to objects found earlier in the encoded RION data.
References are not yet specified officially, but we have some workable ideas about how to implement a Reference field that is both compact and acceptably easy to work with.
Suitable as Network Protocol Message Format
As mentioned elsewhere, RION was originally designed as a data format for data exchange in distributed systems. Thus, RION is the message format used in our application level network protocol IAP. Therefore RION must naturally be suitable for this task. RION was designed with the needs of message formats, clients, servers, routers etc. in mind.
Independent of the Network Protocol
While RION was designed as part of IAP, RION is a data format that is independent of network protocols. Thus, you can use RION outside of IAP as an alternative data format to JSON, XML, YAML etc. As RION is reasonably compact and fast, using RION over HTTP might be a first step for organizations looking to switch to IAP from e.g. HTTP/JSON, SOAP/XML etc.
You could even use RION as a file format. As you will see later, RION is a pretty good alternative to CSV files. You could also use RION as a log file format. It would be pretty fast to scan through RION records in a file.
Routable
Since RION is to be used in IAP, a message oriented network protocol, it was naturally important that RION messages are easy to route for intermediary nodes. Since RION messages are self describing it is easy to see when an RION message starts and ends. It is also easy to read an RION message partially, or wrap it in another RION message for tunneling.
By "routing" we mean routing at the application level - not at the IP level. Regular IP routing does not require inspection of the data inside the IP frame. However, if you are implementing Relay Servers, P2P network, Onion Routing or similar technologies, your custom routing logic might need to inspect the message to find out where to route it.
A message format that requires a schema to interpret it (like Protobuf or Avro) would not be so easily routable at the application layer. The intermediate routers would have to know the schema in order to be able to decode the message and route it. This is not ideal. Instead, you could wrap such a data format in a self describing data format like RION, but then why now just keep all of the data in RION? Why have the hassle of having to deal with multiple data format?
Easy to Allocate Memory For
RION should be easy to allocate memory for. By that, I mean it should be easy to see from the beginning of an inbound RION message how much memory the whole message will take up, so that can be allocated in one go.
RION's consistent TLV encoding - also for composite fields - makes it easy to see from the first few bytes of a RION encoded message how many bytes the whole message takes up. Thus you can allocate the exact needed amount of bytes for it - or find a reusable byte buffer of an appropriate size for it (not too small, not too big).
Easy to Handle For Servers
RION should be easy to handle for servers that receive massive amounts of messages. By "handle" we refer to a few different aspects of server design.
First of all it should be easy to know when a message starts and when a full message has been received without having to look at the whole message. This is easily possible with RION messages.
Second it should be easy to allocate the correct amount of memory for an RION message. Furthermore, an RION message should be fully containable in a single contiguous memory area. This makes it faster / easier to allocate and deallocate memory for the message, and faster to process the message too (the whole message might fit into the L1, L2 or L3 caches of the server). I have mentioned that earlier as being easy to allocate memory for. This is one of the reasons why.
Third, it should be possible to read only part of a message without having to read the full message. Reading a message partially makes it easier to implement a multi-step message processing pipeline where each step parses more and more of the message, and pass it on the correct subsystem in the server. This is also reasonably easy to do with RION messages. This is also similar to what I have referred to earlier as partial parsability and arbitrary hierarchical navigation.
Easy to Handle For Small Devices
A network protocol targeting small devices, like Internet of Things (IoT) should have a data and message format that is easy to handle for small devices too. Not just for big servers. Small data sizes, message sizes, fast read and write times as well as easy memory management are key for small devices.
Tweet | |
Jakob Jenkov |