Polymorph Data Introduction
Jakob Jenkov |
Polymorph Data Encoding (PDE) is a binary data format that is attempting to eliminate many of the limitations of other popular data formats such as XML, JSON, CSV, MessagePack, CBOR, Protobuf, Avro etc. By eliminating these limitations Polymorph Data Encoding becomes applicable in a wider variety of use cases. That way you do not have to make a data encoding choice for every single use case, but can stick with the same data encoding for far more use cases.
Polymorph Data Encoding is an advanced data format designed for use in encoding of messages in network protocols, encoding of structured data during data exchange, and as structured data encoding for file storage. Furthermore, Polymorph Data Encoding has been designed to be able to function as a binary record stream encoding for use in record stream storage as well as record stream transport via a network protocol.
Since Polymorph Data Encoding is binary it can be hard to read and edit in a text editor. To alleviate this problem I have created a textual version of PDE which is called Polymorph Data Language (PDL). You can convert a PDE file to PDL and open it in a text editor for inspection. Or, you can edit a PDL file and convert it to PDE for efficient use in your applications.
The term Polymorph Data refers to both Polymorph Data Encoding (PDE) and Polymorph Data Language (PDL) as a single data expression mechanism.
Polymorph Data Primary Design Goals
The Polymorph Data Encoding (PDE) is designed according to the following design goals:
- High versatility
- High read and write performance
- Compact encoding - small size
Polymorph Data Language is designed to be able to represent PDE consistently, and somewhat concisely, while also being reasonably easy and fast to parse. This is a trade-off, where neither will probably be achieved 100% - but both will be achieved reasonably well (80-90%) and will be improved over time.
Polymorph Data Features
These design goals led to the Polymorph data format with the following features:
- High Versatility
- Self describing - no schema needed.
- Many commonly used data types predefined.
- Able to contain raw binary data.
- Object data type (key -> value - like properties).
- Cyclic + acyclic object graph support.
- Tabular data type for CSV or DB result set style data.
- Able to contain both single data fields or streams of fields.
- Highly extensible - with custom fields.
- Compact encoding
- TLLV encoding with compact encodings for most data types.
- Compact tabular data and tree structures via nested tables.
- Copy or reference fields located earlier in a data block (e.g. a file or message) to reduce data redundancy.
- Fast reading
- Fast and easy to decode.
- Streamable reading.
- Partial readability.
- Arbitrary hierarchical navigation.
- Fast Writing
- The compact encoding means less bytes to write resulting in faster write performance.
- The simple encoding requires fewer instructions to write than more complex encodings.
Polymorph Data Use Cases
The features of Polymorph Data might be eaiser to understand if you know what use cases they are designed for. Thus, here is a brief list of the use cases Polymorph Data is designed for:
- Structured data encoding for data structures such as objects, trees, graphs, tables and streams.
- Object graph structures - such as in XML and JSON files.
- Unnamed fields like in XML or inside JSON arrays.
- Key-value pairs of fields like in JSON objects or XML attributes.
- Acyclic + cyclic object graphs.
- Efficient tabular structures with a single set of column header names - such as in CSV files.
- Efficient tree structures - via tables nested in tables.
- Object graph structures - such as in XML and JSON files.
- Data encoding for streams of records that are easily appendable and replayable.
- Change or transaction log files.
- Event log files.
- Time series data.
- Data encoding for data files
- Document style files - like rich text documents, presentations and other rich media documents.
- Record style files - like CSV files.
- Stream style files.
- Application log files.
- Configuration files.
- Data encoding for network messages - containing structured data (object graphs, record sets or record streams).
- Easy replication of record streams across a network.
- Subscribe-notify style network communication - e.g. subscribe to a time series data set or RSS style feed.
- Broadcast and multicast of structured media data.
- P2P network routing
- Smart media
- Edge computing
- IoT
- Subscription to data streams - via subscribe-notify style network communication.
- Each record in a stream has an offset - from which you can resume subscription.
- Incremental replication of data sets - via record streams - both locally on the same machine, and via network.
- Each record in a stream has an offset - from which you can resume replication.
Polymorph StreamDB
The Java toolkit used to work with PDE and PDL will also contain a tool called Polymorph StreamDB - which is a stream database. StreamDB can be used to query streams of PDE fields efficiently - both via a query language, and a stream processing API. StreamDB is planned to be able to query streams both stored locally, but also on remote StreamDB instances, or services with a StreamDB compatible interface.
Exactly how StreamDB lands feature-wise is still somewhat uncertain, but I have lots of ideas for features that can reasonably easily be implemented on top of PDE.
Polymorph Data Types
Polymorph Data is encoded as fields. Each field has a data type. Polymorph Data field types fall into two categories:
- Atomic types
- Composite types
An atomic type is a field that contains a single value. For instance, a single number, text, date etc.
A composite type is a field that contains other fields nested inside it.
Atomic Data Types
Polymorph Data can express the following core atomic data types:
Boolean | A value of true, false or null. |
Integer | Up to 64 bit integers (for now), or null. |
Float | 32 and 64 bit floats (for now), or null. |
Bytes | Up to 2^64 byte long byte sequences, or null. |
UTF-8 | Up to 2^64 byte long UTF-8 sequences, or null. |
UTC | UTC time down to nanosecond precision. |
Copy (*) | Represents a "copy" of a field found up 2^64 bytes earlier in the data block (* not yet decided if it will be included). |
Reference | References a field found up 2^64 bytes earlier in the data block. Used to express cyclic object graphs. |
Composite Data Types
Polymorph Data can express the following core composite data types :
Table | Tabular data set. Column names are included once, followed by all the rows in the table. Can express tree structures efficiently using tables nested in tables recursively. |
Object | Object with properties modeled as key + value pairs. Can be used for maps (dictionaries) too. Can contain nested objects or tables recursively. Keys can be left out for compact object encoding if needed. |
Key | The name of a property (field) inside an object. |
While there are upper limits of the size of each individual field, there is no upper size limit for a file or stream of fields.
Polymorph Data vs. Other Data Formats
Polymorph data is designed to be competitive with the following data formats:
- MessagePack
- CBOR
- RION
- ION (from Amazon)
- Protobuf
- Avro
MessagePack, CBOR, RION, ION and Polymorph Data all use a similar TLLV encoding. Polymorph uses an encoding that is more similar to MessagePack and CBOR, while having all the advanced data types of RION.
I have not yet benchmarked Polymorph Data, but performance should be similar to that of MessagePack, CBOR, RION and ION. The encoding is in som parts a bit simpler than RION, so perhaps read and write speed is a little bit higher, but I will have to verify that before making any certain claims about that.
Tweet | |
Jakob Jenkov |