Polymorph Data Introduction

Jakob Jenkov
Last update: 2022-02-15

Polymorph Data is the current name for the data format included in the Polymorph smart media platform project. Polymorph Data is a quite advanced data format designed for use in both network protocols + data exchange and as data format for file storage. Polymorph Data is a binary data format as this made it easier to meet the design goals (stated later).

It is the plan to have both a purely binary encoding as well as an editable textual encoding - plus tools to convert between the two versions of the encoding. For a start this editable encoding will most likely be tailored to be able to express Polymorph Data encoding efficiently and easily, but perhaps it will be able to express other types of binary encodings efficiently too. I have not yet decided that.

The binary and textual data encodings also have the alternate names Pinary (Polymorph Binary) and Tinary (Textual Binary). I am not sure if those names will stick - but in case you see them, at least you know what they refer to. Pinary files may use the extension .pin and Tinary files the extension .tin - so if you see these extensions in Polymorph context, you have an idea about that kind of files they signify.

Polymorph Data Design Goals

The Polymorph Data format is designed according to the following design goals:

  • High versatility
  • High read and write performance
  • Compact encoding - small size

Polymorph Data Features

These design goals led to the Polymorph data format with the following features:

  • High Versatility
    • Self describing - no schema needed.
    • Many core data types predefined.
    • Object data type (key -> value - like properties).
    • Cyclic + acyclic object graph support.
    • Tabular data type for CSV or DB result set style data.
    • Highly extensible - with custom fields.

  • Compact encoding
    • TLLV encoding with compact encodings for most data types.
    • Compact tabular data and tree structures via nested tables.
    • Copy or reference fields located earlier in a data block (e.g. a file or message) to reduce data redundancy.

  • Fast reading
    • Fast and easy to decode.
    • Streamable reading.
    • Partial readability.
    • Arbitrary hierarchical navigation.

  • Fast Writing
    • The compact encoding means less bytes to write resulting in faster write performance.
    • The simple encoding requires fewer instructions to write than more complex encodings.

Polymorph Data Types

Polymorph Data is encoded as fields. Each field has a data type. Polymorph Data field types fall into two categories:

  • Atomic types
  • Composite types

An atomic type is a field that contains a single value. For instance, a single number, text, date etc.

A composite type is a field that contains other fields nested inside it.

Atomic Data Types

Polymorph Data can express the following core atomic data types:

BooleanA value of true, false or null.
IntegerUp to 64 bit integers (for now), or null.
Float32 and 64 bit floats (for now), or null.
BytesUp to 2^64 byte long byte sequences, or null.
UTF-8Up to 2^64 byte long UTF-8 sequences, or null.
UTCUTC time down to nanosecond precision.
Copy (*)Represents a "copy" of a field found up 2^64 bytes earlier in the data block (* not yet decided if it will be included).
ReferenceReferences a field found up 2^64 bytes earlier in the data block. Used to express cyclic object graphs.

Composite Data Types

Polymorph Data can express the following core composite data types :

TableTabular data set. Column names are included once, followed by all the rows in the table. Can express tree structures efficiently using tables nested in tables recursively.
ObjectObject with properties modeled as key + value pairs. Can be used for maps (dictionaries) too. Can contain nested objects or tables recursively. Keys can be left out for compact object encoding if needed.
KeyThe name of a property (field) inside an object.
Type (*)A user defined type of e.g. an object or block of data. E.g. a class name or mime type (* not yet decided if it will be included).

Note: The concrete encoding of composite fields is being reexamined at the moment - to see if it is possible to come up with a slightly more flexible encoding that allows for more composite field types - without adding too much overhead in terms of extra bytes to represent the extra field types.

While there are upper limits of the size of each individual field, there is no upper size limit for a file or stream of fields.

Polymorph Data vs. Other Data Formats

Polymorph data is designed to be competitive with the following data formats:

  • MessagePack
  • CBOR
  • RION
  • ION (from Amazon)
  • Protobuf

MessagePack, CBOR, RION, ION and Polymorph Data all use a similar TLLV encoding. Polymorph uses an encoding that is more similar to MessagePack and CBOR, while having all the advanced data types of RION.

I have not yet benchmarked Polymorph Data, but performance should be similar to that of MessagePack, CBOR, RION and ION. The encoding is in som parts a bit simpler than RION, so perhaps read and write speed is a little bit higher, but I will have to verify that before making any certain claims about that.

Jakob Jenkov

Featured Videos











Core Software Performance Optimization Principles

Thread Congestion in Java - Video Tutorial






Advertisements

High-Performance
Java Persistence
Close TOC

All Trails

Trail TOC

Page TOC

Previous

Next