Evolving a format
Modern distributed systems consist of dozens of different services and each of these needs to communicate with others to fulfill tasks. Without a well-known and understood format things can become messy, once data needs to go down to the wire.
The first part of this article compares the textual formats CSV and JSON with the binary-based Avro and lays out how each can be evolved on a format level. And the second part explains how the same can be archived with versions on examples based on REST.
This articles uses the Todo object usually found in many posts of this this blog - if you haven’t seen it before you can find an OpenAPI specification here: https://blog.unexist.dev/redoc/#tag/Todo |
Picking a format
Picking a format should be fairly easy: Anything that can be parsed on the receiving end is a proper format, so why not just serialize (or encode) our object values separated by commas (CSV)?
First,Bla,true,2022-11-21,0
On the implementation side we don’t want to needlessly roll our own CSV code, so a after quick check we settle for any hopefully mature CSV-parser our programming language provides. And once wired up and deployed, our solutions works splendidly, until a change is necessary.
In a previous post we talked about Thucydides, but other Greeks were also sages:
Change is the only constant in life.
Dealing with change
It is probably easy to imagine a scenario that requires a change of our format:
-
A new requirement for additional fields
-
Some data type needs to be changed
-
Removal of data due to regulations
-
..the list is endless!
How do our formats fare with these problems?
Martin Kleppman also compares various binary formats in his seminal book Designing Data-Intensive Application. |
Good ol' CSV
This is probably a bit unfair, we all know CSV isn’t on par with the other formats, but maybe there is a surprise waiting and we also want stay in line with the format of the post, right?
Add or remove fields
Adding or removing fields to and from our original version is really difficult - readers must be able to match the actual fields and any change (even the order) makes their life miserable.
A straight forward solution here is just to include the names of the fields in a header - this is pretty common and probably (in)famously known from Excel:
title,description,done,dueDate,id
First,Bla,true,2022-11-21,0
We ignore the fact, that values itself can also include e.g. commas and assume our lovely CSV-parser handles theses cases perfectly well. |
Change of data types
Figuring out the data type of the values is up to the reader, since we omit all information about data types.
This kind of definition can usually be done with a schema, which basically describes the format including data types and also allows some form of verification of values.
Surprisingly, something like this already exists for CSV, so let me introduce you to CSV Schema.
The schema itself is straight forward and comes with lots of keywords like positiveInteger
,
regex
to provide arbitrary regular expressions or is
to construct enumerations:
version 1.0
@totalColumns 5
title: regex("[-/0-9\w\s,.]+")
description: regex("[-/0-9\w\s,.]+")
done: is("true") or is("false")
dueDate: regex("[0-9]{4}-[0-9]{2}-[0-9]{2}")
id: positiveInteger
The full specification can be found here: http://digital-preservation.github.io/csv-schema/csv-schema-1.1.html |
Using a schema to verify input is nice, but the major advantage here is the format can be formally specified now and be put under version control. If held closely to the code and updated whenever something has to be changed, this specification acts as a living documentation and eases the life of new implementors.
Another useful benefit is your schema might be supported by one of the available schema registries like Apicurio. Although it might be difficult to find one that actually support CSV-schema, there is plenty of support for other types.
Complex data types
There is no support for complex or nested types at all, so this cannot be problem at least.
Textual with JSON
There is probably no lengthy introduction to JSON required, quickly after introduction as an object notation for JavaScript, it got rightfully lots of attention is nowadays pretty much default.
If we look back at our example, a converted version might look like this:
{
"title": "First",
"description": "Bla",
"done": true,
"dueDate": "2022-11-21",
"id": 0
}
Add or remove fields
Adding or removing fields is pretty easy, due to the object nature of JSON.
Fields can be accessed by name like title
and there exist some decent strategies like return
null
on non-existing fields.
Change of data types
Data types in JSON are a bit more tricky and there are similar problems to the CSV version from above. Especially numeric types can be troublesome, if we require a specific precision.
So why reinvent the wheel, when we already know a solution? Yes, another schema - namely JSON Schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"title": {
"type": "string"
},
"description": {
"type": "string"
},
"done": {
"type": "boolean"
},
"dueDate": {
"type": "string"
},
"id": {
"type": "integer"
}
},
"required": [
"title",
"description"
]
}
We are lazy, so the above schema was generated with https://www.liquid-technologies.com/online-json-to-schema-converter |
This pretty much solves the same problems, but also provides some means to mark fields as required or entirely optional. This is a double-edged sword and should be considered as such, because removing a previously required field can be troublesome for compatibility in any direction - let me explain:
Consider your application only knows the schema from above, what happens if you feed it an evolved
version that is basically the same, but replaces the required field description
with a new
field summary
.
This will ultimately fail every time, because it cannot find the required field.
And in contrast to a CSV-schema, the JSON-schema is supported by Apicurio and can be stored there and also be be retrieved from it:

Complex data types
Objects in JSON can nest other objects and also some special forms like lists. This allows some nice trees and doesn’t limit us to flat structures like in CSV:
{
"title": "First",
"description": "Bla",
"done": true,
"dueDate": {
"start": "2022-11-21",
"due": "2022-11-23"
},
"id": 0
}
Unfortunately this introduces another case which requires special treatment:
Applications might expect a specific type like string
and just find an object
.
This can be handled fairly easy, because most of the JSON-parsers out there allow to name a specific type that should be fetched from an object:
String content = todo.get("dueDate").textValue(); (1)
1 | Be careful, the return value might surprise you. |
Avro and the binary
Avro is a entirely different beast and for a change probably needs a bit of explanation first. Originally designed for the special use cases of Hadoop, there were quickly other cases of application, like usage for Kafka due to its small footprint of its binary form and compression codecs.
The base mode of operation is a bundled and encoded form, which includes the schema along with the actual data in binary, which looks rather interesting in hex view:
$ xxd todo.avro
00000000: 4f62 6a01 0416 6176 726f 2e73 6368 656d Obj...avro.schem
00000010: 61a8 037b 2274 7970 6522 3a22 7265 636f a..{"type":"reco (1)
00000020: 7264 222c 226e 616d 6522 3a22 5265 636f rd","name":"Reco
00000030: 7264 222c 2266 6965 6c64 7322 3a5b 7b22 rd","fields":[{"
00000040: 6e61 6d65 223a 2274 6974 6c65 222c 2274 name":"title","t
00000050: 7970 6522 3a22 7374 7269 6e67 227d 2c7b ype":"string"},{
00000060: 226e 616d 6522 3a22 6465 7363 7269 7074 "name":"descript
00000070: 696f 6e22 2c22 7479 7065 223a 2273 7472 ion","type":"str
00000080: 696e 6722 7d2c 7b22 6e61 6d65 223a 2264 ing"},{"name":"d
00000090: 6f6e 6522 2c22 7479 7065 223a 2262 6f6f one","type":"boo
000000a0: 6c65 616e 227d 2c7b 226e 616d 6522 3a22 lean"},{"name":"
000000b0: 6475 6544 6174 6522 2c22 7479 7065 223a dueDate","type":
000000c0: 2273 7472 696e 6722 7d2c 7b22 6e61 6d65 "string"},{"name
000000d0: 223a 2269 6422 2c22 7479 7065 223a 226c ":"id","type":"l
000000e0: 6f6e 6722 7d5d 7d14 6176 726f 2e63 6f64 ong"}]}.avro.cod (2)
000000f0: 6563 086e 756c 6c00 dd2c f589 e9ad 358b ec.null..,....5.
00000100: 7557 a016 a861 8c60 022e 0a46 6972 7374 uW...a.`...First (3)
00000110: 0642 6c61 0114 3230 3232 2d31 312d 3231 .Bla..2022-11-21
00000120: 00dd 2cf5 89e9 ad35 8b75 57a0 16a8 618c ..,....5.uW...a.
00000130: 60
1 | The schema block at the top |
2 | Our example is uncompressed, therefore the null codec has been selected |
3 | And the data block at the end |
If we now step through the output of xxd, we can clearly see it starts with the schema block in plain JSON, which is then followed by the actual encoded data at the end - here highlighted in yellow. The data itself doesn’t include any field name or tags numbers like in Thrift or Protobuf and is separated by a control character - this somehow resembles CSV and can be displayed as such:

Add or remove fields
The IDL of the schema supports various advanced options which are better explained in its spec, but the extracted and formatted version looks like this:
{
"type": "record",
"name": "Record",
"fields": [
{
"name": "title",
"type": "string"
},
{
"name": "description",
"type": "string"
},
{
"name": "done",
"type": "boolean"
},
{
"name": "dueDate",
"type": "string"
},
{
"name": "id",
"type": "long"
}
]
}
This means the schema is strongly required by the reader to make sense from the data block. And to make things a bit more complex, the schema can be omitted, if the reader already knows it or has other means to fetch it like from the previously mentioned registry.
Change of data types
With this in place, the same rules apply here that were valid for our CSV version. Changing order or whole fields should be no problem, as long as the schema is known to the reader.
Complex data types
Avro is a bit of a mix of both of our textual formats and in this regards it behaves like JSON in regards to complex types.
Let’s have a quick glance at the output of xxd of the evolved version:
$ xxd todo-evolved.avro
00000000: 4f62 6a01 0416 6176 726f 2e73 6368 656d Obj...avro.schem
00000010: 619c 057b 2274 7970 6522 3a22 7265 636f a..{"type":"reco (1)
00000020: 7264 222c 226e 616d 6522 3a22 5265 636f rd","name":"Reco
00000030: 7264 222c 2266 6965 6c64 7322 3a5b 7b22 rd","fields":[{"
00000040: 6e61 6d65 223a 2274 6974 6c65 222c 2274 name":"title","t
00000050: 7970 6522 3a22 7374 7269 6e67 227d 2c7b ype":"string"},{
00000060: 226e 616d 6522 3a22 6465 7363 7269 7074 "name":"descript
00000070: 696f 6e22 2c22 7479 7065 223a 2273 7472 ion","type":"str
00000080: 696e 6722 7d2c 7b22 6e61 6d65 223a 2264 ing"},{"name":"d
00000090: 6f6e 6522 2c22 7479 7065 223a 2262 6f6f one","type":"boo
000000a0: 6c65 616e 227d 2c7b 226e 616d 6522 3a22 lean"},{"name":"
000000b0: 6475 6544 6174 6522 2c22 7479 7065 223a dueDate","type":
000000c0: 7b22 7479 7065 223a 2272 6563 6f72 6422 {"type":"record"
000000d0: 2c22 6e61 6d65 7370 6163 6522 3a22 5265 ,"namespace":"Re
000000e0: 636f 7264 222c 226e 616d 6522 3a22 6475 cord","name":"du
000000f0: 6544 6174 6522 2c22 6669 656c 6473 223a eDate","fields":
00000100: 5b7b 226e 616d 6522 3a22 7374 6172 7422 [{"name":"start"
00000110: 2c22 7479 7065 223a 2273 7472 696e 6722 ,"type":"string"
00000120: 7d2c 7b22 6e61 6d65 223a 2264 7565 222c },{"name":"due",
00000130: 2274 7970 6522 3a22 7374 7269 6e67 227d "type":"string"}
00000140: 5d7d 7d2c 7b22 6e61 6d65 223a 2269 6422 ]}},{"name":"id"
00000150: 2c22 7479 7065 223a 226c 6f6e 6722 7d5d ,"type":"long"}]
00000160: 7d14 6176 726f 2e63 6f64 6563 086e 756c }.avro.codec.nul
00000170: 6c00 d313 7980 7ecf 4645 6249 ddd7 08a1 l...y.~.FEbI....
00000180: 070a 0244 0a46 6972 7374 0642 6c61 0114 ...D.First.Bla.. (2)
00000190: 3230 3232 2d31 312d 3231 1432 3032 322d 2022-11-21.2022-
000001a0: 3131 2d32 3300 d313 7980 7ecf 4645 6249 11-23...y.~.FEbI
000001b0: ddd7 08a1 070a ......
1 | The schema block at the top |
2 | And the data block at the end |
The interesting part here is the data section still just contains a value separated list and can be flattened out like this:

So far we discussed how the formats can evolve, but is there another way?
Apply versioning
In this chapter we are going to have a look at version, which is also a viable way, if we cannot directly control our clients or consumers. To keep things simple, we just have a look at the two mostly used ways in the wild with examples based on REST.
Endpoint versioning
Our first option is to create a new version of our endpoint and just keep both of them. We cannot have two resources serve the same URI, so we just add a version number to the endpoint and have a nice way to tell them apart. Another nice side effect here is this allows further tracking and redirection magic of traffic:
$ curl -X GET http://blog.unexist.dev/api/1/todos (1)
1 | Set the version via path parameter |
Pro |
Con |
Clean separation of the endpoints |
Lots of copy/paste or worse people thinking about DRY |
Usage and therefore deprecation of the endpoint can be tracked e.g. with PACT |
|
Further evolution might require a new endpoint |
Content versioning
And the second option is to serve all versions from a single endpoint and resource, by honoring client-provided preferences here in the form of an accept header. This has the additional benefit of offloading the content negotiation part to the client, so it can pick the format it understands.
$ curl -X GET -H “Accept: application/vnd.xm.device+json; version=1” http://blog.unexist.dev/api/todos (1)
1 | Set the version via Accept header |
Pro |
Con |
Single version of endpoint |
Increases the complexity of the endpoint to include version handling |
Difficult to track the actual usage of specific versions without header analysis |
|
New versions can be easily added and served |
Conclusion
During the course of this article we compared textual formats with a binary one and discovered there are many similarities under the toga hood and also how a schema can miraculous save the day.
Still, a schema is also no silver bullet and sometimes we have to use others means to be able to evolve a format - especially when it is already in use in legacy systems.
Going the way of our REST examples might be way to have different versions of the same format in place, without disrupting other (older) services.
All examples can be found here: