Java serialization: what are You serializing Your data for?

I think that after this series of posts about Java serialization we should talk a bit about things which are not present in a standard serialization mechanism.

One upon a time I wrote some GUI application. Of course, Swing based. The JavaFX wasn’t there at that time, in fact it did collapse to “abandon-ware” status for a while. Plus I am not a big fan of it….

But back to serialization.

Of course I did create some very specific JComponent GUI components which were responsible for the job. It was basically a scientific dynamic data charting engine, so those components did show some dynamically changed data. The visual effect depended on two sets of settings: one bound with a data themselves, which of course was serialized through the data serialization mechanism, and a second set which was just visual one. Like for an example what font size to use in table and etc. Clear presentation settings having nothing in common with the data.

So I was thinking: why not to serialize those GUI components?

The standard serialization is for…

I tried and failed. Miserably. My first and simplest approach “take and serialize it to disk” was conceptually wrong. Let me tell You why.

After trying and reading the available sources, I realized that the standard serialization was meant for “hot serialization” of living objects. While I wished to do a “cold serialization” on a disk.

“Hot” serialization

Initially JAVA GUI, as apart of Sun enterprise, was meant (I believe that it was, I never worked for them) to run in a philosophy very alike their X-Server concept.

In the X-Server the body of a program runs on a “server” machine, while all GUI related commands, like drawing on a screen, handling mouse and keyboard and etc. are passed over a network to the “client” machine. This is easy to imagine how much the network throughput is stressed by this approach.

Thous year after year, as the power of an average “client” machine grew, the X-Server protocol was trying to move more and more to a client side. This is only natural. Consider how much processing power requires to draw a True-type fonts in Your LibreOffice Writer and compare it with a power required to manipulate UTF-8 characters in memory which do actually represent a document data. It is clear, that GUI-rich applications consume 99% of power on GUI so it is natural to move this consumption as close to end-user as possible.

But how much You can move if You can’t move a code? You can only grow the command set, grow caches and alike, but each user action must be passed to a “server”.

Note: Remember, “server” and “client” are using different CPU architectures. Severs were Sparc or MIPS, clients were x86 or PowerPC. So no, You can’t pass binary code.

With the introduction of “Java Virtual Machine” passing code became possible. Now it was possible to not only send commands to draw on screen, but one might pass the bunch of class files responsible for GUI and run them on the “client”. Of course it should be as transparent as possible. The server side should be able to build GUI, as if run locally, wrap it in RPC wrappers and pass to remote client. A client should just run it in context of own JVM and pass objects back only when necessary.

A part of this process was, I believe, to be handled by standard serialization protocol.

What does it mean for us?

If You will inspect Swing serialization You will notice two things:

  • first, the warning in documentation (JDK 11) that: “(…)Serialized objects of this class will not be compatible with future Swing releases. The current serialization support is appropriate for short term storage or RMI between applications running the same version of Swing.(…)”
  • second, that what is serialized includes all listeners, the whole parent-child tree up and down, and practically everything. From my experience an attempt to serialize any JComponent do serialize the entire application.

This is because of “(…)The current serialization support is appropriate for (…) RMI between applications(…)“.

In simpler words, for transferring objects which have to be alive at the other end of a stream and actively exchange data with the source end of a stream.

Note: I am now ignoring the part about “(…)support for long term storage of all JavaBeans™ has been added to the java.beans package. Please see XMLEncoder.(…)” which You may find in JavaDocs. It is intentional because this mechanism is far, far away from what generic serialization needs.

“Cold” serialization

“Cold” serialization is when You remove the object from it’s environment, stop it, move it into a “dead” binary storage. Then, possibly the next year at the other end of the world, You do remove it from a storage, put it in an another environment (but compatible of course) and revive it.

The reviving process will require binding the object with new environment, but it is not a problem.

Example: Serial port data logger

Now imagine, You wrote a data logger application for a hardware device which is sending data to PC through a serial port connection.

You have a class there which both keeps and tracks data from the serial port. Let’s say it is fixed to be COM1.

How the “hot” and “cold” serialization would look like?

“Hot” example

A “hot” serialization of this class will basically need to pass the already stored data to an another machine and allow control of the logger device from that machine. It means, that it must during serialization create a “pipe” over the network from the remote machine to the machine at which the hardware device is stored.

“Cold” example

A “cold” serialization of this class should save stored data on a disk and do it in such a way, that when it is de-serialized it will connect itself to a said serial port and continue the data logging. It means, that it must save information about to what port connect and create this connection when de-serialized.

Multi purpose

This is clearly seen that if I would try to serialize this logger using standard serialization I must decide on either “hot” or “cold” method. I can’t do both.

But I do need both methods!

Standard Java serialization is single-purpose.

Hey, we have writeReplace()/readResolve()!

Yea, we have them. Except we have either writeReplace() or readResolve(). We can’t use both at the same moment, but let me be silent about it now.

What are those two?

They a “patch” to multi-purpose serialization. Quite a good one, which will work in an above example case, but not in every case.

We may easily imagine that our “hot” and “cold” serialization can be done by:

  class LocalLogger ...
  class HotLogger ...
  class ColdLogger ...

   HotLogger toHot(LocalLogger ...)
   ColdLogger toCold(LocalLogger ...)

That is we provide and ability to construct “hot” and “cold” variants from “local” variants. Both “cold” and “hot” are different objects, but such that when serialized by a standard mechanism they do what we need. Now if we are writing an application which needs “hot” serialization we do use instead of LocalLogger a class like that:

 class LocalLogger_Hot extends LocalLogger
 {
    private Object writeReplace(){ return toHot(this); };
 }

A standard serialization mechanism will notice it and invoke the writeReplace method before it will start to serialize any instance of LocalLogger_Hot. Thous the remote side will see HotLogger in every place where the reference to LocalLogger_Hot was serialized.

We may also mirror the thing, and decide that LocalLogger will serialize information necessary to both creating a hot link and local port connection and that it is up to reading application to act according to it needs. For that the remote must use different source for LocalLogger:

  class LocalLogger
  {
     Object readResolve(){ return toHot(this); }
  }

The de-serialization engine will notice this method and invoke the readResolve after the LocalLogger was de-serialized. Then since that moment it will use the returned object instead of original, what is achieved by modifying the
“stream_reference_map” (see there).

Note: the underscores are intentional.

When it doesn’t work?

So, having my special GUI components which by default are “hot” serialized and I needed to turn them into “cold” serialized, I did add:

class MyGUI extends JComponent
{
    static class MyColdStorage
    {
        private Object readResolve(){ return new MyGUI(this); };
        ....
    }
    ...
    private Object writeReplace(){ return new MyColdStorage(this); };
}

Basically the idea is, that when the standard serialization will serialize an instance of MyGUI it will transform it to a “cold form” of MyColdStorage. Then, whenever it will de-serialize MyColdStorage it will transform it back to MyGUI.

Nice, plain and simple, isn’t it?

Except it doesn’t work.

Cyclic data structures

The GUI is heavily recursive and cyclic data structure. Each JComponent keeps a list of child GUI component (ie. panel keeps a list of contained buttons). And each child component keeps a reference to a parent component (ie. label must known enclosing panel to tell it that the size of label changed so the panel should recompute the layout of children).

For simplicity let us define it like:

 class JComponent
 {
     private JComponent parent;
     private JComponent child;
 }

If You will consider this post You will notice, that such a structure will be serialized like this:

   JComponent Parent =...
   JComponent Child  =..
  serialize(Parent)
   ... →
      write new JComponent (Parent) //(A)
       write Parent.parent = null
        write new JComponent (Child)
             write Child.parent= refid(Parent) //that is stream reference to Parent set in (A);
             write Child.child = null
        write Parent.child = refid(Child)

and during de-serialization:

  x = deserialize(...)
    create new JComponent (Parent)
       set Parent.parent = null
       create new JComponent (Child)
             set Child.parent= Parent;
             set Child.child = null
       set Parent.child = Child
   return Parent

Now arm it with writeReplace and ReadResolve exactly as it was defined above.

serialize(Parent)
   ... →
      call writeReplace(Parent)
      write new MyColdStorage (Parent_cold)
       write Parent_cold.parent = null
        call writeReplace(Child)
        write new MyColdStorage (Child_cold)
             write Child_cold.parent= refid(Parent);
             write Child_cold.child = null
        write Parent_cold.child = refi(Child) 

and during de-serialization:

  x = deserialize(...)
    create new MyColdStorage (Parent_cold)
       set Paren_cold.parent = null
       create new MyColdStorage (Child_cold)
             set Child_cold.parent= Parent_cold;
             set Child_cold.child = null
       call readResolve(Child_cold) (Child)
             new MyGUI (Child)
               Child.parent = Child_cold.parent (Parent_cold)
               Child.child = Child_cold.child (null)
       set Parent_cold.child = Child
   call readResolve(Parent_cold) (Parent)
       new MyGUI (Parent)
         Parent.parent = Parent_cold.parent (null)
         Parent.child = Parent_cold.child (Child)          
   return Parent

Noticed the red lines?

In this cyclic structure the first use of de-serialized parent reference happens before the place in which its readResolve(Parent_cold) is invoked. It is because designers of standard Java serialization assumed, that to resolve an object You need it to be fully read. And of course, since we have a cyclic structure, the process of reading a “Child” in this example will refer to the “Parent” before it was fully read. Thous it will access unresolved object.

In my case it would produce the ClassCastException because MyColdStorage is not
JComponent.

It is even worse, we will have now two objects, one of unresolved MyColdStorage and one of resolved MyGUI were we originally had a single object.

writeReplace/readResolve doesn’t work in cyclic structures.

Note: This is specified and designed behavior. I can’t tell it was intentionally created like that, because a solution is trivial, but never less You will find it in serialization specs.

How to solve it?

The answer is simple: with standard serialization You can’t. Once it is “hot” it will be “hot” for an eternity.

But if You write Your own serialization engine the solution is simple. Instead of one readResolve use two:

class MyColdStorage
{
  Object readReplace()
  void fillReplacement(Object into)
}

Now the readReplace is bound to create an “empty” object of correct type:

  Object readReplace(){ return new MyGUI(); };

and the fillReplacement is bound to transfer data from the stream form to the target form:

  void fillReplacement(Object into)
  {
    ((MyGUI)into).parent = this.parent;
    ((MyGUI)into).child = this.child;
  };

The readReplace is invoked right after new instance is created and returned value it is put into a “stream_reference_map” (see there) instead of the original.

The fillReplacement is invoked in exactly the same place where the standard readResolve() is invoked, but opposite to original, the “stream_reference_map” is left untouched.

Then make de-serialization to look like:

 x = deserialize(...)
   create new MyColdStorage (Parent_cold)
    call readReplace() → since now each time "Parent" is referenced use returned value (Parent_R)
     set Parent_cold.parent = null
     create new MyColdStorage (Child_cold)
     call readReplace() → (Child_R)
       set Child_cold.parent= Parent_R;
       set Child_cold.child = null
       call fillReplacement(Child_R) (Child cold)
         set Child_R.parent = Child_cold.parent (Parent_R);
         set Child_R.child = Child_cold.child (null);
     set Parent_cold.child = Child_R
    call fillReplacement(Parent_R) (Parent)
      set Parent_R.parent = Parent_cold.parent (null);
      set Parent_R.child = Parent_cold.child (Child_R);
   return Parent_R

So is it multi-purpose now?

No.

It allows us to change purpose but not to serialize the same object once for that purpose and once for an another in exactly the same application.

Can we do it?

Of course.

With “object transformers”. There is absolutely no need to have writeReplace/readReplace+fillReplacement trio to be private methods of serialized class. They can be any methods defined anywhere providing the serialization mechanism can find them. For an example we may define:

public interface ITypeTransformer
{
  public boolean isTransformableByMe(Object x)
  public Object writeReplace(Object x)
  public Object readReplace(Object x)
  public void fillReplacement(Object from, Object into)
}

plug it into our serialization engine and be happy.

Can You do it with a standard serialization?

No. Absolutely not.

Summary

After reading this blog entry You should be understand that different applications may need to serialize the same object in a different way. You should be aware of the fact, that standard serialization is “cast in stone” in that manner and that a writeReplace/readResolve mechanism is broken and won’t help You in that manner.

You should also know that if You decide on Your own serialization engine, then You can do it in a very easy way.

JAVA serialization: if not the pointer then what?

In that post I did talk about “reflections” in JAVA and how this concept relates to serialization.

In that post I would like to say a few words about how to deal with “object references” if we can’t have a pointer.

Why do we need a pointer?

Under the hood to say what is where in a memory. But looking externally just for one thing: to tell two objects apart. If their references (ie. pointers) differs then those are not the same objects. They may be bit-by-bit equal but are not the same.

So basically as long as we can do 1:1 mapping:

  Object reference ↔ sequence-of-bits

then we are done.

Identity of objects in JAVA

Gladly JAVA provides two facilities for that:

class System{
  ...
int identityHashCode(Object x)
...
}

which do compute a “magic” number which provides non 1:1 mapping:

  Object reference → 32 bit integer

and

  Object X, Y;
    X==Y
    X!=Y

reference identity operator which can tell when two “pointer” do point to the same object.

Those two are enough to create identity hash-map (like java.util.IdentityHashMap) which can quickly map Object to int:

  stream_reference_map = new IdentityHashMap<Object, Integer>()

Of course we could do the same without the identityHashCode using only == operator and a list of structures like:

class Descriptor
{
  final Object reference;
  final int assigned_number;
}

but it would be few orders of magnitude slower.

Stream reference

The stream_reference_map shown above do map Object into an int number.
This number is called “stream reference identifier” or, in a short form: “refid”.

Note: Remember, the “refid” is not the result of identityHashCode()! The identityHashCode() does not produce 1:1 mapping! It may return the same number for many objects. It is used just to speed things up grouping objects in “buckets” over which we still need to use == operator.

Producing stream reference identifier

Any method will do. You should however think about few questions:

  1. Should I allow transfer of unlimited number of objects to stream?
  2. Should I allow garbage collection and re-use of refid?

Usually a simple incrementing counter will be ok.

Using stream reference

Basically You do use it exactly the way You would use a “pointer”. You like to write a pointer to object X a stream? Then You look up for “refid” of X and write that “refid” to a stream. Simple.

The question is when You like to write a pointer, but this is an another story.

Reading-side map

The above:

  stream_reference_map = new IdentityHashMap<Object, Integer>()

provides Object → int map. Unfortunately it is just one part of a story, which is used to write pointers to a stream. The other part of a story is to what to do with a “refid” we read from a stream?

The reading side needs:

  int → Object 

map. Gladly, if You have chosen an incrementing counter for a “refid” generator and You are fine with 2^31-1 objects in stream the simple:

  read_refid_map = new Object[...];

will do the best job.

Note: Unless You are actually planning to get anywhere near the 2^31 region in number of objects. A more “scattered” structure will better handle growing and shrinking the array during the live of serialized stream.

Problems

The first problem, which is not dealt with in standard serialization is memory leak. Yes, the standard serialization do leak as hell!

Hard-reference+garbage collector==memory leak

The stream_reference_map = new IdentityHashMap<Object, Integer> used at writing side utilize the standard, plain reference to an Object as a “key” in a map. This has an unfortunate effect: as long as this map exists the garbage collector will see all contained objects as “reachable” and won’t release them.

Usually it is not a problem, but if You will decide to, for an example, use serialization for logging Your application You will get a nasty surprise.

Imagine You do arm Your application with logging commands in following manner:

void woops(int a,int b)
{
  ....
  if (log_level_enabled) log_object_output_stream.writeObject("calling woops("+a+","+b+")");
  ...
}

Each time this code runs, the new string is formed and written to a stream as an object. This means, that it must have the “refid” assigned. And if it must have it assigned, then it must be put into a stream_reference_map. Since it is using hard reference, it means it will stay there forever. Or, precisely, until OutOfMemoryError.

The proper stream_reference_map must hold reference to mapped objects by a WeakReference.

Passing garbage collection event

Of course, even if You will deal with above You will still hit the OutOfMemoryError at the reading side of a stream.

The simplest:

  read_refid_map = new WeakReference<Object>[...];

will not work. The weak reference works at writing side, because if the only place for object to exist is the stream_reference_map
map, then there is no way to write it again to a stream.

At the reading side it is very different. The reading code may pick “refid” from stream (and objects) and drop them right in the place. The writing side may however hug to the object for very long time and write it to stream many times. Of course, to avoid many problems which I will discuss somewhere else, it will prefer to write “refid” to it. If the read_refid_map would be WeakReference then there wouldn’t be any object to map it to.

Good “refid” system do pass garbage collection events to reading side.

Roll over

Of course int isn’t infinite. Even if You will use proper garbage collection of “refid” You will still sooner or later hit:

   assert(refid_generator+1 > refid_generator )

that is a “signed wrap around”. You will run out of possible “refid” to use.

This is something what is also not addressed in standard serialization. The bad problem is that the standard serialization is not utilizing the entire 2^31-1 pool of numbers and the roll-over happens earlier producing some commands instead of “refid”. Fortunately You need a really huge VM to hit this problem, since usually the OutOfMemoryError will appear first.

The good “refid” system do re-use garbage collected refid to avoid roll-overs.

Summary

After reading this chapter You should know what the “stream reference identifier” is and how not to design the system which manages it. This should also make You to notice, that standard serialization stream cannot exist permanently or be used for large amount of data produced on-demand.

And now You may move to following part in which You will read about how object is scanned during serialization and what problems it may create.

Introduction to JAVA serialization

In previous chapters I was talking about generic ideas of serialization, dealing with pointers and dealing with versioning.

In this chapter I would like to show You how the concept and, of course, the problem of “pointers” is dealt with in JAVA.

JAVA limitations

No pointers

The most important limitation is: JAVA has no concept of “pointer”. There is a beast called “object reference” but there is no “pointer” which can be turned into an integer number representing an address in machine memory. And since there is no “pointer” there is absolutely no way to access machine memory the same way You can do in C/C++. Just no way.

Unless, of course, You will try to use unsafe set of classes and methods. Then You can play with bits and bytes, but such play will produce memory dumps which won’t survive neither years nor transfer to different virtual machine running on a different architecture. Notice also, that most of unsafe function which were present in JDK8 are removed in JDK11+. And will be hidden even more in future version. Because they are unsafe and do allow a hell lot of hacking on servers which can run “outsider code”.

Thous: no possibility to do a “memory dump”. Which, in fact, isn’t bad.

But what do we have instead of “pointers”?

We have objects. Or, precisely speaking, “references” to objects. Which are internally “pointers” but are completely opaque to us and can’t be transformed to any kind of number nor anything. Not at all. All we can do with “references” are:

  • compare them using == operator;
  • assign one to another using =;
  • use them to access fields, array elements or invoke some methods attached to them;
  • create “objects” we can later reference to through “references”.

No malloc

The second limitation is a lack of generic way to “allocate some memory”. Any allocation must be either allocation of an array or an object. And when You allocate an object, then some of its constructor must be called. This concept prevents us from creating an object with absolutely zero initialization and later filling it with data got from a serialized form. Some colaboration from an object is required.

Note 1: Standard serialization API do interact with VM at native level to create really empty objects without calling an appropriate constructor. We can also do it, but in pure JAVA it won’t be possible without referencing to some internal, JDK specific classes. Notice however that re-implementing “standard” serialization is not our goal. What we aim for is a long lasting, stable, non tricky way of supporting flexible serialization.

Note 2: If You will inspect the JVM specification then at first glance there will be difficult to find that what I just said (that the constructor must be called) is required. Indeed the instruction set does allow to allocate and object and not call a constructor at all. Such a code won’t pass however the validation stage, that is a process during which JVM checks if the class file is syntactically and logically correct. There are tricks which allow to force JVM do disable class file verification, but it is asking for serious problems.

JAVA benefits

To overcome the limitation related to lack of pointers designers of JAVA introduced one very rare and very powerful mechanism: reflections.

Not many programming languages do have it.

What are “reflections”?

In very short words this is a set of methods and classes which allows You to inspect any reference to object instance You get. You can ask it to tell You how this object is named. You can check what classes it extends or implements. And You can, what is most interesting to us, ask it to list and manipulate all fields contained in it with their names, types, annotations and currently assigned values.

You can even access this way fields which are normally hidden from You if You would just try to reference them in code, what was always a bit of disputable aspect of reflections when looked at from security point of view.

Including, of course, fields You did not know at compile time. Exactly what we need.

Note: JDK 9+ module system puts some constraints on it, but You can still do all things we need. Just not with absolutely every object as You could have done in JDK8 and prior releases.

In simpler words: if You have any “reference” to an object then, using reflections, You can ask:

  • ask how the class it is instance of is named;
  • ask what  all “fields” (that is – object bound variables) contained in it are named and what type they are;
  • ask what methods (functions to call) are declared there and what parameters do they take.

And vice versa, knowing answers to above questions You may, using reflections, do:

  • create a new instance of an object knowing name of a class;
  • set to or get values from any of its fields;
  • invoke any of its methods.

Hey, I can do it in C too!

No, You can’t.

I might have been a bit imprecise in what I was saying. Yes, You can do everything like that in C. At the source code level and, what we do call it, at “compile time”. That is You can write:

class X{ int x; };
*X =...
X->x=4

However in JAVA using reflections You can do:

String class_name = .... read it from file for an example
String field_name = ...
Object A = Class.newObject(class_name)
Field x = A.getClass().getField(field_name)
x.set(A,4)

Note: As always, the code examples are simplified and won’t compile. They are only exemplifications of some idea.

In other words, “reflections” allows You to compute the class name at run time, also from externally supplied data, and manipulated objects of that class as much as You like it.

Including classes which have been not existing when You wrote the code.

Nice, isn’t it?

Note: Java allows You also to actually generate the binary class file in Your code and tell JVM to load it at runtime. But we won’t be needing it for serialization.

Reflections+references versus pointers

As You might already noticed the lack of pointers prevents us from making a “memory dump” of an object and from manipulating its data at bit-by-bit level. The “reflections” do allow us however to manipulate actual data stored in an object using names and values without bothering how they are kept in memory.

Do we need anything more?

Summary

After reading this chapter You should be aware how the “object reference” concept differs from the concept of “pointers” and how the “reflections” do allow to overcome the limitations related to full opacity of the “object reference”.

You should also have guessed that the “reflections” will play a critical role in the implementation “indirect versioning” concept.

And, of course, You should also have noticed, that the fact that “object reference” is fully opaque means that we simply can’t save it. But this problem is left for the next blog entry.

Serialization: versioning

In the this post You could read about serialization in generic and how pointers are messing with it.

In this post I would like to touch an another aspect which do influence serialization engines a lot.

This subject is:

Versioning

As I said previously the dump of bytes and bits not only depends on code, target CPU, compiler and etc. but will change from version to version of Your program.

Now, for sake of this discussion, assume that You can force Your compiler to produce stable, predictable memory layout across all compiler versions. Assume also that You do not care about different CPU architectures.

Now You can say that bit-by-bit image of Your memory “dump” will be consistent and predictable.

True.

Unless You will do something like that:

yesterday today
    
struct{
 char [32] first_name;
 char [32] surename;
 int age
 }
    
struct{        
 char [32] first_name;
 char [32] surename;
 boolean gender;
 int age;
}

Yes,yes, I know, we do live in the era of “non-binary” persons and I was super-duper rude to use boolean for a gender. Glad You noticed that. This is a bug which will need to be fixed later and clearly shows that data structure versioning is a must.

What have happen?

We added a field. And we have done it in a very, very bad way, by stuffing it in the middle of data structure. And kaboom! Now bit-by-bit images from “today” and “yesterday” are totally incompatible with each other.

Direct versioning

The first obvious solution it is to arm Your memory “dump” with a “header”:

    HEADER: int serial_version;
    CONTENT:
    	struct{
    	.....
    	};

Writing such data is super easy:

    	write(...,serial_version);
    	write(...,&data_in_memory,sizeof(struct...))

Reading is a completely another story.

You have to do something like that:

    	int serial_version = read(...)
    	switch(serial_version)
    	{
    		case 0:
    			....
    		case 1:
    			....
    	}

and for each case You need to provide a transformation from “that version” into “current version”.

This is a bit messy for complex structures and You need continuously rename Your “yesterday” structures in Your source to avoid name clashing. You can’t do:

yesterday today future
    
typdef struct{
 char [32] first_name;
 char [32] surename;
 int age
} version_1
    
typedef struct{        
 char [32] first_name;
 char [32] surename;
 boolean gender;
 int age;
}version_2
    
typedef struct{
 ...
}version_3

because with each bump-up of a version You would have to update entirie code which references Your structure.

Instead You would rather do:

yesterday today tomorrow
    
typedef struct{
    	...
} data
    
old definition
typedef struct{
   ...
} v1
active definition
typedef struct{
   ...
} data

    
old definition
typedef struct{
....
 } v1
typedef struct{
....
 } v2
active definition
typedef struct{
....
 } data

that is keep most recent version always with the same name. This is a good idea, because if there was no gender field yesterday then none of old code made any use of it. If You will add it today there is a huge chance that 90% of code still won’t need to use it. Keeping the name unchanged saves You a lot of work.

The obvious downside is that You have to rename “old” structures but still keep them in Your code base. With plain structs it is easy, but with objects with their entire inheritance tree it is a hell lot of mess. Doable, but messy.

We need different approach.

But before saying anything about it let’s look at an another problem.

Upwards compatibility? Downwards compatibility?

Now return to:

    	int serial_version = read(...)
    	switch(serial_version)
    	{
    		case 0:
    			....
    		case 1:
    			....
    		default: what to do?
    	}

It does not need a lot of thinking to notice that with direct versioning You can have only downwards compatibility.

“New” code can load “old” data, but “old” code cannot load “new” data.

What is a point in loading “new” data by “old” code You say?

Well… Your yesterday structure did not have gender field. Then “old” program, by it’s nature, will not need it. Why not let it read “today” data? Is there any logic problem with it?

Again, vice versa, loading structure without gender requires that a transforming code has to do some guessing. There was no information about gender, but it must have it.

Notice, there is a one very serious reason to not allow upwards compatibility. That is: money. If You will allow old version of Your software to load files stored by more modern versions, then Your clients may decide to hold all their licenses back and buy just one seat upgrade to try it out. If they will find that there is no value in the upgrade they won’t buy more seats and they won’t loose anything.

If however You will design Your software in such a way that old version can’t do anything with files written by new version, and You will prevent new version from saving files in and old way, then it is a completely another story. Now if You client will upgrade just one seat, then the person working at it will on daily basis corrupt files in such a way that the rest of Your client employees won’t be able to use them. And, since You prevent saving files in “old way” after some time Your client will have a choice: either to not upgrade and throw away all job done on that new seat, or upgrade the whole company and keep the work.

Nice, isn’t it? Welcome to the world of Autodesk Inventor!

Indirect versioning

Indirect versioning uses all the “why” I spoke about above to provide both up and downwards compatibility.

Instead of:

    HEADER: int serial_version;
    CONTENT:
    	struct{
    	.....
    	};

it does:

    HEADER: int logic_version;
    CONTENT:
     begin
    	FIELD "Name" =...
    	FIELD "Gender" =...
    	....
     end

and saves not only the content of the structure, but also information what fields are in it and where.

Armed with this information You don’t really have to use any transformation. You just read fields from a stream and apply them to fields in Your current structure in memory.

As You can see indirect versioning do carry huge potential: Your program can read both older and newer streams without any effort from Your side.

Great, isn’t it?

Blah, isn’t it good old XML or JSON? Sure it is. I am just curious if You ever was thinking about it that way.

logic_version

Notice I have left some “header” still, but instead of serial_version I renamed it to logic_version.

The idea behind it is simple:

In some cases You will have to introduce a “breaking change” in Your data structure. This change is past adding and removing fields, it changes the logic so much that mapping field from stream to field in memory won’t work anymore. To indicate it You just “bump up” the logic_version.

Of course with that we do move from “indirect” to “direct” versioning.

Note: Java serialization do have serialVersionUID field just for that. It lacks however any possibility to deal with such change except of complaining that nothing can be done.

Missing fields

Of course with “indirect” approach You will have to deal with a case where You expect some fields (like said gender) but they are not there.

To deal with it You need two operations:

  • to be able to initialize missing fields with reasonable values;
  • to be able to post-process the entire structure, once it is loaded
    and initialized with defaults, and do some guessing and cleanup;

This have to be done in two separate stages (or at least: last stage is necessary) because the “reasonable” initialization is not always possible without knowing what are values of other fields.

For an example You may attempt to guess gender by looking into names dictionary, but to do that You need to have “name” field loaded first.

Additional fields

And a vice versa.

A stream may contain more fields that You need. If You did not introduce a “logic change”, then this usually means that either Your program no longer needs some information, or new version added some information Your program does not understand.

In both cases ignoring it will be fine.

Pointers?

Hurray! We solved it! If we already have given names to fields in stream why not to add a marker to say which of them are “pointers”?

Summary

After reading this chapter You should be aware version handling is not something what can be left for later because it will have significant impact on code maintenance cost and Your licensing policy.

You should also notice, that if we will switch from “memory dump” to “named fields” concept then we can solve not only up and down-wards compatibility issues but also have a nice mechanism which we can use to identify “pointers”.

Plus, obviously, You should notice why I was so eager to have a structured abstract file format and why people are so much in love with zipped xml file formats nowadays.

All right, so we know about problem with pointers and something about how to deal best with versioning. Now it is time to move to some JAVA related stuff.

Serialization: introduction

In this, that,that and finally in there You could read about abstract file formats.

Note: please notice the reference implementation I proposed there.

Especially in the first post of the series You could read about Java serialization.

In this series of posts, which will explain a background behind my next project, I will try explain basics, concepts and pit-falls which one may encounter when trying to build a data serialization engine.

Note: Most of stuff You will read here will become a part of said project documentation and will be included in a more expanded version in the final project on a github.

What is serialization?

“Serialization” is a method of transforming complex, object based data structures, existing alive in memory, into a “dead” files or data-streams which can be moved from machine to machine or saved for later use.

In other words – a way to save objects on disk or pass them through the net.

What is de-serialization?

An exactly reverse process: Having some “dead” data on disk or received from a network we do “de-serialized” them by creating living objects in memory matching previously stored content.

“Memory dump” serialization

A most idiotic but often good form of serialization is a “memory dump”. Just take a native reference to memory block containing the data structure and dump it on disk. Like in below pseudo C code:

         struct{ int a,b,c }x;
         writeFile(..., &x, sizeof(x))
    

This type of serialization has some serious flaws:

  • it saves data including all “padding” bytes injected by a compiler;
  • different compilers or even different compilations may result in different
    structure layout;
  • pointers? Can pointers can’t be stored at all? What do You think about it?
  • absolute zero robustness against version change;

Even tough this is idiotic method it was used to manually implement “swap file”
in applications which needed far too much memory that it could be provided by
an operating system.

Why? Because it is extremely fast.

Pointers or references

There is usually no conceptual problems with saving elementary data like bits,
bytes, numbers or texts into any kind of “file format”.

Pointers and references are something else.

Pointer concept

A “pointer” or “reference” is, technically speaking, an integer number which is interpreted by CPU as an address in memory from which it should read something, execute something or write something.

There are very, very few cases in modern operating systems on modern machines when an “address” of certain piece of data in memory (ie the variable which holds this text in Your web browser) will be preserved from a run of a program to run. In almost every case it will be different even tough in Your program it will be named the same, is bit to bit the same and etc.

If You would save such an “address” on a disk, would close the program, and the would start it again and load such an address from stored file then You have 99.999% chance that a loaded address won’t point to where it should.

When pointer can be serialized?

A pointer must be always serialized in a “smart way”.

First, the serialization mechanism must know that it is serializing the “pointer”. This means, that it can’t just dump a block of memory on disk and then load it later. It must know where in this block pointers are.

Second, the saved pointer must point only to a part of memory which is also being serialized. Only then You may somehow change the address X to “this is an offset X1 in N-th serialized block of memory”.

A pointer pointing to something what is not serialized can’t be serialized.

How pointer can be de-serialized?

The serialized pointer is basically a way of saying to which part of serialized data it points.

The easiest method of imagining it will be:

                       This is a memory
                    block to be serialized

         *******************b*******************c*************
         ↑                  ↑
         Aptr               ↑
                           bptr
    

The Aptr is an address of memory block as the CPU can see it. For an example 0x048F_F400.

The b is a some variable in that block. The address of this variable, as a CPU can see it is bptr.

And the c is a variable inside that serialized in which we would like to save the bptr.

Let us say bptr=0x048F_F4A0.

If we would just dump the block on disk the c would contain the 0x048F_F4A0.

Then imagine we are loading that block from disk to memory five days later.

Will it work?

Yes. Providing we do load it into exactly the same memory location. Our new Aptr must be 0x048F_F400.

If however You have done like:

         Aptr = malloc some data // Aptr ==  0x048F_F400
         writefile(Aptr...)
         ....
         kill program, wait five days
         ....
         Aptr = malloc some data  // Aptr == 0x0500_0000
         readFile(Aptr...)
    

Then c=0x048F_F4A0 won’t point to anything.

But if You would have done:

         Aptr = malloc some data // Aptr ==  0x048F_F400
         Aptr->c = Aptr->c - Aptr //    c ==  0x0000 00A0
         writefile(Aptr...) ; ← including c in this written block
         ....
         kill program, wait five days
         ....
         Aptr = malloc some data   // Aptr == 0x0500_0000
         readFile(Aptr...)         //   c  == 0x0000 00A0
         Aptr->c = Aptr->c + Aptr; //   c  == 0x0500 00A0
    

then c is correctly deserialized

They key concept You should remember and understand is:

To be able to serialize a pointer You must know when You are serializing a pointer.

Summary

After reading this blog post You should understand what is the basic idea behind the serialization and that raw “memory dump” serialization is not the best concept for any long term data storage. You should be also aware that the most tricky part of it are pointers. You should also notice that the most important thing one should deal with during serialization is to figure out some way of saying “hey, this is a pointer what I am serializing now!”.

No You are ready to take next step.

Abstract file format: what to use for a “signal”?

In that post I wrote:
(…)
The good abstract API needs to be able to indicate data boundaries and move from boundary to boundary with:

 void writeSignal(signal type)
   signal type readNextSignal()

(…)

I let myself to use the enigmatic signal type.

Now it is a time to dig into it.

What is “signal” used for?

Again, just to remind You: for telling what a certain bunch of data is used for. To give it a name.

How many different “signals” do we need?

At least two… or to be specific – two kinds of signals.

If a signal must give a name for a certain block of data then it must somehow indicate when this block starts and when it ends.

In fact it must be very alike good old C typedef:

typedef struct{
   int a;
   char c;
}Tmy_named_struct

Since to be able to efficiently process a stream of data we need to know what struct means before we start reading it, then the structure in a data stream should rather look like even older Pascal:

 begin Tmy_named_struct
   int: a;
   char: c;
 end

The begin and the name comes first, the end comes last.

The “end” and the “begin”

This means we need an API which will be closer to:

 void writeBegin(signal name)
 void writeEnd()
 ...
 signal type readNextSignal()

Now You can see I used signal name and signal type. We need to define them more closely.

The signal name

Historically speaking, when I started this project, I decided to do:

 void writeBegin(int signal)
 void writeEnd()
 ...
 /*....
  @return positive for a "begin" signal, -1 for an "end" signal. */
 int readNextSignal()

It was very bad idea.

At first few uses I started to get pains to manage what number means what and how to prevent numbers from different libraries to not clash. This was the same problem, although in mini-scale, as global function names clash in C.

So I thought to myself: How did they solve it in C?

With a name-space. You can assign a function name to a name-space and then use a fully qualified name to avoid a name clash. And if then names do clash, You can put a name-space into a name-space and form something like:

  space_a::space_b::function

Please excuse my wrong syntax. I did not use C/C++ for quite a long time and I don’t remember how exactly it looks like nowadays.

So I could use a sequence of int numbers….

Dumb I was, wasn’t I?

The hundred times easier and more efficient is to do:

 
 void setNameLengthLimit(int length);
 void writeBegin(String name)throws ENameTooLong;
 void writeEnd();
 ...
 /*....
  @return either a name for "begin" signal or null for "end" signal
*/
 String readNextSignal()throws ENameTooLong;

We use String for names. Strings are variable in length, easy for humans to understand (this is important if back-end will be something like JSON which ought to be human-readable) and Java has a very efficient support for them.

Note: I did myself let to introduce setNameLenght(...) and throws ENameTooLong. Remember what I have said about OutOfMemoryException attack? The ENameTooLong is there to let You keep flexibility and put a safety breaks on Your stream.

But Strings are slooooow

Sure.

With the int as a signal name and a careful selection of constants like below,
the code which looks like this:

  int name=...
  switch(name)
  {
    case 0: ... break;
    case 1: ... break;
  }

can be compiled on almost any architecture to a “computed goto”:

  mov name → reg0
  cmp reg0 with 1
  jmp_if_greater _after_switch
  shl reg0, times           ; according to DATA width
  mov jump_table[reg0], PC  ;read target from table and jump there
  jump_table:
     DATA case0
     DATA case1
    ...
case0:
   ...
   jmp _after_switch
case1:
   ...
   jmp _after_switch     

where the entire comparison take just about 6 machine instructions regardless of how huge the switch-case block would be.

Prior to JDK8 using Strings was a pain in a behind, because the only syntax You could use was:

  if ("name0".equals(name))
  {
  }else
  if ("name1".equals(name))
  ....

what in worst case scenario ended up in large number of .equals calls.

At a certain moment, and I admit I missed it, JAVA specs enforced the exact method of computing String.hashCode(). Prior to JDK8 it had no special meaning and each JVM/JRE/JDK could in fact provide own implementation of String.hashCode(). There was simply no reason to enforce the use of the standard method.

Since JDK8 such a use did appear.

JAVA now absolutely requires that:

  • regardless of JDK/JRE/JVM String.hashCode() is always computed using the same algorithm. This way compiler may compute hash codes for known const Strings at compile time and be sure that in any environment:
       StringBuilder sb = new StringBuilder();
       sb.append('a');
       sb.append('b');
       sb.append('c');
    
       assert(  1237 == sb.hashCode())
       assert(  1237 == "abc".hashCode())
    

    assertions will not fail.
    Please notice, 1237 is not a correct hash code for that example. I just faked it to show, that compile time constant can be used.

  • second they directly requested that String has something like:
        class String
        {
          boolean hash_valid;
          int hash_cache;
    
          public int hashCode()
          {
              if (hash_valid) return hash_cache;
              hash_cache = computeCache();
              hash_valid = true;
              return hash_cache;
          };
        }

    Notice this is a very simplified code which may fail on some machines in a multi-threaded multi-core environment due to memory buss re-ordering. Some precautions have to be made to ensure that no other thread do see hash_valid==true before it can see hash_cache to be set to computed value. Since String is implemented natively I won’t try to dig into it. It is just worth to mention that volatile would do the job but
    it would be unnecessary expensive. I suppose native code could have found better solution.

    Notice, the race condition on setting up hash_cache is not a problem. Every call to computeCache() will always give the same result as JAVA strings are immutable. At worst we will compute it twice but nothing would break.

  • and third they did require that:
       class String
       {
         public boolean equals(Object o)
         {
            ... null, this, instanceof and etc checked.
            if (o.hashCode()!=this.hashCode()) return false;
            return compareCharByChar(this,(String)o);
         }
       } 
    

    which avoids calling compareCharByChar() unless there is a high probability that it will return true. And, of course, the compareCharByChar will terminate immediately at first not matching character.

In much, much more simple words it means, that since JDK8 it is important that String.hashCode() is environment invariant, cached, and used for quick rejection of not matching strings.

Knowing that they let us use:

  String name=...
   switch(name)
   {
     case "name0":...break;
     case "name1":...break;
   };

which is not implemented as:

  String name=...
   if ("name0".equals(name))
   {
   }else if ("name1".equals(name))
   {
   };

but as a hell more complex and hell more faster:

 String name=..
 switch(name.hashCode())
 {
    case 1243:
         if ("name0".equals(name))
         {
         } else if (... other const names with 1234 hash code.
         break;
   case 3345:
        ...
  }

This can be, in large cases, about 10 times faster than pure if(...)else if(...) solution.

Never less it is still at least 50 times slower that using int directly because it can’t use jump-table and must use look-up table which has a comparison cost linear with the number of cases. Unless very large table is used in which case we can profit from binary search and reduce the cost to log2N cost.

Never less it won’t never ever be even close to 6 machine instructions.

Then why String?

Because with int I really could not find a good, easy to maintain method of combining constants from different libraries wrote by different teams into a non-clashing set. I could figure out how to avoid clashing with static initialization blocks, but then such constant are not true compile-time constants and can’t be used in switch-case block.

Strings are easy to maintain and clashes are rare due to the usually long, human friendly names.

In other words – use int if speed is everything and difference between 6 and 600 machine cycles do mean everything for You while continuous patching of libraries in struggle to remove code clashes is not a problem.

Use String if development costs and code maintenance are Your limit.

And even if You are, like me, the “speed freak” please remember that we are talking about file or I/O formats. Will 6 versus 600 cycles matter when put aside of the time of loading data from a hard-disk or network connection?

I don’t think so.

But Strings are so laaaarge…

Yes they are.

And human readable.

Using non human readable numeric names for XML or JSON back-end makes using XML or JSON pointless. The:

 <1>3</1>

is equally readable to human as a binary block.

If however size is Your concern, and in case of a binary back-end it will be, You can always create a “name registry” or a “name cache” and stuff into Your stream something like:

with String names with name registry
begin “joan” assign “joan” to 0
begin 0
end end
begin “joan” begin 0
end end

and get a nice, tight stream with short numeric names and at the same time nice, maintenance friendly String names.

Note: this table is a bit of simplification. In reality the capacity of name cache will be limited and the capacity of numeric names will also be restricted. Most possibly You will need assign, begin-with-numeric-name, begin-with-string-name commands… but this is an another story.

Summary

After reading this blog entry You should now know that the “signal” can in fact be a “begin” accompanied with a “name” or just a plain nameless “end”. You should also know that there is no reason to fret about not using String for “begin” name and how to deal with performance and size issues related with String used in that context.

What next? See there.

“Content driven” versus “code driven” file formats.

In a previous blog entry I did promise to tell You something about two different approaches to data parsing: “content driven” versus “code driven”.

Or “visitor” versus “iterator”.

Content driven

In this approach a parser is defined as a machine which exposes to an external world an API looking like:

public interface IParser
{
   public void parse(InputStream data, Handler content_handler)throws IOException
};

where Handler is defined as:

public interface Handler
{
  public void onBegin(String signal_name);
  public void onEnd();
  public void onInteger(int x);
  public void on....
 and etc, and etc...
}

Conceptually the parser is responsible for recognizing what bytes in stream do mean what, and for invoking an appropriate method of a handler in an appropriate moment.

The good example are: org.xml.sax.XMLReader and org.xml.sax.ContentHandler from standard JDK. Plenty of You most probably used that.

Note: This is very alike “visitor” coding pattern in non-I/O related data processing. Just for Your information.

Benefits

At first glance it doesn’t look as we can have much profit from that, right? But the more file format becomes complicated, the more benefits do we have. Imagine a full blown XML with a hell lot of DTD schema, xlink-ed sub files and plenitude of attributes. Parsing it item by item would be complex, while with handler we may just react on:

public interface org.xml.sax.ContentHandler...
{
  public void startElement​(String uri, String localName, String qName, Attributes atts) throws SAXException
  {
   if ("Wheel".equals(qName))
   {
     ....

and easily extract a fragment we are looking for.

This is exact the reason why XML parsing was defined that way. XML is hell tricky for manual processing!

Obviously we do not load the file just for fun. We usually do like to have it in a memory and do something with data contained, right? We do like to have something very alike Document Object Model, that is a data structure which reflects file content. The “content handler” approach is ideal for that purpose because we just do build elements in some handler methods and append them to the data structure in memory.

Easy.

And a last, but one of most important concept: we can arm a parser with “syntax checking”. Like we arm SAX parser by supplying an XML which carries inside its body the DTD document definition schema. Parser will do all the checking for us (well, almost all) and we can be safe, right?

Well… not right, but I will explain it later.

Why I call it “content driven”?

Because this is not You who tells what code is invoked and when. You just tell a parser what can be invoked, but when and in what sequence Your methods are called is decided by a person who prepared a data file.

Who, by the way, may wish to crack Your system.

Content driven vulnerabilities

XML

The content driven approach was a source of plenty of vulnerabilities in XML parsing. One of most known was forging an XML with cyclic, recursive DTD schema. The XML parser do load DTD schema absolutely before it parses anything else from an XML. After that it creates a machine which is responsible for validation process. If DTD schema was recursive the process of building this machine  will consume all the memory and system will barf.

Of course this gate for an attack was opened by some irresponsible idiot who thought that embedding rules which do say how the correct data file looks like inside a data file itself is a good idea…

Note: Always supply Your org.xml.sax.XMLReader with org.xml.sax.EntityResolver which will capture any reference to DTD and will forcefully supply a known good definition from Your internal resources.

If You will defend Your XML parser with DTD or alike schema and You will make sure, that nobody can stuff in Your face fake DTD then in most cases “content driven” approach will be fine.

When it won’t be fine?

When Your document syntax do allow open, unbound recursion in definition. Or when DTD does not put any constrains (which it can’t do) on length of an attribute. Or in some other pitfalls which I did not fall-in because I don’t use XML on daily basis.

There is however one another, even more hellish piece of API which can be used to crack Your machine…. and this is…

Java serialization

Yep.

Or precisely speaking: Java de-serialization.

A serialized stream can, in fact, create practically any object it knows it exists in a target system with practically any content of its private fields. Usually creating objects does not call some code, but in Java it is. Sometimes constructor will be called, sometimes methods responsible for setting up the object after de-serialization will be. All will be parametrized with fields You might have crafted to Your liking.

A possible attack scenarios do allow from simple OutOfMemoryException to execution of some peculiar methods with very hard to predict side effects.

All in response to:

   Object x = in.readObject()

Basically this is why modern JDK do state that serialization is a low level insecure mechanism which should be used only to exchange data between known good sources.

Preventing troubles

Since in a “content driven” approach they are data what drives Your program You must defend against incorrect data.

You can’t just code it right – instead You need to accept and parse bad data and then reject them. Like for an example You need to accept opening XML tag with huge attributes, and just once Your content handler is called You must say: “recursion too deep” or “invalid attribute“.

Alike in Java de-serialization You must either install:

public final void setObjectInputFilter​(ObjectInputFilter filter)

(since JDK 9)
or override

protected ObjectStreamClass readClassDescriptor()

in earlier version to be able to restrict what kind of an object can be created.

Notice, even then some code will be executed regardless if You reject it or not because the sole existence of Class<?> object representing a loaded class means, that static initializer for that class was executed.

The “content driven” approach is always using a load & reject security model.

I hope I don’t have to mention how insanely bug prone it is, do I?

Code driven

In this approach we do things exactly opposite way: we are not asking parser to parse and react on what is there. Instead we know what we do expect and we ask parser to provide it. If it is not there, we fail before we load incorrect data.

For an example a code driven XML parsing would be very alike using:

public interface java.xml.XMLEventReader
{
  boolean hasNext()
  XMLEvent nextEvent()
....
  String getElementText()
}

As You can see You may check what next element in XML stream is before reading it.

Note: Unfortunately I let myself to mark one method of this class in red to indicate, that it is also not an attack proof concept. The String in XML is unbound and a crafted rogue XML may carry huge string inside a body to trigger OutOfMemoryException when You do attempt to call that method.

Very alike Java de-serialization might be tightened a bit by providing an API:

 Object readObject(Class of_class ...)

instead of just an unbound:

 Object readObject()

Sadly de-serialization API in generic is maddening unsafe regardless of an approach. What doesn’t mean You should not do it. It just means, You need to pass it through trusted channels to be sure the other side is not trying to fool You.

Benefits

Using “code driven” approach we can be as sure as possible to not accept incorrect input instead, as in “content driven” approach, to reject it later.

Simply, what is not clearly specified in code as expected won’t be processed. It is like wearing a muffler versus curing the flu.

On the other hand, one must write that code by hand, and usually the order of data fields will be forced to fixed or it would be too hard to code. One must also deal manually with missing fields, additional fields and all other issues related to format versioning.

This is why I was so picky about being expandable and support dumb skipping.

Code driven vulnerabilities

Security? No inherent one. At least if API is well designed and all operations are bound.

Usability?

Sure, a lot troubles. Code driven data processing is very keyboard hungry.

But…

“Code driven” can be used to implement “content driven”

Consider for an example a plain “get what we expect” code driven API.

It might look like:

public interface IReader
{
  public String readBegin(int max_signal_name)throws MissingElementException
  public void readEnd()throws MissingElementException
  public int readInt()throws MissingElementException
....
};

This is a pure essence of “code driven” approach. You have to know what You expect and You do call an appropriate method. You call the wrong one, it barfs with a MissingElementException.

Of course it means, You must know the file format to an exact field when You do start coding the parser.

If we would however define this API to allow to “peek what is next”:

public interface IReader
{
  enum TElement{....}
  public TElement peek();
   ....
};

there would be absolutely no problem in writing something like:

public void parse(Handler h)
{
   for(;;)
   {
     switch(in.peek())
     {
        case BEGIN: h.onBegin(in.readBegin()); break;
        case .....
     }
   }
}

and we just have transformed our “code driven” parser into a “content driven”. Under the condition that we can “peek what is next”.

Opposite transformation is impossible.

“Iterator” versus “visitor”?

Yes, I did mention it at the beginning.

Those two concepts are very alike “code” and “content” driven and for Your information both
are present since JDK 8 in Java Iterator contract.

First let us look at the below pair of methods:

public interface java.util.Iterator <T>
{
  boolean hasNext()
  T next()
   ....
};

They do formulate “code” driven contract which allows us to “peek if something is there” and then get it. If we don’t like it we do not have to get it.

Then look at the method added in JDK 8, together with an introduction of lambdas and “functional streams“:

void forEachRemaining​(Consumer<? super E> action)

This turns it into a “visitor” concept where in a:

public interface Consumer...
{
   void accept​(T t)
};

accept(t) method is invoked for every available data regardless if we do expect more of it or not.

Reader may easily guess, that if one loves “functional streams” concept, which I don’t, then the “visitor” pattern has a great potential.

Note: There is one case when visitors are beating iterators. This is a complex thread safety. Thread safe iteration requires the user to ensure it, while visiting puts this job on the shoulders of a person who wrote data structure.

Summary

After reading this blog entry You should notice, that “content driven” parsing is very simple to use but at the price of being inherently unsafe.

On the contrary “code driven” is usually order of magnitude safer, but also an order of magnitude more stiff and harder to use.

If not the fact that code driven parsing with “peek what is next” API can be used to implement “content driven” parser the choice would be a matter of preference. Since this is how it is, then my proposal of an abstract file format must be, of course, designed around code driven approach.

Abstract file format API, basic primitives

All right, in this and that blog entry You might have read about file formats.

Then You might have noticed that I defined a certain absolute minimum for a file format which I called a “signal format” and proposed some API for it:

public interface ISignalWriter
{
   void writeSignal()...
   OutputStream content()...
}

and

public interface ISignalReader
{
  InputStream next();
};

You may also remember, that I have said this is a bad API.

Today I would like to explain why it is bad.

OutputStream/InputStream are bad guys

Note: For those who are not familiar with java.io I must explain a bit. The InputStream and OutputStream are classes which do provide a sequential and byte oriented API for binary data streams. You can just write some bytes and read them. Nothing fancy.

Now first things first: we are talking about “abstract” file formats. Abstract, in term of an API which do allow us to write any data without concerning how it is sent to the world. Binary? Fine, no problem. XML? Why not. Json? Sure, You are welcomed. And etc, and etc.

The InputStream and OutputStream are binary. They do know only bytes and bits and we have to play with them to encode our data. We can do “binary”, but the XML won’t just happen without a lot of care from our side. And this is what I would like to avoid in my abstract file format API.

The API I proposed above do take care about naming data structures and telling us where do they start and where do they end. It also allows us to move around, basically skipping content which we do not care about. I does not tell us well however how do we store the data.

All right, but what the data are?

What are data?

Honestly? Anything. But to be more precise: anything You can express in Your programming language.

The primitive types.

In Java our data will be then:

boolean,byte,chart,short,int,long,float,double

These are basic building blocks. Absolutely everything what can be expressed in Java can be told using those data types.

Obviously other programming languages will have different set of primitive types. The good thing about Java is that those types are well defined. There is no ambiguity like in C/C++ and byte is always binary U2 encoded signed number with 8 bits. This is why I love Java.

Primitive data versus Input/Output streams

Obviously I am not a genius and there were smart people before me. The Java guys long time ago thought about “hey, why to play with bits and bytes in Output/Input stream? Can’t we just play with primitive types?”

And they did introduce DataInput and DataOutput interfaces.

The idea was good… except it was totally fucked up. This was still the era when we had struggled with telling apart “contract” from “implementation” (interfaces and pure virtual classes were something new then) and they did define those interfaces as something like that:

int readInt() throws IOException
Reads four input bytes and returns an int value. 
Let a-d be the first through fourth bytes read. The value returned is:
 (((a & 0xff) << 24) | ((b & 0xff) << 16) |
  ((c & 0xff) <<  8) | (d & 0xff)) 

I let myself to highlight in red what was done wrong. They not only defined that this method do read 32 bit signed integer from a stream, but they also did specify how should it be achieved. They messed up contract with an implementation.

But if You would just ignore it and leave only the following fragment:

int readInt() throws IOException
Reads and returns an int value. 

then it is a a good abstract API for reading primitive data from a stream. I like it.

What else is wrong in DataInput?

Since I am already at pointing what have been made wrong let me continue and point out other weak and possibly dangerous point in DataInput API.

The next candidate to yell at is:

String readUTF() throws IOException

which is defined in a triple wrong way.

  1. It specifies how it is stored in a binary form. This messes up an implementation with a contract, but I already have told it.
  2. Then the binary format which is chosen limits the size of encoded binary form of a string to up to 64kBytes. Notice it creates two problems:
    • first it prevents saving longer strings, and second;
    • You can’t predict if Your string will fit the 64k limit until You try it. This limit applies to an UTF-8 encoded form and the size of UTF-8 form do depend on string content. This is silly and make it unpredictable. Unpredictable code is bug prone and inherently unsafe. You may be fine saving 65535 letters long US English text but in Chinese You will hit the limit, I think, at about 8192 characters or less.
  3. And at last, this API do remove any control from the reader about how many data will be actually read and used to build the String. Sure, the encoding 64k limit puts a serious safety constraint, but You can’t say how long the returned string will until You read it.

Why it was done this way?

Because this reflects the constraints put in class file format on constants and data. The DataInput and DataOutput were initially meant to manipulate java class files. And this is all.

All right, so how the readUTF should be declared then?

Maybe this would be ok:

String readUTF() throws IOException
Reads an UTF or alike encoded string of an arbitrary length and returns it.

This API looks good. We have plenty of alike API’s, right?

Except that it would be insanely unsafe.

And this is where we come to an another important factor.

File formats are gateways for an enemy attack

Yes.

The:

String readUTF() throws IOException
Reads an UTF or alike encoded string of an arbitrary length and returns it.

could be a good API if it would be used internally inside a program to process data it keeps in memory or produces on demand. If however it interfaces with a potentially hostile external world we have to take more care.

First we need not to limit our usability and need not to constraint the length of a String in a dumb way like DataInput did. We may like to store MBytes or GBytes in that way. Or we may just store sentences few characters long. At the implementation side we will have to resort to something functionally working like good old “null term string”. Remember, unnatural limits in API do remove its usability.

But having no size limit means…. that we have no size limit.

Remember, the file comes from an outside of a program and may be crafted by an attacker to intentionally harm us. For an example an attacker may create a program which just pumps characters at infinity into and I/O stream at no cost except a connection load.

What code implementing the API:

String readUTF() throws IOException

would do if it would be confronted with such a crafted stream?

It will first allocate some small buffer for a text. Then it will load text from an underlying file or I/O stream, decode it and append to the buffer. If buffer will get full before the end of text is reached it will re-allocate it and fill again. And again, again… till OutOfMemoryException exception will be thrown.

Even tough Java is well defended against this type of error the OutOfMemoryException is one of nastiest to restore from because it can pollute system all around. Imagine one of threads touching the memory limit. Sure, it made it wrong and is punished with an exception. But what if a good behaving thread is also allocating some memory during the problematic operation performed by a wrong doing thread? It is just a matter of randomness which of them will be punished with OutOfMemoryException.

We can’t open this gateway to hell!

The correct API would look like:

int readUTF(Appendable buffer, int up_to_chars) throws IOException
Reads up to specified number of characters and appends it to a given buffer. 
@return returns number of appended characters. 
        Returned value is -1 if nothing was read at due to end of text.
        If returned value is less than up_to_chars then the end of text was reached.
        If returned value is equal to up_to_chars then either end of text was reached,
        or some characters are left unread and can be fetched by subsequent calls of
        this method.

Sure it doesn’t look very easy to use but it allows us to put a control and restrain the out of resources attack by simply calling:

 int s = readUTF(my_buffer,64*1024)
  if (s==64*1024) throw new AttackException("String too long, somebody is trying to attack us!")

and be sure that even if an attacker will forge dangerously huge I/O stream it won’t harm our application.

So how the API should look?

Again I drifted off shore to the lands of unknown. So let me swim back and return to the API.

The good abstract API needs:

  • to be able to indicate data boundaries and move from boundary to boundary with:
       void writeSignal(signal type)
       signal type readNextSignal()
    
  • to write and read elementary primitives without taking care of how exactly they are stored:
     
       void writeBoolean(boolean x)
       boolean readBoolean() throws IOException
        ...and so on for each primitive
    

    The DataInput and DataOutput are good starting points if You remove anything related to “bytes” and encoding from them.

Is that all?

Well… no it is not. But before I will move to more details we will have have to talk about a content driven parsing and code driven parsing because it will impact the API a lot and will again show us some serious safety issues which may be created by carefree built API.

Summary

After reading this blog entry You should be aware how the abstract file format should deal with basic data like numbers end etc. You should also be able to point out potential safety issues with file format related APIs.

What are file formats anyway?

In a previous blog entry I did roughly describe how Java serialization is using its file format and why it is wrong. I also introduced the idea of plugable file format.

In this blog entry I would like to dig into a sole definition of file format.

How file format is defined?

By hand and on paper.

Really. No joking.

What are You looking for when, let’s say You, are tasked with processing data stored in STL file?

For a specification. You are looking for a human readable specification.

Format specification

The document called “format specification” may be either very short and unclear, as in case of a binary STL , or it may be very long, formal… and also unclear as if in Microsoft *.lnk  case.

Regardless how it is expressed the ultimate result of reading it is to know what data in file means what.

This is all. It may say: “first four bytes store IEEE32 float for X coordinate, next four bytes alike for Y, next for Z and so on to end of file”. This is roughly speaking the idea behind STL format. Or it may say it in much, much more complex way, as in case of the said above Microsoft format.

The common denominator for such formats is one: if You don’t have specs, You can’t do anything with it.

Exactly as with Java serialization format.

Intermediate file formats

One may: “All right, but we have XML. It is self describing and solves everything”

Almost good. Almost…. No, not at all.

XML is self describing. This is true.

You may open a file and force Your machine to interpret it as UTF-8 text. Or as ASCII text. Or as UTF-16LE text. Or as UTF-32BE text. Or…

Sooner or later You will get a readable text. Then You may look at it with Your human eye and deduce what means what. Unless it is *.dwf file which is a “portable” format consisting of XML with one huge text encoded binary block. What a nice joke they made!

Then, my dear XML fan, why no JSON? It is also self describing. And the plus is, there is no text encoding lottery because it is hard-coded to UTF-8.

Bad formats, bad!

The primary problem with both, and in fact with most formats of such kind, is that files are huge.

The 64 bit floating point number needs 8 bytes in binary format. And about 20 or even more in JSON or XML. Not mentioning that some numbers which have finite 2 base form have infinite 10 base form and vice versa.

Note:
At first I did add here a lot of parsing and security problems, but later I decided this is not a right moment. So let’s stick with a huge as a main problem.

Good formats, good!

The most important advantages of XML alike intermediate formats are:

  • they are self-describing;
  • they do allow “dumb skipping” of unknown content;
  • they are expandable;

Self describing means…

For the format to be “self-describing” it is necessary that it somehow, in a standardized way, gives names to some elements it carry. Since the way of giving names is standard You may take a file, parse it using a standard parser, and see what names and in what order do appear. With this information You may easily guess what is stored where.

Both XML and JSON are self-describing.

Dumb skipping of unknown content is…

This functionality is tightly coupled with the previous one. The standard way of giving names means, that there must be a standard way to find names. This way must be independent of a content carried inside or between named elements. If it would depend on it, then You would not be able to find names.

For an example we may create a text format which specification will say:

File is divided into tokens separated by “,”. First token is the name of first element. After the name there are some “fields” and then there is the name of a next element.

The number and meaning of fields is following:

  • for name “point” we have two fields (X,Y);
  • for name “circle” we have three fields (X,Y,R);

This format does not allow dumb skipping. You must know mapping from names to counts of tokens to find which token is a name and which is a field.

If for an example this format will be modified to:

File is divided into tokens separated by “,”. First token in a line is the name of an element (…)

the this format would allow dumb skipping because name is always first in each line.

Dumb skipping is very important because it allows You to extract data of interest from files without bothering about full syntax of the file.

And expandable is…

This is almost like “dumb skipping”, but not exactly alike. The “dumb skipping” do allow You to ignore elements You do not understand. For an example if version 1.0 of above simplified format knew only “point” and a “circle” and version 2.0 did add:
(…)

  • for name “rectangle” we have four fields;

then parser understanding version 1.0 may parse 2.0 file. It won’t be able to react correctly on a “rectangle” but the presence of it won’t stop it from understanding the rest of a file. And what would it do with “rectangle” anyway if the application it is built in does not know rectangles?

If however the version 1.1 would add:
(…)

  • for name “circle” we have four fields (X,Y,R,A). First two being X and Y, next radius, and next the aspect ratio;

then our parser version 1.0 may read “circle”, read three fields and then expect the name. Which is not there. If file format is expandable this parser should not be fooled by this and the request: “and now I expect the name” should be correctly fulfilled by skipping the aspect ratio added in 1.1 version of a file.

In other words, to be expandable the format must allow “dumb skipping” regardless of at what token the cursor is.

So why I find XML bad?

Because in both cases, XML and JSON the declaration of API:

and You can start element with a name, then write content, naming possible sub-elements...

is bundled together and inseparably with implementation of API:

 XML element starts with <name and ...

Smallest common denominator

Now let us ponder what is a smallest common denominator of XML, JSON and “specification” file formats?

To know where information about X start and where it ends. “Specification” formats are saying it, frequently, by knowing the position of a cursor in a binary file. XML and JSON are using a kind of syntactic marker.

And this is it.

The smallest common denominator is the ability to say, when writing a file:

“Here is a boundary of the element”

This is all we need to be able to parse format element-by-element.

Elementary signal file format

This smallest common denominator API may be defined in Java like:

public interface ISignalWriter
{
   void writeSignal()...
   OutputStream content()...
}

The writeSignal() do write a kind of “syntactic marker“, let us now ignore how it does it, and stream returned by the content() allows us to write raw bytes into such format.

The reading counter-part may look like:

public interface ISignalReader
{
  InputStream next();
};

where next() finds next, nearest “syntactic marker” and returns InputStream object which allows us to read the content up to next “syntactic marker”.

The very important functionality is that next() must work regardless how may bytes were read from previously returned InputStream. That is it must support both “dumb skipping” and “expandability”.

Summary

After reading this blog entry You should have some idea what are requirements for a good file format and how an elementary, good API may look like. I do warn You that in fact this is NOT a good API at the moment, but it illustrates well the concept.

In next blog entry I will expand that idea in a bit more sophisticated form.

Towards abstract file format

Today I would like to talk about complex file formats.

Anyone of You who are programming most probably was either reading or writing to files. If You got lucky, You were supplied with some format specific API. If not You had to get to format specification and write the API by Yourself.

How many times You had to do it? Five? Ten? More?

I got a bit pissed off having to do the same brainless work again and again. I have my data, I know the structure of it, and I would like just to push them to a file. I should not care if the format is XML, JSon or any other binary format, right?

Java Serialization

Java serialization was a brilliant step forwards, but it was stopped half the way.

Note: For those who do not know what serialization is: You take an object, You take a stream and say “write that damn object to a stream”. And serialization writes the object and all objects it references to. Just like that.

Why I am saying it? Well…

Because the serialization is a fixed hard-coded binary format. Even worse, it is implemented in such a way, that there is no clear “borderline” which You may override to serialize the object to XML instead.

Sure, You will say, we have other serialization engines for Java which do write to XML. Yes, we have. But they also come with hard-coded format, and what is much worse, with own object processor.

In fact, what I do need to have is a standard serialization engine with a “plugable format”. Something like that

How it is done currently?

The current serialization source code is build like:

Taken from LTS JDK 8 source
....
  private void writeHandle(int handle) throws IOException {
        bout.writeByte(TC_REFERENCE);
        bout.writeInt(baseWireHandle + handle);
    }

This is a part of ObjectOutputStream.java which is responsible for writing to a stream a reference (pointer) to an object which was already serialized (at least – partially serialized). This is a good API for serialization format. Having the API with writeHandle() would be nice. The implementation is however utterly stupid. At least from an object oriented programming point of view.

This method should be:

   protected abstract void writeHandle(int handle) throws IOException;

and should be declared in class AbstractObjectOutputStream. Then a DefaultObjectOutputStream class should be declared and it should carry the implementation:

  @Override protected void writeHandle(int handle) throws IOException {
        bout.writeByte(TC_REFERENCE);
        bout.writeInt(baseWireHandle + handle);
    }

If it would be done this way we could have easily change the binary format to XML, text dump or whatever we would like to have.

Note: Looking at the serialization source code one should devise from it the bright idea that You should not let inexperienced coders to code new ideas. The idea of serialization and algorithms behind it were very new at the moment. Nobody ever done something like before. Sure, it has numerous conceptual bugs but the coding… it is was sea of errors which can be made only by very inexperienced coders.

Making it plugable

The obvious part to be made later would be:

public interface ISerializationFormat
{
    public void writeHandle(int handle) throws IOException;
.....
}

and

public class PluggableObjectOutputStream extends AbstractObjectOutputStream
{
     private final ISerializationFormat fmt;
  ....
    public PluggableObjectOutputStream( ISerializationFormat fmt )....
    ....
@Override protected void writeHandle(int handle) throws IOException {
       fmt.writeHandle(int);
    }
}

This way we can use a precious “wrapper” technique to debug serialization by, for an example:

   AbstractObjectOutputStream o = new PluggableObjectOutputStream ( new LoggingSerializationFormat ( new DefaultSerializationFormat( ....

Try doing it now…

Benefits from plugable serialization

One may say: “All right, so You have a problem with that. This is just because You are lazy. If You need different format, why not to write the serialization for Yourself? The algorithm and sources are public, right?”

One may be right. May be.

But only if that one did not try to do it. I have tried it. Three times.

The serialization algorithm is not trivial. But it can be done. In a very ugly, sub-optimal way, but it can be done.

What cannot be done, at least not in a portable way in pure java at JDK 8 level, is de-serialization. Specifically a very, very tiny bit of it, which translates to Java bytecode:

   new "x.y.class"
   without calling a constructor

This sequence of bytecode can be executed by JVM but is prohibited and rejected by class file verification mechanism. You can’t have object without a constructor being called, and serialization especially does not require to have a “do nothing” constructor. This action must be implemented with digging into JVM guts and thous the special open source project called ObjGenesis (as far as I recall) was created. But this project is no magic and does nothing more that “check what JVM you run on and hack it”.

So implementing the exactly compatible de-serialization algorithm is at least very time consuming task.

Just to make it, let’s say XML?

If it would be plugable, then there would be no problem at all.

Serialization format API or…?

Up to now I was talking about a very specific case of Java serialization API. This API is very focused on a dense packing of Java objects. If You will just try to use it to save a struct of some fields You will notice, that it will pollute the stream with things called “class descriptor”, reference counters and etc. While You just wanted to have a plain, dumb structure, right?

I think we should now focus on thinking about what exactly the file format is.

But this will be in the next blog entry.

Summary

After reading this short blog entry You should have grasped the idea why something like a “plugable file format” may be useful. You should also know what inspired me to dig into the problem.
In next blog entry You will be show details about how exactly we should define such a format.

Binary file formats: how to screw it up

In this long and boring blog entry I will try to show You most of mistakes I encountered in specifications of binary file formats.

But first things first.

Binary data format

A binary data format is a 010101…0101 representation of some abstract data You have in Your program. At first glance they do look exactly like data structures in memory but there are subtle yet important differences.

First, most important difference is that a binary data stored in a file do exist outside the program. They can be put on some data storage or travel through the wire or air. They are used to move information over the space, time and machines of different types and architectures. They may be specified in a way independent of the media they exist on or they may be tightly bound with it.

If the binary file format is independent of the media, it leaves some elementary data properties to the media format. In such case we may be most probably speaking about “file format” or an “application layer” if data travel over the wire.

If the binary format is dependent of the media we usually do speak about “protocols”. Both have their specific quirks, tips and tricks.

Since the “protocol” is both a data and a media specific, and “file format” is just data specific let me first talk about “file format”.

Note: All the following text makes an assumption that file media uses “bytes” as elementary transaction data elements.

A bad example

Ok, so let me show how to do it wrong.

Assume now that I am a C programmer. I do live in C world ( something not like A or B class worlds 😉 ) so when I was told to define a simple file format I did something like that:

typedef struct{
 char [32] identifier
 unsigned int  number_of_entities
}Header
typedef struct
{
  float X
  float Y
  float Z
  int attributes
  short signed int level
}Entity

and I have said that a file consist of a Header and a number of Entities following it.

I have then said that:

  • identifier is a text identifying the format, which is “Mój żałosny format”. Notice, I intentionally formulated in in a non-English language;
  • number_of_entities is a number of Entity elements following it;
  • X,Y,Z are some coordinates;
  • attributes are some attributes;
  • level is a level of importance assigned to an Entity.

Ignoring a meaning of data is it a good specification of how to represent them in binary format?

How do You think?

I think it is very bad.

Characters

In the C world “char” is vague. It may, depending on machine or compiler be signed or unsigned 8 bit or longer integer number. Notice, C does not say “integers” are binary. It only says “unsigned integer” is binary.

Second there is always a problem how to represent an actual text and how to encode it to its binary form. Like how to turn to bits “Mój żałosny format” so that it can be ready by any machine on the world and understood correctly.

The source of all the problems with characters encoding comes from a typical Anglo-Saxon arrogance. Since the same beginning of computers in Poland we always struggled with it. A “character” mentally equaled to ASCII and that was all. And we in Poland needed more than the mere arrogant ASCII. I assure You that those little ąęźżć dots and lines over characters do have a critical meaning. Like in famous: “Ona robi mi łaskę” (she does me a favor) which if stripped of those little lines turns to: “Ona robi mi laske” (she gives me a head). You may ques what is a difference whether Your wife receives and SMS with second sentence instead of a first one. And Yes, it still do happen. The arrogance of telecom’s and Google is set so high that the Android smartphones are by default stripping the Polish letters from all SMS messages without a warning. They claim they do it to save Your money, cause telecom prices the Polish SMS twice the ASCII SMS. Well.. what is 0,01$ compared to divorce costs?

But back to business.

Whenever You says “character” or “text” You must specify what “character encoding is to be used”. If You say ASCII then it is fine. But You may be polite and say “UTF-8”, which is, I think a good compatibility path. UTF-8 text always looks acceptable when understood as a straight forward ASCII (just some “dumb letters” do appear) and can be processed by any 8-bit character routines which are unaware of UTF-8 characters encoding.

When You specify the encoding this is wise to avoid a “char” type and use “byte” instead. Byte is always a sequence of 8 bits. Just for clarity.

So the specs should be:

byte [32] identifier ; //An UTF-8 text.

Length of a text

Second element of any text specification is to say how to figure out where text ends.

In my example I intentionally used 32 bytes and a shorter text. How the end of text should be detected?

The standard C way is to add zero at the end. So 32 bytes long array may carry up to 31 ASCII characters and hard to predict number of UTF-8 characters. Notice, this approach means, that character of value zero is prohibited. And if such character is present in a text it may result in false early “end of string”. And to disaster, if it was used to say “and after this string next element follows”.

In Java for an example binary zero is a fully valid “character”.

The other method of saying what is the length of string is adding the “length field” and clearly specifying the length of following text.

In the example I made:

byte [32] identifier ; //An UTF-8 text.

I did however decide to set the space for a text to a fixed size. I did it so that the header would have a fixed, known and finite size.

For short, bound texts it is acceptable to do it that way and to say:

“The encoded text ends with either zero byte or at the 32-th byte. If the encoded text is shorter than 32 bytes all remaining byte should be zero.”

Integers

We have:

   unsigned int  number_of_entities

and

   int attributes
   short signed int level

What does it exactly mean?

Integer is a signed number which can be iterated from -N to N with increments of “1”. That all. In C that is. So first we have to clarify is: This is an binary integer. If we don’t say it it may be a binary-encoded-decimal for an example.

Second, binary signed integers may be encoded with a sign bit, bias, one’s complement or two’s complement. Please specify it.

The two above can be usually skipped because the binary two’s complement integers do dominate the world. So if You would not specify it then You may be 99% sure, that over the next 20-years or so (until a quantum computers will dominate) coders will read “signed integer” as a “binary two’s complement” number.

But how long those numbers are? How many bits or bytes do they have?

“int”, “short”, “char” and etc do have in C only the lower bound. If You are at Java or C# You are very specific saying “int”. But If You will not say in specs that “all types are JAVA types” then no, You haven’t said anything about the exact length.

unsigned int24 number_of_entities
unsigned int11 attributes
signed int24 level

This looks much better. But it is still incomplete. What is the order of bytes? Less significant byte first? Most significant? Or some other mash-up? Please say it.

Now You probably noticed the int11 type. No, it was not a mistake. An 11 bit long type. In 99% of cases You won’t be needing them, but if You need them it is wise to know what to do.

Now please consider, how would You have interpreted above structure of three numbers? At which bit the “level” starts? At 24+11? Or at 24+16?

If Your data are aligned to a certain number of bytes or bits You must always specify that.

Floating points

Basically the screw ups You can make are the same as with integers. You must specify the format (ie. IEEE32), a byte order and an alignment.

read(buffer_ptr,sizeof(Header))

This is most tempting equation to write in C when dealing with binary file format. A nice, good portable line to read the Header in one haul.

Never do it. Never.

Alright, I was joking. You can do it. I do it. But only when You are going to use such a code on a fixed, known CPU architecture and with a fixed, well known C compiler.

Why?

Because C compilers can arrange structure fields in memory the way they like. They only thing they need to preserve is the ordering (I’m not sure of that also) and type. They may add gaps, empty spaces and etc. For an example an MSP430 CPU can fetch 16 bit data from even addresses with one instruction, but to do it from a byte aligned address it needs four instructions. So most C compilers will stuff all 16 bits data at 16 bit boundary and will put all structures the same way.

So depending on CPU and compiler, even if You use the properly sized types for fields, the size of a structure in memory may differ from the sum of the size of all data in it.

But If You, like me, are coding on micro-controllers of fixed brand on a fixed compiler then it pays back in terms of code size and speed to use a hack and define so called “packed structures”, collect them in “packed unions” and lay them over a memory buffers used for data transfer. It is an excellent, fool proof, easy to maintain way of decoding incoming data at near to zero cost. Providing You cast in stone Your compilation environment.