JAVA serialization: if not the pointer then what?

In that post I did talk about “reflections” in JAVA and how this concept relates to serialization.

In that post I would like to say a few words about how to deal with “object references” if we can’t have a pointer.

Why do we need a pointer?

Under the hood to say what is where in a memory. But looking externally just for one thing: to tell two objects apart. If their references (ie. pointers) differs then those are not the same objects. They may be bit-by-bit equal but are not the same.

So basically as long as we can do 1:1 mapping:

  Object reference ↔ sequence-of-bits

then we are done.

Identity of objects in JAVA

Gladly JAVA provides two facilities for that:

class System{
  ...
int identityHashCode(Object x)
...
}

which do compute a “magic” number which provides non 1:1 mapping:

  Object reference → 32 bit integer

and

  Object X, Y;
    X==Y
    X!=Y

reference identity operator which can tell when two “pointer” do point to the same object.

Those two are enough to create identity hash-map (like java.util.IdentityHashMap) which can quickly map Object to int:

  stream_reference_map = new IdentityHashMap<Object, Integer>()

Of course we could do the same without the identityHashCode using only == operator and a list of structures like:

class Descriptor
{
  final Object reference;
  final int assigned_number;
}

but it would be few orders of magnitude slower.

Stream reference

The stream_reference_map shown above do map Object into an int number.
This number is called “stream reference identifier” or, in a short form: “refid”.

Note: Remember, the “refid” is not the result of identityHashCode()! The identityHashCode() does not produce 1:1 mapping! It may return the same number for many objects. It is used just to speed things up grouping objects in “buckets” over which we still need to use == operator.

Producing stream reference identifier

Any method will do. You should however think about few questions:

  1. Should I allow transfer of unlimited number of objects to stream?
  2. Should I allow garbage collection and re-use of refid?

Usually a simple incrementing counter will be ok.

Using stream reference

Basically You do use it exactly the way You would use a “pointer”. You like to write a pointer to object X a stream? Then You look up for “refid” of X and write that “refid” to a stream. Simple.

The question is when You like to write a pointer, but this is an another story.

Reading-side map

The above:

  stream_reference_map = new IdentityHashMap<Object, Integer>()

provides Object → int map. Unfortunately it is just one part of a story, which is used to write pointers to a stream. The other part of a story is to what to do with a “refid” we read from a stream?

The reading side needs:

  int → Object 

map. Gladly, if You have chosen an incrementing counter for a “refid” generator and You are fine with 2^31-1 objects in stream the simple:

  read_refid_map = new Object[...];

will do the best job.

Note: Unless You are actually planning to get anywhere near the 2^31 region in number of objects. A more “scattered” structure will better handle growing and shrinking the array during the live of serialized stream.

Problems

The first problem, which is not dealt with in standard serialization is memory leak. Yes, the standard serialization do leak as hell!

Hard-reference+garbage collector==memory leak

The stream_reference_map = new IdentityHashMap<Object, Integer> used at writing side utilize the standard, plain reference to an Object as a “key” in a map. This has an unfortunate effect: as long as this map exists the garbage collector will see all contained objects as “reachable” and won’t release them.

Usually it is not a problem, but if You will decide to, for an example, use serialization for logging Your application You will get a nasty surprise.

Imagine You do arm Your application with logging commands in following manner:

void woops(int a,int b)
{
  ....
  if (log_level_enabled) log_object_output_stream.writeObject("calling woops("+a+","+b+")");
  ...
}

Each time this code runs, the new string is formed and written to a stream as an object. This means, that it must have the “refid” assigned. And if it must have it assigned, then it must be put into a stream_reference_map. Since it is using hard reference, it means it will stay there forever. Or, precisely, until OutOfMemoryError.

The proper stream_reference_map must hold reference to mapped objects by a WeakReference.

Passing garbage collection event

Of course, even if You will deal with above You will still hit the OutOfMemoryError at the reading side of a stream.

The simplest:

  read_refid_map = new WeakReference<Object>[...];

will not work. The weak reference works at writing side, because if the only place for object to exist is the stream_reference_map
map, then there is no way to write it again to a stream.

At the reading side it is very different. The reading code may pick “refid” from stream (and objects) and drop them right in the place. The writing side may however hug to the object for very long time and write it to stream many times. Of course, to avoid many problems which I will discuss somewhere else, it will prefer to write “refid” to it. If the read_refid_map would be WeakReference then there wouldn’t be any object to map it to.

Good “refid” system do pass garbage collection events to reading side.

Roll over

Of course int isn’t infinite. Even if You will use proper garbage collection of “refid” You will still sooner or later hit:

   assert(refid_generator+1 > refid_generator )

that is a “signed wrap around”. You will run out of possible “refid” to use.

This is something what is also not addressed in standard serialization. The bad problem is that the standard serialization is not utilizing the entire 2^31-1 pool of numbers and the roll-over happens earlier producing some commands instead of “refid”. Fortunately You need a really huge VM to hit this problem, since usually the OutOfMemoryError will appear first.

The good “refid” system do re-use garbage collected refid to avoid roll-overs.

Summary

After reading this chapter You should know what the “stream reference identifier” is and how not to design the system which manages it. This should also make You to notice, that standard serialization stream cannot exist permanently or be used for large amount of data produced on-demand.

And now You may move to following part in which You will read about how object is scanned during serialization and what problems it may create.

Leave a comment