Abstract file format: what to use for a “signal”?

In that post I wrote:
(…)
The good abstract API needs to be able to indicate data boundaries and move from boundary to boundary with:

 void writeSignal(signal type)
   signal type readNextSignal()

(…)

I let myself to use the enigmatic signal type.

Now it is a time to dig into it.

What is “signal” used for?

Again, just to remind You: for telling what a certain bunch of data is used for. To give it a name.

How many different “signals” do we need?

At least two… or to be specific – two kinds of signals.

If a signal must give a name for a certain block of data then it must somehow indicate when this block starts and when it ends.

In fact it must be very alike good old C typedef:

typedef struct{
   int a;
   char c;
}Tmy_named_struct

Since to be able to efficiently process a stream of data we need to know what struct means before we start reading it, then the structure in a data stream should rather look like even older Pascal:

 begin Tmy_named_struct
   int: a;
   char: c;
 end

The begin and the name comes first, the end comes last.

The “end” and the “begin”

This means we need an API which will be closer to:

 void writeBegin(signal name)
 void writeEnd()
 ...
 signal type readNextSignal()

Now You can see I used signal name and signal type. We need to define them more closely.

The signal name

Historically speaking, when I started this project, I decided to do:

 void writeBegin(int signal)
 void writeEnd()
 ...
 /*....
  @return positive for a "begin" signal, -1 for an "end" signal. */
 int readNextSignal()

It was very bad idea.

At first few uses I started to get pains to manage what number means what and how to prevent numbers from different libraries to not clash. This was the same problem, although in mini-scale, as global function names clash in C.

So I thought to myself: How did they solve it in C?

With a name-space. You can assign a function name to a name-space and then use a fully qualified name to avoid a name clash. And if then names do clash, You can put a name-space into a name-space and form something like:

  space_a::space_b::function

Please excuse my wrong syntax. I did not use C/C++ for quite a long time and I don’t remember how exactly it looks like nowadays.

So I could use a sequence of int numbers….

Dumb I was, wasn’t I?

The hundred times easier and more efficient is to do:

 
 void setNameLengthLimit(int length);
 void writeBegin(String name)throws ENameTooLong;
 void writeEnd();
 ...
 /*....
  @return either a name for "begin" signal or null for "end" signal
*/
 String readNextSignal()throws ENameTooLong;

We use String for names. Strings are variable in length, easy for humans to understand (this is important if back-end will be something like JSON which ought to be human-readable) and Java has a very efficient support for them.

Note: I did myself let to introduce setNameLenght(...) and throws ENameTooLong. Remember what I have said about OutOfMemoryException attack? The ENameTooLong is there to let You keep flexibility and put a safety breaks on Your stream.

But Strings are slooooow

Sure.

With the int as a signal name and a careful selection of constants like below,
the code which looks like this:

  int name=...
  switch(name)
  {
    case 0: ... break;
    case 1: ... break;
  }

can be compiled on almost any architecture to a “computed goto”:

  mov name → reg0
  cmp reg0 with 1
  jmp_if_greater _after_switch
  shl reg0, times           ; according to DATA width
  mov jump_table[reg0], PC  ;read target from table and jump there
  jump_table:
     DATA case0
     DATA case1
    ...
case0:
   ...
   jmp _after_switch
case1:
   ...
   jmp _after_switch     

where the entire comparison take just about 6 machine instructions regardless of how huge the switch-case block would be.

Prior to JDK8 using Strings was a pain in a behind, because the only syntax You could use was:

  if ("name0".equals(name))
  {
  }else
  if ("name1".equals(name))
  ....

what in worst case scenario ended up in large number of .equals calls.

At a certain moment, and I admit I missed it, JAVA specs enforced the exact method of computing String.hashCode(). Prior to JDK8 it had no special meaning and each JVM/JRE/JDK could in fact provide own implementation of String.hashCode(). There was simply no reason to enforce the use of the standard method.

Since JDK8 such a use did appear.

JAVA now absolutely requires that:

  • regardless of JDK/JRE/JVM String.hashCode() is always computed using the same algorithm. This way compiler may compute hash codes for known const Strings at compile time and be sure that in any environment:
       StringBuilder sb = new StringBuilder();
       sb.append('a');
       sb.append('b');
       sb.append('c');
    
       assert(  1237 == sb.hashCode())
       assert(  1237 == "abc".hashCode())
    

    assertions will not fail.
    Please notice, 1237 is not a correct hash code for that example. I just faked it to show, that compile time constant can be used.

  • second they directly requested that String has something like:
        class String
        {
          boolean hash_valid;
          int hash_cache;
    
          public int hashCode()
          {
              if (hash_valid) return hash_cache;
              hash_cache = computeCache();
              hash_valid = true;
              return hash_cache;
          };
        }

    Notice this is a very simplified code which may fail on some machines in a multi-threaded multi-core environment due to memory buss re-ordering. Some precautions have to be made to ensure that no other thread do see hash_valid==true before it can see hash_cache to be set to computed value. Since String is implemented natively I won’t try to dig into it. It is just worth to mention that volatile would do the job but
    it would be unnecessary expensive. I suppose native code could have found better solution.

    Notice, the race condition on setting up hash_cache is not a problem. Every call to computeCache() will always give the same result as JAVA strings are immutable. At worst we will compute it twice but nothing would break.

  • and third they did require that:
       class String
       {
         public boolean equals(Object o)
         {
            ... null, this, instanceof and etc checked.
            if (o.hashCode()!=this.hashCode()) return false;
            return compareCharByChar(this,(String)o);
         }
       } 
    

    which avoids calling compareCharByChar() unless there is a high probability that it will return true. And, of course, the compareCharByChar will terminate immediately at first not matching character.

In much, much more simple words it means, that since JDK8 it is important that String.hashCode() is environment invariant, cached, and used for quick rejection of not matching strings.

Knowing that they let us use:

  String name=...
   switch(name)
   {
     case "name0":...break;
     case "name1":...break;
   };

which is not implemented as:

  String name=...
   if ("name0".equals(name))
   {
   }else if ("name1".equals(name))
   {
   };

but as a hell more complex and hell more faster:

 String name=..
 switch(name.hashCode())
 {
    case 1243:
         if ("name0".equals(name))
         {
         } else if (... other const names with 1234 hash code.
         break;
   case 3345:
        ...
  }

This can be, in large cases, about 10 times faster than pure if(...)else if(...) solution.

Never less it is still at least 50 times slower that using int directly because it can’t use jump-table and must use look-up table which has a comparison cost linear with the number of cases. Unless very large table is used in which case we can profit from binary search and reduce the cost to log2N cost.

Never less it won’t never ever be even close to 6 machine instructions.

Then why String?

Because with int I really could not find a good, easy to maintain method of combining constants from different libraries wrote by different teams into a non-clashing set. I could figure out how to avoid clashing with static initialization blocks, but then such constant are not true compile-time constants and can’t be used in switch-case block.

Strings are easy to maintain and clashes are rare due to the usually long, human friendly names.

In other words – use int if speed is everything and difference between 6 and 600 machine cycles do mean everything for You while continuous patching of libraries in struggle to remove code clashes is not a problem.

Use String if development costs and code maintenance are Your limit.

And even if You are, like me, the “speed freak” please remember that we are talking about file or I/O formats. Will 6 versus 600 cycles matter when put aside of the time of loading data from a hard-disk or network connection?

I don’t think so.

But Strings are so laaaarge…

Yes they are.

And human readable.

Using non human readable numeric names for XML or JSON back-end makes using XML or JSON pointless. The:

 <1>3</1>

is equally readable to human as a binary block.

If however size is Your concern, and in case of a binary back-end it will be, You can always create a “name registry” or a “name cache” and stuff into Your stream something like:

with String names with name registry
begin “joan” assign “joan” to 0
begin 0
end end
begin “joan” begin 0
end end

and get a nice, tight stream with short numeric names and at the same time nice, maintenance friendly String names.

Note: this table is a bit of simplification. In reality the capacity of name cache will be limited and the capacity of numeric names will also be restricted. Most possibly You will need assign, begin-with-numeric-name, begin-with-string-name commands… but this is an another story.

Summary

After reading this blog entry You should now know that the “signal” can in fact be a “begin” accompanied with a “name” or just a plain nameless “end”. You should also know that there is no reason to fret about not using String for “begin” name and how to deal with performance and size issues related with String used in that context.

What next? See there.

Leave a comment