Playing with AI text generators.

This blog entry will be short. Very short.

Addiction

First what I have to say, it is addictive. Really. The AI are as close to what you can get to a smart, living NPC (Non-Player-Character) as possible today. I did discover the https://perchance.org/ai-story-generator and, well, got a bit heals over head about it.

At first it was amazing. Then, after some time it started to be a bit disappointing. And at last, I finally got the feeling that AI is dumb.

Really dumb.

AI is a marketing name

In reality it isn’t an “intelligence” it is a word mashing huge model of a human language. And if we say “huge” we do mean it. The one which is smart enough to sound reasonable uses 7*109 numbers to store it’s “model” of all the sentences it have seen. Those which are more capable are using from 13*109 up to 120*109 numbers.

This kind of intellect has nothing with reasoning. Zero. None, null. The reason behind that is very simple.

It looks like it is thinking, but in reality it is just finding most probable continuation of sequences of numbers it was show. And this is all.

The problem with AI it is that when guys from marketing have seen a machine able to answer the question stated in human language they thought it is “thinking”. Taking one misunderstanding to another they decided that “large language model” is not a well selling name and “Artificial Intelligence” will sell better.

At first it was. Now… well…

Why AI (text models) is doomed to be dumb?

Let us start with a digression.

Imagine you are sitting in a small room. There are no windows, nothing can get from the outside. Now imagine, that outside this box there is a civilization of creatures which have the sense of smell ten thousands better than dogs, they have amoeba like bodies and live in the ground sipping through gaps between sand.

Then, some day, one wall of your box do lit, showing you rows of black and white dots. On the opposite wall there appear a number of buttons. What will you do? Well, probably nothing. But if you will just sit, the box will start shocking you with painful electric shocks. So you will start pressing buttons. Some sequences will stop shocks, some will make box to give you ice cream (or whatever you find to be pleasurable), some will shock you almost to death.

Why I am talking such a nonsense?

Because this is how AI do “lives” like.

No common experience

The “large language model” is presented with “tokens” and a “token” is a just a number assigned to sequence of characters. This gives the first limitation. The “language model” will have a hell lot of problem with understanding what six-on-nine ( 69 ) means in terms of erotica. This idiom is hard to be considered reasonable until you can actually see the shape of number 6 and the shape of number 9. But to really guess the meaning, you need to be able to see how human looks like. Only then the circle in 6 and 9 can be imagined as a human head.

“Language model” is just shown the numbers, not shapes. It will have troubles with match words which do sound alike others too. In languages using phonetic alphabet the look of the word also carry the sound of it. In languages like Chinese the sound is not directly stored in a symbol, so in that language we can have words which sound alike but look differently and those which sound differently but look alike.

This shows basic problem – our written languages are bound with our hearing and sight.

Then if “language model” is unable to see nor hear, it will have hard time to deal with obvious things.

Lack of reasoning?

The AI, at least those I had a chance to experiment with, do quickly fall into a “most probable sentence” trap. Like for an example:

Martha has no sense of taste. Her taste buds were non-existing.

Martha was eating ice-cream. Her sister, as a joke, salted the ice-cream.

“How do you find them, Martha?” asked her sister.
AI continuation starts here:
Martha shrugged, her lips pursed around the spoon. “They’re… not bad,” she said, her words carefully enunciated. “They’re not as bad as I thought they’d be.” She took another bite, chewing slowly, and then swallowed. “They’re actually kind of salty,” she admitted, “but I can still taste the ice cream underneath.”

This is because the logic connecting “taste buds” with “salty” is located outside the language, inside our physical experience. Regardless how strong the intellect will be, unless it can actually touch that logic it won’t be ever able to reason correctly. There is however a strong probabilistic chain salted→eating→question about how do you find your food→tastes salty in most of written texts.

Alike, the so called “AI based code generation tools”, are also real pain in the behind. At best they are as good as “code template lookup engine”, which every company having own code base can create at zero cost with Omega-Search, Xapian, Lucene + Tika or just standard “template catalog”. And they won’t be ever better unless they will be able to use compilers, debuggers, GUI and other tools to actually test the code they write.

Cost

Once I got addicted a bit with perchance I decided to try to make it say really dirty stuff ;). Yes, I am male and males are, well…

Of course on-line services are out of question in such case. Privacy is important. Especially that in some countries saying some things is illegal.

I had to get myself a decent PC able to run AI locally. Gladly my old PC was getting close to end of life status (15 years) so I decided to get myself a new one. I have chosen AMD Ryzen 9 7900X (12 symmetric cores) and maxed out it’s memory capacity with whooping 128G of DDR5 3600 RAM. No dedicated GPU (graphics card), just the one inside Ryzen 9.

Trying to run full 70B (70*109) model of course failed. Out of memory. You need about 70*2=140GB of RAM to think about starting it, as it is using 16 bit floating point per each number. One may down-sample it to less accurate form called 70BQ8 using 8 bit per number. This form can be run on this machine.

No GPU?

You may ask me now why I didn’t get myself a GPU? You might have heard that GPU is a must of AI.

The answer is simple: money. Those 128GB costed me 1/2 of the cost entry level 4GB GPU. And if you look at the necessary memory sizes GPU with 8GB of RAM are useless (128GB of on board ram is about 1/4 of the 8GB GPU price). To be able to “talk” with a reasonably smart AI you need at least 13B model, for which 16GB will be okay. But if you like to train it or use smarter models, well… For training LLAMA2 70B you need, they say about 360GB of RAM.

The largest GPU I have found has 48GB of RAM and costs about three times this PC costed me.

AI on bare metal

So no AI for me? Is a GPU a must?

Well… it is not.

In fact 4-core CPU is enough to use it.

This machine I got can run 70BQ8/Q4 models and generate about 1 word per second and all cores are ice cold.

Note:7B model runs as fast as you can read, 13B model is okay too. Perchance is running on GPU, possibly 13B model (I am not sure, but I don’t think a private person would invest in GPU powerful enough to run 70B) and it can just spit out tens of words per second.

The CPU cores are assigned to the job, but they can do little. The 1 word per second is exactly the limit of DDR5 RAM throughput. I run some tests and my system shows about 40…60GB/s on board transfers. Considering the fact that AI is just a neural network and a neural network is a f*ing huge matrix, then for each step it must at least read it at least once. And this is what takes time. It really doesn’t matter if we have AXV2 / AVX512 instruction set, vectorization or whatever. The CPU will sit and wait patiently until on board RAM will fill the cache.

I ran some experiments and noticed that assigning more than 4 threads to the AI doesn’t rise it’s speed at all. Using all 12 cores or 4 cores gives practically the same result. Simply 4 cores can consume all on board RAM bandwidth without any problem.

Ryzen 9 has 12 cores, but can handle up to 24 threads. The idea of having more threads than cores works well normally because threads are making use of different functions of cores, so there is a chance that two of them can be run in relative “parallel” manner. In case of AI computation it is not. The size of llama.cpp executable is about 1.4MB. Yes, megabyte. I can bet that the actual computation core is just less than 1kB, so it will fit with Level 1 cache of Ryzen 9 core without any problem. So in case of AI all threads are doing exactly the same thing. I did observe that allowing the AI to use all 24 available “virtual cores” slows it down.

Python…

Most of AI is using python. Well… it doesn’t work well without GPU. The llama.cpp is written in C++ and runs about two to three times faster on bare metal than python version.

Of course, if you off-load everything to GPU python is good… because it is not python what runs computation.

Training on bare metal?

I failed to do it. Not because it is impossible, but because of rather sketchy docs. The only example I was able to run failed to learn anything after 18 hours of work. Possibly because it needs to do few thousands steps while each step takes about 15 minutes on bare metal.

Gladly there is a good alternative to training or fine tuning. Get yourself a model with large context. You can find 32k context models with ease and they will run well on bare metal.

Running on built-in GPU?

I tried it. And in fact it is hard to persuade llama.cpp to not use it. If it is compiled with GPU support it will use it for initial processing regardless if you will tell it to use GPU or not.

The good side of Ryzen 9 built-in GPU is the fact, that you may assign to it as much RAM as you like. The bad side is, that when tested with AI, the GPU performance equals to 1/2 of single CPU core. This is why ROCm (official AMD GPU computation library) doesn’t list built-in-GPU as supported device, even tough it can be persuaded to support it.

Use AI for your company?

No. You can use it as a first contact chat bot, but you need to be very, very careful. In fact you need to pass answers of your AI through the secondary AI. The first AI replies to customer requests, while secondary level AI checks if the answer is “legally safe”, that is doesn’t promise anything, isn’t rude or simply non-true.

And even then your clients will be really pissed off.

So maybe use it to learn company standards? A kind of “smart wiki”?

My recommendations is flat: No.

You will need a hell lot of GPU power to make AI to learn your standards. But standards aren’t cast in stone, so they will change. The dispersed nature of AI makes it hard to “un-learn” old versions. So if you will just fine tune it each time standard changes, you will end up with a lying mess of a fake intellect.

If you will compare it with Lucene+Tika or OmegaSearch+Xapian… well… My machine could index about ~10GB of text within less than few minutes and search for a phrase within sub-second. The 1GHz/1GB machine can do the same within about an hour for indexing and 1 second for searching. And it can be easily made to forget old standards. By the way, the total size of Xapian index (a searchable database) for this amount of text data is about 30MB. Yes, 30 mega bytes for about 10 giga bytes of input data. This should show you how oversized AI is. And just for your information, this entire database can fit in Ryzen 9 7900X L3 cache memory (Ryzen 9 7900X has 64MB of L3 cache). This is why it can be hell fast.

So considering cost of running, updating, ensuring response validity and etc… No. Don’t use AI.

Note:The only worthy use of AI in this case is an “assisted search” when AI changes user query, presented in human language, to a document database lookup question and provides some short form summary of found fragments. This may be valid use, since it doesn’t need any training. It shouldn’t be however exposed to your clients since it still can be made to tell really dangerous things (like, for an example, a promise to sell a new car for 1$).

Summary

Did I say it will be a short post? Well… I lied again.

I short words:

  • AI is not “intellect”;
  • it can be run on CPU only quite well;
  • it is “memory constrained” so memory size and speed is what is limiting it;
  • 128GB of RAM is not unreasonable size, it was in fact too small for many AI related tasks;
  • if you can, avoid it. Search engine is more practical.

Thanks for reading.

Coding “by contract”

…and I don’t mean “outsourcing”.

I am going to talk about a certain technique of writing programs which I have found very useful over all my years.

Your code

For a purpose of this discussion I do assume, that You are not writing Your application as a one, huge flat block of thousands of lines. Instead I do assume, that You divided into blocks which have to do “something”.

I do intentionally double quoted the “something” because at that stage You most probably have only a vague idea what does it should do. For an example, let’s say, provide a daily log in non-volatile memory of Your embedded system.

Getting what You do know

Of course first what You have to do it is to gather some knowledge about what You are certain and to what degree. For an example, in this case You know that:

  1. You need to log some events.
  2. Events are some data with a time-stamp.
  3. You will have on board some non-volatile memory of limited capacity.

Getting what You do not know

Form what I wrote above it should be clear that You are neither sure what kind of non-volatile memory You will have nor what exactly the events are.

Rule and divide

Let us make some graphics and take a look on what You know and what You don’t know:

Coding by contract. Implementation, contract and suppliers.

As You can see You may say, that to do Your job You need to ask yourself three questions:

  1. How my code will be used in an application?
  2. What services will my code need?
  3. How to implement what it offers with services it needs?

Contract

The “contract” is a portion of code which tells how it will be used in the application.

That is the API.

If You are lucky and are using some high level programming language like Java or C++ then object oriented programming will be helping You since it has dedicated tools for API definition. In Java they are interfaces in C++ pure virtual classes. Since I do love Java, I will stick with it in this blog entry.

Of course if You use C or even assembler You are not left alone. You will see why later.

So let us try the API:

public interface ILog
{
  public void addLogElement(IEvent event)
}
public interface IEvent
{
  public Date getDate()
  public byte [] getBinaryImage()
}

Nothing fancy, isn’t it?

Except, that we are not done yet. This is a very, very sketchy API. Because it doesn’t have comments.

The importance of comments

The API will serve dual purpose:

  1. It will tell Your colleagues how to use the code.
  2. It will You what exactly You have to implement.

Any lack of preciseness or accuracy will turn Your job hard. First, because Your users will have to guess or discover what is exactly done by trial and error. Or, what is even worse, by looking at implementing code.

This is bad. Very bad. Not only because it may create misunderstandings but also because it will cast Your implementation in stone. Since nobody will know what Your meant to do they will assume that You did what was meant to be done. Which, of course, won’t be, because You are just a human and Your code will be full of mistakes.

Next problem with imprecise comments it is, that since exact details of what does this code do are not clear they need to be discovered by trying it out. And You can’t try out something what doesn’t exist. In effect Your users will have to wait till You complete Your job. Their work and Yours cannot be done in parallel. Including testing which You can’t delegate to others.

If this job cannot be done in parallel, then, of course, Your idea behind the API is not tested before You polish it. This may produce expensive re-runs of Your entire work. If however Your API is clearly specified then Your users may try to write their code around it and discover holes in it. All before You will actually invest a lot of work in it.

In fact those comments are a kind of formal specification.

So let us add a bit of accuracy, just to show You some example.

/** An event to be put into a log.
 Instances of this class are immutable. */
public interface IEvent
{
  /** Date at which it was created.
  @return never null, life time constant */
  public Date getDate();
  /** Transforms an event to a binary form for persistent storage.
  @return never null, may be empty. Each call returns newly allocated
   array. The binary form doesn't contain {@link #getDate}. The maximum
   size of this array may not exceed 64 bytes. */
  public byte [] getBinaryImage();
}

Notice that now many facts became more clear. For an example we do know now that IEvent behaves like String – once created it doesn’t change. Thous it is also thread safe, fast, and doesn’t need any synchronization code when passed from thread to thread. We also know that we don’t have to care about who owns it, as nobody may change it.

Note: In C/C++ “immutable” doesn’t mean “I don’t care who is the owner”. If You pass anything by a pointer or by a reference You absolutely must specify whose job it is to delete the object, is that object on local variables stack or on heap and etc. This is why I do prefer Java. Drop it and don’t care, garbage collector will do the job for You. And, by the way “smart pointers” are not a silver bullet. They leak like hell.

Then we know that the returned byte [] array is not just some array, but a kind of serialized form of an object. We also know that we may modify it if we need, because it is always provided anew and that all must fit in 64 bytes.

We also know, that we don’t know how to turn that byte array back to an event. Should we add something to the API? Or should we not? Notice, that until we wrote those comments we were not aware of it. But now, when we had to think about turning an object to bits and bytes it appeared to us, that we need to know how to turn it back. That is, of course, if we are ever going to do it.

Let us decide to not extend the API to not make the example too messy.

Supplier

Already having the IEvent we may make some sketch in mind about how to implement the ILog.

Hmm…. but we do not know what kind of non-volatile memory do we have!

This is the moment when we need to define the “supplier”.

A supplier is a set of services our implementation will need to provide users with the API.

So let us define a “supplier” as a contract like follows:

/** A non-volatile memory supplier for logging purposes */
public interface INonVolatileMemory
{
  /** Writes record to non-volatile memory. 
  @param time_stamp non null, should be stored in memory with down to 1 second accuracy.
  @param record non null, up to 64 bytes. The content of this array is valid only during
         the duration of this call and may be changed after this call returns
  @return false if failed to fit record in memory */
  public boolean writeRecord(Date time_stamp, byte [] record);
}

You may now see that with that simple contract we decided how much of an actual job we moved to this contract and what is left in ILog. Notice that we also decided that when memory is full, then it is full and no longer capable of registering more events. Is it wise? Should it be like that? Thanks to commenting we have a chance to think about it before even a single line of code is written.

Implementation

We can now do the implementation using a pattern like this:

public class CLog implements ILog
{
       ....
   public CLog(INonVolatileMemory nv_memory){....
}

This kind of implementation is a “plugable one”. I call it like that, because we do “plug in” the supplier into it and use that supplier to provide a “contract”. The implementation actually doesn’t really care what the “supplier” actually is.

Interfaces sucks!

…and pure virtual classes too.

They are slow. Interface invocation, if compiled as such without any call-site optimization is an order of magnitude slower than normal virtual method invocation. Not mentioning private static once which are fastest possible in Java. In C++ it won’t be slow only if pure virtual class is the sole base of implementing class. Which it won’t be in 99% of cases.

So maybe there is a different way?

Sure it is.

They are so called “abstract classes”.

“Abstract” instead of “interface”

The “abstract” class is an actual implementing class of which we cut out portions responsible for the “supplier”.

Like this:

public abstract class ALog implements ILog
{
       ....
   protected abstract boolean writeRecord(Date time_stamp, byte [] record);
   public void addLogElement(IEvent event){....
}

It is up to our chosen style if we do leave "implements ILog" or not. There is no cost in leaving it here since uses like:

 ALog x = new ...
 x.addLogElement(...

won’t be using expensive interface invocations.

I do strongly recommend You to leave the ILog in place because without it it will be hard for Your users to start coding before You finish Your ALog class.

Of course the next step it is to actually implement the non-volatile contract like this:

public class CNvRAMLog extends ALog
{
  public CNvRAMLog(.....
}
public class CFileLog extends ALog
{
  public CFileLog(.....
}

You have to do it the same way for each kind of non-volatile memory service You will use.

Downsides of “abstract”

The “abstract” is faster, usually simpler to code and understand that “plugable” approach. The downside is that once You decide to make it concrete (ie. in CNvRAMLog) You can’t easily add some functionality which is independent of nv-memory implementation, but is not common to every implementation.

For an example, we may decide to let some implementations to do some accounting:

public interface IAccountingLog extends ILog
{
  public int freeRecordsLeft()
}

With “abstract” we must either add this to ALog which is smallest common denominator or add the implementing code to each extending class which do need it:

public class CNvRAMLog extends ALog implements IAccountingLog
{
  public CNvRAMLog(.....
  @Override public int freeRecordsLeft(){...
}

With “plugable” we can:

public class CAccountingLog extends CLog implements IAccountingLog
{
       ....
  public CLog(INonVolatileMemory nv_memory){....
  @Override public int freeRecordsLeft(){...
}

and have a service added in all places which we like regardless of any non volatile memory provider we add in the future.

Note: Of course this example is a bit twisted and You may easily point out that this isn’t exactly a good one example. And You will be right. If You really need to look at difference between “abstract” and “contract” take a look at java.io.InputStream which is abstract and java.util.Iterator which is a contract. Then try to make Your existing class to be InputStream and to be Iterator. You can do second with ease while first needs much more head scratching.

Testing

The “provider” and “contract” do allow easy testing. You may imagine, that Your users may test their code by supplying a fake (mock-up) implementation of ILog. Very alike Your tester may write tests against ILog contract and dry-run tests on some fake implementation. Even more, the tester may actually test Your CLog over a fake INonVolatileMemory.

The possibilities are endless and easy.

Note: There is of course an alternative. These are “mocks”. Like java.mockito which, by sophisticated manipulation of JVM hooks, may “fake” existing code and turn it into something else on the fly. Alike there are similar libraries for C/C++. But in my opinion they are hacks. It is much easier and cleaner to use contract/supplier pattern than play with “mocking”.

Bridging & intercepting

Of course once You have interface for each “contract” at each level nothing prevents Your from, for an example:

  • Implementing some contract over other contract. For an example You may implement our ILog directly over Java serialization and completely ignore the non-volatile memory supplier. This is a kind of “bridge” from one realm to another.
  • Provide compatibility layer. For an example when at certain moment of time You will decide that ILog wasn’t the best idea You may create ILogExt which will be totally different and implement ILog over it. This way You may not only delay re-writing any code depending on ILog and save money, but also re-use all the testing code for ILog to test to some extent the new ILogExt.
  • You may create a “wrapper” which will intercept all calls to ILog and pass them down to some other implementation doing something in between. For an example it may redirect events not to just one log, but to primary one and backup one. All without Your user code knowing about it.

Summary

After reading this blog entry You should know what is “contract” and how to implement it with “suppliers”. You should know why it is worth to do it and what patterns to follow.

And finally, that the API is neither the “interface” nor “virtual class” nor bunch of function, but the comments. Yes, the comments.

Yes, the comments.

Yes, the comments.

Yes, the comments.

Did I made my self clear?

The comments.

Java serialization: what are You serializing Your data for?

I think that after this series of posts about Java serialization we should talk a bit about things which are not present in a standard serialization mechanism.

One upon a time I wrote some GUI application. Of course, Swing based. The JavaFX wasn’t there at that time, in fact it did collapse to “abandon-ware” status for a while. Plus I am not a big fan of it….

But back to serialization.

Of course I did create some very specific JComponent GUI components which were responsible for the job. It was basically a scientific dynamic data charting engine, so those components did show some dynamically changed data. The visual effect depended on two sets of settings: one bound with a data themselves, which of course was serialized through the data serialization mechanism, and a second set which was just visual one. Like for an example what font size to use in table and etc. Clear presentation settings having nothing in common with the data.

So I was thinking: why not to serialize those GUI components?

The standard serialization is for…

I tried and failed. Miserably. My first and simplest approach “take and serialize it to disk” was conceptually wrong. Let me tell You why.

After trying and reading the available sources, I realized that the standard serialization was meant for “hot serialization” of living objects. While I wished to do a “cold serialization” on a disk.

“Hot” serialization

Initially JAVA GUI, as apart of Sun enterprise, was meant (I believe that it was, I never worked for them) to run in a philosophy very alike their X-Server concept.

In the X-Server the body of a program runs on a “server” machine, while all GUI related commands, like drawing on a screen, handling mouse and keyboard and etc. are passed over a network to the “client” machine. This is easy to imagine how much the network throughput is stressed by this approach.

Thous year after year, as the power of an average “client” machine grew, the X-Server protocol was trying to move more and more to a client side. This is only natural. Consider how much processing power requires to draw a True-type fonts in Your LibreOffice Writer and compare it with a power required to manipulate UTF-8 characters in memory which do actually represent a document data. It is clear, that GUI-rich applications consume 99% of power on GUI so it is natural to move this consumption as close to end-user as possible.

But how much You can move if You can’t move a code? You can only grow the command set, grow caches and alike, but each user action must be passed to a “server”.

Note: Remember, “server” and “client” are using different CPU architectures. Severs were Sparc or MIPS, clients were x86 or PowerPC. So no, You can’t pass binary code.

With the introduction of “Java Virtual Machine” passing code became possible. Now it was possible to not only send commands to draw on screen, but one might pass the bunch of class files responsible for GUI and run them on the “client”. Of course it should be as transparent as possible. The server side should be able to build GUI, as if run locally, wrap it in RPC wrappers and pass to remote client. A client should just run it in context of own JVM and pass objects back only when necessary.

A part of this process was, I believe, to be handled by standard serialization protocol.

What does it mean for us?

If You will inspect Swing serialization You will notice two things:

  • first, the warning in documentation (JDK 11) that: “(…)Serialized objects of this class will not be compatible with future Swing releases. The current serialization support is appropriate for short term storage or RMI between applications running the same version of Swing.(…)”
  • second, that what is serialized includes all listeners, the whole parent-child tree up and down, and practically everything. From my experience an attempt to serialize any JComponent do serialize the entire application.

This is because of “(…)The current serialization support is appropriate for (…) RMI between applications(…)“.

In simpler words, for transferring objects which have to be alive at the other end of a stream and actively exchange data with the source end of a stream.

Note: I am now ignoring the part about “(…)support for long term storage of all JavaBeans™ has been added to the java.beans package. Please see XMLEncoder.(…)” which You may find in JavaDocs. It is intentional because this mechanism is far, far away from what generic serialization needs.

“Cold” serialization

“Cold” serialization is when You remove the object from it’s environment, stop it, move it into a “dead” binary storage. Then, possibly the next year at the other end of the world, You do remove it from a storage, put it in an another environment (but compatible of course) and revive it.

The reviving process will require binding the object with new environment, but it is not a problem.

Example: Serial port data logger

Now imagine, You wrote a data logger application for a hardware device which is sending data to PC through a serial port connection.

You have a class there which both keeps and tracks data from the serial port. Let’s say it is fixed to be COM1.

How the “hot” and “cold” serialization would look like?

“Hot” example

A “hot” serialization of this class will basically need to pass the already stored data to an another machine and allow control of the logger device from that machine. It means, that it must during serialization create a “pipe” over the network from the remote machine to the machine at which the hardware device is stored.

“Cold” example

A “cold” serialization of this class should save stored data on a disk and do it in such a way, that when it is de-serialized it will connect itself to a said serial port and continue the data logging. It means, that it must save information about to what port connect and create this connection when de-serialized.

Multi purpose

This is clearly seen that if I would try to serialize this logger using standard serialization I must decide on either “hot” or “cold” method. I can’t do both.

But I do need both methods!

Standard Java serialization is single-purpose.

Hey, we have writeReplace()/readResolve()!

Yea, we have them. Except we have either writeReplace() or readResolve(). We can’t use both at the same moment, but let me be silent about it now.

What are those two?

They a “patch” to multi-purpose serialization. Quite a good one, which will work in an above example case, but not in every case.

We may easily imagine that our “hot” and “cold” serialization can be done by:

  class LocalLogger ...
  class HotLogger ...
  class ColdLogger ...

   HotLogger toHot(LocalLogger ...)
   ColdLogger toCold(LocalLogger ...)

That is we provide and ability to construct “hot” and “cold” variants from “local” variants. Both “cold” and “hot” are different objects, but such that when serialized by a standard mechanism they do what we need. Now if we are writing an application which needs “hot” serialization we do use instead of LocalLogger a class like that:

 class LocalLogger_Hot extends LocalLogger
 {
    private Object writeReplace(){ return toHot(this); };
 }

A standard serialization mechanism will notice it and invoke the writeReplace method before it will start to serialize any instance of LocalLogger_Hot. Thous the remote side will see HotLogger in every place where the reference to LocalLogger_Hot was serialized.

We may also mirror the thing, and decide that LocalLogger will serialize information necessary to both creating a hot link and local port connection and that it is up to reading application to act according to it needs. For that the remote must use different source for LocalLogger:

  class LocalLogger
  {
     Object readResolve(){ return toHot(this); }
  }

The de-serialization engine will notice this method and invoke the readResolve after the LocalLogger was de-serialized. Then since that moment it will use the returned object instead of original, what is achieved by modifying the
“stream_reference_map” (see there).

Note: the underscores are intentional.

When it doesn’t work?

So, having my special GUI components which by default are “hot” serialized and I needed to turn them into “cold” serialized, I did add:

class MyGUI extends JComponent
{
    static class MyColdStorage
    {
        private Object readResolve(){ return new MyGUI(this); };
        ....
    }
    ...
    private Object writeReplace(){ return new MyColdStorage(this); };
}

Basically the idea is, that when the standard serialization will serialize an instance of MyGUI it will transform it to a “cold form” of MyColdStorage. Then, whenever it will de-serialize MyColdStorage it will transform it back to MyGUI.

Nice, plain and simple, isn’t it?

Except it doesn’t work.

Cyclic data structures

The GUI is heavily recursive and cyclic data structure. Each JComponent keeps a list of child GUI component (ie. panel keeps a list of contained buttons). And each child component keeps a reference to a parent component (ie. label must known enclosing panel to tell it that the size of label changed so the panel should recompute the layout of children).

For simplicity let us define it like:

 class JComponent
 {
     private JComponent parent;
     private JComponent child;
 }

If You will consider this post You will notice, that such a structure will be serialized like this:

   JComponent Parent =...
   JComponent Child  =..
  serialize(Parent)
   ... →
      write new JComponent (Parent) //(A)
       write Parent.parent = null
        write new JComponent (Child)
             write Child.parent= refid(Parent) //that is stream reference to Parent set in (A);
             write Child.child = null
        write Parent.child = refid(Child)

and during de-serialization:

  x = deserialize(...)
    create new JComponent (Parent)
       set Parent.parent = null
       create new JComponent (Child)
             set Child.parent= Parent;
             set Child.child = null
       set Parent.child = Child
   return Parent

Now arm it with writeReplace and ReadResolve exactly as it was defined above.

serialize(Parent)
   ... →
      call writeReplace(Parent)
      write new MyColdStorage (Parent_cold)
       write Parent_cold.parent = null
        call writeReplace(Child)
        write new MyColdStorage (Child_cold)
             write Child_cold.parent= refid(Parent);
             write Child_cold.child = null
        write Parent_cold.child = refi(Child) 

and during de-serialization:

  x = deserialize(...)
    create new MyColdStorage (Parent_cold)
       set Paren_cold.parent = null
       create new MyColdStorage (Child_cold)
             set Child_cold.parent= Parent_cold;
             set Child_cold.child = null
       call readResolve(Child_cold) (Child)
             new MyGUI (Child)
               Child.parent = Child_cold.parent (Parent_cold)
               Child.child = Child_cold.child (null)
       set Parent_cold.child = Child
   call readResolve(Parent_cold) (Parent)
       new MyGUI (Parent)
         Parent.parent = Parent_cold.parent (null)
         Parent.child = Parent_cold.child (Child)          
   return Parent

Noticed the red lines?

In this cyclic structure the first use of de-serialized parent reference happens before the place in which its readResolve(Parent_cold) is invoked. It is because designers of standard Java serialization assumed, that to resolve an object You need it to be fully read. And of course, since we have a cyclic structure, the process of reading a “Child” in this example will refer to the “Parent” before it was fully read. Thous it will access unresolved object.

In my case it would produce the ClassCastException because MyColdStorage is not
JComponent.

It is even worse, we will have now two objects, one of unresolved MyColdStorage and one of resolved MyGUI were we originally had a single object.

writeReplace/readResolve doesn’t work in cyclic structures.

Note: This is specified and designed behavior. I can’t tell it was intentionally created like that, because a solution is trivial, but never less You will find it in serialization specs.

How to solve it?

The answer is simple: with standard serialization You can’t. Once it is “hot” it will be “hot” for an eternity.

But if You write Your own serialization engine the solution is simple. Instead of one readResolve use two:

class MyColdStorage
{
  Object readReplace()
  void fillReplacement(Object into)
}

Now the readReplace is bound to create an “empty” object of correct type:

  Object readReplace(){ return new MyGUI(); };

and the fillReplacement is bound to transfer data from the stream form to the target form:

  void fillReplacement(Object into)
  {
    ((MyGUI)into).parent = this.parent;
    ((MyGUI)into).child = this.child;
  };

The readReplace is invoked right after new instance is created and returned value it is put into a “stream_reference_map” (see there) instead of the original.

The fillReplacement is invoked in exactly the same place where the standard readResolve() is invoked, but opposite to original, the “stream_reference_map” is left untouched.

Then make de-serialization to look like:

 x = deserialize(...)
   create new MyColdStorage (Parent_cold)
    call readReplace() → since now each time "Parent" is referenced use returned value (Parent_R)
     set Parent_cold.parent = null
     create new MyColdStorage (Child_cold)
     call readReplace() → (Child_R)
       set Child_cold.parent= Parent_R;
       set Child_cold.child = null
       call fillReplacement(Child_R) (Child cold)
         set Child_R.parent = Child_cold.parent (Parent_R);
         set Child_R.child = Child_cold.child (null);
     set Parent_cold.child = Child_R
    call fillReplacement(Parent_R) (Parent)
      set Parent_R.parent = Parent_cold.parent (null);
      set Parent_R.child = Parent_cold.child (Child_R);
   return Parent_R

So is it multi-purpose now?

No.

It allows us to change purpose but not to serialize the same object once for that purpose and once for an another in exactly the same application.

Can we do it?

Of course.

With “object transformers”. There is absolutely no need to have writeReplace/readReplace+fillReplacement trio to be private methods of serialized class. They can be any methods defined anywhere providing the serialization mechanism can find them. For an example we may define:

public interface ITypeTransformer
{
  public boolean isTransformableByMe(Object x)
  public Object writeReplace(Object x)
  public Object readReplace(Object x)
  public void fillReplacement(Object from, Object into)
}

plug it into our serialization engine and be happy.

Can You do it with a standard serialization?

No. Absolutely not.

Summary

After reading this blog entry You should be understand that different applications may need to serialize the same object in a different way. You should be aware of the fact, that standard serialization is “cast in stone” in that manner and that a writeReplace/readResolve mechanism is broken and won’t help You in that manner.

You should also know that if You decide on Your own serialization engine, then You can do it in a very easy way.

Java serialization: new Object()…?

In previous blog entries I presented the problems and concept related to recursive data structures, stream references and reflections instead of pointers.

Now it is time to move to next important and, as You will see, most problematic case when it comes to Java serialization. And the problem is….

How to create a new object?

Whenever the binary stream is de-serialized it will, sooner or later, be told to turn a stream command into hot, live, living java Object. Of course not any object, but of exactly specified class.

How do You create an object?

In source code it is simple:

  class my_class...
  myclass x = new my_class(...)

Of course serialization runs on reflections, so we have to:

Class<?&gt _class = Class.forName(....)
_class.getDeclaredConstructor().newInstance()

Unfortunately it will not work. Or, to be precise, it will not work in about 75% of cases.

What is behind a new?

The simple new ... is, at JVM bytecode level, a two phase process:

  1. First the JVM is told to create new empty body of an object. There is a dedicated bytecode for it and a dedicated JVM mative hook for those who implement JDK reflections.
  2. Second the specified constructor (or default, param-less) of the actually created class is invoked.

new empty object

Each object in any programming languages exists in three “realms”:

  1. Memory allocation realm.
  2. Object management realm.
  3. User data realm.

Memory allocation realm

Each object must use some memory. To be able to do it the program must somehow assign it to an object. In case of C++ it may be either global static allocation, allocation on local variables stack or a dynamic (heap) allocation:

 my_class X;
...
void my_code()
{
   my_class X;
};
...
  my_class *X = new X();
  my_class *X = (*X)(malloc(sizeof(X)));

The two first cases do not appear in JAVA, so let us keep an eye on the third one. The malloc in an above equation is very much alike what happens when JVM creates and empty body of an object. Not exactly, but alike.

All the malloc needs to do, it is to reserve a space in memory of size large enough to keep object management data plus user data. It however also needs to store some additional information which will allow later to do free(...). Or, in case of JVM, to garbage collect the object.

In C the allocated memory block is not considered to be an object. It is just a bunch of bytes which may be interpreted as an object, but I recommend You to not try doing it.

In JVM we don’t have a facility to allocate memory which is not an object.

Object management realm

Once we have a block of bytes we need to turn it into an object. This requires, at least:

  • writing into it a reference to an object descriptor;
  • wiping out object data to default state, if language requires that. In JAVA it is required, so it is done;

Of course different systems may use different approaches, but an absolute minimum it is to let runtime code to be able to select a proper implementation of a virtual method. The object descriptor, of which part is so called “virtual methods table”, is one of possible ways of doing that.

Note: A user might have noticed that if object has no virtual methods there is no need for an “virtual methods table” and thous no need for “object descriptor”. This is true. Fortunately in JVM every object has some virtual methods, so we may be not bothered by this border-line case.

What is important to remember, it is that the allocation of an empty object is a two step process:

  1. Allocation of memory.
  2. “Arming” that memory with object infrastructure.

User data realm

These are simply: fields. Named fields listed by source code.

Initialization of those fields to their proper state is done by constructor.

Where is the problem?

The JVM specification was designed with a great flexibility in mind. Thous they assumed that said above two step sequence is not strictly defined when it comes to what exactly should happen in each of steps. What part of each “realm” is executed by what operation is not told. It may be that object is completely ready once allocated, or it may be that it is a constructor what is responsible for creating object management data.

The specification is however very explicit in one point: it requires that JVM will refuse to execute a code which tries to execute just one of those steps, execute them out of order, use inappropriate constructor or allow a reference on which constructor wasn’t called to leave the JVM machine registers stack. In other words – the program which can see the empty object (that is one without constructor run on it) won’t be executed.

Note: The OpenJDK is implemented (last time I looked at it, that is) in such a way, that memory allocation initializes both “memory” realm and “object management” realm. But it doesn’t always have to be like that.

In simple words: we can’t allocate an object without calling a constructor.

What does it mean for serialization?

Serialization should be transparent

Serialization should be as transparent as possible. We should be able to “move” object from one machine (or moment in time) to another through a binary stream in such a way, that the object won’t notice that.

For it to be possible, we need to do:

  1. Create an empty object, but “armed” in “object management” realm.
  2. Write data into object fields, field by field, thous transport the “user realm” from one place to another.

We should be able to do it without calling any code on an object.

Note: There is a plenty of cases when we will have to call a specific code on a de-serialized object, but we will talk about it in other blog entries.

An this is the catch.

We can’t do that with JAVA reflections.

We can’t because we can’t prevent constructor from running.

Why it is a problem?

Because a standard constructor doesn’t know if it is used to “do nothing” during de-serialization or if it used to create a live, ready to use object. In plenty of cases default constructors do initialize some data structures which are absolutely necessary for an object to work. While during de-serialization those structures will be overwritten with data coming from the binary stream.

In some cases it may work, in some it may mostly work, in others it will be a tragedy.

How it is done in standard JAVA serialization?

Very simple.

By hack.

There is an internal JVM class which pierces through JVM defense and allows to run split the operation. It allows two things:

  • to allocated “armed” object without calling a constructor;
  • to invoke an inappropriate constructor of an object;

It is there… but we shouldn’t bother about it. Because it is/will be hidden in post OpenJDK 19 release. Which is good, because it was a hell of the hack which exposed too many internal concepts to users. On the other hand users who wrote their own serialization implementations were forced to reverse engineer this hack because serialization was specified in conceptually wrong manner. Think ten times, do once, this is my motto.

What we should remember it is that it is impossible to re-implement standard serialization in pure JAVA. Dot. Not possible. Really. Been there, tried it, always ended with a hack. Not mentioning, that ObjectOutputStream/ObjectInputStream represents one of worst case of Object Oriented Programming I have seen (base on JDK 8). If You wish to check how to not write programs then take a look at it.

What to do, what to do?!

Well… Not much left, isn’t it?

Use constructor.

If You do design a serialization system do define some “serialization context” class, ie:

public final class MySerializationContext{};

and request that all serializable classes should have a constructor:

public class MySerializableClass
{
  public MySerializableClass(MySerializationContext x){};
}

The de-serialization code will execute this constructor and it is up to this constructor to ensure, that an “empty” object is in fact “empty”.

Note: Notice, that with this approach Your object may support different serialization engines since it has a dedicated constructor to support each of them.

If constructor is not what You like, You may require a class to have a “factory” method:

public class MySerializableClass
{
  public static MySerializableClass newInstanceOf(MySerializationContext x){....}
}

what is a must if Your class is “singleton”.

Note: A “singleton” class is a class of which only one instance may exist per application. Naturally serialization of such class is a tricky concept, but sometimes it just means: “use the same facility on target system as others do”. De-serialization of such a class of course cannot create new instance and should use instance present in target system instead.

What to not do?

Don’t follow the java.io.Externalizable concept. It is broken beyond imagination. It calls generic param-less constructor, so the class doesn’t know if it is de-serialized or create a new. It then expects that readExternal do actually read an object. Which it can’t do well if object takes a part in recursive data structure.

Don’t even try:

public class MySerializableClass
{
  public MySerializableClass(MyInputStream de_serialize_from_it){ .... };
}

which is even worse.

Summary

After reading this blog entry You should be aware, that object construction isn’t just one simple operation. You should also be aware, that in JAVA we can’t allocate without calling a constructor. And You should remember that standard JAVA serialization is a hell of a hack.

Java serialization: recursive data structures

In previous chapter I was talking about how to deal with references/pointers and introduced the concept of “refid” to replace them in a stream.

Now it is a time to talk about fields in object. Especially about fields which are themselves references.

Serializing an object

Note: I do abstract now from how do we get information about fields to store. It may be through “reflections” but may be also done directly. For the purpose of this blog entry it doesn’t matter at all.

Imagine we have an object like this:

class Node
{
   private final int payload;
   private final Node next;
   private final Node prev;
}

The reader will quickly recognize that it is a piece of so called “bi-directional list”:

The payload is an actual data, which in this case is just a number, and next points to subsequent node in a list or is null. Alike prev points to previous node in a list. The beauty in bi-directional list it is, that having a reference to any node one can traverse the entire list.

Serializing it

No let use define a very simple algorithm which will serialize any object:

serialize(Object n)
{
   for(f = each field in n)
   {
     if (f is primitive field) //*1
     {
          write its bits to stream
     }else
     {
       assert(f is reference)
       if (stream_reference_map contains f)  //*2
       {
            take refid of f from stream_reference_map;
            write refid of f to stream to indicate
             re-use of previously stored object;
       }else
       {
            assign new refid to f;
            store f and refid in stream_reference_map;
            write to a stream information necessary to
            create new instance of and retain refid of it; 
            call serialize(f); 
       }
      } //non primitive field.
   }//for each 
}//serialize

Ehmm…. and this is in fact an entire algorithm of Java serialization. Seriously. It is. Simple as a hammer. Except it is not, if You will dig into details and ponder about how to do it well.

Now where is the catch?

The first catch (*1) it is that we need to tell apart a reference from a primitive data. Primitive data can be stored as bits and bytes in stream. Refrences have to be converted to “refid” using the method I described in previous blog entry. The stream_reference_map is used to do that.

The second catch (*2) is to detect re-use of previously serialized object or an object which is actually being serialized.

The “bi-directional list” is an example of cyclic data structure. If You will walk through next and prev fields just iterating and calling serialize(...) on each field You will never finish the job. This is why we do first store an object in stream_reference_map and then invoke serialize(...) on it, since it allows us to break a cycle.

Note: Cyclic structures do have a nasty side effects during de-serialization, but I will write about it another time.

Recursion → potential of StackOverflowError

If You will try this exact example in a real life it will work and then fail. The failure will depend on the size of a list You serialize, will depend on JVM used and computer You run it at and will be seen as StackOverflowError exception being thrown.

Why will it happen?

Because You will have a following call:

serialize(x)
{
   ...serialize(x.next)
  { 
        ...serialize(x.next.next)
        {
           ....and so on till whole list is serialized
        }
  }
};

This is how an excellent feature of “bi-directional list” turns into a problem. You can reach a whole list starting from any node by traversing the next/prev fields. Thous, during this plain serialization You will traverse whole list in just a single call to serialize(...) on any node.

Call stack/stack frames

Now we need to get a bit into details how do CPU work and how compilers do turn Your code into a machine code.

Basically each time You call a method You need to supply following data to CPU:

  • what arguments are passed to that method?
  • where is this method in memory, so that CPU can jump there?
  • where to return after the method completes?

In most languages the arguments are indirectly transformed into local variables, so if we do talk about method invocation we may treat them the same.

If You declare the method:

  private int doom(int a)
  {
    int b = a *a;
    return b;
  }

You are in fact declaring two local variables: a and b.

The locality of variables is two fold: first their names are not visible to compiler from the outside of the method which declared them. Second, and most important, they have to be implemented in such a way, that if two threads do invoke the doom(...) those invocations must not interfere with each other in any way. Each needs to use own, separate piece of memory to hold “a” and “b”.

This can be implemented in many interesting ways, but most common and most reasonable way it is to use a concept of “stack-frame”.

The “stack frame” is a block of memory where all mentioned above information is stored. Like for an example in this:

position name meaning
0 a a space reserved for a parameter “a”
1 b a space reserved for a local variable “b”
2 retptr a space in which we do store a return address where to jump when method ends

Whenever the method is invoked the new “stack frame” must be allocated for it.

Call stack

There in no obvious problem visible yet. A new “stack frame” must be allocated? Well… what’s the problem?

The problem is: speed.

Calling a code is one of most common and elementary actions a CPU may take. So it must be a hell fast.

To support it 99% of CPU’s do have a dedicated instructions:

call TARGET
ret

which are implemented in following way (I do assume TARGET is a compile-time constant in this example):

call TARGET:
{
 x = memory[PC];//PC is "program counter register"
                //which tells where "call" is in memory.
                //In this example I assume TARGET is stored
                //right after "call" instruction.
   PC=PC+1;     //so that we can read next instruction data.
   take SP  ;   //SP is "stack pointer register"
   memory[SP]=PC;//save address of code after our "call"
   SP = SP+d;   //d is a size of return address
                //expressed in SP units
   PC=x;        //and jump to TARGET.
 }
ret:
{
  SP=SP-d;      //make SP to point at begin of stored PC
  PC=memory[SP];//jump there
}

The remaining CPU’s do have:

 brl TARGET_REGISTER,LINK_REGISTER
 br TARGET_REGISTER

where “brl” stands for “branch with link” and “br” is for “branch”. Both are implemented as simple:

brl TARGET_REGISTER,LINK_REGISTER:
{
   PC=PC+1
   LINK_REGISTER=PC
   PC=TARGET_REGISTER
}
br TARGET_REGISTER:
{
   PC=TARGET_REGISTER
}

You may easily see that the “brl/br” is just a piece of “call” and “ret” without SP being involved. Of course, since the number of registers is finite, they have to be stored somewhere during call. Usually on the “stack frame” for which allocation the SP will be used. So usually compiler will use pre-increment/post-decrement memory addressing modes and generate something like that:

ld TARGET,R0
brl R0,R1
st R1,@++SP

for calling and alike sequence for return:

ld @SP--,R1
br R1

One way or another the entire operation is made using memory[++SP]=x and PC=memory[SP--] operations. Which in fact are using ++SP and SP-- to allocate and release a tiny bit of memory necessary to store the return address.

This memory space is called “call stack”,”control stack” or simply a “stack”.

This simple method of allocation is fast. Since it is fast the 99% of compilers do use exactly the same space to allocated “stack frames”.

Our doom may be compiled into:

doom:
  SP+= sizeof(a)+sizeof(b)
  ...play with them
  SP-= sizeof(a)+sizeof(b)
 ret

Again, it is fast and supported directly by a hardware.

Note: There are CPU which do have separate physical memory for “control stack”. In such case it cannot be used for local variables and a separate “stack” is created by compiler. It will look very alike tough, with an exception that return address won’t be a part of “stack frame”.

Problem with ++SP/SP– memory allocation

The problem with this simplicity is that it is very simple indeed. It needs a continuous memory. This means that compiler must decide from start in what address space put static variables, the dynamic variables (those managed by new and garbage collector) and in what address space put “call stack”.

Or, in fact, “call stacks“. Because we need one of them for each thread.

Note: The CPU capable of Virtual Memory may actually use the Memory Mapping Unit to place all call stacks at the same logic address. I haven’t seen it done tough. Possibly because it is very CPU+OS specific and make thread switching more time consuming as each time not only registers would have to be restored from thread state data, but MMU map too. And since “call stack” is accessed through SP register anyway there is no real benefit of doing that.

One way or another compiler must decide at the beginning what would be the size of stack for each thread. It must arrange it, allocate somehow (can do it with the new subsystem too), but once it is done, it must be fixed in address space and must be large enough to accommodate all “stack frames” which will be needed by a thread.

StackOverflowError

And after this digression we are back to Java.

The fact that stack is hardware managed and that it must be large enough to accommodate all “stack frames” crates a tricky problem.

What happens if You do try to store too many stack frames on a stack?

It will overflow. This overflow is basically overwriting data which are after the stack, messing with them and then returning as if nothing happen. This is one of the nastiest problems to detect since cause and effect are far, far away one of each other in time. Very common in embedded world.

The C language defines the stack that way. Mostly because not all CPU’s can detect the problem before it happens. Most modern, big CPU’s can do it without any performance penalty. If however it is not supported by hardware a proper testing code can be always injected:

doom:
  if SP is such that stack frame won't fit
  {
       throw an exception
  }
  SP+= sizeof(a)+sizeof(b)
  ...play with them
  SP-= sizeof(a)+sizeof(b)
  ret

This will generate constant and significant performance penalty. Java was designed however with safety in mind and if they had a choice between speed and safety safety came first.

If Java detects stack problem it throws the StackOverflowError.

This is a very brute error and it is rather hard to recover from it. Possible, but usually indicates a huge problem with a code… or…?

Back to our list

And now we can easily see why using recursion to serialize bi-directional list is a bad approach. It is bad because the number of recursive calls do depend on the size of a list, while the size of a stack remaining when we do call serialize(...) depends on program state and execution context. The JVM, depending on settings, version and environment may assign stacks as small as 16kB and as bit as 4MB. This means, that on some computers said list may serialize without problems and on some may throw this nasty exception. And even worse, You can’t check it or control it from a code.

Solving this problem

As You can see the recursive serialization, even tough simple in concept may fail miserably on small, but cyclic data structures.

The ideal solution would be to not use recursion at all, but it complicates things a lot. I will discuss it some time later in following posts.

Of course this is not a problem which is unknown to the world. It is known and was known from the beginning. The designers of Java serialization decided to solve it in an another way. They left the recursion in place but allowed custom serialization code. Like this:

serialize(Object n)
{
   if (n contains custom serialization )
   {
      invoke it
   }
   else
   {
    for(f = each field in n)
    {
      if (f is primitive field) //*1)
      {
       ...

The java.util.LinkedList makes use of that functionality. It is a bi-directional list and it serializes well because it has a custom serialization code which does:

custom_writeObject(...)
{
   for(x: this) serialize(x);
}

which basically speaking do iterate over the list and do serialize payload stored in list Node objects. The nodes of list itself are not serialized at all.

Is it good?

Yes and not. In my opinion: it is not a good solution. It relies on the fact, that a person who wrote a data structure did realize that it may cause stack problems and dealt with it. The wrong, in my opinion, is it that the appearance of a problem depends on size of data stored in a said structure. Size is a matter of a context in which it is used, while recursion and custom code is a matter of design process.

It also generates a lot of problems with versioning and etc. creating “cast in stone” code, but this is not a right place to write about it.

Plus, cyclic structures do pop out continuously in many places and are natural method of implementing so called “observable data”. Sooner or later You will attempt to serialize them without realizing it.

Summary

After reading this, again too long, blog entry You should have a grip on how object is scanned during serialization process and what problems it may produce. You should also be able to realize possible problems with call stack when designing Your own algorithms which may involve crawling through linked and cyclic data of significant depth. And, of course, You should now realize what is the real reason behind the writeObject/readObject private methods which You may encounter in may JDK classes.

And now You it is a time to see how an object is created when de-serialization takes place.

JAVA serialization: if not the pointer then what?

In that post I did talk about “reflections” in JAVA and how this concept relates to serialization.

In that post I would like to say a few words about how to deal with “object references” if we can’t have a pointer.

Why do we need a pointer?

Under the hood to say what is where in a memory. But looking externally just for one thing: to tell two objects apart. If their references (ie. pointers) differs then those are not the same objects. They may be bit-by-bit equal but are not the same.

So basically as long as we can do 1:1 mapping:

  Object reference ↔ sequence-of-bits

then we are done.

Identity of objects in JAVA

Gladly JAVA provides two facilities for that:

class System{
  ...
int identityHashCode(Object x)
...
}

which do compute a “magic” number which provides non 1:1 mapping:

  Object reference → 32 bit integer

and

  Object X, Y;
    X==Y
    X!=Y

reference identity operator which can tell when two “pointer” do point to the same object.

Those two are enough to create identity hash-map (like java.util.IdentityHashMap) which can quickly map Object to int:

  stream_reference_map = new IdentityHashMap<Object, Integer>()

Of course we could do the same without the identityHashCode using only == operator and a list of structures like:

class Descriptor
{
  final Object reference;
  final int assigned_number;
}

but it would be few orders of magnitude slower.

Stream reference

The stream_reference_map shown above do map Object into an int number.
This number is called “stream reference identifier” or, in a short form: “refid”.

Note: Remember, the “refid” is not the result of identityHashCode()! The identityHashCode() does not produce 1:1 mapping! It may return the same number for many objects. It is used just to speed things up grouping objects in “buckets” over which we still need to use == operator.

Producing stream reference identifier

Any method will do. You should however think about few questions:

  1. Should I allow transfer of unlimited number of objects to stream?
  2. Should I allow garbage collection and re-use of refid?

Usually a simple incrementing counter will be ok.

Using stream reference

Basically You do use it exactly the way You would use a “pointer”. You like to write a pointer to object X a stream? Then You look up for “refid” of X and write that “refid” to a stream. Simple.

The question is when You like to write a pointer, but this is an another story.

Reading-side map

The above:

  stream_reference_map = new IdentityHashMap<Object, Integer>()

provides Object → int map. Unfortunately it is just one part of a story, which is used to write pointers to a stream. The other part of a story is to what to do with a “refid” we read from a stream?

The reading side needs:

  int → Object 

map. Gladly, if You have chosen an incrementing counter for a “refid” generator and You are fine with 2^31-1 objects in stream the simple:

  read_refid_map = new Object[...];

will do the best job.

Note: Unless You are actually planning to get anywhere near the 2^31 region in number of objects. A more “scattered” structure will better handle growing and shrinking the array during the live of serialized stream.

Problems

The first problem, which is not dealt with in standard serialization is memory leak. Yes, the standard serialization do leak as hell!

Hard-reference+garbage collector==memory leak

The stream_reference_map = new IdentityHashMap<Object, Integer> used at writing side utilize the standard, plain reference to an Object as a “key” in a map. This has an unfortunate effect: as long as this map exists the garbage collector will see all contained objects as “reachable” and won’t release them.

Usually it is not a problem, but if You will decide to, for an example, use serialization for logging Your application You will get a nasty surprise.

Imagine You do arm Your application with logging commands in following manner:

void woops(int a,int b)
{
  ....
  if (log_level_enabled) log_object_output_stream.writeObject("calling woops("+a+","+b+")");
  ...
}

Each time this code runs, the new string is formed and written to a stream as an object. This means, that it must have the “refid” assigned. And if it must have it assigned, then it must be put into a stream_reference_map. Since it is using hard reference, it means it will stay there forever. Or, precisely, until OutOfMemoryError.

The proper stream_reference_map must hold reference to mapped objects by a WeakReference.

Passing garbage collection event

Of course, even if You will deal with above You will still hit the OutOfMemoryError at the reading side of a stream.

The simplest:

  read_refid_map = new WeakReference<Object>[...];

will not work. The weak reference works at writing side, because if the only place for object to exist is the stream_reference_map
map, then there is no way to write it again to a stream.

At the reading side it is very different. The reading code may pick “refid” from stream (and objects) and drop them right in the place. The writing side may however hug to the object for very long time and write it to stream many times. Of course, to avoid many problems which I will discuss somewhere else, it will prefer to write “refid” to it. If the read_refid_map would be WeakReference then there wouldn’t be any object to map it to.

Good “refid” system do pass garbage collection events to reading side.

Roll over

Of course int isn’t infinite. Even if You will use proper garbage collection of “refid” You will still sooner or later hit:

   assert(refid_generator+1 > refid_generator )

that is a “signed wrap around”. You will run out of possible “refid” to use.

This is something what is also not addressed in standard serialization. The bad problem is that the standard serialization is not utilizing the entire 2^31-1 pool of numbers and the roll-over happens earlier producing some commands instead of “refid”. Fortunately You need a really huge VM to hit this problem, since usually the OutOfMemoryError will appear first.

The good “refid” system do re-use garbage collected refid to avoid roll-overs.

Summary

After reading this chapter You should know what the “stream reference identifier” is and how not to design the system which manages it. This should also make You to notice, that standard serialization stream cannot exist permanently or be used for large amount of data produced on-demand.

And now You may move to following part in which You will read about how object is scanned during serialization and what problems it may create.

Introduction to JAVA serialization

In previous chapters I was talking about generic ideas of serialization, dealing with pointers and dealing with versioning.

In this chapter I would like to show You how the concept and, of course, the problem of “pointers” is dealt with in JAVA.

JAVA limitations

No pointers

The most important limitation is: JAVA has no concept of “pointer”. There is a beast called “object reference” but there is no “pointer” which can be turned into an integer number representing an address in machine memory. And since there is no “pointer” there is absolutely no way to access machine memory the same way You can do in C/C++. Just no way.

Unless, of course, You will try to use unsafe set of classes and methods. Then You can play with bits and bytes, but such play will produce memory dumps which won’t survive neither years nor transfer to different virtual machine running on a different architecture. Notice also, that most of unsafe function which were present in JDK8 are removed in JDK11+. And will be hidden even more in future version. Because they are unsafe and do allow a hell lot of hacking on servers which can run “outsider code”.

Thous: no possibility to do a “memory dump”. Which, in fact, isn’t bad.

But what do we have instead of “pointers”?

We have objects. Or, precisely speaking, “references” to objects. Which are internally “pointers” but are completely opaque to us and can’t be transformed to any kind of number nor anything. Not at all. All we can do with “references” are:

  • compare them using == operator;
  • assign one to another using =;
  • use them to access fields, array elements or invoke some methods attached to them;
  • create “objects” we can later reference to through “references”.

No malloc

The second limitation is a lack of generic way to “allocate some memory”. Any allocation must be either allocation of an array or an object. And when You allocate an object, then some of its constructor must be called. This concept prevents us from creating an object with absolutely zero initialization and later filling it with data got from a serialized form. Some colaboration from an object is required.

Note 1: Standard serialization API do interact with VM at native level to create really empty objects without calling an appropriate constructor. We can also do it, but in pure JAVA it won’t be possible without referencing to some internal, JDK specific classes. Notice however that re-implementing “standard” serialization is not our goal. What we aim for is a long lasting, stable, non tricky way of supporting flexible serialization.

Note 2: If You will inspect the JVM specification then at first glance there will be difficult to find that what I just said (that the constructor must be called) is required. Indeed the instruction set does allow to allocate and object and not call a constructor at all. Such a code won’t pass however the validation stage, that is a process during which JVM checks if the class file is syntactically and logically correct. There are tricks which allow to force JVM do disable class file verification, but it is asking for serious problems.

JAVA benefits

To overcome the limitation related to lack of pointers designers of JAVA introduced one very rare and very powerful mechanism: reflections.

Not many programming languages do have it.

What are “reflections”?

In very short words this is a set of methods and classes which allows You to inspect any reference to object instance You get. You can ask it to tell You how this object is named. You can check what classes it extends or implements. And You can, what is most interesting to us, ask it to list and manipulate all fields contained in it with their names, types, annotations and currently assigned values.

You can even access this way fields which are normally hidden from You if You would just try to reference them in code, what was always a bit of disputable aspect of reflections when looked at from security point of view.

Including, of course, fields You did not know at compile time. Exactly what we need.

Note: JDK 9+ module system puts some constraints on it, but You can still do all things we need. Just not with absolutely every object as You could have done in JDK8 and prior releases.

In simpler words: if You have any “reference” to an object then, using reflections, You can ask:

  • ask how the class it is instance of is named;
  • ask what  all “fields” (that is – object bound variables) contained in it are named and what type they are;
  • ask what methods (functions to call) are declared there and what parameters do they take.

And vice versa, knowing answers to above questions You may, using reflections, do:

  • create a new instance of an object knowing name of a class;
  • set to or get values from any of its fields;
  • invoke any of its methods.

Hey, I can do it in C too!

No, You can’t.

I might have been a bit imprecise in what I was saying. Yes, You can do everything like that in C. At the source code level and, what we do call it, at “compile time”. That is You can write:

class X{ int x; };
*X =...
X->x=4

However in JAVA using reflections You can do:

String class_name = .... read it from file for an example
String field_name = ...
Object A = Class.newObject(class_name)
Field x = A.getClass().getField(field_name)
x.set(A,4)

Note: As always, the code examples are simplified and won’t compile. They are only exemplifications of some idea.

In other words, “reflections” allows You to compute the class name at run time, also from externally supplied data, and manipulated objects of that class as much as You like it.

Including classes which have been not existing when You wrote the code.

Nice, isn’t it?

Note: Java allows You also to actually generate the binary class file in Your code and tell JVM to load it at runtime. But we won’t be needing it for serialization.

Reflections+references versus pointers

As You might already noticed the lack of pointers prevents us from making a “memory dump” of an object and from manipulating its data at bit-by-bit level. The “reflections” do allow us however to manipulate actual data stored in an object using names and values without bothering how they are kept in memory.

Do we need anything more?

Summary

After reading this chapter You should be aware how the “object reference” concept differs from the concept of “pointers” and how the “reflections” do allow to overcome the limitations related to full opacity of the “object reference”.

You should also have guessed that the “reflections” will play a critical role in the implementation “indirect versioning” concept.

And, of course, You should also have noticed, that the fact that “object reference” is fully opaque means that we simply can’t save it. But this problem is left for the next blog entry.

Serialization: versioning

In the this post You could read about serialization in generic and how pointers are messing with it.

In this post I would like to touch an another aspect which do influence serialization engines a lot.

This subject is:

Versioning

As I said previously the dump of bytes and bits not only depends on code, target CPU, compiler and etc. but will change from version to version of Your program.

Now, for sake of this discussion, assume that You can force Your compiler to produce stable, predictable memory layout across all compiler versions. Assume also that You do not care about different CPU architectures.

Now You can say that bit-by-bit image of Your memory “dump” will be consistent and predictable.

True.

Unless You will do something like that:

yesterday today
    
struct{
 char [32] first_name;
 char [32] surename;
 int age
 }
    
struct{        
 char [32] first_name;
 char [32] surename;
 boolean gender;
 int age;
}

Yes,yes, I know, we do live in the era of “non-binary” persons and I was super-duper rude to use boolean for a gender. Glad You noticed that. This is a bug which will need to be fixed later and clearly shows that data structure versioning is a must.

What have happen?

We added a field. And we have done it in a very, very bad way, by stuffing it in the middle of data structure. And kaboom! Now bit-by-bit images from “today” and “yesterday” are totally incompatible with each other.

Direct versioning

The first obvious solution it is to arm Your memory “dump” with a “header”:

    HEADER: int serial_version;
    CONTENT:
    	struct{
    	.....
    	};

Writing such data is super easy:

    	write(...,serial_version);
    	write(...,&data_in_memory,sizeof(struct...))

Reading is a completely another story.

You have to do something like that:

    	int serial_version = read(...)
    	switch(serial_version)
    	{
    		case 0:
    			....
    		case 1:
    			....
    	}

and for each case You need to provide a transformation from “that version” into “current version”.

This is a bit messy for complex structures and You need continuously rename Your “yesterday” structures in Your source to avoid name clashing. You can’t do:

yesterday today future
    
typdef struct{
 char [32] first_name;
 char [32] surename;
 int age
} version_1
    
typedef struct{        
 char [32] first_name;
 char [32] surename;
 boolean gender;
 int age;
}version_2
    
typedef struct{
 ...
}version_3

because with each bump-up of a version You would have to update entirie code which references Your structure.

Instead You would rather do:

yesterday today tomorrow
    
typedef struct{
    	...
} data
    
old definition
typedef struct{
   ...
} v1
active definition
typedef struct{
   ...
} data

    
old definition
typedef struct{
....
 } v1
typedef struct{
....
 } v2
active definition
typedef struct{
....
 } data

that is keep most recent version always with the same name. This is a good idea, because if there was no gender field yesterday then none of old code made any use of it. If You will add it today there is a huge chance that 90% of code still won’t need to use it. Keeping the name unchanged saves You a lot of work.

The obvious downside is that You have to rename “old” structures but still keep them in Your code base. With plain structs it is easy, but with objects with their entire inheritance tree it is a hell lot of mess. Doable, but messy.

We need different approach.

But before saying anything about it let’s look at an another problem.

Upwards compatibility? Downwards compatibility?

Now return to:

    	int serial_version = read(...)
    	switch(serial_version)
    	{
    		case 0:
    			....
    		case 1:
    			....
    		default: what to do?
    	}

It does not need a lot of thinking to notice that with direct versioning You can have only downwards compatibility.

“New” code can load “old” data, but “old” code cannot load “new” data.

What is a point in loading “new” data by “old” code You say?

Well… Your yesterday structure did not have gender field. Then “old” program, by it’s nature, will not need it. Why not let it read “today” data? Is there any logic problem with it?

Again, vice versa, loading structure without gender requires that a transforming code has to do some guessing. There was no information about gender, but it must have it.

Notice, there is a one very serious reason to not allow upwards compatibility. That is: money. If You will allow old version of Your software to load files stored by more modern versions, then Your clients may decide to hold all their licenses back and buy just one seat upgrade to try it out. If they will find that there is no value in the upgrade they won’t buy more seats and they won’t loose anything.

If however You will design Your software in such a way that old version can’t do anything with files written by new version, and You will prevent new version from saving files in and old way, then it is a completely another story. Now if You client will upgrade just one seat, then the person working at it will on daily basis corrupt files in such a way that the rest of Your client employees won’t be able to use them. And, since You prevent saving files in “old way” after some time Your client will have a choice: either to not upgrade and throw away all job done on that new seat, or upgrade the whole company and keep the work.

Nice, isn’t it? Welcome to the world of Autodesk Inventor!

Indirect versioning

Indirect versioning uses all the “why” I spoke about above to provide both up and downwards compatibility.

Instead of:

    HEADER: int serial_version;
    CONTENT:
    	struct{
    	.....
    	};

it does:

    HEADER: int logic_version;
    CONTENT:
     begin
    	FIELD "Name" =...
    	FIELD "Gender" =...
    	....
     end

and saves not only the content of the structure, but also information what fields are in it and where.

Armed with this information You don’t really have to use any transformation. You just read fields from a stream and apply them to fields in Your current structure in memory.

As You can see indirect versioning do carry huge potential: Your program can read both older and newer streams without any effort from Your side.

Great, isn’t it?

Blah, isn’t it good old XML or JSON? Sure it is. I am just curious if You ever was thinking about it that way.

logic_version

Notice I have left some “header” still, but instead of serial_version I renamed it to logic_version.

The idea behind it is simple:

In some cases You will have to introduce a “breaking change” in Your data structure. This change is past adding and removing fields, it changes the logic so much that mapping field from stream to field in memory won’t work anymore. To indicate it You just “bump up” the logic_version.

Of course with that we do move from “indirect” to “direct” versioning.

Note: Java serialization do have serialVersionUID field just for that. It lacks however any possibility to deal with such change except of complaining that nothing can be done.

Missing fields

Of course with “indirect” approach You will have to deal with a case where You expect some fields (like said gender) but they are not there.

To deal with it You need two operations:

  • to be able to initialize missing fields with reasonable values;
  • to be able to post-process the entire structure, once it is loaded
    and initialized with defaults, and do some guessing and cleanup;

This have to be done in two separate stages (or at least: last stage is necessary) because the “reasonable” initialization is not always possible without knowing what are values of other fields.

For an example You may attempt to guess gender by looking into names dictionary, but to do that You need to have “name” field loaded first.

Additional fields

And a vice versa.

A stream may contain more fields that You need. If You did not introduce a “logic change”, then this usually means that either Your program no longer needs some information, or new version added some information Your program does not understand.

In both cases ignoring it will be fine.

Pointers?

Hurray! We solved it! If we already have given names to fields in stream why not to add a marker to say which of them are “pointers”?

Summary

After reading this chapter You should be aware version handling is not something what can be left for later because it will have significant impact on code maintenance cost and Your licensing policy.

You should also notice, that if we will switch from “memory dump” to “named fields” concept then we can solve not only up and down-wards compatibility issues but also have a nice mechanism which we can use to identify “pointers”.

Plus, obviously, You should notice why I was so eager to have a structured abstract file format and why people are so much in love with zipped xml file formats nowadays.

All right, so we know about problem with pointers and something about how to deal best with versioning. Now it is time to move to some JAVA related stuff.

Serialization: introduction

In this, that,that and finally in there You could read about abstract file formats.

Note: please notice the reference implementation I proposed there.

Especially in the first post of the series You could read about Java serialization.

In this series of posts, which will explain a background behind my next project, I will try explain basics, concepts and pit-falls which one may encounter when trying to build a data serialization engine.

Note: Most of stuff You will read here will become a part of said project documentation and will be included in a more expanded version in the final project on a github.

What is serialization?

“Serialization” is a method of transforming complex, object based data structures, existing alive in memory, into a “dead” files or data-streams which can be moved from machine to machine or saved for later use.

In other words – a way to save objects on disk or pass them through the net.

What is de-serialization?

An exactly reverse process: Having some “dead” data on disk or received from a network we do “de-serialized” them by creating living objects in memory matching previously stored content.

“Memory dump” serialization

A most idiotic but often good form of serialization is a “memory dump”. Just take a native reference to memory block containing the data structure and dump it on disk. Like in below pseudo C code:

         struct{ int a,b,c }x;
         writeFile(..., &x, sizeof(x))
    

This type of serialization has some serious flaws:

  • it saves data including all “padding” bytes injected by a compiler;
  • different compilers or even different compilations may result in different
    structure layout;
  • pointers? Can pointers can’t be stored at all? What do You think about it?
  • absolute zero robustness against version change;

Even tough this is idiotic method it was used to manually implement “swap file”
in applications which needed far too much memory that it could be provided by
an operating system.

Why? Because it is extremely fast.

Pointers or references

There is usually no conceptual problems with saving elementary data like bits,
bytes, numbers or texts into any kind of “file format”.

Pointers and references are something else.

Pointer concept

A “pointer” or “reference” is, technically speaking, an integer number which is interpreted by CPU as an address in memory from which it should read something, execute something or write something.

There are very, very few cases in modern operating systems on modern machines when an “address” of certain piece of data in memory (ie the variable which holds this text in Your web browser) will be preserved from a run of a program to run. In almost every case it will be different even tough in Your program it will be named the same, is bit to bit the same and etc.

If You would save such an “address” on a disk, would close the program, and the would start it again and load such an address from stored file then You have 99.999% chance that a loaded address won’t point to where it should.

When pointer can be serialized?

A pointer must be always serialized in a “smart way”.

First, the serialization mechanism must know that it is serializing the “pointer”. This means, that it can’t just dump a block of memory on disk and then load it later. It must know where in this block pointers are.

Second, the saved pointer must point only to a part of memory which is also being serialized. Only then You may somehow change the address X to “this is an offset X1 in N-th serialized block of memory”.

A pointer pointing to something what is not serialized can’t be serialized.

How pointer can be de-serialized?

The serialized pointer is basically a way of saying to which part of serialized data it points.

The easiest method of imagining it will be:

                       This is a memory
                    block to be serialized

         *******************b*******************c*************
         ↑                  ↑
         Aptr               ↑
                           bptr
    

The Aptr is an address of memory block as the CPU can see it. For an example 0x048F_F400.

The b is a some variable in that block. The address of this variable, as a CPU can see it is bptr.

And the c is a variable inside that serialized in which we would like to save the bptr.

Let us say bptr=0x048F_F4A0.

If we would just dump the block on disk the c would contain the 0x048F_F4A0.

Then imagine we are loading that block from disk to memory five days later.

Will it work?

Yes. Providing we do load it into exactly the same memory location. Our new Aptr must be 0x048F_F400.

If however You have done like:

         Aptr = malloc some data // Aptr ==  0x048F_F400
         writefile(Aptr...)
         ....
         kill program, wait five days
         ....
         Aptr = malloc some data  // Aptr == 0x0500_0000
         readFile(Aptr...)
    

Then c=0x048F_F4A0 won’t point to anything.

But if You would have done:

         Aptr = malloc some data // Aptr ==  0x048F_F400
         Aptr->c = Aptr->c - Aptr //    c ==  0x0000 00A0
         writefile(Aptr...) ; ← including c in this written block
         ....
         kill program, wait five days
         ....
         Aptr = malloc some data   // Aptr == 0x0500_0000
         readFile(Aptr...)         //   c  == 0x0000 00A0
         Aptr->c = Aptr->c + Aptr; //   c  == 0x0500 00A0
    

then c is correctly deserialized

They key concept You should remember and understand is:

To be able to serialize a pointer You must know when You are serializing a pointer.

Summary

After reading this blog post You should understand what is the basic idea behind the serialization and that raw “memory dump” serialization is not the best concept for any long term data storage. You should be also aware that the most tricky part of it are pointers. You should also notice that the most important thing one should deal with during serialization is to figure out some way of saying “hey, this is a pointer what I am serializing now!”.

No You are ready to take next step.

Abstract file format: what to use for a “signal”?

In that post I wrote:
(…)
The good abstract API needs to be able to indicate data boundaries and move from boundary to boundary with:

 void writeSignal(signal type)
   signal type readNextSignal()

(…)

I let myself to use the enigmatic signal type.

Now it is a time to dig into it.

What is “signal” used for?

Again, just to remind You: for telling what a certain bunch of data is used for. To give it a name.

How many different “signals” do we need?

At least two… or to be specific – two kinds of signals.

If a signal must give a name for a certain block of data then it must somehow indicate when this block starts and when it ends.

In fact it must be very alike good old C typedef:

typedef struct{
   int a;
   char c;
}Tmy_named_struct

Since to be able to efficiently process a stream of data we need to know what struct means before we start reading it, then the structure in a data stream should rather look like even older Pascal:

 begin Tmy_named_struct
   int: a;
   char: c;
 end

The begin and the name comes first, the end comes last.

The “end” and the “begin”

This means we need an API which will be closer to:

 void writeBegin(signal name)
 void writeEnd()
 ...
 signal type readNextSignal()

Now You can see I used signal name and signal type. We need to define them more closely.

The signal name

Historically speaking, when I started this project, I decided to do:

 void writeBegin(int signal)
 void writeEnd()
 ...
 /*....
  @return positive for a "begin" signal, -1 for an "end" signal. */
 int readNextSignal()

It was very bad idea.

At first few uses I started to get pains to manage what number means what and how to prevent numbers from different libraries to not clash. This was the same problem, although in mini-scale, as global function names clash in C.

So I thought to myself: How did they solve it in C?

With a name-space. You can assign a function name to a name-space and then use a fully qualified name to avoid a name clash. And if then names do clash, You can put a name-space into a name-space and form something like:

  space_a::space_b::function

Please excuse my wrong syntax. I did not use C/C++ for quite a long time and I don’t remember how exactly it looks like nowadays.

So I could use a sequence of int numbers….

Dumb I was, wasn’t I?

The hundred times easier and more efficient is to do:

 
 void setNameLengthLimit(int length);
 void writeBegin(String name)throws ENameTooLong;
 void writeEnd();
 ...
 /*....
  @return either a name for "begin" signal or null for "end" signal
*/
 String readNextSignal()throws ENameTooLong;

We use String for names. Strings are variable in length, easy for humans to understand (this is important if back-end will be something like JSON which ought to be human-readable) and Java has a very efficient support for them.

Note: I did myself let to introduce setNameLenght(...) and throws ENameTooLong. Remember what I have said about OutOfMemoryException attack? The ENameTooLong is there to let You keep flexibility and put a safety breaks on Your stream.

But Strings are slooooow

Sure.

With the int as a signal name and a careful selection of constants like below,
the code which looks like this:

  int name=...
  switch(name)
  {
    case 0: ... break;
    case 1: ... break;
  }

can be compiled on almost any architecture to a “computed goto”:

  mov name → reg0
  cmp reg0 with 1
  jmp_if_greater _after_switch
  shl reg0, times           ; according to DATA width
  mov jump_table[reg0], PC  ;read target from table and jump there
  jump_table:
     DATA case0
     DATA case1
    ...
case0:
   ...
   jmp _after_switch
case1:
   ...
   jmp _after_switch     

where the entire comparison take just about 6 machine instructions regardless of how huge the switch-case block would be.

Prior to JDK8 using Strings was a pain in a behind, because the only syntax You could use was:

  if ("name0".equals(name))
  {
  }else
  if ("name1".equals(name))
  ....

what in worst case scenario ended up in large number of .equals calls.

At a certain moment, and I admit I missed it, JAVA specs enforced the exact method of computing String.hashCode(). Prior to JDK8 it had no special meaning and each JVM/JRE/JDK could in fact provide own implementation of String.hashCode(). There was simply no reason to enforce the use of the standard method.

Since JDK8 such a use did appear.

JAVA now absolutely requires that:

  • regardless of JDK/JRE/JVM String.hashCode() is always computed using the same algorithm. This way compiler may compute hash codes for known const Strings at compile time and be sure that in any environment:
       StringBuilder sb = new StringBuilder();
       sb.append('a');
       sb.append('b');
       sb.append('c');
    
       assert(  1237 == sb.hashCode())
       assert(  1237 == "abc".hashCode())
    

    assertions will not fail.
    Please notice, 1237 is not a correct hash code for that example. I just faked it to show, that compile time constant can be used.

  • second they directly requested that String has something like:
        class String
        {
          boolean hash_valid;
          int hash_cache;
    
          public int hashCode()
          {
              if (hash_valid) return hash_cache;
              hash_cache = computeCache();
              hash_valid = true;
              return hash_cache;
          };
        }

    Notice this is a very simplified code which may fail on some machines in a multi-threaded multi-core environment due to memory buss re-ordering. Some precautions have to be made to ensure that no other thread do see hash_valid==true before it can see hash_cache to be set to computed value. Since String is implemented natively I won’t try to dig into it. It is just worth to mention that volatile would do the job but
    it would be unnecessary expensive. I suppose native code could have found better solution.

    Notice, the race condition on setting up hash_cache is not a problem. Every call to computeCache() will always give the same result as JAVA strings are immutable. At worst we will compute it twice but nothing would break.

  • and third they did require that:
       class String
       {
         public boolean equals(Object o)
         {
            ... null, this, instanceof and etc checked.
            if (o.hashCode()!=this.hashCode()) return false;
            return compareCharByChar(this,(String)o);
         }
       } 
    

    which avoids calling compareCharByChar() unless there is a high probability that it will return true. And, of course, the compareCharByChar will terminate immediately at first not matching character.

In much, much more simple words it means, that since JDK8 it is important that String.hashCode() is environment invariant, cached, and used for quick rejection of not matching strings.

Knowing that they let us use:

  String name=...
   switch(name)
   {
     case "name0":...break;
     case "name1":...break;
   };

which is not implemented as:

  String name=...
   if ("name0".equals(name))
   {
   }else if ("name1".equals(name))
   {
   };

but as a hell more complex and hell more faster:

 String name=..
 switch(name.hashCode())
 {
    case 1243:
         if ("name0".equals(name))
         {
         } else if (... other const names with 1234 hash code.
         break;
   case 3345:
        ...
  }

This can be, in large cases, about 10 times faster than pure if(...)else if(...) solution.

Never less it is still at least 50 times slower that using int directly because it can’t use jump-table and must use look-up table which has a comparison cost linear with the number of cases. Unless very large table is used in which case we can profit from binary search and reduce the cost to log2N cost.

Never less it won’t never ever be even close to 6 machine instructions.

Then why String?

Because with int I really could not find a good, easy to maintain method of combining constants from different libraries wrote by different teams into a non-clashing set. I could figure out how to avoid clashing with static initialization blocks, but then such constant are not true compile-time constants and can’t be used in switch-case block.

Strings are easy to maintain and clashes are rare due to the usually long, human friendly names.

In other words – use int if speed is everything and difference between 6 and 600 machine cycles do mean everything for You while continuous patching of libraries in struggle to remove code clashes is not a problem.

Use String if development costs and code maintenance are Your limit.

And even if You are, like me, the “speed freak” please remember that we are talking about file or I/O formats. Will 6 versus 600 cycles matter when put aside of the time of loading data from a hard-disk or network connection?

I don’t think so.

But Strings are so laaaarge…

Yes they are.

And human readable.

Using non human readable numeric names for XML or JSON back-end makes using XML or JSON pointless. The:

 <1>3</1>

is equally readable to human as a binary block.

If however size is Your concern, and in case of a binary back-end it will be, You can always create a “name registry” or a “name cache” and stuff into Your stream something like:

with String names with name registry
begin “joan” assign “joan” to 0
begin 0
end end
begin “joan” begin 0
end end

and get a nice, tight stream with short numeric names and at the same time nice, maintenance friendly String names.

Note: this table is a bit of simplification. In reality the capacity of name cache will be limited and the capacity of numeric names will also be restricted. Most possibly You will need assign, begin-with-numeric-name, begin-with-string-name commands… but this is an another story.

Summary

After reading this blog entry You should now know that the “signal” can in fact be a “begin” accompanied with a “name” or just a plain nameless “end”. You should also know that there is no reason to fret about not using String for “begin” name and how to deal with performance and size issues related with String used in that context.

What next? See there.

Java anti-pattern: non-final getters

In this post I would like to present You first of the series of Java anti-patterns. And it will be:

public class MyClass
{

          private double speed;

public void setSpeed(double speed){ this.speed = speed; };
public double getSpeed(){ return this.speed; };
public double computeTraveledDistance(double time){  return this.speed * time; };
}

I did let myself to color in red problematic pieces.

What is wrong with that pattern?

We do live in Object Oriented world, what means inheritance, overriding and virtual methods.
This means that anyone can legally do:

public class MyUniDirectionalMover extends MyClass
{
@Override public double getSpeed(){ return Math.abs(super.getSpeed()); };
};

This is legal code, which makes sure that regardless what speed is set returned value will always be positive and we should always travel in one direction…

Is that true?

No, because:

....
public double computeTraveledDistance(double time){  return this.speed * time; };

is not referring to getSpeed() but to the this.speed field directly. User did change the behavior, class asked about speed will never return negative number, yet the computed travel distance can be negative.

How to cure it?

There are two possible ways: either block user from overriding what has no effect or prevent Yourself from using the fields directly.

Final getter

The easiest way is to prevent user from overriding code in expectancy of some effect which can’t take place:

public class MyClass
{
          private double speed;
public void setSpeed(double speed){ this.speed = speed; };
public final double getSpeed(){ return this.speed; }
public double computeTraveledDistance(double time){  return this.speed * time; };
}

Hidden fields

If this is undesired and side effect are expected one must prevent oneself from touching fields directly:

class MyClassBase0
{
    private  double speed;

public void setSpeed(double speed){ this.speed = speed; };
public void double getSpeed(){ return this.speed; }
}
... 
class MyClass extends MyClassBase0
{
  public double computeTraveledDistance(double time){  return this.getSpeed() * time; };
}

Where can I find such anti-patterns?

For an example in java.awt.CardLayout which refers directly to component.width and component.height
package private fields while java.awt.Component#getWidth() and getHeight() are non-final.

Summary

What to say, what to say…

Just keep in mind, that You can always use Your compiler to prevent You from doing stupid things. Declaring a getter final clearly states to anyone that the only way to limit speed to positive values is to override setSpeed:

public class MyUniDirectionalMover extends MyClass
{
@Override public void getSpeed(double speed){ super.setSpeed(Math.abs(speed)); };
};

“Content driven” versus “code driven” file formats.

In a previous blog entry I did promise to tell You something about two different approaches to data parsing: “content driven” versus “code driven”.

Or “visitor” versus “iterator”.

Content driven

In this approach a parser is defined as a machine which exposes to an external world an API looking like:

public interface IParser
{
   public void parse(InputStream data, Handler content_handler)throws IOException
};

where Handler is defined as:

public interface Handler
{
  public void onBegin(String signal_name);
  public void onEnd();
  public void onInteger(int x);
  public void on....
 and etc, and etc...
}

Conceptually the parser is responsible for recognizing what bytes in stream do mean what, and for invoking an appropriate method of a handler in an appropriate moment.

The good example are: org.xml.sax.XMLReader and org.xml.sax.ContentHandler from standard JDK. Plenty of You most probably used that.

Note: This is very alike “visitor” coding pattern in non-I/O related data processing. Just for Your information.

Benefits

At first glance it doesn’t look as we can have much profit from that, right? But the more file format becomes complicated, the more benefits do we have. Imagine a full blown XML with a hell lot of DTD schema, xlink-ed sub files and plenitude of attributes. Parsing it item by item would be complex, while with handler we may just react on:

public interface org.xml.sax.ContentHandler...
{
  public void startElement​(String uri, String localName, String qName, Attributes atts) throws SAXException
  {
   if ("Wheel".equals(qName))
   {
     ....

and easily extract a fragment we are looking for.

This is exact the reason why XML parsing was defined that way. XML is hell tricky for manual processing!

Obviously we do not load the file just for fun. We usually do like to have it in a memory and do something with data contained, right? We do like to have something very alike Document Object Model, that is a data structure which reflects file content. The “content handler” approach is ideal for that purpose because we just do build elements in some handler methods and append them to the data structure in memory.

Easy.

And a last, but one of most important concept: we can arm a parser with “syntax checking”. Like we arm SAX parser by supplying an XML which carries inside its body the DTD document definition schema. Parser will do all the checking for us (well, almost all) and we can be safe, right?

Well… not right, but I will explain it later.

Why I call it “content driven”?

Because this is not You who tells what code is invoked and when. You just tell a parser what can be invoked, but when and in what sequence Your methods are called is decided by a person who prepared a data file.

Who, by the way, may wish to crack Your system.

Content driven vulnerabilities

XML

The content driven approach was a source of plenty of vulnerabilities in XML parsing. One of most known was forging an XML with cyclic, recursive DTD schema. The XML parser do load DTD schema absolutely before it parses anything else from an XML. After that it creates a machine which is responsible for validation process. If DTD schema was recursive the process of building this machine  will consume all the memory and system will barf.

Of course this gate for an attack was opened by some irresponsible idiot who thought that embedding rules which do say how the correct data file looks like inside a data file itself is a good idea…

Note: Always supply Your org.xml.sax.XMLReader with org.xml.sax.EntityResolver which will capture any reference to DTD and will forcefully supply a known good definition from Your internal resources.

If You will defend Your XML parser with DTD or alike schema and You will make sure, that nobody can stuff in Your face fake DTD then in most cases “content driven” approach will be fine.

When it won’t be fine?

When Your document syntax do allow open, unbound recursion in definition. Or when DTD does not put any constrains (which it can’t do) on length of an attribute. Or in some other pitfalls which I did not fall-in because I don’t use XML on daily basis.

There is however one another, even more hellish piece of API which can be used to crack Your machine…. and this is…

Java serialization

Yep.

Or precisely speaking: Java de-serialization.

A serialized stream can, in fact, create practically any object it knows it exists in a target system with practically any content of its private fields. Usually creating objects does not call some code, but in Java it is. Sometimes constructor will be called, sometimes methods responsible for setting up the object after de-serialization will be. All will be parametrized with fields You might have crafted to Your liking.

A possible attack scenarios do allow from simple OutOfMemoryException to execution of some peculiar methods with very hard to predict side effects.

All in response to:

   Object x = in.readObject()

Basically this is why modern JDK do state that serialization is a low level insecure mechanism which should be used only to exchange data between known good sources.

Preventing troubles

Since in a “content driven” approach they are data what drives Your program You must defend against incorrect data.

You can’t just code it right – instead You need to accept and parse bad data and then reject them. Like for an example You need to accept opening XML tag with huge attributes, and just once Your content handler is called You must say: “recursion too deep” or “invalid attribute“.

Alike in Java de-serialization You must either install:

public final void setObjectInputFilter​(ObjectInputFilter filter)

(since JDK 9)
or override

protected ObjectStreamClass readClassDescriptor()

in earlier version to be able to restrict what kind of an object can be created.

Notice, even then some code will be executed regardless if You reject it or not because the sole existence of Class<?> object representing a loaded class means, that static initializer for that class was executed.

The “content driven” approach is always using a load & reject security model.

I hope I don’t have to mention how insanely bug prone it is, do I?

Code driven

In this approach we do things exactly opposite way: we are not asking parser to parse and react on what is there. Instead we know what we do expect and we ask parser to provide it. If it is not there, we fail before we load incorrect data.

For an example a code driven XML parsing would be very alike using:

public interface java.xml.XMLEventReader
{
  boolean hasNext()
  XMLEvent nextEvent()
....
  String getElementText()
}

As You can see You may check what next element in XML stream is before reading it.

Note: Unfortunately I let myself to mark one method of this class in red to indicate, that it is also not an attack proof concept. The String in XML is unbound and a crafted rogue XML may carry huge string inside a body to trigger OutOfMemoryException when You do attempt to call that method.

Very alike Java de-serialization might be tightened a bit by providing an API:

 Object readObject(Class of_class ...)

instead of just an unbound:

 Object readObject()

Sadly de-serialization API in generic is maddening unsafe regardless of an approach. What doesn’t mean You should not do it. It just means, You need to pass it through trusted channels to be sure the other side is not trying to fool You.

Benefits

Using “code driven” approach we can be as sure as possible to not accept incorrect input instead, as in “content driven” approach, to reject it later.

Simply, what is not clearly specified in code as expected won’t be processed. It is like wearing a muffler versus curing the flu.

On the other hand, one must write that code by hand, and usually the order of data fields will be forced to fixed or it would be too hard to code. One must also deal manually with missing fields, additional fields and all other issues related to format versioning.

This is why I was so picky about being expandable and support dumb skipping.

Code driven vulnerabilities

Security? No inherent one. At least if API is well designed and all operations are bound.

Usability?

Sure, a lot troubles. Code driven data processing is very keyboard hungry.

But…

“Code driven” can be used to implement “content driven”

Consider for an example a plain “get what we expect” code driven API.

It might look like:

public interface IReader
{
  public String readBegin(int max_signal_name)throws MissingElementException
  public void readEnd()throws MissingElementException
  public int readInt()throws MissingElementException
....
};

This is a pure essence of “code driven” approach. You have to know what You expect and You do call an appropriate method. You call the wrong one, it barfs with a MissingElementException.

Of course it means, You must know the file format to an exact field when You do start coding the parser.

If we would however define this API to allow to “peek what is next”:

public interface IReader
{
  enum TElement{....}
  public TElement peek();
   ....
};

there would be absolutely no problem in writing something like:

public void parse(Handler h)
{
   for(;;)
   {
     switch(in.peek())
     {
        case BEGIN: h.onBegin(in.readBegin()); break;
        case .....
     }
   }
}

and we just have transformed our “code driven” parser into a “content driven”. Under the condition that we can “peek what is next”.

Opposite transformation is impossible.

“Iterator” versus “visitor”?

Yes, I did mention it at the beginning.

Those two concepts are very alike “code” and “content” driven and for Your information both
are present since JDK 8 in Java Iterator contract.

First let us look at the below pair of methods:

public interface java.util.Iterator <T>
{
  boolean hasNext()
  T next()
   ....
};

They do formulate “code” driven contract which allows us to “peek if something is there” and then get it. If we don’t like it we do not have to get it.

Then look at the method added in JDK 8, together with an introduction of lambdas and “functional streams“:

void forEachRemaining​(Consumer<? super E> action)

This turns it into a “visitor” concept where in a:

public interface Consumer...
{
   void accept​(T t)
};

accept(t) method is invoked for every available data regardless if we do expect more of it or not.

Reader may easily guess, that if one loves “functional streams” concept, which I don’t, then the “visitor” pattern has a great potential.

Note: There is one case when visitors are beating iterators. This is a complex thread safety. Thread safe iteration requires the user to ensure it, while visiting puts this job on the shoulders of a person who wrote data structure.

Summary

After reading this blog entry You should notice, that “content driven” parsing is very simple to use but at the price of being inherently unsafe.

On the contrary “code driven” is usually order of magnitude safer, but also an order of magnitude more stiff and harder to use.

If not the fact that code driven parsing with “peek what is next” API can be used to implement “content driven” parser the choice would be a matter of preference. Since this is how it is, then my proposal of an abstract file format must be, of course, designed around code driven approach.

Abstract file format API, basic primitives

All right, in this and that blog entry You might have read about file formats.

Then You might have noticed that I defined a certain absolute minimum for a file format which I called a “signal format” and proposed some API for it:

public interface ISignalWriter
{
   void writeSignal()...
   OutputStream content()...
}

and

public interface ISignalReader
{
  InputStream next();
};

You may also remember, that I have said this is a bad API.

Today I would like to explain why it is bad.

OutputStream/InputStream are bad guys

Note: For those who are not familiar with java.io I must explain a bit. The InputStream and OutputStream are classes which do provide a sequential and byte oriented API for binary data streams. You can just write some bytes and read them. Nothing fancy.

Now first things first: we are talking about “abstract” file formats. Abstract, in term of an API which do allow us to write any data without concerning how it is sent to the world. Binary? Fine, no problem. XML? Why not. Json? Sure, You are welcomed. And etc, and etc.

The InputStream and OutputStream are binary. They do know only bytes and bits and we have to play with them to encode our data. We can do “binary”, but the XML won’t just happen without a lot of care from our side. And this is what I would like to avoid in my abstract file format API.

The API I proposed above do take care about naming data structures and telling us where do they start and where do they end. It also allows us to move around, basically skipping content which we do not care about. I does not tell us well however how do we store the data.

All right, but what the data are?

What are data?

Honestly? Anything. But to be more precise: anything You can express in Your programming language.

The primitive types.

In Java our data will be then:

boolean,byte,chart,short,int,long,float,double

These are basic building blocks. Absolutely everything what can be expressed in Java can be told using those data types.

Obviously other programming languages will have different set of primitive types. The good thing about Java is that those types are well defined. There is no ambiguity like in C/C++ and byte is always binary U2 encoded signed number with 8 bits. This is why I love Java.

Primitive data versus Input/Output streams

Obviously I am not a genius and there were smart people before me. The Java guys long time ago thought about “hey, why to play with bits and bytes in Output/Input stream? Can’t we just play with primitive types?”

And they did introduce DataInput and DataOutput interfaces.

The idea was good… except it was totally fucked up. This was still the era when we had struggled with telling apart “contract” from “implementation” (interfaces and pure virtual classes were something new then) and they did define those interfaces as something like that:

int readInt() throws IOException
Reads four input bytes and returns an int value. 
Let a-d be the first through fourth bytes read. The value returned is:
 (((a & 0xff) << 24) | ((b & 0xff) << 16) |
  ((c & 0xff) <<  8) | (d & 0xff)) 

I let myself to highlight in red what was done wrong. They not only defined that this method do read 32 bit signed integer from a stream, but they also did specify how should it be achieved. They messed up contract with an implementation.

But if You would just ignore it and leave only the following fragment:

int readInt() throws IOException
Reads and returns an int value. 

then it is a a good abstract API for reading primitive data from a stream. I like it.

What else is wrong in DataInput?

Since I am already at pointing what have been made wrong let me continue and point out other weak and possibly dangerous point in DataInput API.

The next candidate to yell at is:

String readUTF() throws IOException

which is defined in a triple wrong way.

  1. It specifies how it is stored in a binary form. This messes up an implementation with a contract, but I already have told it.
  2. Then the binary format which is chosen limits the size of encoded binary form of a string to up to 64kBytes. Notice it creates two problems:
    • first it prevents saving longer strings, and second;
    • You can’t predict if Your string will fit the 64k limit until You try it. This limit applies to an UTF-8 encoded form and the size of UTF-8 form do depend on string content. This is silly and make it unpredictable. Unpredictable code is bug prone and inherently unsafe. You may be fine saving 65535 letters long US English text but in Chinese You will hit the limit, I think, at about 8192 characters or less.
  3. And at last, this API do remove any control from the reader about how many data will be actually read and used to build the String. Sure, the encoding 64k limit puts a serious safety constraint, but You can’t say how long the returned string will until You read it.

Why it was done this way?

Because this reflects the constraints put in class file format on constants and data. The DataInput and DataOutput were initially meant to manipulate java class files. And this is all.

All right, so how the readUTF should be declared then?

Maybe this would be ok:

String readUTF() throws IOException
Reads an UTF or alike encoded string of an arbitrary length and returns it.

This API looks good. We have plenty of alike API’s, right?

Except that it would be insanely unsafe.

And this is where we come to an another important factor.

File formats are gateways for an enemy attack

Yes.

The:

String readUTF() throws IOException
Reads an UTF or alike encoded string of an arbitrary length and returns it.

could be a good API if it would be used internally inside a program to process data it keeps in memory or produces on demand. If however it interfaces with a potentially hostile external world we have to take more care.

First we need not to limit our usability and need not to constraint the length of a String in a dumb way like DataInput did. We may like to store MBytes or GBytes in that way. Or we may just store sentences few characters long. At the implementation side we will have to resort to something functionally working like good old “null term string”. Remember, unnatural limits in API do remove its usability.

But having no size limit means…. that we have no size limit.

Remember, the file comes from an outside of a program and may be crafted by an attacker to intentionally harm us. For an example an attacker may create a program which just pumps characters at infinity into and I/O stream at no cost except a connection load.

What code implementing the API:

String readUTF() throws IOException

would do if it would be confronted with such a crafted stream?

It will first allocate some small buffer for a text. Then it will load text from an underlying file or I/O stream, decode it and append to the buffer. If buffer will get full before the end of text is reached it will re-allocate it and fill again. And again, again… till OutOfMemoryException exception will be thrown.

Even tough Java is well defended against this type of error the OutOfMemoryException is one of nastiest to restore from because it can pollute system all around. Imagine one of threads touching the memory limit. Sure, it made it wrong and is punished with an exception. But what if a good behaving thread is also allocating some memory during the problematic operation performed by a wrong doing thread? It is just a matter of randomness which of them will be punished with OutOfMemoryException.

We can’t open this gateway to hell!

The correct API would look like:

int readUTF(Appendable buffer, int up_to_chars) throws IOException
Reads up to specified number of characters and appends it to a given buffer. 
@return returns number of appended characters. 
        Returned value is -1 if nothing was read at due to end of text.
        If returned value is less than up_to_chars then the end of text was reached.
        If returned value is equal to up_to_chars then either end of text was reached,
        or some characters are left unread and can be fetched by subsequent calls of
        this method.

Sure it doesn’t look very easy to use but it allows us to put a control and restrain the out of resources attack by simply calling:

 int s = readUTF(my_buffer,64*1024)
  if (s==64*1024) throw new AttackException("String too long, somebody is trying to attack us!")

and be sure that even if an attacker will forge dangerously huge I/O stream it won’t harm our application.

So how the API should look?

Again I drifted off shore to the lands of unknown. So let me swim back and return to the API.

The good abstract API needs:

  • to be able to indicate data boundaries and move from boundary to boundary with:
       void writeSignal(signal type)
       signal type readNextSignal()
    
  • to write and read elementary primitives without taking care of how exactly they are stored:
     
       void writeBoolean(boolean x)
       boolean readBoolean() throws IOException
        ...and so on for each primitive
    

    The DataInput and DataOutput are good starting points if You remove anything related to “bytes” and encoding from them.

Is that all?

Well… no it is not. But before I will move to more details we will have have to talk about a content driven parsing and code driven parsing because it will impact the API a lot and will again show us some serious safety issues which may be created by carefree built API.

Summary

After reading this blog entry You should be aware how the abstract file format should deal with basic data like numbers end etc. You should also be able to point out potential safety issues with file format related APIs.

What are file formats anyway?

In a previous blog entry I did roughly describe how Java serialization is using its file format and why it is wrong. I also introduced the idea of plugable file format.

In this blog entry I would like to dig into a sole definition of file format.

How file format is defined?

By hand and on paper.

Really. No joking.

What are You looking for when, let’s say You, are tasked with processing data stored in STL file?

For a specification. You are looking for a human readable specification.

Format specification

The document called “format specification” may be either very short and unclear, as in case of a binary STL , or it may be very long, formal… and also unclear as if in Microsoft *.lnk  case.

Regardless how it is expressed the ultimate result of reading it is to know what data in file means what.

This is all. It may say: “first four bytes store IEEE32 float for X coordinate, next four bytes alike for Y, next for Z and so on to end of file”. This is roughly speaking the idea behind STL format. Or it may say it in much, much more complex way, as in case of the said above Microsoft format.

The common denominator for such formats is one: if You don’t have specs, You can’t do anything with it.

Exactly as with Java serialization format.

Intermediate file formats

One may: “All right, but we have XML. It is self describing and solves everything”

Almost good. Almost…. No, not at all.

XML is self describing. This is true.

You may open a file and force Your machine to interpret it as UTF-8 text. Or as ASCII text. Or as UTF-16LE text. Or as UTF-32BE text. Or…

Sooner or later You will get a readable text. Then You may look at it with Your human eye and deduce what means what. Unless it is *.dwf file which is a “portable” format consisting of XML with one huge text encoded binary block. What a nice joke they made!

Then, my dear XML fan, why no JSON? It is also self describing. And the plus is, there is no text encoding lottery because it is hard-coded to UTF-8.

Bad formats, bad!

The primary problem with both, and in fact with most formats of such kind, is that files are huge.

The 64 bit floating point number needs 8 bytes in binary format. And about 20 or even more in JSON or XML. Not mentioning that some numbers which have finite 2 base form have infinite 10 base form and vice versa.

Note:
At first I did add here a lot of parsing and security problems, but later I decided this is not a right moment. So let’s stick with a huge as a main problem.

Good formats, good!

The most important advantages of XML alike intermediate formats are:

  • they are self-describing;
  • they do allow “dumb skipping” of unknown content;
  • they are expandable;

Self describing means…

For the format to be “self-describing” it is necessary that it somehow, in a standardized way, gives names to some elements it carry. Since the way of giving names is standard You may take a file, parse it using a standard parser, and see what names and in what order do appear. With this information You may easily guess what is stored where.

Both XML and JSON are self-describing.

Dumb skipping of unknown content is…

This functionality is tightly coupled with the previous one. The standard way of giving names means, that there must be a standard way to find names. This way must be independent of a content carried inside or between named elements. If it would depend on it, then You would not be able to find names.

For an example we may create a text format which specification will say:

File is divided into tokens separated by “,”. First token is the name of first element. After the name there are some “fields” and then there is the name of a next element.

The number and meaning of fields is following:

  • for name “point” we have two fields (X,Y);
  • for name “circle” we have three fields (X,Y,R);

This format does not allow dumb skipping. You must know mapping from names to counts of tokens to find which token is a name and which is a field.

If for an example this format will be modified to:

File is divided into tokens separated by “,”. First token in a line is the name of an element (…)

the this format would allow dumb skipping because name is always first in each line.

Dumb skipping is very important because it allows You to extract data of interest from files without bothering about full syntax of the file.

And expandable is…

This is almost like “dumb skipping”, but not exactly alike. The “dumb skipping” do allow You to ignore elements You do not understand. For an example if version 1.0 of above simplified format knew only “point” and a “circle” and version 2.0 did add:
(…)

  • for name “rectangle” we have four fields;

then parser understanding version 1.0 may parse 2.0 file. It won’t be able to react correctly on a “rectangle” but the presence of it won’t stop it from understanding the rest of a file. And what would it do with “rectangle” anyway if the application it is built in does not know rectangles?

If however the version 1.1 would add:
(…)

  • for name “circle” we have four fields (X,Y,R,A). First two being X and Y, next radius, and next the aspect ratio;

then our parser version 1.0 may read “circle”, read three fields and then expect the name. Which is not there. If file format is expandable this parser should not be fooled by this and the request: “and now I expect the name” should be correctly fulfilled by skipping the aspect ratio added in 1.1 version of a file.

In other words, to be expandable the format must allow “dumb skipping” regardless of at what token the cursor is.

So why I find XML bad?

Because in both cases, XML and JSON the declaration of API:

and You can start element with a name, then write content, naming possible sub-elements...

is bundled together and inseparably with implementation of API:

 XML element starts with <name and ...

Smallest common denominator

Now let us ponder what is a smallest common denominator of XML, JSON and “specification” file formats?

To know where information about X start and where it ends. “Specification” formats are saying it, frequently, by knowing the position of a cursor in a binary file. XML and JSON are using a kind of syntactic marker.

And this is it.

The smallest common denominator is the ability to say, when writing a file:

“Here is a boundary of the element”

This is all we need to be able to parse format element-by-element.

Elementary signal file format

This smallest common denominator API may be defined in Java like:

public interface ISignalWriter
{
   void writeSignal()...
   OutputStream content()...
}

The writeSignal() do write a kind of “syntactic marker“, let us now ignore how it does it, and stream returned by the content() allows us to write raw bytes into such format.

The reading counter-part may look like:

public interface ISignalReader
{
  InputStream next();
};

where next() finds next, nearest “syntactic marker” and returns InputStream object which allows us to read the content up to next “syntactic marker”.

The very important functionality is that next() must work regardless how may bytes were read from previously returned InputStream. That is it must support both “dumb skipping” and “expandability”.

Summary

After reading this blog entry You should have some idea what are requirements for a good file format and how an elementary, good API may look like. I do warn You that in fact this is NOT a good API at the moment, but it illustrates well the concept.

In next blog entry I will expand that idea in a bit more sophisticated form.

Towards abstract file format

Today I would like to talk about complex file formats.

Anyone of You who are programming most probably was either reading or writing to files. If You got lucky, You were supplied with some format specific API. If not You had to get to format specification and write the API by Yourself.

How many times You had to do it? Five? Ten? More?

I got a bit pissed off having to do the same brainless work again and again. I have my data, I know the structure of it, and I would like just to push them to a file. I should not care if the format is XML, JSon or any other binary format, right?

Java Serialization

Java serialization was a brilliant step forwards, but it was stopped half the way.

Note: For those who do not know what serialization is: You take an object, You take a stream and say “write that damn object to a stream”. And serialization writes the object and all objects it references to. Just like that.

Why I am saying it? Well…

Because the serialization is a fixed hard-coded binary format. Even worse, it is implemented in such a way, that there is no clear “borderline” which You may override to serialize the object to XML instead.

Sure, You will say, we have other serialization engines for Java which do write to XML. Yes, we have. But they also come with hard-coded format, and what is much worse, with own object processor.

In fact, what I do need to have is a standard serialization engine with a “plugable format”. Something like that

How it is done currently?

The current serialization source code is build like:

Taken from LTS JDK 8 source
....
  private void writeHandle(int handle) throws IOException {
        bout.writeByte(TC_REFERENCE);
        bout.writeInt(baseWireHandle + handle);
    }

This is a part of ObjectOutputStream.java which is responsible for writing to a stream a reference (pointer) to an object which was already serialized (at least – partially serialized). This is a good API for serialization format. Having the API with writeHandle() would be nice. The implementation is however utterly stupid. At least from an object oriented programming point of view.

This method should be:

   protected abstract void writeHandle(int handle) throws IOException;

and should be declared in class AbstractObjectOutputStream. Then a DefaultObjectOutputStream class should be declared and it should carry the implementation:

  @Override protected void writeHandle(int handle) throws IOException {
        bout.writeByte(TC_REFERENCE);
        bout.writeInt(baseWireHandle + handle);
    }

If it would be done this way we could have easily change the binary format to XML, text dump or whatever we would like to have.

Note: Looking at the serialization source code one should devise from it the bright idea that You should not let inexperienced coders to code new ideas. The idea of serialization and algorithms behind it were very new at the moment. Nobody ever done something like before. Sure, it has numerous conceptual bugs but the coding… it is was sea of errors which can be made only by very inexperienced coders.

Making it plugable

The obvious part to be made later would be:

public interface ISerializationFormat
{
    public void writeHandle(int handle) throws IOException;
.....
}

and

public class PluggableObjectOutputStream extends AbstractObjectOutputStream
{
     private final ISerializationFormat fmt;
  ....
    public PluggableObjectOutputStream( ISerializationFormat fmt )....
    ....
@Override protected void writeHandle(int handle) throws IOException {
       fmt.writeHandle(int);
    }
}

This way we can use a precious “wrapper” technique to debug serialization by, for an example:

   AbstractObjectOutputStream o = new PluggableObjectOutputStream ( new LoggingSerializationFormat ( new DefaultSerializationFormat( ....

Try doing it now…

Benefits from plugable serialization

One may say: “All right, so You have a problem with that. This is just because You are lazy. If You need different format, why not to write the serialization for Yourself? The algorithm and sources are public, right?”

One may be right. May be.

But only if that one did not try to do it. I have tried it. Three times.

The serialization algorithm is not trivial. But it can be done. In a very ugly, sub-optimal way, but it can be done.

What cannot be done, at least not in a portable way in pure java at JDK 8 level, is de-serialization. Specifically a very, very tiny bit of it, which translates to Java bytecode:

   new "x.y.class"
   without calling a constructor

This sequence of bytecode can be executed by JVM but is prohibited and rejected by class file verification mechanism. You can’t have object without a constructor being called, and serialization especially does not require to have a “do nothing” constructor. This action must be implemented with digging into JVM guts and thous the special open source project called ObjGenesis (as far as I recall) was created. But this project is no magic and does nothing more that “check what JVM you run on and hack it”.

So implementing the exactly compatible de-serialization algorithm is at least very time consuming task.

Just to make it, let’s say XML?

If it would be plugable, then there would be no problem at all.

Serialization format API or…?

Up to now I was talking about a very specific case of Java serialization API. This API is very focused on a dense packing of Java objects. If You will just try to use it to save a struct of some fields You will notice, that it will pollute the stream with things called “class descriptor”, reference counters and etc. While You just wanted to have a plain, dumb structure, right?

I think we should now focus on thinking about what exactly the file format is.

But this will be in the next blog entry.

Summary

After reading this short blog entry You should have grasped the idea why something like a “plugable file format” may be useful. You should also know what inspired me to dig into the problem.
In next blog entry You will be show details about how exactly we should define such a format.

How not to translate the user interface

i18n – internationalization

Today I would like to throw my stone in someone else garden, and this stone will be about translation of the user interface. Or to be more precise – how not to did.

Ok, what is all about?

No more than few days ago I have inspected some Qt GUI code with a line like that:

someObject->setText(tr("Close file"));

The tr() is a function which takes input text and translates it to selected language.

Excellent, isn’t it? Standard, good and readable. And if no translation is found it obviously falls back to English. Good.

Right?

Well…

Not so good.

Each language does have own specifics. In each language some words do have different meanings depending on where and how they are used. For an example in English “drive a nail” and “drive a car” are using the same word: “drive”. But in Polish it is: “wbić gwóźdź” and “prowadzić samochód”. You don’t have to know Polish to notice, that there is no common word in those two sentences.

Of course the tr() does not translate the text word-by-word. Instead it is a simple look-up table or a map which maps one text to another. So there is less space for confusion.

Despite of that it is still there. And You will have just one, single translation for all places where the sentence “Close file” is used in Your program.

It is all about a context

This is all about a context. Each text which is to be translated must be translated in a specific context. So I would recommend to do something like that:

someObject->setText(tr("SomeWindow::Close file"));

This way the tr() function is provided with an additional context: the “close file” text is used in “SomeWindow” class. Now if this text is used in an another window or an another context it may have a different translation.

This is relatively easy to make the tr() to use English translation as a fallback before it falls back to passed key. So it may return “Zamknij plik” or “Close file” or, finally if neither Polish nor English translation is present it will return “SomeWindow::Close file”. I agree that the last will be ugly, but it still can be understood by a human.

It is all about an ease of work

Second thing is the ease of actually performing the translation. Imagine You are a translator and imagine, You are provided with a file which is just a bunch of English sentences.

How can You correctly translate it?

Of course You can’t. You won’t take such a job. You would rather have the screenshots of GUI and write translations over it because then You will know the context.

Sadly supplying such screenshots to You is complex and expensive job. It will be also a bug prone when somebody else will be typing the translations into resource files. So You will rather be provided with a text file.

And now imagine this text file has some context information. It still will be bad. It still will be hard, but certainly Your translation may be better. And it will be easier (cheaper) to provide You with screenshots on which there is just an arrow pointing to a certain dialog and saying “this is SomeWindow”.

That’s all for today.

Throw away Your debugger! Unit tests.

Honestly. Really.

The only place I still need a debugger is an embedded world when I do program a micro-controller. I suppose You will soon catch up why.

What is debugger for anyway?

This is not a silly question.

Debugger is letting You to step-by-step through Your program. You may inspect data, inspect variables, see how does it flow. It allows You to look into Your program.

Yet it totally breaks timings. Totally fails in solving multi-threaded problems. Completely fails in solving performance issues.

And it is hand driven. You need to click. Click. Click. click….

Basically a debugger allows You to see what is happening. This is a very useful tool for beginners or in case if there is no other way to see what is going on, like for an example on a micro-controller.

But does an experienced programmer has much use of it?

In my case an answer is clear: NO.

Unit tests.

The much better work flow is to write a test which will check what You would otherwise check with a debugger.

Unit test is a small program which, when run, can check, best if without a user intervention at all, if a certain piece of software does what it should do.

So in my opinion whenever You need to check if a certain routine does what it should do You should write a test. In java I do recommend a JUNIT.ORG which can be nicely integrated with ANT build scripts to run all tests it can find automatically.

In my case I did configured it in such a way, that it searches for all class files matching “Test*.class” or “*$Test.class” in a package and runs them.

But writing tests is a real pain!

Surely writing a test takes a time. From my experience it takes about 1/4 of amount of time I spent writing a code which is to be tested.

Surely, if You will hit a wall and will have to re-design code You will most probably have to re-write all Your tests. You may see at this point why it is so important to focus on code design and make a great effort to create a reasonable API which may be then put to tests.

So these are disadvantages of unit tests.

There is however one huge advantage.

Whenever You change anything in Your package all You have to do is to run a single script. It is called “test-package.sh” in my case, and it asks ANT to run all unit tests it can find.

You just run it and observe if big large FAILED did not appear on a screen.

You may possibly imagine how great impact may it have on so called “regression bugs”.

Write bug-exposing tests

The next thing I do recommend to do is to write tests which will expose bugs which were reported to You. You got a bug report, You did hunted down a piece of code or a certain use scenario which triggers the bug. This is an excellent moment to write a test which will expose it.

Run it and it should have failed.

Then fix the bug and run test again. This time it should not fail.

The next important step is to actually add this test to Your test suite so that every time a package/library/application tests are run this test is run too. I do insist on it, because the sole fact that a bug have appeared in a first place means, that Your test suite had holes in it.

If You will follow this routine You will systematically plug all holes not only in Your application but in Your test suite. Thanks to that a chance for regression errors to appear will be getting lower and lower over time at almost no costs.

Test will let less experienced coders to make fixes

You might have noticed in some of my previous post that I was saying that people with knowledge are a most important resource. Then I stressed that one should document the code in such a way, that if an original author goes away there should be no high learning costs for new employees.

You may see tests as a piece of such documentation. Having well documented tests is a best thing, but in my case I do rarely have enough strong will to document them as good as I document main code. The KISS (Keep It Simple Stupid) rule should be then followed. A small, few line long test with meaningful name can be self explanatory enough.

Ok., so we have a code and tests. We have a bug report. And we have a high paid experienced programmer and a newbie at half of that price. Experienced guy will fix it quickly, but the he won’t be doing a job only he can do. So maybe we should give it to a newbie?

It will probably cost us even more. From what I can see a newbie will need about five time more work hours to solve a problem than an experienced programmer. The better the code will be documented, the easier will it be for him to solve it. Yet still he won’t be knowing it very well and a “fix” may in fact become a “regression bug spawning daemon”.

But remember, we have a good test suite. Armed with it even if he did not understood what is going on in all details the test should catch if he broke something. The risk of “regression bugs” appearing is low and we now have a next person who is slightly more experienced.

Providing that all tests are run…

Dependency tracking in testing

Remember, Your test suite may have holes.

If You like to be really pedantic then You should be able to track every library or program Your company is selling which is using the library You just fixed. Your fix might have influenced them in both good and bad way.

So for best results You should run not only tests which are inside the library You fixed, but also run tests in libraries which depends on it. Run them before fix to make sure, that they were not failing and run them after You made a fix. If any of them fails pin-point a problem and extend Your test suite with a test which will expose that.

Well… I’m usually too lazy to do that extension and I am satisfied with the fact that tests in that dependent library failed. My wrong, but it is still better than not running those tests at all.

But I would rather use debugger…

You would. Once. At most twice. But imagine that You need to run a regression check on let’s say five libraries. It will take out days of Your precious life.

You may however, If You like, to use debugger to hunt bugs or to validate Your tests.

I, personally, do prefer logs and unit test, because I’m too lazy to use debugger.