“Snaps” will kill Ubuntu?

Sounds stupid. How the base application distribution system of Ubuntu can kill it?

It is simple: because of “jailing” or “sand-boxing”.

I had a displeasure to install two key applications in my system: FreeCad and Firefox from “snap”.

And you know what?

I couldn’t use them.

Firefox was unable to actually see anything except the “~/Download” folder and I use it as a primary HTML renderer tool. So if it cannot see absolutely every file in system, then it is useless.

Then FreeCad could only see “~/tmp” and “~/” (home) folders. Normally you would say, it is ok. I did however created a huge LVM volume which I mounted in a point which can be accessed by all user accounts. Then, I thought, each user may create a sym-link to folder on LVM volume and place it inside their “~/” home folder. It is easy to use and reasonable, more reasonable that creating separate LVM volume for each user and mounting it inside their “home” folders.

Of course sand-boxing destroyed this idea. Sorry, sym-links leading out of allowed folders are out of question.

Why it is so stupid?

To protect us, users of course. Because we re dumb idiots, and we can’t tell rogue application from legitimate one.

True.

But the snap and flatpack, both assume that the person who wrote the application can tell it! We, users cannot say: “this snap can access that folder and f*off dear sandboxing”. No, the snap distributor is the one who can do it.

Where is logic in that? The one who creates rogue snap will for sure enable access to all folder he needs. So there is zero protection.

The true protection comes only when we, users, are able to decide what to protect and what to not protect.

Why it is not transparent?

The sandboxing, when done well and having proper GUI for users to deal with it is a great tool. I would be happy to be able to right click on any application and say: “this one bastard cannot get there!” I am very curious why it is not done this way, anyway.

The great source of confusion and proof of idiocy is in the fact that snap “jail” and flat-pack “sandboxing” actually do hide some folders and inject others. And this happens totally without user being told it is done. You click “open” dialog and… hey, where are my folders?! Alike you may find folders there which do not exist on your disk at all.

I could accept sand-boxing which would pop up an “access denied” error or something like that, but not one which simply makes things gone as if never been there. Also added “virtual” folders and files must be clearly marked. Of course that last will be tricky, as most file systems do not support such a marking, but I suppose this is up to those who promote aggressive sandboxing to scratch their bald heads about it.

Summary

Having secure and safe applications is great. Sometimes it comes close however to make them so secure and safe, that it is almost as if we are not having them at all. A html browser which can’t see file on disk, a CAD which can’t access other folders than “~/home”, both are so cumbersome to use that users will simply throw them away.

And replace with less secure solutions. Is this what you want guys? Make users to absolutely reject any security measures because you tightened them too much? Why not simply create a more reasonable file system access rights than old, useless “user/group/root” triad? Why don’t let set access rights per application?

It is like “change your password weekly” mania in many companies. Sure it good. But so problematic to users, that you will sooner or later end up with two groups of users:

  • those who write password on sticky notes and place them on their desks or;
  • those who create “algorithmic” passwords, like “two first letters of a month followed by week number, followed by name of my mum and silly character”;

Both are bad security practices, but what would you expect?!

Playing with AI text generators.

This blog entry will be short. Very short.

Addiction

First what I have to say, it is addictive. Really. The AI are as close to what you can get to a smart, living NPC (Non-Player-Character) as possible today. I did discover the https://perchance.org/ai-story-generator and, well, got a bit heals over head about it.

At first it was amazing. Then, after some time it started to be a bit disappointing. And at last, I finally got the feeling that AI is dumb.

Really dumb.

AI is a marketing name

In reality it isn’t an “intelligence” it is a word mashing huge model of a human language. And if we say “huge” we do mean it. The one which is smart enough to sound reasonable uses 7*109 numbers to store it’s “model” of all the sentences it have seen. Those which are more capable are using from 13*109 up to 120*109 numbers.

This kind of intellect has nothing with reasoning. Zero. None, null. The reason behind that is very simple.

It looks like it is thinking, but in reality it is just finding most probable continuation of sequences of numbers it was show. And this is all.

The problem with AI it is that when guys from marketing have seen a machine able to answer the question stated in human language they thought it is “thinking”. Taking one misunderstanding to another they decided that “large language model” is not a well selling name and “Artificial Intelligence” will sell better.

At first it was. Now… well…

Why AI (text models) is doomed to be dumb?

Let us start with a digression.

Imagine you are sitting in a small room. There are no windows, nothing can get from the outside. Now imagine, that outside this box there is a civilization of creatures which have the sense of smell ten thousands better than dogs, they have amoeba like bodies and live in the ground sipping through gaps between sand.

Then, some day, one wall of your box do lit, showing you rows of black and white dots. On the opposite wall there appear a number of buttons. What will you do? Well, probably nothing. But if you will just sit, the box will start shocking you with painful electric shocks. So you will start pressing buttons. Some sequences will stop shocks, some will make box to give you ice cream (or whatever you find to be pleasurable), some will shock you almost to death.

Why I am talking such a nonsense?

Because this is how AI do “lives” like.

No common experience

The “large language model” is presented with “tokens” and a “token” is a just a number assigned to sequence of characters. This gives the first limitation. The “language model” will have a hell lot of problem with understanding what six-on-nine ( 69 ) means in terms of erotica. This idiom is hard to be considered reasonable until you can actually see the shape of number 6 and the shape of number 9. But to really guess the meaning, you need to be able to see how human looks like. Only then the circle in 6 and 9 can be imagined as a human head.

“Language model” is just shown the numbers, not shapes. It will have troubles with match words which do sound alike others too. In languages using phonetic alphabet the look of the word also carry the sound of it. In languages like Chinese the sound is not directly stored in a symbol, so in that language we can have words which sound alike but look differently and those which sound differently but look alike.

This shows basic problem – our written languages are bound with our hearing and sight.

Then if “language model” is unable to see nor hear, it will have hard time to deal with obvious things.

Lack of reasoning?

The AI, at least those I had a chance to experiment with, do quickly fall into a “most probable sentence” trap. Like for an example:

Martha has no sense of taste. Her taste buds were non-existing.

Martha was eating ice-cream. Her sister, as a joke, salted the ice-cream.

“How do you find them, Martha?” asked her sister.
AI continuation starts here:
Martha shrugged, her lips pursed around the spoon. “They’re… not bad,” she said, her words carefully enunciated. “They’re not as bad as I thought they’d be.” She took another bite, chewing slowly, and then swallowed. “They’re actually kind of salty,” she admitted, “but I can still taste the ice cream underneath.”

This is because the logic connecting “taste buds” with “salty” is located outside the language, inside our physical experience. Regardless how strong the intellect will be, unless it can actually touch that logic it won’t be ever able to reason correctly. There is however a strong probabilistic chain salted→eating→question about how do you find your food→tastes salty in most of written texts.

Alike, the so called “AI based code generation tools”, are also real pain in the behind. At best they are as good as “code template lookup engine”, which every company having own code base can create at zero cost with Omega-Search, Xapian, Lucene + Tika or just standard “template catalog”. And they won’t be ever better unless they will be able to use compilers, debuggers, GUI and other tools to actually test the code they write.

Cost

Once I got addicted a bit with perchance I decided to try to make it say really dirty stuff ;). Yes, I am male and males are, well…

Of course on-line services are out of question in such case. Privacy is important. Especially that in some countries saying some things is illegal.

I had to get myself a decent PC able to run AI locally. Gladly my old PC was getting close to end of life status (15 years) so I decided to get myself a new one. I have chosen AMD Ryzen 9 7900X (12 symmetric cores) and maxed out it’s memory capacity with whooping 128G of DDR5 3600 RAM. No dedicated GPU (graphics card), just the one inside Ryzen 9.

Trying to run full 70B (70*109) model of course failed. Out of memory. You need about 70*2=140GB of RAM to think about starting it, as it is using 16 bit floating point per each number. One may down-sample it to less accurate form called 70BQ8 using 8 bit per number. This form can be run on this machine.

No GPU?

You may ask me now why I didn’t get myself a GPU? You might have heard that GPU is a must of AI.

The answer is simple: money. Those 128GB costed me 1/2 of the cost entry level 4GB GPU. And if you look at the necessary memory sizes GPU with 8GB of RAM are useless (128GB of on board ram is about 1/4 of the 8GB GPU price). To be able to “talk” with a reasonably smart AI you need at least 13B model, for which 16GB will be okay. But if you like to train it or use smarter models, well… For training LLAMA2 70B you need, they say about 360GB of RAM.

The largest GPU I have found has 48GB of RAM and costs about three times this PC costed me.

AI on bare metal

So no AI for me? Is a GPU a must?

Well… it is not.

In fact 4-core CPU is enough to use it.

This machine I got can run 70BQ8/Q4 models and generate about 1 word per second and all cores are ice cold.

Note:7B model runs as fast as you can read, 13B model is okay too. Perchance is running on GPU, possibly 13B model (I am not sure, but I don’t think a private person would invest in GPU powerful enough to run 70B) and it can just spit out tens of words per second.

The CPU cores are assigned to the job, but they can do little. The 1 word per second is exactly the limit of DDR5 RAM throughput. I run some tests and my system shows about 40…60GB/s on board transfers. Considering the fact that AI is just a neural network and a neural network is a f*ing huge matrix, then for each step it must at least read it at least once. And this is what takes time. It really doesn’t matter if we have AXV2 / AVX512 instruction set, vectorization or whatever. The CPU will sit and wait patiently until on board RAM will fill the cache.

I ran some experiments and noticed that assigning more than 4 threads to the AI doesn’t rise it’s speed at all. Using all 12 cores or 4 cores gives practically the same result. Simply 4 cores can consume all on board RAM bandwidth without any problem.

Ryzen 9 has 12 cores, but can handle up to 24 threads. The idea of having more threads than cores works well normally because threads are making use of different functions of cores, so there is a chance that two of them can be run in relative “parallel” manner. In case of AI computation it is not. The size of llama.cpp executable is about 1.4MB. Yes, megabyte. I can bet that the actual computation core is just less than 1kB, so it will fit with Level 1 cache of Ryzen 9 core without any problem. So in case of AI all threads are doing exactly the same thing. I did observe that allowing the AI to use all 24 available “virtual cores” slows it down.

Python…

Most of AI is using python. Well… it doesn’t work well without GPU. The llama.cpp is written in C++ and runs about two to three times faster on bare metal than python version.

Of course, if you off-load everything to GPU python is good… because it is not python what runs computation.

Training on bare metal?

I failed to do it. Not because it is impossible, but because of rather sketchy docs. The only example I was able to run failed to learn anything after 18 hours of work. Possibly because it needs to do few thousands steps while each step takes about 15 minutes on bare metal.

Gladly there is a good alternative to training or fine tuning. Get yourself a model with large context. You can find 32k context models with ease and they will run well on bare metal.

Running on built-in GPU?

I tried it. And in fact it is hard to persuade llama.cpp to not use it. If it is compiled with GPU support it will use it for initial processing regardless if you will tell it to use GPU or not.

The good side of Ryzen 9 built-in GPU is the fact, that you may assign to it as much RAM as you like. The bad side is, that when tested with AI, the GPU performance equals to 1/2 of single CPU core. This is why ROCm (official AMD GPU computation library) doesn’t list built-in-GPU as supported device, even tough it can be persuaded to support it.

Use AI for your company?

No. You can use it as a first contact chat bot, but you need to be very, very careful. In fact you need to pass answers of your AI through the secondary AI. The first AI replies to customer requests, while secondary level AI checks if the answer is “legally safe”, that is doesn’t promise anything, isn’t rude or simply non-true.

And even then your clients will be really pissed off.

So maybe use it to learn company standards? A kind of “smart wiki”?

My recommendations is flat: No.

You will need a hell lot of GPU power to make AI to learn your standards. But standards aren’t cast in stone, so they will change. The dispersed nature of AI makes it hard to “un-learn” old versions. So if you will just fine tune it each time standard changes, you will end up with a lying mess of a fake intellect.

If you will compare it with Lucene+Tika or OmegaSearch+Xapian… well… My machine could index about ~10GB of text within less than few minutes and search for a phrase within sub-second. The 1GHz/1GB machine can do the same within about an hour for indexing and 1 second for searching. And it can be easily made to forget old standards. By the way, the total size of Xapian index (a searchable database) for this amount of text data is about 30MB. Yes, 30 mega bytes for about 10 giga bytes of input data. This should show you how oversized AI is. And just for your information, this entire database can fit in Ryzen 9 7900X L3 cache memory (Ryzen 9 7900X has 64MB of L3 cache). This is why it can be hell fast.

So considering cost of running, updating, ensuring response validity and etc… No. Don’t use AI.

Note:The only worthy use of AI in this case is an “assisted search” when AI changes user query, presented in human language, to a document database lookup question and provides some short form summary of found fragments. This may be valid use, since it doesn’t need any training. It shouldn’t be however exposed to your clients since it still can be made to tell really dangerous things (like, for an example, a promise to sell a new car for 1$).

Summary

Did I say it will be a short post? Well… I lied again.

I short words:

  • AI is not “intellect”;
  • it can be run on CPU only quite well;
  • it is “memory constrained” so memory size and speed is what is limiting it;
  • 128GB of RAM is not unreasonable size, it was in fact too small for many AI related tasks;
  • if you can, avoid it. Search engine is more practical.

Thanks for reading.

Festival speech synthesis: worst coding ever?

This week I played with text2wave component of festival tool set used on Linux to run speech synthesis. I am not very keen in speech synthesis tools on Linux, but on my machine the HTS voices and CMS voices sounded better than espeak and mbrola.

So I played a bit with it.

The voice produced by text2wave is a bit flat, so I searched for some way to annotate text. I do need:

  • to be able to adjust rate (speed) of talking;
  • to be able to adjust a pitch (higher, lower tone);
  • to be able to insert pauses on paragraph breaks;
  • to be able to change voice (speaker) in text;
  • to be able to save spoken text to a file;
  • to be able to script the process to automatically run it on many files.

Promises

Festival promises much. It declares support for sable (XML) markup. For which I couldn’t find any specs, except in the festival sources.

The DTD of this XML declares that I will be able to do everything I need.

If one however will try it out, and the look at source, one will notice that:

    1. The DIV element is ignored. And this is the only one element which could introduce a pause. This problem is documented in manual, but why it is then in DTD?;
    2. The BREAK element is a sophisticated way of entering a space in text. All parameters declared by DTD are ignored.
    3. The AUDIO tag requires URL as an argument instead of file. It doesn’t understand path to a file at all. Of course it can understand file:// URL, but it does mean that path to you audio file must be absolute. This simply rules out any re-distribution of your text files.
    4. The AUDIO tag does neither validate sampling rate nor resamples the file. In result you must guess what output sampling rate would be and match your file with it. This is a pain in behind, because using AUDIO is the only way of injecting pause in text.
    5. The AUDIO tag promises “background” and “insertion” mode, but, of course, only “insertion” works. This is documented, but the fact that it doesn’t re-sample file is not documented.
    6. The EMPH is in fact using RATE to say word slower. It ignores all parameters, but it is documented in manual. Again, why it is in DTD?
    7. The PITCH and RATE do promise change of speaking speed and tone. This is the promise only. It works only with low quality “diphone” voices. The high quality CMU/HTS voices are not affected by this setting at all. This is not documented. In fact even internal festival commands do not affect them.
    8. The range of attribute values, the meaning of numbers and etc is nowhere documented.
      You must guess if when PITCH takes n% as an argument, then 100% will be “base” pitch or twice the base pitch?
    9. The VOLUME does work. Although you must guess how do you specify the value. And, of course, what is the base value and how large is the margin before sounds starts “clicking” due to clipping.
    10. The SPEAKER element, which changes the voice, works…. but in fact it doesn’t. The voices I have are all using different sample-rate while SPEAKER does use first speaker sampling rate for all voices. In my case the male “di-phone” voice is sampled at slow sampling rate, the HTS is sampled at highest sampling rate and CMU voices are in the middle. In result you can’t switch speakers unless you will limit yourself to a very narrow range of voices.
      If you will however use the text2wave -F 32000 and are lucky with selected number, then the problem goes away.
    11. The sable XML parser follows XML specs in places it likes and ignores in places it doesn’t like. It collapses the sequence of white-space characters in one space, which is as XML requires, but does not even try to understand   which is the XML way to say “and I have the space here which I like to keep”. In result there is no way to add pause in text without using AUDIO element.
    12. The text-to-speech process can take very long time. Yet the XML parser does not pre-parse the input to validate if all syntax is correct. Instead it can run for an hour to crash in the middle if you mess up with tags.

Documentation

Worst possible ever. It looks like notes made by a person who wrote the program. For an example:

text2wave [options] textfile
  Convert a textfile to a waveform
  Options
  -mode <string>  Explicit tts mode.
  -o ofile        File to save waveform (default is stdout).
  -otype <string> Output waveform type: alaw, ulaw, snd, aiff, riff, nist etc.
                  (default is riff)
  -F <int>        Output frequency.
  -scale <float>  Volume factor
  -eval <string>  File or lisp s-expression to be evaluated before
                  synthesis.

Now please tell me what the <string> in -mode could be?

And finally… gosh… ... riff, nist etc.. Etc.?! Really? What are we playing here, a LOTTO? Guess the sequence of letters and win the prize?!

Btw. -F is unpredictable and produces errors. Especially if you, by accident, will set -F to match the default sampling of voices.

The names of voices are cryptic. The content of archives at festivox.org is not described. Guys, I did waste few GB of your server bandwidth only to discover that what I downloaded was not what I was looking for.

Maintenance

The festival (manual nil) command do depend on Netscape. Yes. You read it correctly. Not on default OS web browser, but on Netscape. Amazing!

Even tough last big release was in 2017 (2.5) the “latest” link on server points to revision 2.1.

There are plenty of dependencies on non existing external documents. For an example a link to sable XML format specification points into a void. You need to reverse-engineer it from DTD and implementing code.

Voices

The compatibility of voices is a total missery. The HTS voices, which are best, can be get from Nitech archive page. They are however not up-wards compatible and must match exactly the festival version you have. And know what? Festival will not validate the compatibility. It will just crash with either cryptic message or core-dump.

State faults

If one (SayText "....") command fails, subsequent calls will also fail. There is no cleanup, nothing. Zero fault resistance.

Is it total crap then?

Well… both “yes” and “no”. It is a good scientific work but a total crap when it comes to coding quality and something what we may broadly describe as a “end-user support”. This is certainly not a program which should be used in any production environment.

It is however speaking quite well.

Never ever use common noun as a program name!

I must say it even tough some professors may feel offended. “Festival” is f*ng event during people dance and play! And “rhino” is an animal.

Using common words, especially such which have nothing in common with a project to name it is a strategy used by an army to hide project from an enemy.

Are we, users really Your enemy?

Summary

Festival and text2wave is just one big disappointment. And one big lost chance. The HTS voices are really good and can compete with today cloud TTS services. They run however locally so there is no security and privacy concerns. You can make festival to say “….enter some phrase which is illegal in Your country in here…” and be sure that your Police will not intercept it.

It could be good. It could be fast. It could be easy to use.

But it is not.

What a shame.

Note: If you need a working, up to date and much better sounding TTS finds a project called “piper”. It sounds superior to festival and is an order of magnitude more reliable. Unfortunately the only thing you can change it is the tempo in which the speaker talks. But at least it is well documented and doesn’t make empty promises.

Gladly I did manage to craft a small bash script which, although far from being efficient, allows me to annotate the text in such a way that I can:

    • give “piper” voices friendly names;
    • make one of them to be ‘narrator’ who speaks normal text and other one ‘actor’ who speaks text enclosed in “double quotes” and switch them on the fly. This option makes books read by TTS to sound very attractive.
    • change tempo (with “piper” parameter) on the fly;
    • change pitch and volume on the fly (with “sox” project as a mid-processing pipeline. In fact all “sox” effects are possible.);
    • insert additional pauses automatically when more than three consequent \n, dots or whitespaces are found in text;
    • pack it all to *.mp3 file with “lame” as a post-processing pipeline.

What I cannot do it is:

    • to make voices to whisper;
    • to make voices to yell or scream;
    • to make voices to mumble, cry and etc.;
    • to control accent inside a word;
    • to extend a vocabulary so that words “piper” doesn’t know are said phonetically. For an example “Hm” for piper is read in ‘letter-by-letter’ mode, as if you would iterate over the alphabet and sounds like “Ejch am”.

I think I will soon turn this script in some more powerful tool. I need to figure out how to manipulate piper at source level and how to twist the “digital larynx” it is using to make it whisper. But for that I need to get my new PC running, because the “piper” can starve to death my Q2400 quad core Pentium machine and still fails to speak in a real time. It is hard to experiment in such conditions.

What to do with broken parts?

Imagine You do run a production company. Such a real one, not an imaginary one. And this company do produce something.

And, of course, from time to time as we do live in a real world, the product comes out a piece of crap.

What to do with such a crap?

ISO 9000 says…

(…)prevent bad outputs from entering production again(…)

Which is good. If You detected a broken part or a broken product it must not be used or sold.

The easiest way to achieve that it is to scrap it and throw it away.

Store them for investigation

Sometimes however bad parts should not be thrown away. There are plenty of modes of failure. Some of them are random, some are statistically important. For an example You may observe that one of each 100 pieces of Your MP3Players do turn-on when plugged to USB port while 99 pieces do not. This is not a critical failure, but it may turn critical after some time. So it would be best to know why does it happen.

Of course Your Research&Development department is under a full load and it may not be able to react in timely manner to each single failure. And even if they would, they also need some statistics. For an example they must break a piece to see what is inside. And of course such an operation renders that piece inoperable, so no more work can be done.

This means, that if You are observing a repeating, statistically significant failures of unknown origin then You must collect samples.

Prepare some warehouse storage, mark each broken part sample with clear description:

  • when it was discovered it is broken?
  • when it was manufactured?
  • add tracking data so You may pin point all components tracking data;
  • who tested it and found out the problem?
  • what the problem exactly is?

Describe the problem

And no, “it doesn’t work” is not a good description. Describe step-by-step how to reproduce error. Describe all numeric value and measurements You made. Describe what You have seen. And, equally important, what You think You should have been seen.

Only then Your Research&Development department will be able to do the job.

Be prepared for long-term storage

This depends on the load which is put on Your Research&Development department and failure criticality. I think that for non-critical errors You need to be prepared for 3…9 months long storage. Remember, this is not only due to Research&Development department delay, but also due to the fact that You must collect a reasonable (5…10 pieces at least) number of samples. This also may take some time.

So prepare Your broken product for long term storage. Remove batteries or be prepared to charge them from time to time. Keep them in proper air-conditions and temperatures.

And do not mix them. Remember, they must be tracked piece-by-piece and have their matching description. Piling them up on the shelf is not right.

Store them for training

Now imagine You do manufacture not a 15$ per piece MP3 player, but 5’000$ in-cost worth gizmo for some techno-fetishists.

If You think about quality and safety, then You must continuously train Your employees. Train, re-train, allow them to try out their skills in safe and inexpensive manner.

For that You need training samples. Teaching a guy on assembly line how to put together 5’000$ worth modules using a pneumatic press tool on almost ready 5’000$ worth part is… well… at least stressful for both teacher and pupil.

But what if You would collect some broken parts, spray them with a red paint or drill some holes in them? They will be clearly distinct from regular parts, so chance that someone will re-enter them into production is almost zero, so we are ISO9000 safe. And with such parts You main train Your employees.

Even more, You can let them try by themselves how much force the part can withstand before it is destroyed. Or that if they will stick their fingers there then they will break something. The training with broken parts may be efficient, highly educational and inexpensive

“How not to” examples

Different people do have different minds. Different operations may require different quality criteria. And there is always some shadow zone between “it is good” and “it is bad”. Having a manual which shows “good part” and “bad part” is good.

But having also an example of “bad part” is even better. The employee may then self-control the job confronting it with “good” and “no-good” example.

Note: In mechanics in 1980-ties there were popular “pass”-“no pass” “test rods” for holes. The “pass” side of rod was slightly smaller than the hole should be, the “no pass” side was slightly bigger. All the employee had to it was to try to ram-in both of them. If both could enter the hole, it was too large. If none could enter, it was too small. Why not to use this concept for electronics and other production?

Summary

After reading this blog entry You should be at least partially aware that sometimes it is good to retain broken products. Of course after marking them in some impossible to miss and impossible to remove way.

Reporting crap… or crappy reporting.

I am currently finishing listening to an audio version of Columbia space shuttle disaster report. This is something what every manager who is concerned with at least a bit of quality should read. And especially those who are managing Research&Development departments because this is exactly what this report is about.

This report contains many important observations and hints related to safety, quality and process management. Today however I would like to focus on just one of them.

Do You report to Your boss?

One of first things modern managers do expect from mid-level managers or even from those at the same bottom of the hierarchy is “reporting”. You need to email or fill some tables in some stupid software. You do need to spend precious hours on doing it, not even knowing if anyone reads it.

Yep. Even tough reports should be read they are often not. At best skimmed through. Thous this is wise, of course if You have a boss with a sense of humor, to inject sometimes something what can be funny but also troublesome. Like some language which shouldn’t left the company or, what my colleagues did in their high school works, a page long quotation from some erotic book. I assure You that if You will try doing it, then they will always carefully read Your reports before passing them up.

Of fire You. It depends.

The idea behind reporting is that employees should provide managers with necessary information about progress and problems.

Why Your boss needs it?

We, employees usually think that they need reports to:

  • Spank our bottoms because we had to report we missed the deadline again.
  • Need to stuff something into mouths of the higher-ups.

If however You will have a chance to become a manager even for a short bit of time, You will notice that all this reporting is needed for a decision making.

Yep. The informed decision making.

No manager can make a wise decision without knowing things.

Reports are… fake

In 95%. This is something what the Columbia accident report is stating between lines. When people need to report they will tend to cover problems with sugar and cover up their own weaknesses with careful wording. And If Your company happens to have a strict hierarchy they will also try to formulate report in a way which will be proper and with accordance to their position in hierarchy.

Next, what also commission pointed out, is the fact, that the more levels the report needs to climb/descend through the more “fake” it becomes. When it is going up it becomes stripped out of details and over-simplified. When it is going down it is augmented with unnecessary additions, comments and miss-interpretations.

Three levels are usually enough to make it so far from truth that it becomes useless. I have been there, seen it, had to deal with it. Now usually when my boss, coming back from managers meeting, is telling me: “There is such and such problem with such and such thing at the production floor” all what I do ask him about it is: “Who told You that?”. Then I go to that person and ask again: “Who told You that?”. And so on, and so on, until I will stand in front of a guy who will answer: “Yep, it is me. I told something in that manner”. Now I can gather untainted information I can work with. Notice, quite often the problem was far away from what my boss told me.

Reports are just signals

I think that what we do call “routine reporting”,“daily reporting” or “progress reporting” should be treated by managers only as signals. Like a cough in the gentleman’s club room.

Active information gathering

The commission observed, and I being an engineer fully agree with that, that the manager must actively search for information. There are many reasons behind that.

First it is what I stated above: that reports may be worded in a way which covers shit with sugar, may be reluctant to provide some information which may be harmful to reporting person, or may refrain from providing information which may be seen against the hierarchy.

Next, what is also stated above, the manager should be very, very careful if one is presented with a “report about report about report”. The summary about summary about some report is usually worth as much as monkey screaming on the sight of the banana. It is like with JPEG – each conversion is lossy. Each time someone reports about what one read in a report something is lost and something is added. Important becomes not important and urgent becomes non-urgent.

Then, what also commission pointed out, the language of report will be usually technical while manager needs to use processing and decision making language. Even reading the untainted report may create false impression because of language problems and difference in background knowledge between report author and the manager.

And finally, at least 90% of important information won’t be present in the report. Because that information is so obvious to the author, that it is not necessary to write about it. Remember, authors of technical reports will be usually not skilled in literature and won’t be able to write well. The may be writing tens of pages bragging about the maximum percentage of NO2 per liter of hydro-carbon mixture and yet not mention even a single word that the car won’t go without a gasoline.

You, dear manager, must ask.

All right, so do You have something to say?

The commission was kind enough to point out that this is not the proper way of gathering information. The manager must learn how to ask proper questions.

For an example imagine a guy being asked “Can You do it till September?” answering “Yes, sure!”.

What have You really asked him about? And what really had he answer You? Think about it using the skills You gained at Your psychology course.

In fact Your question was: “It is important for me to have it done till September. I am not sure however if You are the right person to do it. Are You sure, You are good enough?”. That guy answer was: “I like You to be pleased, because I like working here. I think that if I will work hard I have a chance to do it in that time-line, and if nothing surprising will come out, so I may say yes to not show that I am un-qualified for my position”

You, dear manager, must ask control question. For an example: “Fine, good. Can You please give me some hints about what exactly will You have to do? Some few steps, so that I will be able to get a grip on what resources You need”

This is very hard to answer that kind of question without showing You that there is in fact a hell lot of work to be done, there is a hell lot of uncertainties and numerous safety margins needs to be cut-off to make it in a proposed dead-line.

Summary

I think You already know my opinion. Do not rely on reporting. Do not insist on reporting, unless You really, really need it. And if it is the case make sure, that Your people will know that You do read them.

Actively gather information. Be investigative and remember about freezing effects of hierarchy on people and losses happening when information is passing through hierarchy levels.

Only then You will be able to make informed decisions.

Unless, of course, You need to report to Your own boss and You need something to stuff his ugly mug with. In that case however You don’t need hints about informed decision making but a through course in ass licking. Which, unfortunately, I cannot provide You with.

You are fired!

This blog entry will be about something I don’t like in my organization managers… well… Honestly, it is hard to find what I do like in them.

Anyway…

This week one of my colleague came to me and complained about his boss. He said, that the boss told him, that: “you can’t do programming well enough so there will no place for you anymore”. Which is, of course, natural when You hire a coder. Except that this colleague wasn’t hired as a coder. He was hired as an electronic designer who might do some minor coding job. Plus he was never told by this boss to even try to learn coding and gather the missing skills.

Not mentioning that the company never ever attempted to provide any training. Absolutely any, and not only in the coding area, in which he really needs it, but also in other matters. The company also refused to gain profits from his experience in production automation, PCB design using top level tools and etc.

So, maybe, he will be fired.

Very alike, some months ago, I did mention to the company director that some guy involved in quality critical operations at the production floor is not performing as well as he should. He simply has just a year or so before he can retire and he is no longer ambitious to be the best.

Note: from reading of all my blog posts You might have already notice that if I am saying: “he isn’t doing good job” this may mean, that he is keeping quite a high standard. My expectations are set, You know, absurdly high.

This director said: “Let’s fire him”. Which I opposed and decided never to say anything anymore to that director about personnel engagement in work. That old guy is practically the only person who knows how to tune that problematic product and how to fix production problems. Even I, who designed it, am not that capable. And certainly not that quick.

Firing them is a good solution

In an ideal world, when Your company is ideal and there are ideal workers waiting in front of the factory for hiring it is a good idea. You do replace broken tool with a new one, and You are happy.

Tools do keep breaking

I do own a small Chinese milling machine. It is made of cast steel, cast alloys and in generic is looking quite robust. Of course, being Chinese and within my household budget for “toys for big boys”, its quality and design are inferior.

It is however advertised as capable of milling steel, so this is what I am using it for.

Note: After buying this machine I will never ever again believe in absolutely any word written in Chinese product specification.

Anyway, I did mill the steel.

Until the milling bit broke at first pass.

So I got next one and after few passes later it also broke.

So I got next one, reduced milling parameters from 1 mm deep 2mm high cut to 0.5 x 2mm… and it broke again.

Note: A small clunky communist made milling machine form about 1970 can mill without a problem 5 x 10mm cut in one pass using 12mm milling bit.

I did not got next one because I was out of spare ones.

Now please think a bit about following question: Were my tools bad?

Bad tools, bad!

Those tools surely broke well below their declared milling parameters.

So I went to a colleague at production floor and asked about higher class milling bits. He told me that if HSS steel chipped off at those milling parameters then some better, harder, higher class milling bits will just explode and scatter around. HSS steel (HSS stands from “High Speed Hardened Steel”) is relatively soft while better tools are crazy brittle.

Close inspection and testing of the machine have shown that “stiff & rigid” are words which never were even in the vicinity of it. Even tough it looks like quite a stiff chunk of steel it is in fact so badly designed, that it is flexing and bending at numerous critical points. And with machine which bends You can’t do any serious job. It will vibrate, bounce off from milled stock and hit it hard when bouncing back. The overall stiffness of this machine is close to what I can hold in my hand. Yep, I tried to get a grip on how large milling forces are by holding a stock in hand during 0.2 x 2mm milling which was the practical limit of this machine.

Note: Throwing away this machine is out of budget.

What did I do later? Did I thrown away my bits and got better?

Fixing the machine

What I did it was to buy a 30kg block of concrete, the bag of M50 cement, the 1kg bag of small steel nails with large heads. I did bent those nails into Z shape and mixed the concrete with them to get it “reinforced”. Then used it to fill absolutely every empty space in the machine body. Almost everywhere I could get. Then I bolted this machine to the slab of concrete through some vibration dampeners tightening everything through the wet concrete.

The result isn’t the best, but machine certainly vibrates less and is visibly stiffer. Adding mass to some moving parts (even through dampers) softens resonances and moves them towards low frequencies. As with 5000…15000 rmp four blades milling bit any resonance between 300….1000Hz is a killer moving it towards 100 Hz or lower makes things much better. On the downside adding concrete made it also stiffer what moved resonant frequency up, but due to greater stiffness lowered the amplitude of oscillations.

In simple terms: I did fix the machine.

And tools did stop breaking.

So… fire them?

Neither that boss nor director stopped to think why those employee do not perform at their best.

The electronic designer needs to be moved into electronics design. The company needs to provide him with better tools (we use PADs which is a total crap when compared with Altium). The company needs to let him do his job and needs to realize that his experience is not ours experience. The company needs to realize that that guy was hired because he was bringing experience from other companies, and not because he was a blank piece of paper which awaits to be filled with our experience.

Instead there are expectations that he will guess what he needs to learn, that he will learn it without being paid using own resources and that he will not share his unique expertise.

With that approach You will end up equally good with hiring cooks and cleaners as designers and coders.

The production floor employee on the other hand was under a growing pressure for deadlines and amount of production he had to make. There was zero promotion for quality, zero positive stimulus for reporting problems and zero stimulus for self development. Even more, the more problems You reported, the worse You looked. If this guy would like to keep up with my expectations of quality and engagement he would be yelled at.

He was also seeing that even tough he will soon retire the management did not care to make him to pass his unique experience to anyone. This is especially emotionally painful message. It is like if they would simply slap him in the face and said: “Your entire life (he spend more than 30 years at this factory) is worth a crap and we are happy to throw it away”

Summary

The God (or a Satan in my case) gave You the brain. Use that gift. Think.

I won’t say, that there are no cases when employees are lazy, dumb, dangerous idiots.

Notice, I did not say: incompetent. Nor slow. Incompetence means lack of training. Slowness means lack of training, experience and tools. Lack of engagement means that You have some de-motivating procedures in action. Lack of creativity means that…. and this… and that…. and….

Think about it. In my case my milling tools were good. It was the machine what was broken and needed and modifications. Event tough it looked good.

What about Your machine?.

Listening to: “Report of The House Select Committee about the January 6th Attack on the United States Capitol”

Note: Thanks librivox.org for providing audio-book version of it! I really do appreciate it.

In this blog entry I would like to give You some view of an outsider who had a pleasure to read a report about how bad things did happen in some far possibly allied and possibly enemy country. You know, You never know. One day we are in Warsaw Treaty, next day in NATO… who knows what may happen in the future?

Looking at the report as at the book

Failure.

This is a second report I listen to, after Obama’s “Digital Economy”, about which I can say: “Damn! Why did You screwed it up so badly?!”. The “Deep Water Horizon”, “Three Mile Island”, even “Hindenburg” were really good books.

This one is bad.

I do appreciate the “multi-tire approach” when in first “tire” You write very simplified text so that a reader may quickly decide: “this isn’t for me”. The second tire You write for those who would like to know what did You figure out. And then third tire for those who love to dig into minute details.

Three. Maybe four. But not, damn, five or six! Listening to this report I really did check two or three times the display of my mp3 player if by mistake I did not push “rewind back”. (Sandisk Clip Sport, I do recommend it. Robust as hell. Ten years and still going for 24 hours of continuous play) You know, I suspected it, because this mp3 player has buttons sticking out, so it is easy to press them not knowing about it. But, no, I didn’t push it. It was just next of far too many tires in which just few sentences were added.

How about the language?

Well… I did at least learned a new swear-word. Otherwise it was relatively easy to understand even for a foreigner like myself. Except, maybe, those portions where they quote high officials saying: “Yeah, You know, Yeah… maybe, You know”. A kind of red-neck talk to me. Makes one think: “Why the committee did such thing to those guys?”

Oh, and I know now who the heck the “Potus” is. Your love for confusing acronyms will last forever, will it?

Political view

First what is seen right off the bat, it is that the committee was a side of a conflict. There is an immense pressure to create an impression in a reader that the acting president of United States is a brazen liar. Exactly, precisely aimed at that goal sentences are poured onto a reader many, many times before any attempt is even made to supply some actual proofs. Clear political goal.

All right, so they said that Donal Trump is a liar. Of course they also have proven it. By, be ware, saying that FBI did not found anything. This is usually a good proof. Except if in case when the lie the president is saying is more or less about FBI not trying hard enough to find something.

Never less, he lied. Mostly in the same style he lied in 2016 election, so it should be expected.

Petty drama

“An Attack on Capitol!”

Yeah…

In a country full of automatic weapon, where You can even find guys owning tanks… rioters used tear gas, clubs and tasers.

Great dramatic attack.

From my point of view – a peaceful demonstration which get a bit out of control. Even the extremists did not attack seriously. How hard it would be, in this turmoil to find a sharp-shooter with 50 BMG riffle? How hard it would be to bring not tear gas canisters but hand grenades?

One what is true, it is the fear for their own asses. It is clearly seen from each word of the committee report.

Of course Police and security forces might suspect that some grenades may come to play. They might be in fear. Yet it was them who shoot to kill first. Fear crates violence, violence creates fear.

We are elite, You are mob!

This is a next impression which is clear. The crowd was mob, the representatives at the capitol were elite. The life of mob may be spared but the life of elite must be protected.

Elite is right, mob is stupid and needs to be investigated and harnessed.

Of course I do really appreciate the method of thinking “You may be saying whatever You like. What You do it is what counts”. This is a great rule. Yet the recommendations of a committee do suggest, that it may better if it would not be like that any more.

Of course I was born as a communist so I am a bit sensitive to such things. I know how thin is the border between “for a greater good” and a government sponsored excursion to wast cold planes of east Russia.

There president is calling….

Yet if You respond to the call You do commit a crime.

This was something I couldn’t understand. Donald Trump still was a president of United States. Donald Trump publicly and using official channels requested support against traitors destroying the country.

What more could he have done to make it an official order?

Am I wrong, but aren’t every citizen of United States bound to respond to government commands? Especially in a time of obvious dramatic crisis?

The courts of justice which did condemn rioters did great harm to the country. They undermined the trust in government and clearly communicated that even if You see the Supreme Commander on the TV saying: “Kill all Martians!” You should first get to the local library, dig through some cubic meters of law books and court cases before You can act. For Your own safety.

You don’t know how to cheat

The most hilarious was however how the Donal Trump lied.

Honestly?

My local politicians can do it way, way, way (*1000) better.

There are three simple rules when You are doing something illegal:

  1. If You are doing it, You are not talking about it.
  2. If You have to talk about it, You never let anything to be written or recorded.
  3. If You let it to be written… Then why are You surprised?

He did not only try to sabotage the election system. He was also talking about illegal actions. Through White House logged phone lines. Through emails. In the Oval Office.

Even the near zero level IQ politicians in my country don’t behave that stupid.

What should he be doing?

First, if You claim that voting machines were hacked You should make sure they were hacked. And I am not talking about finding a solid proof. I am talking about hiring some guys to hack one of them. One, harmless example is all what You need.

Alike, You claim, someone smuggled fake votes? Make sure that it really happen.

Of course, no talking over the office phone, nothing in writing.

Business arrogance

For us, in my country, most of US companies are arrogant beyond anything. White gloved grinning killers. Strength is everything, possession is 99% of law, let’s meet in court if You don’t like it.

And there is no truth, everything we say is marketing.

Sorry if You do feel offended, dear reader. This is how I do feel.

My impression about Donald Trump is clear – he is a true United States businessman down to the bone. Most of his practices are the practices of ruling CEO who owns everybody, and if You don’t like it, then… say what: You and what army?

Except, what is also a kind of strange, that this approach seems to not work if You are a part of a government.

But You have a great bureaucrats

This time “great” means great. Not yeah, “great”… Just great.

This is something what would make my country politicians to fail in United States. Your bureaucrats do act by the law. Dot. Nothing more, nothing less. They have the pride in doing so and can stand and say: “No!” if something against it has to be done. In my country law is… well… a suggestion maybe? A tool? It is used when it is convenient or against people who are nobodies.

By the way, in my country the vice president would not think about if it is legal for him to open additional votes or not. Legal or illegal, it would be a secondary issue. What would be important, it would be the profitable mess he could create. Make a turmoil, put a smoke screen and do things behind it. Like my politicians did by aborting parliament session and moving it to the place about which nobody knew except those who knew what to vote for.

The vice president of United States did not think about if he can profit from the turmoil. He was seriously considering not only if it is not prohibited by the law. Not only if it is allowed by the law. No, he did consider what the law is binding him to do.

Great. I’m jealous.

And You have people who believe in democracy

There is an absolute and stunning lack in reflection in the committee report about why there were no grenades?

Think about it.

Why people who believed that the election was stolen did not spill the blood?

Then think about one more thing which made me really stunned.

The fact that those people could be made to attack the Capitol is something in what I can believe. It does, for the next time, do convince me that U.S. citizens are, in a mass… well… a bit lacking in the brain department, but I can believe that one can, with proper words and proper atmosphere make people to riot.

But how the hell is that possible, that they did go home just because Donald said so!?

Unbelievable.

Except, if You assume, that they did really believe that they are acting within bounds of more or less democratic rules and that the democracy still work.

Amazing.

And yet courts did sent such great people to prisons…

Summary

Download the report and read it. Or, if You like jogging or walking, listen to it. It is worth doing, especially if You are not United States citizen. It can be a real eye opener. An insight in how Americans do think and why we have such a great deal of problems with understanding each other.

RtOS + cooperative interrupts… Ehmm… say what?!

In this entire series of posts You might have read what do I think about RtOS. Especially about it cooperative variant.

Of course it is nothing new. Multi-tasking operating systems are on the market from ages and oldest of them were cooperative ones.

Even tough for a lot of embedded programmers it is something new. Even if not, they almost always go for “free rtos” caliber and then complain that it is not a feather-light piece of software.

Costs and benefits, costs and benefits. You always need to find a perfect balance.

In my opinion the biggest benefit You get from an RtOS, and especially from the cooperative one it is a clean, linear code.

Linear code, where the lexical text flow and an actual execution flow are the same is very easy to understand and maintain.

The more complex Your system is the more You may benefit from moving from plain “state based” coding to RtOS like approach.

Interrupts

One part of RtOS which wasn’t covered yet it is how to employ it to very efficiently simplify the way You code interrupts.

But first some background.

Interrupt driven state machine

I do have now to do some cleanup in some interrupt driven state machine. It is written quite well. Not very good, but acceptable. That is using high school standards.

This machine is a nasty piece of beast. It has about twenty states and is driven by four interrupts:

  1. UART character received interrupt;
  2. UART output shift buffer empty;
  3. DMA transfer to UART transmitter register completed;
  4. Fast timeout timer interrupt;

Can You see how to code it so that it will be easy to write, easy to fix and easy to understand?

Currently it is coded in a quite well thought manner. There is one state variable and one source file for each interrupt plus one file which describes enter-to-state actions. If You could open all five files and put them side by side in columns You would get a table like that:

entry state UART Rx UART Tx DMA Tmr
state: byte
org RX_vector
 add state, PC
 jmp state_0
 jmp state_1
 jmp state_2
 ...
 jmp state_20
org TX_vector
 add state, PC
 jmp state_0
 jmp state_1
 jmp state_2
 ...
 jmp state_20
org DMA_vector
 add state, PC
 jmp state_0
 jmp state_1
 jmp state_2
 ...
 jmp state_20
org TMR_vector
 add state, PC
 jmp state_0
 jmp state_1
 jmp state_2
 ...
 jmp state_20
enter_state_0:
   ...some code
  state=0
  reti
state_0:
 ...some code possibly
  jmp enter_state...
  reti
state_0:
 alike at the left, 
 but for this
 interrupt
state_0:
  prohibited
  interrupt
  jmp $
state_0:
  etc.
enter_state_1:
   ...some code
   state=1
   reti
state_1:
  etc.
state_1:
  etc.
state_1:
  etc.
state_1:
  etc.

Of course cryptic state_x were not that cryptic and real actions were taken. You should notice that each row in this table describes actions for one single state and each column contains actions triggered by one of interrupts. If an interrupt is not allowed in a state, there is a suicidal jmp $ which will loop till watchdog flips over the CPU.

Each time You see:

 jmp #enter_state_x

you know that it tells you the row in which You should look for handlers of “what happens next” interrupts.

And, of course, the add state,PC is so called computed goto.

Note: Your real CPU may need a slightly different code. For an example state to be 0,2,4,6… instead of 0,1,2… and etc.

I like it quite a lot. Especially because the row-column approach clearly shows what interrupt should be enabled in what state.

Yet it is tempting to screw it up. For an example one may add some sub state flags, variables or even try a deadly idea to create invisible sub-states by enabling Tx interrupt, then in Tx interrupt enable DMA interrupt and so on. Or even use call #enter_state_x and do something state specific after return to add to confusion if You actually should enter that state or leave in current one.

Never ever do any of above. It is hard to read enough.

It is still damn complex

This form of representation of state machine is good. And it is fast. Interrupt is practically vectored right into to the spot. Yet if You try reading it it will give You a headache. Scrolling all sources in parallel, when each state is handled by different number of lines of code isn’t very easy. And there is a lot of scrolling back and forth to get a grip on what is done.

Even if Your coder resisted the temptation to screw it with any of above mentioned actions.

And even if he/she was so nice and drew the algorithm in Inkscape.

It is still a hell lot of hard work to follow what is going on.

RtOS approach

One may easily notice that an interrupt driven state machine is, when drawn on paper, a sequence of blocks like that:

  arm some hardware
  enable some interrupts
  return to main thread
  wait for any interrupt
  if some interrupt
  {
     
     arm some hardware
     enable some interrupts
     return to main thread
     wait for any interrupt
     if some interrupt
     {
      ...
    
  if other interrupt
  {
  }

Are they not very alike:

   arm hardware
   TaskTable[current_task].event_mask |=(1<<E) ;set bit E in event_mask
   call #yield
   if (TaskTable[current_task].event_flags)
   {
    ....

Note:If You are curious this is a piece of code from waitfor(…) implementation of our cooperative RtOS.

And yes, they are very alike indeed. In fact both do represent the same idea: “do something, prepare to wait for something to happen in a background, give up CPU for other tasks, let the magic to get back when something happens.”

In case of cooperative RtOS task the in background are interrupts or other tasks while magic is the scheduler.

In case of interrupts the in background is hardware and the magic is hardware interrupt controller.

Use preemptive RtOS!

Well… yes and no. “Yes”, because it could do it, and “no” because preemptive scheduler is hell heavy. We would loose all the benefits of interrupts.

Plus, what about race condition between scheduled task and next enabled interrupt? Interrupt may interrupt the task, You know. What then?

Use cooperative RtOS!

Yes.

But not directly, of course. Use the idea of it.

Do You remember how the RtOS do switch tasks?

It takes task then makes call to scheduler routine. This call puts the return address on the task stack. Then the scheduler do switch stack to the stack of the next task and performs return from subroutine.

Like in below table:

task A hardware scheduler task B
  call #yield
next_A: ...
push #next_A
jmp yield
yield:
TaskTable[A].SP=SP
SP=TaskTable[B]
pop PC
ret
next_B:
....

What does the interrupt do?

It takes the stack of current primary task and injects call to interrupt vector. This injected hardware call pushes return address and state on stack and jumps to interrupt vector:

 disable interrupts
 push PC
 push Status
 jmp #interrupt_vector

Hmm… so we are half way done. We are inside an interrupt scheduler. This scheduler is in fact not a scheduler at all. Opposite to regular RtOS scheduler it doesn’t need to choose which task to run. It knows it a’priori because the specific task is bound with that interrupt.

So all is left to do, it is to switch the stack from primary stack to interrupt task stack and do the return from subroutine.

SP_primary=SP ;a place-holder common for all interrupt tasks.
              ;even if we have many interrupt tasks, we assume
              ;they can't preempt each other, so one variable
              ;for each level of interrupts is fine.
SP=TaskTable[interrupt].SP
ret

Now we are inside the task, but also inside interrupt. This means, that interrupt enable flag of CPU is cleared (at least for the same level of interrupts) and we are running the interrupt code.

All right, so we done our job. Now what?

What do we have to do to return CPU to primary thread in such a way, that next interrupt will continue from where we yielded?

call #yield_interrupt

where

yield_interrupt:
TaskTable[interrupt].SP=SP
SP=SP_primary
reti  

where reti is a machine operation for:

 pop Status
 pop PC
 enable interrupts

Faster, faster, faster!

We do know now, that we can make an RtOS task which is living inside an interrupt and write it like that:

  loop:
   prepare some interrupts
   call #yield_interrupt
   if (interrupt_A)
   {
     ...
     call #yield_interrupt

It is quite clean but, unfortunately, contains the nasty if (interrupt_A). This code is a pure waste of time. We used specific interrupt to awake the task and now we are checking it again! Nonsense.

So maybe we could do:

   yield_interrupt(when_A, when_B)
   when_A: 
      ...
   when_B:
      ...

and somehow create a “magic” to make interrupt to enter the task at when_A or when_B appropriately?

Multi-return address subroutine

We will use a trick which is not possible in C or other C-like languages. A subroutine which is called with a many conditional return addresses. That is, for an example, instead of returning 0 or 1 it will return to place “A” or to place “B”.

Assume, we have a fixed number of possible interrupts used for our task. In this case two. The yield_interrupt(when_A, when_B) is then:

  push #when_A
  push #when_B
  jmp #yield_interrupt

and the each of interrupt is now entering the task in a slightly different way:

interrupt A interrupt B
SP_primary=SP
SP=TaskTable[interrupt].SP
SP=SP-2 ;drop two levels
jmp SP+2;jmp to what was first on stack
SP_primary=SP
SP=TaskTable[interrupt].SP
SP=SP-2 ;drop two levels
jmp SP+1;jmp to what was second on stack

The interrupt now switches stack, but instead of just doing ret it drops both return addresses from stack by moving stack pointer as if

  pop
  pop

would do. Then it, assuming stack was not touched, jumps to either what was first on stack or what was second.

Note: The assumption about stack not touched do hold only if interrupts are disabled
If interrupts would be not disabled or we would have higher level interrupts possible the intermediate register would have to be used like this:

 X = SP[0  or  1]
 SP=SP-2
 jmp X

Easy? Sure.

Hey, I would like to have it in primary RtOS thread too!

While I wouldn’t.

The multiple return subroutine in task switch requires that each and every #yield_interrupt takes the same number of return addresses. With interrupts it is worth to pay since vectoring in is much faster and we usually drive our state machine with just few interrupts. With primary tasks with 8 to 16-teen events arming each #yield with 16 return addresses would be pain in behind. Especially if You compare the code size expense and stack expense plus eventual speed profit with the fact, that primary tasks are switched with huge delays caused by a cooperative approach.

Kicking it from primary threads

We now know, that adding a cooperative interrupt task to our RtOS is a piece of cake. It will cost us just few lines of code and we can change nasty state machine into a nice looking, easy to read linear piece of code.

Now we have only one thing more to do.

What it we need to awake this interrupt task from main thread?

Fake interrupt

For that we need to let our code to “fake” interrupt. In fact we need to not fake an existing one, but create a completely new “fake” interrupt. We will do it very alike we did in our regular RtOS, by utilizing the task table:

 TaskTable[?].event_mask
 TaskTable[?].event_flag

What is changed it is, that we need now to have a dedicated subroutine to signal the event from primary thread:

signal(event,interrupt_task)
 TaskTable[interrupt_task].event_flag|=event
 if ( TaskTable[interrupt_task].event_mask & event )
 { 
   disable interrupts
   push #_ret_ptr
   push status
   jmp TaskTable[interrupt_task].fake_interrupt_vector
_ret_ptr:
 }

The fake_interrupt_vector is, of course, looking exactly as a normal, real interrupt handling routine. Depending on the context we may even have it hard-coded right in here. For an example if we have just one interrupt task.

The first line is exactly the same as if we would signal a regular event for a regular task. The remaining portion does exactly what would have happen if the hardware would pick the interrupt.

The placement of disable interrupts depends on how atomic the if can be. If it is a non-interruptible sequence I don’t think we need any additional protection.

This is however not all. We need some mechanism which will pick the event later if the event_mask is blocking the immediate execution. Exactly the same way hardware interrupt works and the same way the RtOS must work to be usable at all.

All right, but what can actually change the event_mask? For event_mask to be of any use the only code which may touch it it is the task itself. This means, that this mask may be altered only during some interrupt. It may never ever happen in primary thread.

Alike in regular RtOS, we change event mask and call #yield. The scheduler will pick then other tasks, but sooner or later will pick us again if event_flag is set.

With interrupt task the scheduler is the hardware and there is no “round-robin” with interrupts. If it is risen it is handled. We don’t need to check other interrupts. We don’t need to check primary threads. It is enough to do:

yield_interrupt(when_A,when_B,when_event)
  if (event_flag & event_mask)
     jmp #when_event
  else
  { 
   push #when_A
   push #when_B
   push #when_event
   jmp #yield_interrupt
  }

This code assumes, then when_event is highest priority.

Note: In our code we assume, that the fact, that task is awaken by interrupt is neither clearing the request flag nor disables the interrupt. It will be handled again unless directly cleaned in code. This is symmetric with what I did assume with RtOS events. A like symmetry.

Limitations

One:Sanity.

Second:All interrupts driving the task have to be at the same level. They can’t interrupt each other or the linear control flow won’t be possible.

Third: Multiple-return address subroutines are possible only in an assembler. I don’t know any higher level language which allows it.

Summary

Ok, it was looooong.

What You should know now?

That is it fairly possible and in fact easy to have a cooperative task switch with interrupts. And that You can write a task which runs on multiple interrupts almost exactly the same way You do write a regular RtOS task. What differs are only names of RtOS functions and multiple return address subroutines.

Is it worth to change five files table approach to one single linear flow at the cost of some head scratching when implementing this concept?

For me: yes. Worth every penny.

Coding “by contract”

…and I don’t mean “outsourcing”.

I am going to talk about a certain technique of writing programs which I have found very useful over all my years.

Your code

For a purpose of this discussion I do assume, that You are not writing Your application as a one, huge flat block of thousands of lines. Instead I do assume, that You divided into blocks which have to do “something”.

I do intentionally double quoted the “something” because at that stage You most probably have only a vague idea what does it should do. For an example, let’s say, provide a daily log in non-volatile memory of Your embedded system.

Getting what You do know

Of course first what You have to do it is to gather some knowledge about what You are certain and to what degree. For an example, in this case You know that:

  1. You need to log some events.
  2. Events are some data with a time-stamp.
  3. You will have on board some non-volatile memory of limited capacity.

Getting what You do not know

Form what I wrote above it should be clear that You are neither sure what kind of non-volatile memory You will have nor what exactly the events are.

Rule and divide

Let us make some graphics and take a look on what You know and what You don’t know:

Coding by contract. Implementation, contract and suppliers.

As You can see You may say, that to do Your job You need to ask yourself three questions:

  1. How my code will be used in an application?
  2. What services will my code need?
  3. How to implement what it offers with services it needs?

Contract

The “contract” is a portion of code which tells how it will be used in the application.

That is the API.

If You are lucky and are using some high level programming language like Java or C++ then object oriented programming will be helping You since it has dedicated tools for API definition. In Java they are interfaces in C++ pure virtual classes. Since I do love Java, I will stick with it in this blog entry.

Of course if You use C or even assembler You are not left alone. You will see why later.

So let us try the API:

public interface ILog
{
  public void addLogElement(IEvent event)
}
public interface IEvent
{
  public Date getDate()
  public byte [] getBinaryImage()
}

Nothing fancy, isn’t it?

Except, that we are not done yet. This is a very, very sketchy API. Because it doesn’t have comments.

The importance of comments

The API will serve dual purpose:

  1. It will tell Your colleagues how to use the code.
  2. It will You what exactly You have to implement.

Any lack of preciseness or accuracy will turn Your job hard. First, because Your users will have to guess or discover what is exactly done by trial and error. Or, what is even worse, by looking at implementing code.

This is bad. Very bad. Not only because it may create misunderstandings but also because it will cast Your implementation in stone. Since nobody will know what Your meant to do they will assume that You did what was meant to be done. Which, of course, won’t be, because You are just a human and Your code will be full of mistakes.

Next problem with imprecise comments it is, that since exact details of what does this code do are not clear they need to be discovered by trying it out. And You can’t try out something what doesn’t exist. In effect Your users will have to wait till You complete Your job. Their work and Yours cannot be done in parallel. Including testing which You can’t delegate to others.

If this job cannot be done in parallel, then, of course, Your idea behind the API is not tested before You polish it. This may produce expensive re-runs of Your entire work. If however Your API is clearly specified then Your users may try to write their code around it and discover holes in it. All before You will actually invest a lot of work in it.

In fact those comments are a kind of formal specification.

So let us add a bit of accuracy, just to show You some example.

/** An event to be put into a log.
 Instances of this class are immutable. */
public interface IEvent
{
  /** Date at which it was created.
  @return never null, life time constant */
  public Date getDate();
  /** Transforms an event to a binary form for persistent storage.
  @return never null, may be empty. Each call returns newly allocated
   array. The binary form doesn't contain {@link #getDate}. The maximum
   size of this array may not exceed 64 bytes. */
  public byte [] getBinaryImage();
}

Notice that now many facts became more clear. For an example we do know now that IEvent behaves like String – once created it doesn’t change. Thous it is also thread safe, fast, and doesn’t need any synchronization code when passed from thread to thread. We also know that we don’t have to care about who owns it, as nobody may change it.

Note: In C/C++ “immutable” doesn’t mean “I don’t care who is the owner”. If You pass anything by a pointer or by a reference You absolutely must specify whose job it is to delete the object, is that object on local variables stack or on heap and etc. This is why I do prefer Java. Drop it and don’t care, garbage collector will do the job for You. And, by the way “smart pointers” are not a silver bullet. They leak like hell.

Then we know that the returned byte [] array is not just some array, but a kind of serialized form of an object. We also know that we may modify it if we need, because it is always provided anew and that all must fit in 64 bytes.

We also know, that we don’t know how to turn that byte array back to an event. Should we add something to the API? Or should we not? Notice, that until we wrote those comments we were not aware of it. But now, when we had to think about turning an object to bits and bytes it appeared to us, that we need to know how to turn it back. That is, of course, if we are ever going to do it.

Let us decide to not extend the API to not make the example too messy.

Supplier

Already having the IEvent we may make some sketch in mind about how to implement the ILog.

Hmm…. but we do not know what kind of non-volatile memory do we have!

This is the moment when we need to define the “supplier”.

A supplier is a set of services our implementation will need to provide users with the API.

So let us define a “supplier” as a contract like follows:

/** A non-volatile memory supplier for logging purposes */
public interface INonVolatileMemory
{
  /** Writes record to non-volatile memory. 
  @param time_stamp non null, should be stored in memory with down to 1 second accuracy.
  @param record non null, up to 64 bytes. The content of this array is valid only during
         the duration of this call and may be changed after this call returns
  @return false if failed to fit record in memory */
  public boolean writeRecord(Date time_stamp, byte [] record);
}

You may now see that with that simple contract we decided how much of an actual job we moved to this contract and what is left in ILog. Notice that we also decided that when memory is full, then it is full and no longer capable of registering more events. Is it wise? Should it be like that? Thanks to commenting we have a chance to think about it before even a single line of code is written.

Implementation

We can now do the implementation using a pattern like this:

public class CLog implements ILog
{
       ....
   public CLog(INonVolatileMemory nv_memory){....
}

This kind of implementation is a “plugable one”. I call it like that, because we do “plug in” the supplier into it and use that supplier to provide a “contract”. The implementation actually doesn’t really care what the “supplier” actually is.

Interfaces sucks!

…and pure virtual classes too.

They are slow. Interface invocation, if compiled as such without any call-site optimization is an order of magnitude slower than normal virtual method invocation. Not mentioning private static once which are fastest possible in Java. In C++ it won’t be slow only if pure virtual class is the sole base of implementing class. Which it won’t be in 99% of cases.

So maybe there is a different way?

Sure it is.

They are so called “abstract classes”.

“Abstract” instead of “interface”

The “abstract” class is an actual implementing class of which we cut out portions responsible for the “supplier”.

Like this:

public abstract class ALog implements ILog
{
       ....
   protected abstract boolean writeRecord(Date time_stamp, byte [] record);
   public void addLogElement(IEvent event){....
}

It is up to our chosen style if we do leave "implements ILog" or not. There is no cost in leaving it here since uses like:

 ALog x = new ...
 x.addLogElement(...

won’t be using expensive interface invocations.

I do strongly recommend You to leave the ILog in place because without it it will be hard for Your users to start coding before You finish Your ALog class.

Of course the next step it is to actually implement the non-volatile contract like this:

public class CNvRAMLog extends ALog
{
  public CNvRAMLog(.....
}
public class CFileLog extends ALog
{
  public CFileLog(.....
}

You have to do it the same way for each kind of non-volatile memory service You will use.

Downsides of “abstract”

The “abstract” is faster, usually simpler to code and understand that “plugable” approach. The downside is that once You decide to make it concrete (ie. in CNvRAMLog) You can’t easily add some functionality which is independent of nv-memory implementation, but is not common to every implementation.

For an example, we may decide to let some implementations to do some accounting:

public interface IAccountingLog extends ILog
{
  public int freeRecordsLeft()
}

With “abstract” we must either add this to ALog which is smallest common denominator or add the implementing code to each extending class which do need it:

public class CNvRAMLog extends ALog implements IAccountingLog
{
  public CNvRAMLog(.....
  @Override public int freeRecordsLeft(){...
}

With “plugable” we can:

public class CAccountingLog extends CLog implements IAccountingLog
{
       ....
  public CLog(INonVolatileMemory nv_memory){....
  @Override public int freeRecordsLeft(){...
}

and have a service added in all places which we like regardless of any non volatile memory provider we add in the future.

Note: Of course this example is a bit twisted and You may easily point out that this isn’t exactly a good one example. And You will be right. If You really need to look at difference between “abstract” and “contract” take a look at java.io.InputStream which is abstract and java.util.Iterator which is a contract. Then try to make Your existing class to be InputStream and to be Iterator. You can do second with ease while first needs much more head scratching.

Testing

The “provider” and “contract” do allow easy testing. You may imagine, that Your users may test their code by supplying a fake (mock-up) implementation of ILog. Very alike Your tester may write tests against ILog contract and dry-run tests on some fake implementation. Even more, the tester may actually test Your CLog over a fake INonVolatileMemory.

The possibilities are endless and easy.

Note: There is of course an alternative. These are “mocks”. Like java.mockito which, by sophisticated manipulation of JVM hooks, may “fake” existing code and turn it into something else on the fly. Alike there are similar libraries for C/C++. But in my opinion they are hacks. It is much easier and cleaner to use contract/supplier pattern than play with “mocking”.

Bridging & intercepting

Of course once You have interface for each “contract” at each level nothing prevents Your from, for an example:

  • Implementing some contract over other contract. For an example You may implement our ILog directly over Java serialization and completely ignore the non-volatile memory supplier. This is a kind of “bridge” from one realm to another.
  • Provide compatibility layer. For an example when at certain moment of time You will decide that ILog wasn’t the best idea You may create ILogExt which will be totally different and implement ILog over it. This way You may not only delay re-writing any code depending on ILog and save money, but also re-use all the testing code for ILog to test to some extent the new ILogExt.
  • You may create a “wrapper” which will intercept all calls to ILog and pass them down to some other implementation doing something in between. For an example it may redirect events not to just one log, but to primary one and backup one. All without Your user code knowing about it.

Summary

After reading this blog entry You should know what is “contract” and how to implement it with “suppliers”. You should know why it is worth to do it and what patterns to follow.

And finally, that the API is neither the “interface” nor “virtual class” nor bunch of function, but the comments. Yes, the comments.

Yes, the comments.

Yes, the comments.

Yes, the comments.

Did I made my self clear?

The comments.

Killing Your flash/fram based CPU with unguarded write

I am now in progress of inspecting some youngster code for a FRAM based MCU which is doing some writes to non-volatile code memory to preserve some data. I doesn’t look bad… except it also doesn’t look superior. Motivated by it I will try to show You what You should think about when You need to write to a permanent non-volatile program memory.

Suicidal CPU

Yep. This what I intended to write. A CPU which can kill itself. Yes, they are on the market.

What am I talking about?

About any CPU / MCU equipped with non-volatile self-writable program memory. Let it be FLASH, EEPROM, MRAM or FRAM or whatever, it doesn’t matter. If a program can write into a persistent code memory then this program can destroy itself permanently.

So called “in-field upgradeable” solutions. Nowadays – almost everything falls into this category.

Killer CPU

What wrong may happen when CPU program controlling some equipment is permanently altered? Let us iterate over some possibilities:

  1. Some device function may not work as expected.
  2. Device may restart, hang or crash in conditions in which it previously worked fine.
  3. Device may continuously crash, hang or restart.
  4. Device may fail to start.
  5. Device may continuously insist on opening a valve releasing a poisonous gas into air ducts of a public hospital.
  6. Device may continuously insist on nuking the Earth.

Of course two last items do require a lot of other conditions to happen, but who knows? Our trust in engineers is based on the assumption that they do know what they are doing, yet, well… somehow I did felt obliged to write this post, right?

You may argue, that those may happen also if just a random fluke will happen. True. However with random failure they CPU will be restored to normal operation by reset and will not try to do the killing anymore. If however a code is altered the device will repeat the action after a reset. Thous all protection systems will be stressed continuously and the risk of deadly failure will rise high.

When it may be killed?

In this blog entry I will speak only about one way of killing the program: the unplanned write to permanent program memory. All other methods are out of my concern today.

Writing to non-volatile memory

Plenty of simple FLASH based MCU’s have only elementary write protection. You need to toggle some bit in some register with a special instruction. Like for an example:

   FLASHCTL = PASSWORD+WREN

The MSP430, which will be an example, is designed in such a way. It has some minor protection, which is that not every CPU operation may toggle this bit, but it is not a true barrier.

Then most of MCU’s will trigger the non-volatile memory write by a simple write into a memory at an address which is to be changed. Like for an example in following sequence:

  FLASHCTL = PASSWORD+WREN
   [#application_settings] = new settings
  FLASHCTL = PASSWORD

This sequence enables write to non-volatile permanent memory, performs the write and disables the ability to write. In this example the application_settings is a constant hard coded in a machine instruction, but nothing prevents You from using an indirect addressing and writing a subroutine like:

flash_update(address, value)
{
  FLASHCTL = PASSWORD+WREN
   [address] = value
  FLASHCTL = PASSWORD
}

When it is safe?

Never. A CPU may always go hay and execute the killing sequence. The CPU failure modes may be:

  • instruction is modified, so it does something different;
  • data are modified;
  • program counter register is altered and a forced random-like branch is executed.

All three cases may be triggered by an external pulse, let it be electric field, high energy particle or anything. The probability however that any of them will trigger such an event without frying the CPU totally is low. Even the most point-accurate of them, which is a high energy particle, will play a havoc across multiple sub-micrometer scale transistors.

The modification of data, or a branch to unpredictable location, may be however triggered by a human error. You, the coder may by mistake do some write where You shouldn’t write or do some branch where You shouldn’t do.

Of course I assume You were not such an idiot to place FLASHCTL = PASSWORD+WREN at the boot routine, right? If You were then… well.. someone is going for a hot iron in a behind if I’ll catch them.

So let us assume that You were sane and You have some code like above in some place. Then Your CPU isn’t safe. There is always a chance that some random thing will make it branch to that routine and boom! Code memory is written with some garbage.

Writes to hard-coded addresses

Now assume that the only non-volatile memory updates You have in Your code are like this:

update_settings(new_settings)
{
  FLASHCTL = PASSWORD+WREN
   [#application_settings] = new_settings
  FLASHCTL = PASSWORD
}

Let us now take a look what may happen if by accident the CPU will branch into specific place of code memory:

Memory layout with safety zones
Memory layout with effect zones for brute branch-in failure.

Safe zone

The green zone is a “safe zone”. If CPU jumps into this zone, the non-volatile memory writes are disabled and any accidental write doesn’t do anything wrong.

Note: Most MCU will trigger the “Memory Violation” non-maskable interrupt or even restart the CPU right at the spot.

The especially safe is the “unused code memory” zone, of course only if filled with a code which will make device to die gracefully. Usually any looping code will cause watchdog timer to kick in and restart the system with minimum harm done.

Minor failure zone

The yellow zone is a “minor direct failure”. If the CPU branches into such a zone the write to non-volatile memory will happen since our write procedure will see this branch as a legitimate call. The effect will be however minor, because even tough data written may be considered random the location is not. Thanks to the fact that we hard-coded write address right into machine instruction the double failure is needed to destroy the code. Thous this yellow zone may only harm the data. This may create a semi-permanent malfunction of the device, but it may be restored to regular operation by an action from the user. For an example performing a factory rest or re-entering some data may be necessary. The code is not affected.

Note: The device may still be deadly if You did allow that setting to carry a request like “and now kill all humans”. Think about it.

Green is not so green

Of course green and yellow zones are not so green and yellow if You consider what happens next. But I am not going to consider the secondary effects now.

Indirect write

But what if we do:

flash_update(address, value)
{
  FLASHCTL = PASSWORD+WREN
   [address] = value
  FLASHCTL = PASSWORD
}

Let us take a closer look again:

Memory layout with effect zones for brute branch-in failure.

Looks familiar?

Exactly. Except now part of what was yellow is “red”.

Green zones

As previously. No change. No write to non-volatile memory will be efficient.

Yellow zones

As previously. Notice yellow zones are those in which code will correctly compute the address and pass it to memory write routine.

Red zones

Deadly zones. If CPU branches into those zones with random data in registers, then both data and address will be incorrect. In effect a code may be overwritten since flash write will see it as a legitimate request.

Hmm… “may be overwritten” is in this case a serious understatement. “Will be for sure” is much closer to reality. The only case I can think about in which “may be” will be probable, it is when You have a 32 bit MCU with only few kBytes of code memory. Then yes, the chance that random address will hit the used code address space is low. If however Your MCU is a 16 bit one and You have 63kB of 64kB memory space used up by code then “may” is “will be for sure”.

Forced branch-out

Of course, if we do consider a forced branch-in then we should also consider the reverse: what happens if CPU starts to go astray when our routine is being executed?

Everything between:

  FLASHCTL = PASSWORD+WREN
   is "red zone"
  FLASHCTL = PASSWORD

Any write to any random location which will be executed in that zone will be a legitimate write and will be effective.

However let us think about probability for a moment.

How probable is it when the “red zone” is only one machine instruction? Near to zero. The MCU will run hundreds of millions of instructions each day and will stay in a “red zone” for just a few of them.

That is, if You took care about it and are not writing data to non-volatile memory ten times a second. If this CPU if FLASH based then a chance that You are doing that are low cause FLASH wouldn’t stand it even during Your development tests. If however CPU is FRAM then who knows? Maybe You are doing it that way?

Next thing You should consider it is, that we are not talking here about lines of code in a “red zone”, but about time spent. What am I talking about? Interrupts. If an interrupt will happen inside a red zone then one instruction may change into a hundreds of them. And probability of failure is skyrocketing.

Increasing safety

Branch out:Limit time spent in red zone

Disable interrupts before setting write enable and restore the interrupt state after You complete the write:

X = GIE
FLASHCTL = PASSWORD+WREN
   is "red zone"
FLASHCTL = PASSWORD
GIE = X

Branch in:Make sure address is sane

Use the “green zone” to check if address is sane. Remember, any force branch-in which gets into code after write enable is harmless – the non-volatile memory protection will kick in. So do it:

X = GIE
FLASHCTL = PASSWORD+WREN
  if (address>=min) && (address<=max) [address] = data
FLASHCTL = PASSWORD
GIE = X

Remember however that what is “green” for a branch-in is “red” for branch out. If complex checks are necessary it may be wise to have separate routine for each purpose. For an example separate write_settings and write_log routines?

Never ever do something like that:

if not ((address>=min) && (address<=max)) kill-self
X = GIE
FLASHCTL = PASSWORD+WREN
   [address] = data
FLASHCTL = PASSWORD
GIE = X

This kind of code is double risky:

  1. The branch-in after address check is a legitimate write with a wrong address.
  2. A broken or lengthy interrupt may cause alteration of the address after You checked it.

Use hard coded addresses whenever possible

Do You really need to save few words on each write and use indirect addressing? Using macro will give You alike ease of code maintenance and You will get rid of a lot of “red” zones.

Use the “Memory Protection Unit”

Memory Protection Unit (MPU) – a piece of hardware which does check if address You write to is sane.

If Your MCU has any kind of it – use it. But never ever assume that since You configured the MPU, You may now enable writes in a bootstrap code. Just don’t, ok? The MPU only checks where do You write but not when. If You will rely only on it, then every place in which You mess up the indirect addressing is a potential for failure. Gladly not the critical one, that is if You did configure the MPU correctly and made it let only write to data zones.

Defending against effects

…is an another part of a story and a subject which should be touched in other blog entries.

Summary

After reading this blog entry You should be aware that any write to non-volatile memory which can be used by CPU to fetch the code may be lethal to Your device. Failures may range from “user action required” to destruction of the device or a system it does control. I won’t be something what will go away just by itself at reset.

And finally, You should know what to do to minimize the chance that such a bad thing will happen. Or at least – what You should think about each time You write to a non-volatile permanent program memory.

Page revamping in progress…

A bit of visual page revamping is in progress finished

What is to be is changed:

  • categories cleaned up, all posts assigned to their category. The number of categories have grown to better reflect primary subjects of posts;
  • tags cleaned up. The number of tags dropped rapidly. I did remove most of those for which only single post was present. Instead they are now more like a cross-category;
  • The pages visual style changed a bit, the better, in my opinion, tag cloud is now in use. Also there is a category cloud.
  • Added in menu a link to dedicated tags listing page and category listing page.
  • Tried to make pages wider, but my ability to edit style templates is limited. It is amazing how a person who can do almost everything with HTML can do almost nothing with “easy editing functionality” of this, otherwise great, web service. The alternative is, of course, to pay for raw server and do the html by hand. What is not where I do wish put my savings. Maybe if the number of readers would hit a hundred a day? Currently – no sir, no way.

Stay tuned, more old fart farts are in progress.

Carrier plan: Wasting Your best people

This blog entry will be short and centered around some sentence I had read recently.

“Every employee who works for You long enough is incompetent”.

Or something like that.

This sentence looks stupid, but in reality it may be true if You had developed a very dump carrier plan for Your people. The main idea behind this sentence is: A person do climb up the company hierarchy when this person excels at one’s current position. Thous when a person no longer climbs up it means, that that person no longer excels at own position.

In simpler words, one do stop climbing up when competence limit is reached.

Promotion towards incompetence

Your company may have one of two strategies when it comes to promotion. Either You have an exam system, which is testing candidates before promotion or You promote up those, who are doing their current job best.

Exam system

If You do it that way, the said sentence won’t apply to Your company.

Unfortunately it is a rare case, since exam system requires that You not only know what competences are needed at each position, but are also competent enough to verify it. Not easy to do even if Your company is perfectly organized. If it is a messy hell of entangled positions full of holes and stuff nobody is doing, then, I dare to say, exam system is not only impossible to employ, but even harmful.

“Kick them up” system

In this case You don’t do any exams, because You don’t know what You should ask about. Instead You move person up the company hierarchy when You see that that person excels at that person job. The problem is, that what is needed at lower position is not necessarily what is a must at higher position.

For an example a low level manager must have an excellent contact and understanding with own people, but doesn’t need to be a manipulative person. The higher one is, the more politician like skills are necessary.

Loosing Your specialists

Poland is stupid country. 99 percents of our promotion systems is moving persons up the hierarchy from specialists positions towards management positions. Mainly because practically none company has a structure in which an another path to climb up do exist. We do believe that being able to give orders is something better than being listened.

In other words: to reward the skilled specialist we have to move that person from a place where ones knowledge is most important to the place where political skills are critical.

Example?

Our hospitals.

The carrier path for a doctor is from patients bed through the chief doctor of the hospital ward towards the top hospital manager. There is no other path. In effect plenty of our hospitals are managed in terrifyingly inefficient way and provide low quality services at high cost.

The management inefficiency comes form the fact that doctor is skilled in medicine but not in managing people and processes.

The loss of quality comes from the fact, that most experienced in medicine are moved away from patient beds because once You become a manager and are not good at it, You will have to spend a lot of time on managing and will have no time for patients.

Mid level management

There is usually no problem with breaking that nasty structure by hiring a proper manager for high up positions. Those positions are such, that 99% of skill necessary are political like. For an example good knowledge of how to control people. Or how to perform financial inspections. Or anything I don’t know because I am not a manager.

There is however a level at which significant management skills are needed but technical knowledge is a must. For an example a chief doctor should be someone who has vast amount of experience and can always say: “We should treat this patient such and such way because I have seen it hundreds of times while You guys haven’t seen it at all”. In other words this doctor must be able to show the path for younger ones, look how they are going, but the actual planning of details must be left for them.

Unfortunately it involves telling who should do what, yelling at some lazy bastards, giving some prizes, keeping schedules, procuring stuff, monitoring quality and alike.

This is really nasty position, because if You can’t do management job everything breaks. Yet if You can do only the management job, it also breaks.

Put at this position a professional manager and Your patients will start dying.

Put at this position a superior doctor with a bit personality issues and Your younger doctors will go away looking for a better employment opportunity.

Managing without knowledge

If You will move a bit higher in a hierarchy You will find a position at which it won’t matter if You are a superior doctor or just studied a medicine a bit. This is because now You do manage the entire clinic with few tens of wings of different specialty. You simply cannot be a specialist in all those branches of medicine.

Yet You need to manage the clinic in medical context because medicine it is what makes clinic run. If You will ignore medical issues, Your clinic will collapse.

But You cannot know it by Yourself, right?

Advisors: carrier path for specialists

And here comes to play what broke at the beginning: You don’t have any carrier plan for Your specialists which will allow them to still be specialists.

Now imagine, that You somehow managed to break the believe that commanding people is something more important than being listened by them them. Imagine, that You can promote Your specialist to an advisor. Younger advisor, senior advisor and so on. Advisor is not commanding. Advisor is telling those who command, how should they command if only technical issues should be considered. Advisors are teaching and training less experienced employees. Advisors are finding paths which can solve unsolvable problems.

Advisors are best specialists in their branch of knowledge, are hell experienced and with an analytical set of mind. Have problem? Ask an advisor.

They are respected because they know what they’re doing, and that respect is the prize.

If You will create such a carrier plan which will move Your specialists away from management positions towards advisors, then You not only keep them doing what they are best at but also create people who may be asked for an advice by Your skilled managers.

Summary

Of course, after writing it I did realize that it is nothing new. It may be new in Poland, but many U.S. Government institutions do apply the sane and separate carrier paths for specialists.

If You happen to run a company or work for such which is thinking that making someone “a boss” is a good plan for rewarding people then please read it again. Then think why the heck Your company is not running as it should. Why almost every of Your mid-level manager behaves in an irresponsible, childish manner. Why they cannot do any serious planing nor reporting. Why they always don’t know anything about managing. Why it looks like they don’t know what they are doing.

And why, when You listen to whispers and gossips, You are hearing about Yourself: “This idiot doesn’t know anything and doesn’t have any plan”.

Ehmmm… both are true.

Because they are not managers and You don’t have anyone You can ask for a technical advise.

To whom give that job?

In that post I was babbling about how to use competence matrix to make a most valuable choice when selecting an employee for a project. This time I would like to ask You to think again about it, but from a bit different point of view.

All right, so what is going to happen now?

Re-organizing Your company

First let us assume, that some little piece of Your company needs re-organizing. To be specific, that You have a certain job which is already being done but something squeaks, squeals and crunches. Something is wrong. The quality of job is not enough, there are expensive mistakes or something like that. You need to either change the way it is done or make somebody else to do it.

Now for a purpose of this discussion please assume two things. First, that in most generic terms, there is no point in changing how that problematic job is done. And second, that this time Your competence matrix is filled in all rows and columns with very close numbers. That is, from a competence point of view, there is no difference to whom You will give the job, since they all are capable of doing it equally well.

Or, in this case, equally bad.

Does this mean that there is neither any point in changing how the job is done nor who is doing it?

Well…. not really. As always there is a catch and let us now dig into it.

Competence, Concentration, Control, Profit and Harm

CCCPH…. well… CHPACC? Hell… I can’t find a good acronym.

Those five words are something You should focus at.

Competence

Competence is obvious. The person You select must be able to do the job. Better or worse, but must be capable enough. But this time, as I asked You above, it doesn’t matter. Anyone at Your company can do it. So let us forget about competence.

Concentration

Even a genius can screw up if one doesn’t focus enough at what is being done. And, of course, even a not so bright person can do miracles if one can concentrate all available attention, care and abilities.

But how to be sure that an employee can concentrate on the task? An ability to keep focus is a very personal thing and very important competence… of which, as I asked You above to assume, all Your employees are practically equal.

So let us inverse this question: why it is possible that someone won’t focus the attention enough? When will it happen?

It may be not very business like thinking, but concentration is highest when the job is interesting. When there is a passion. When person does what one loves to do. When there is an attraction. When there is a satisfaction coming from just “I did it!“.

And on the other hand a job which is boring, hated or seen as not important at all, a laziness an carelessness creeps in.

Imagine You have a certain strictly technical problem at production floor. For an example some pieces of equipment do return from Your clients with a claim, that such and such piece of an enclosure looks like if it is dented. And indeed it is. At first glance it looks like a kind of design problem, so You task some of Your R&D guy to investigate it. After a careful few days long investigation this guy finds, that this is in fact a production problem due to use of too small and too light hammer. This R&D guy have found, that an employee who needs to hammer in a small settling pin into a hole uses hammer which is so lightweight, that it must be swung at a great speed. What, obviously, makes it hard to aim and sometimes the blow is not precise enough. Thous the small dent.

The solution is obvious: to use a bigger hammer. Bigger, heavier hammer is not only larger, thous harder to miss, but also doesn’t need to be swung at all. It is enough to drop it from, let’s say, one inch height and it is done.

Case closed, problem solved.

Except, that in fact it wasn’t solved. You had a problem, R&D have said they solved it, but dents are still there.

This is because to actually solve the problem this is not enough to say: “Use bigger hammer!” Someone must procure that hammer. To do it, one must specify what kind of a hammer exactly it has to be. Then someone must remove old hammer from equipment list at that workstation and put there a new hammer. And finally someone must remove the “and swing it swiftly” sentence from a production floor guide documents and training materials.

A hell lot of paperwork is required, but until it is done nothing is done.

The catch is, as You observed, that it wasn’t done. The R&D guy forgot to do it.

Why?

Because it was boring. The investigation was interesting. Finding a part number for a hammer was not. Writing a memo about “using heavier hammer will solve the problem” was, from that guy point of view, the last reasonable action he should take. Yet You did expect this guy to fix the problem, and this involves changing so many documents in a company… Are You sure that this guy even knew they do exist?.

The concentration of this person last only up to the end of an investigation. This person is capable of doing the rest, but the attention level was rapidly dropping to the point, at which one forgot to do something absolutely necessary.

Control

All right, so the R&D guy forgot to do what one should do.

Why it did escape unnoticed?

Your company screw up the quality assurance process. Nobody checked if that guy did his job…

…which is what a normal, plain and sane person would tell You. But since I am not normal and sane I won’t be telling You that shit.

Control, checking, inspecting, validating… all this is a piece of expensive crap. It doesn’t bring a profit. Yes, it allows You to run Your business in a better way, but controlling and checking by itself is not profitable. And if something is not profitable what should we do with it?

Throw it away!

Yet that guy really forgot it and nobody noticed it. Your organization doesn’t work like it should be working, because even if mistakes do happen, they should not be able to propagate.

So let us now try to think what did You screw up.

One thing is concentration. You gave a job to someone who couldn’t focus on something what was boring for him. This was Your first mistake. But even if You would give this task of finding a part number of a proper hammer to someone who is getting a hard-on from looking for hammers over the internet shops, then it is still a slight chance that even such a pervert would make some mistake.

Your second error was, that You wasn’t looking for profit and harm.

Profit & harm

Now there is an important assumption to make. And Yes, it is very important. And yes, if You think otherwise then You are an idiot.

People do make mistakes. It doesn’t matter how well they focus on the job or how high competences they have. They will error.

So the proper question You should ask it is not how to prevent mistakes, because eliminating them totally is impossible, but how to detect them at lowest possible expense? Or even better, how to allow them to happen, but not allow them to do any harm?

Let us invert it a bit.

Who will first notice, that there is something wrong with the hammer?

Only the person who will hit ones own finger with it.

Or to be precise, the person best suited to notice a mistake is a person who will be harmed by that mistake.

Consider for an example pedestrians at a crossing. Is it wise for them to have unconditional priority? If someone makes a mistake, who will be harmed and who will be mistaken? In any case, regardless if it will be pedestrian who will barge at the crossing or a car driver who will be speeding, the pedestrian will be the one who will be dead. Even if it is a car driver who will make the mistake. The action and result are not at the same person. The theory that giving more priority to pedestrians will make them more safe is based only on the fact, that car drivers will be trying to be more focused because they will be more afraid of killing someone. It doesn’t create any barrier against mistakes. In fact, it removes safety margins.

But this is an another story. So let us get back to our problematic business.

Again, who will be most inclined to make sure, that a proper hammer is at the workstation? Only the person who profits from that fact.

Imagine that it would be this R&D guy who needs to talk to pissed off clients and apologize for all those dents. Will he then not triple check, that the right hammer is used? His job will be much easier if the right hammer is used.

This means, that if You carefully place the right task at the right place in Your company, then You may not only get better results, but also spare a lot of money on checking and controlling. But the right place is a very peculiar one. It must be that place on which the job both can be done and at which there are direct benefits from it being done well. Or harms with done bad, it doesn’t matter.

If however one person is doing something, while another person job is made easier because of that, then an expensive quality control system is a must.

Rule of thumb

There is a great rule of thumb in safety, and it is one very simple statement You must learn:

“Doing things the safe way must be easier than doing them unsafe way”

Follow this rule, and Your company will flourish. Break it, and people will be dying.

This rule can be re-phrased to:

“Doing things well must make one job easier than doing things wrong”

Can You see it?

Updating production documents did not make R&D guy job any easier at all. Forgetting them was.

Then we may add another one:

“You Reap What You Sow”

One own mistakes must hurt one self. If this is true, You don’t need a third person to check it, as You don’t need a quality controller to ensure that You turned of the flame under a pot if You are going to grab it with a bare hand. And this is this the exact reason why it isn’t very wise to grab a pot with bare hand just because some other person said “I did turn the burner off”.

Summary

After reading this blog entry You should be able to notice, that selecting a right person for a job isn’t only a matter of competences. It is also a matter of selecting someone who will be satisfied by the subject, will directly profit in a non-financial way from good work, an will be harmed by own mistakes.

Of course this is not only a matter of finding a person, but a matter of organizing Your company. What workplace is responsible for what? What workplace benefits from actions of others? What workplace will be directly harmed by mistakes of others? What workplace should focus on what actions?

For sure this isn’t easy to arrange everything in a perfect manner, but I think it is worth trying. Sometimes moving job from one place to another may solve a hell lot of quality problems.

And last, but not least, You should be now able to figure out where are You wasting money on process quality assurance which is necessary only because You arranged Your work-flow without proper consideration of above facts.

That is, if You have any process quality assurance at all.

Asking for troubles: “Can You do it?”

This blog entry won’t be about technical skills but about people. And when it comes to people, it will be, as always with humans, about stupidity, laying and falsehood.

Too much Machiavellian You say? Well… I read Casanova too. And plenty of US government reports. And have seen some actions by myself. Finally, I am myself and I can be honest with myself about how bad person I am.

But back to the subject.

Imagine You are a boss or a leader of not very large group of skilled employees. I say “not very large”, because if You are near the top in a company ladder You will have probably some mid-level managers below You. Mid-level managers are different kind of people and I am not willing to talk about them. I like to talk about You, the mid-level or low level manager and Your own people. People like me.

Now imagine You have some task to do and You came to the question: “Can it be done? And if yes, then when?“.

Obviously it won’t be You yourself doing it, because You are a manager, right? The actual job will have to be done by Your employees. You have at most very vague idea about what should be done and almost none about what problems will it involve.

What You have to do then?

Well… First what comes to mind it is to take You men and ask them: “Can You do it?

Are You an idiot, aren’t You?

This is a logical question, You say. Who should know better if it is possible to do and when it can be finished that those who will do it?. They are specialists, aren’t they?

If You honestly think that way then I do recommend You to search for an another job. You are not fit to be a manager at any level. Go hire Yourself in shoe selling department or whatever. Really. You can be a good father tough, so not everything is lost.

Because even tough they are professionals in their craft, then at the same moment they are people whom You pay. Living in capitalism, and as a manager You love the capitalism, means that money equals live. In simple words, the fact You are paying them money means, that at least the quality of their lives, if not the ability to live, do depend on the fact, that You are paying them money.

And yet You ask them: “Can You do it?”

Ehm…

Yes, we can!

Now imagine for a moment, they say: “Fuck off, John, we can’t do it!”

I will re-phrase it a bit, to make it sound more they think what You will hear instead: “We are too dumb to do it!”

Honestly, do You really expect, that anyone who is getting paid for being a pro will ever say to the boss: “I am a lousy amateur, not skilled enough to do it“.

Really?

Can You see how they glance at the door knowing how many candidates for their place are standing in line at the front of Your company?

Asking a group

… is even a higher level of stupidity.

When You ask someone You know very good in person, and You did put a really great effort to make Your people to feel safe at their job, then You may expect a carefully phrased answer which will sound like that they could do it but some peculiar problems may hinder the time frame or final result.

If however You call a meeting and then You ask that “Can You do it?” kind of question…

Did I mention a shoe selling department yet?

When an individual person is talking to You in a safe place, on their own ground, approached carefully, that person may be only afraid that incorrect, and by “incorrect” I mean “displeasing”, answer will harm Your affection towards them. If You are very, very careful You may get a honest and substantive answer.

Any person answering such a question in front of a group must, in addition to what You will think about them, think about:

  • how other employees will see me saying that we as a group are too dumb to do it?
  • how the group will react if because of me saying that we can’t do it, the payment to entire group will suffer?

But they may say: “No!”

And this is an another problem. They are really allowed to say it.

But if they are too cautious, and they will say “No, John, it cannot be done” with a bit too large safety margin, then the company will not enter the profitable contract. Their cautiousness may harm the profit, and profit is what makes them needed by the company.

Then, on the other hand, if they cut the safety margin and they will say “Yes, we can” then there is still a chance it will work out. A chance is not something You should bet at, but if the contract will slip a bit, the management will have to do something, right? If they, employees, will slightly fail while the company did invest significant money in a project, then a bit more of money and a bit more of time may be scooped out.

The risks from saying “No” are high and focused on exact person who said it. “No” is “no”, and this is Tom who said that we can’t do it and this is why we didn’t get paid . While the risks in saying “Yes” are distributed and weighted by a said above chance. Something unpredictable may always happen, and even if it won’t, they may create such thing easily. The plausible explanation why it failed can be always forged in such a way, that harm will be distributed and there will be no single person to blame.

Except, of course, You my dear manager.

Can You do it till…?

So You should already know, that asking Your professional team about if they are pros or wussies is not a wise way to manage Your company. But surely they will be able to tell You how long will the job take?

Sure.

But not if You ask it that usual “add the motivator” way: “Ladies and gentleman, our stinky rich client expressed a wish that it would be excellent if we could supply the product to the end of this summer. Can You do it in that time frame?”

What can they say? “No, we are not skilled enough to work that fast?”. They will know, from Your message, that if they say that it cannot be done sooner than till the end of a winter, then the filthy rich client may go away. With their money, their food, mortgage, and that their daughters will need to start working in a horizontal position to supply the family.

The answer will be sound: “Yes”.

You have said it will be done till summer and it is not!

All right, so You are not an idiot, You kept the client wish in secret, what was not easy thing to do, and instead of asking a suggestive question: “Can it be done till the end of this summer?” You did ask: “How many days You need to do it?”

It is a bit wiser question. Providing, You will understand the answer correctly.

This is because there is one more thing, when You ask Your specialists abut necessary time frame. First of all, You may expect a relatively trustworthy answer only if they are doing a job which they did before. I do work in R&D and I never did anything I wasn’t doing first time in my life. Asking me about a time-frame of such a job is… well… Shoe selling department?

But assume for a moment, that You are asking someone who is doing a repetitive job. For an example You have a guy who is laying out ceramic tiles on the floor. You do show him the room, and ask: “Say, Fred, how long will it take?”

The answer is straight forward: “Call it four days, boss”.

Except, that Fred doesn’t know, that client wife did not decide on the tiles yet, except that they must be from an Italy, and that it will take at least a week to deliver them. And, what You don’t know, it is that You asked Fred how long it will take to lay out the tiles, but not how long it will take to do the self-levelling spout and then lay the tiles and then await for glue to fully harden.

Competences

In all those cases questions like “Can it be done?” or “Can we do it till the end of summer?” are something what is asking about subjects which are outside the competence of Your specialists. They know how difficult the job is. They can iterate over each necessary step. The can quantify risks involved. They can say what abilities are needed to be gained for the job. They can say how many work hours may be involved if everything will go smoothly.

These are questions You may ask.

Question: “Can it be done?” is asking about something else. It is asking if we are willing to take a risk. It is asking if we are willing to gain abilities. It is asking if we can allocated resources necessary for such an ambitious task, possibly harming other projects.

Then “Can we do it till the end of summer?” is not about a work hours needed in ideal conditions, but about willingness of allocation of certain resources. About planning how to distribute resources across different projects.  About planning of some assumed safety margin.

And finally, about acceptable risk level that everything will go “kaboom!”

Hell yes, dear manager, answering to all of this is Your job!. This is You who are competent in planning, not Your coder, not Your application designer, graphics artist, experimenter, document writer or even not the Fred how knows how to perfectly lay out ceramic tiles. Asking all those people about Your own decisions is like asking a five year old kid to guard the delicious cake from evil-doers. Kid tummy-sick, cake eaten, crumbs swept under the carpet. Only cockroaches will be happy.

Loss of trust

Asking those questions has one another long-term harmful effect.

It is a loss of trust.

If You will insist on asking such questions You will hear answers which, if followed, will harm Your business. This is unavoidable effect.

Since You did insist on asking them, then of course, You are an idiot. But since You are an idiot, You will refuse to admit it. Instead You will think that Your employees are lying bastards with just a single brain cell. And even that single brain cell they use only to cheat on You.

You will stop trusting them, and since You are not trusting them, but instead You do believe that they are fucking up with You, next time Fred will say “Call it four days, boss” You will respond: “Do it in three days”. Because Fred is a laying bastard who is never saying anything close to the truth.

On the other side of the table Fred will be sure, that You are asking him to do all the planning he has no blind idea about. In his eyes You are asking him about tiles, but then You are telling him to pour the spout too. Of course, both in three days. How can he trust that You know what You are doing?!

Summary

After reading this blog You should know, that I did read Machiavelli and that I do agree with him in many places. You should be aware that every question You ask is giving out a lot of information. That the form of a question is critical, because it is telling people what do You wish to hear.

And that “truth” and “falsehood” are always walking hand in hand together on the tightrope of a chance that it all will come out good.

What can I recommend to You?

Know Your people well. Know how do they think, what they are ashamed of, what they are ready to admit or not. Who is ambitious, who is cunning, who cares about others and who is not? Who has a money reserve allowing one to say “Fuck off boss” and who is up to the neck in debts.

Know You people the same way You need to know Your tools. Only then You may hope that You will get true, trustworthy and really thought up answers to such stupid questions like above.

And, of course, let Your people to know You. Only if they will know Your weaknesses they will be able to understand what You really had in mind asking them or what You could have forget about.

Letting know You weaknesses….

Right…

So that they could cheat on You better?

Unfortunately I am not a good manager, so I can’t really help You. I am a R&D guy who is annoyed beyond imagination by “Can You do it till end of summer?” questions. And I must say, that in my entire carrier I was never asked a proper question about any project.

If however You, dear manager, will figure out how to ask a proper questions then You may expect a through and trustworthy answer from me.

And if You are an idiot asking such questions, the remember, that if You will be hearing “No” then, believe me, it will mean: “No, never, impossible, absolutely not and go hang Yourself You sick fuck, cause if we take it, then I am dead”. Because we will answer “No” only when answering “Yes” will bring us a certain doom.

Coherent operations

In that blog entry I was babling about atomic operations. Now it is time to talk about something much more amusing.

Data coherency

The “data coherency” is a concept which, in most simple words, expresses the hope that some data do look always the same regardless of who and when is looking at it.

Again, alike atomic operations, it sound silly if it could be different like that. Yet, again, it happens.

Evils of cache

When about 1993 the 80386SX hit the market together with it they did introduce a concept of cache memory. The CPU was running at whooping 25MHz while on-board memory couldn’t get past mere 12MHz. Prior to that there were cases when CPU were faster than main memory, but not twice as fast! Such a fast CPU’s were using read-ahead (80286) or look-back instruction queues (Motorola 68000) to have as much time to fetch code as possible, but in this case it wasn’t enough.

The idea was, initially, to put a small fast memory between CPU instruction queue and main memory. It didn’t create any data coherency problems since only code was cached. But since this memory was sitting in front of main memory, then the temptation arisen: why not cache data too?

And it begun.

Whenever CPU reads data from memory it is first looked for in cache. If it is there the read request doesn’t even touch main memory. If it is not, the CPU is put on hold and data are loaded from main memory. There are two tricks however:

  • missing read cycle on memory mapped hardware;
  • line filling of cache memory.

Memory mapped hardware

Prior to cache nobody was even thinking about it. If some hardware needed to be added to a machine, it was put as a part of its memory. When a CPU reads memory it always supplies both an address and read request signal. A hardware could understand them and not only provide data as memory would do, but also change own state. For an example a plain serial port FIFO has just one register, but each time it is read it supplies next byte received from serial port.

Cache changed it. If memory mapped hardware was subject of read-cache it might stop working. Once said FIFO register is put in cache no reading of it will ever touch the serial port hardware at all.

If You remember those times, You might recon some jumpers on mother-board which could be used to turn-off cache for certain address space areas.

Line filling of cache

I am not sure if it was the case of 80386SX, but it for sure was the first time DIMM memory hit the market. Probably at first Pentium processors. This type of main memory isn’t any longer the RAM (Random Access Memory) regardless of the fact, that we are calling it that name. It is burst access memory. You have to arm it at certain address and then You may stream data from it much faster.

This nature of memory system makes it unreasonable to initiate cache filling transfer to fetch just the data we need. Instead cache is divided into quite long lines of many bytes which are filled at once in one burst.

The side effect is, that the hardware which is connected to memory buss will see transfers which were never requested by the code. For an example if byte at address 8 is requested the memory will see transfer of bytes from 0…256 which takes just four DDRAM module clock edges (after arming it, of course).

Cache writes

First cache was used for reading only. It still allowed twice the speed of main memory, but if cache is RAM then why just read from it?

Write through and write-back

There were two policies: first which assumed that we will benefit from reading the updated data again from the same location and the other which also tried to speed up the write process.

One was called write through, the other write back. I hope I didn’t mess it up.

The write through is simple. CPU writes data to cache and main memory at the same moment. Thous it is slow, but no special state machine is needed. If data are needed to be read from that place, they are, of course, got from cache. Memory mapped hardware might get confused by it, but nothing bad could have happen except of that.

The write back is much more complex. The CPU writes data to cache only and marks cache line to be dirty. Then if cache needs to be filled with new data because of cache miss, the cache controller takes one of lines in cache and if it is dirty it first writes the whole line to memory.

You may clearly notice, that with this policy some changed data will never get to main memory and if they get into it the order in which it happens may be surprising.

And it doesn’t matter at all…

if You get rid of memory mapped hardware…

…except if we do start thinking about multi-processor systems. But let us put it for later, because there is one more tricky thing which may harm even a small modern micro-controller.

Multiple execution paths and reordering

The Intel Pentium was a first desktop processor which had multiple execution units. The were not multiple processors but mere ALU. The instruction decoder might direct sequence of instructions part to one ALU, part to another. It was doing it in such a smart way, that instructions which didn’t depend on result of previous instructions could be directed to ALU which was free at the moment. For an example:

  mov EAX, my_variable_1
  mov EDX, my_variable_2

could be run in order of appearance or in parallel or even in a reverse order. All depended on what ALU and buss resource was free at the moment. Of course the CPU was smart enough to remember what it is told to do and what it is doing and in:

  mov EAX, my_variable_1
  mov EDX, my_variable_2
  mov my_variable_1, ECX

the third instruction was always executed after first one. But not necessarily after my_variable_1 was sent to cache, because more efficient it would be:

  mov EAX, buss_register
          background: move buss_register to my_variable_1
  mov EAX, ECX
          wait for buss transfer to complete
  mov EDX, my_variable_2

The effect was, that even cache could now see some operations in incorrect order.

Multi-processor

Everything was fine as long, as we were confined to a single core. Since absolutely every operation was passed through re-ordering engine and through cache, then all the changes made to transfers were transparent. Also the memory mapped hardware problem was solved by moving from buss-mapped systems (ISA/VESA Local buss) to streaming transaction oriented PCI systems.

However when You start dealing with multiple CPU’s You must return to those questions again. When processor A changes some block of data, how processor B will see them? In what order what bytes will change? When will they change? Right when written or later maybe?

The desktop CPU’s are doing all this behind our backs. They have to do it because they have to run legacy 8088 code which not only didn’t knew about multiple cores, but didn’t even knew about cache.

With embedded CPU’s it can be very different. Ensuring data consistency between multiple cores all the time consumes a lot of memory bandwidth, energy and time. Why even bother about it, when program knows very well when it is touching data which might have been touched by an another thread? It may tell CPU to keep them coherent exactly when it needs it.

This is however not mine story. I do deal with tiny micros and the only moment when I deal with cross-core coherency it is when I write on PC in JAVA and use java.util.concurrent atomics, stamp code with volatile or synchronized keywords. I need to know that data coherency is there, but, gladly, don’t have to deal with it all the time.

Except when I tried to run MSP430 with USB core, which is in fact dual core CPU. In this case ensuring data coherency was the must.

Summary

After reading this blog entry You should be aware, that if shit happens and You will be given a multi-core DSP or some Cortext bastard with cache and etc. then You will have to read the manual carefully and think about unthinkable – that what one core writes to a memory may be not what second core is reading from it.

Atomic operations

In this, as always too long, blog entry I would like to say a few words about so called “atomic”operations in nowadays CPUs and programs.

What is an “atomic” operation?

An atomic operation is a sequence of CPU operations of which result doesn’t change regardless of what happens in a background.

Sounds silly, isn’t it? A program is a sequence of operations producing a predictable results. How could it be predictable at all if the result of operation would depend on what have happen behind a back of a program?!

Exactly. It couldn’t be predictable at all. This is why when we do create a program we do always assume that instructions do what the are told to do, that they are executed one after another, and if we say:

   x = y+5

then x will be y plus five.

The problem is, that reality isn’t a theory and there are quite a few peculiar cases when x won’t be y+5.

Hardware atomic operation

The root of atomic operation is the hardware. The hardware must always perform consistently and a hardware operation is atomic if the result of an operation of CPU is always the same regardless of what happens on other input pins of the hardware when the instruction is executed.

Again, sounds at least dumb. It must be that way!

Well… But it not always is.

Non atomic hardware example

There is some nice CPU architecture now in the world which I like very much. This is MSP430 from Texas Instruments. This is a very, very flexible design which has one rarely seen quirk: it is partially asynchronous.

The synchronous architecture is such, in which absolutely every operation is expressed in a relationship to CPU main clock. For an example if synchronous CPU has an “interrupt request” input pin, then the specification will say that interrupt request signal must be active for least one full clock cycle. This is because the synchronous architecture is looking at the input only at the edge of clock signal. If at one edge it is low and at the next it is high, the interrupt is triggered.

The asynchronous architecture is a bit different. You may recognize it by looking at the datasheet and seeing that some required timings are expressed in [ns] instead of clock cycles. For an example the MSP430 pin change interrupt is told to be triggered if there is at least 50[ns] long positive pulse present at the input pin. It will trigger the interrupt regardless if CPU is running at 1Hz clock or 16MHz clock. It will just happen.

Such an architecture is superior to synchronous one in many aspects, but should be always approached with a great care. You need to be very, very sharp and experienced to design such a CPU and not make a stupid mistake. Unfortunately guys who added this functionality to MSP430 weren’t sharpest knifes in the shelf and did make a mistake. Of course it is easy to say it now, knowing what I do know. The hadn’t have such a knowledge.

So what did they do wrong?

Let’s take a look at a very, very simplified schematic of the interrupt logic of this CPU:

As You can see there are pins P1…Pn which are input on which interrupt may be requested. In this simplified example this is a so called level sensitive interrupt. That is the interrupt is requested by setting “1” of Pi pin and is kept being requested as long, as it is held at logic “1”.

How it is made?

Each of Pi line is tied to “set” input of D-type flip-flop. When “set” input of D-type flip-flop is activated the flip-flop is toggled to “1” and made to remember it. Each flip-flop is serving one interrupt request pin and they are all OR-ed together to send an interrupt request to the CPU.

Simple, isn’t it?

Indeed. Many CPUs did have such an architecture for ages and worked fabulous. For an example 8051 was also using it. Except, that in 8051 interrupt flags were in so called bit addressable area, while MS430 is read-modify-write architecture.

What is the difference?

Read-modify-write architecture

If You look at the image on the left You will see that DATA-IN inputs are connected to data buss of CPU and that all CLK (write clock) lines of all flip-flops are tied together. This means that if CPU writes to an interrupt handling register it will always write data to all flip-flops. It isn’t bit-addressable. The CPU can do only:

   mov #010101, P1_N

On the contrary in 8051 the write-clock was separated and code might do:

   mov #0, P1

Can You see the problem yet? Possibly no.

Bit-addressable architecture is nasty. It is expensive and creates asymmetry which makes it complicated. While in fact a non-bit addressable CPU can of course operate on bits.

For an example, if we like to set a single bit in P1_N register we can write:

   P1_N = P1_N OR 0b000010

If we like to clear the same bit we can do:

   P1_N = P1_N AND NOT 0b000010

Why to bother with bit-addressable architecture then?

No reason.

Except, that You must be aware of some non atomic behavior.

In case of MSP430 is appears when user needs to service the interrupt requested:

   recognize what of P1...Pn interrupt to handle
   if ((P1_N AND 0b000010)!=0)
   {
     process the interrupt
     P1_N = P1_N AND NOT 0b000010;  clear handled interrupt
   }
   return from interrupt.

Writing it in MSP430 assembly it will look alike:

   bit.b #0b00001, P1_N
   jz _no_interrupt
     handle it
     and.b #~0b00001, P1_N
     reti

Can You see the problem?

It is in and.b #~0b00001,P1_N.

If You will look at the image above, on the right part of, You will notice that to actually perform this operation following sequence of actions must happen:

  drive CPU buss with P1_N outputs     \
  latch this value into ALU argument A  |--- this is a read phase
  latch #~0b00001 into  ALU argument B /
  wait till ALU computes the A AND B operation - this is a modify phase
  put the result on CPU buss             \
  toggle the CLK of P1_N writing result  / --- and this is the write phase

There should be now clear to You why we call it read-modify-write architecture.

Now do some nastiness. Let us run CPU so slow that it will be quite probable to happen what follows:

  drive CPU buss with P1_N outputs     \
  latch this value into ALU argument A  |--- this is a read phase
  latch #~0b00001 into  ALU argument B /
  wait till ALU computes the A AND B operation - this is a modify phase
   put 50ns pulse on P2 pin
  put the result on CPU buss             \
  toggle the CLK of P1_N writing result  / --- and this is the write phase

If we would be able to observe what happens internally we will see:

    P1_N  =  0b00001  ←this is initial state
    ALU.A =  0b00001  ←arguments are loaded into ALU.
    ALU.B = ~0b00001
(*) P1_N  =  0b00101 ←50ns pulse sets bit 2 of P1_N register
    ALU.Y =  0b00000
    P1_N  =  0b00000 ←result of ALU computation is stored to P1_N

At the (*) the P2 bit was set, interrupt was for a short duration requested to CPU, but then the ALU.Y did overrode it with zero and interrupt was lost. Depending on the remaining part of the architecture it may be lost totally or be quested but a program won’t know why it was requested.

Note: This case of MPS430 is just one of examples which can be found in an embedded world. The PIC16 architecture is non-atomic against diode or capacitor on output pin, what is surely much more fun.

Atomic is always “atomic against something”

In an above example I have shown that the and.b #constant, P1_N operation on MSP430 architecture on that specific register is non atomic against interrupt request pulse shorter than the duration of and.b. It is atomic against absolutely anything except that.

If You have even a tiniest bit of asynchronous hardware in CPU always take a great care to check what operations are atomic on it and in what conditions.

Atomic against interrupt

Non-interruptible instructions

The MSP430 CPU is providing us with non-interruptible instructions. This means, that absolutely any opcode we write, like:add.w variable_X,variable_Y will execute even if interrupt request will be set during it. The interrupt will be handled either before the instruction or after, but never within it.

Such a CPU is nice to use.

Interruptible instructions

The 80386 is different in that manner. It can be told to repeat some operation multiple times, where the number of repetitions is stored in one of CPU registers. Since this number can be up to 2^32 it would be very silly to prevent interrupts from being handled for that long. Thous they decided that such operation can be interrupted even tough they are single machine command.

So in case of this architecture You must take additional care to make opcode an atomic.

Interruptible sequences

Of course this is quite frequent that one machine operation may be not enough to do what You need. For an example You may need to write:

   tst.w my_variable
   jnz _something_needs_to_be_done
         sleep
_something_needs_to_be_done:

with an intend that if noting is told to be done, the CPU may get to sleep and conserve power. Except, that if my_variable is updated during interrupts it may happen something like this:

   tst.w my_variable
   jnz _something_needs_to_be_done
         interrupt
         {
            mov.w #100, my_variable
            reti
         }
         sleep
_something_needs_to_be_done:

and You are boned.

You need to make it atomic against interrupts by, for an example, disabling them around the problematic sequence.

Atomic against threads

As You probably noticed, being non-atomic always comes from the possibility that something what You can’t see happens between instructions or during them.

Hardware issues and interrupts are one of those things. The other is multi-threading.

Preemptive multi-threading

In this case interrupt is used to switch tasks, so to be atomic against other thread barging into Your sequence of instructions You must be atomic against interrupts.

Cooperative multi-threading

In this case the threads are by definition only barging in when You explicit allow it:

   tst.w my_variable
   jnz _something_needs_to_be_done
         call YIELD
         sleep
_something_needs_to_be_done:

Obviously if You will do it, You got what You wanted. In this case it is enough to not allow other thread barging in when You don’t like it. For an example following scenario will be more sane:

   
   call YIELD
   tst.w my_variable
   jnz _something_needs_to_be_done
         sleep
_something_needs_to_be_done:

Summary

After reading this blog entry You should better understand what is an atomic operation and why it is very important to know what and under what conditions is or is not atomic. You should also be aware of the fact that any asynchronous piece of hardware is possible source of both benefits and troubles. And finally, what was not literally stated, that cooperative multitasking will give You a power of task switching without causing headache.

If You had found the subject touched here to be amusing then the next blog entry may be of interest for You.

Competence management: how to kill Your business

This blog post will be slightly different form the others. I won’t be directly speaking neither of quality nor about programming. I will be speaking about management.

But first few words of the background.

I do currently work for the company in which we all are idiots. With me in the first place, of course. The slight difference between me and others it is, that I am aware that I am an idiot ;).

We are idiots because we do not manage competences of personnel in absolutely any possible way. We do not think what abilities will be needed to finish the project, because we, and I must admit here with pride, inherited a culture which tends to create real renaissance engineers. Thous we have teams made of guys who can do practically anything... And, as a consequence, we seems to follow an extremely simplified management rule: “It will work out somehow”.

Well…

It does.

At the hellish cost of time spent, money lost, stress, frustration, and dead lines never meet. And, of course, the quality is rapidly dropping when the pressure rises. Notice, I haven’t said a word about budget. Because with this level of management we can’t even hope about making budget plans.

In other words this company is using competence management suitable for an university instead of R&D department of production business.

Where are money lost?

Even tough they are renaissance people, that is most of them can do electronics, coding on embedded and PC, do mechanical design, do manual labor or even operate some machines, plan and run experiments, perform simulations of physical processes… and much much more, the are not perfect at everything.

I, for an example love assembler level coding, like Java, have a great pleasure from designing 3D mechanical parts and machines, love analytical job needed for research and physical experiments. On the other hand I seriously hate paperwork and absolutely cannot do the same thing exactly the same way more than twice. I have a mind of a machine and have great difficulties with following daily human vague expressions, so a simple form is an overkill for me.

Oh, and, by the way, I know how to cook.

In simple words: nobody is perfect in everything.

But they are good anyway…

If You will look at Your renaissance people, You will probably be able to tell who will attain best quality, who will do job quickest and alike. You may put them in a table, add a row for a person and a column for each competence. Then qualify them. Lets say using numbers from 0 == “can’t do it” to 9 == “is fast and good as a daemon from the deepest depths of Hell“.

Probably You will end up with something like below:

If this table looks like all columns are filled with very alike numbers:

then You may skip the reading – You won’t have any troubles with completing and managing Your design teams, because all Your employees are equal and no competence management is necessary. Probably because You are already managing it very well, that is.

If this table looks like a diagonal array, like this:

that is You can arrange it in such a way that 9 appears just once each row and just once each column, then You are walking on a tight rope of disaster, but still have no problems with managing projects. This is clear whom You must assign what job.

But if You have some rows filled with high numbers in almost every column and others with low numbers:

Table representing unique skill

then this blog entry is for You.

Visual aid

I love anime. Of course as long as I don’t have to pay for viewing them, my love isn’t that deep ;).

Recent year or two I came across “Yuusha, Yamemasu”. Simple, classic story about a hero who is… fixing management issues in daemon army. I do recommend You to watch it.

In very short words, this army has a lot of daemons with low numbers in all columns and one or two with all nines. So this is exactly our case. And this is an exact example how You can destroy Your business… if You happen to have a few very gifted employees, that is.

Deadline

I hate this word. Really. I am a f*ng engineer, You can ask me about anything in all my brands, but I am not a f*ng manager! I know how to do things but my guess about how long will it take is… well… just a guess. And, what I dare to remind You, based on values in my row in competence table. Unfortunately I am close to, lets say, mostly 7, in plenty of places 9. So my every estimate is far, far from the level 3 reality.

Now imagine You are a manager and You are tasked by a CEO with a simple: “Do this and that within a month”. And, of course, You have to do it with Your current team under a current load. You need to add this job and the deadline is far from being reasonable.

So, assuming You are not an utter idiot, You will take a look said competence table (You do got one, don’t You?), mark columns You need for a task and search for a team.

So what You do? You do select a row with all highest number in necessary columns. This person will do the thing fastest, with highest quality and at lowest possible cost. Case closed, good job mister manager.

Except that a daemon army was routed.

Giving job to best employee doesn’t work

If You take a closer look at the said table You will notice, that even tough selected person may have 9 in all columns there will be there one or two columns in which no other person have anything above, let’s say, 2. If You are love risk in business, then there is even a possibility that everybody else have zero in some column.

For an example take a look at the Jack the Ripper in above table.

What does it mean?

That this person has unique competences. There is not only no one else in Your company who can do it good, but there is no one who can do it at all!

If You will assign this person to any project which doesn’t require those unique competences, then projects which do require them will have to be put on hold.

For an example in my small team I am the only one who can code. And I am very good at debugging and testing. There are however other team members who are really bad at coding (level 1 I would say) and barely passing (level 3) at testing. So the most efficient way will be to make me do coding and testing, right?

Right.

Except, that I am only one and the day has only 24 hours of which at least some needs to be spent in a lavatory. If a deadline is pushed to the limit I will die from overworking and nothing will be done.

Paying more for less

If Your table is all equal in numbers, then the solution is simple – You do assign more people and struggle, but this is Your job mister manager, to arrange a project so that at least some level of parallelism would be possible.

But what if You are not that lucky?

For an example it is hard to do tests before program is written. Where one can introduce a parallelism in here?

Ok, so how does the work flow looks like?

API concept + algorithms
           ↓
           coding
           ↓
   creating test methods
           ↓
running tests and exposing problems
           ↓
       fixing problems

You champion with all 9 in all columns can do it, let’s say with following time schedule:

API concept + algorithms            =10h (*)
           ↓
           coding                   =5h (*)
           ↓
   creating test methods            =2h (*)
           ↓
running tests and exposing problems =0.5h
           ↓
       fixing problems              =1h (*)

I did let myself to mark with (*) the unique competences in our example team. This is a very badly completed team, You know. It shouldn’t look like that ever, but as I already have said I am an idiot.

What can be done in parallel? What can we delegate? And what will be the profit?

API concept + algorithms            =10h (*)      -
           ↓
           coding                   =5h (*)      -
           ↓
   creating test methods            =2h (*)      -
           ↓
running tests and exposing problems = -          3h
           ↓
       fixing problems              =1h (*)      -

In an above example I did remove the task which doesn’t need unique competence and moved it to someone who is six times less efficient. The job Your champion would do within 0.5 hour now needs three hours. During this time however Your champion can fix problems from previous tests (1h) and create new tests for remaining part of product (2h). Once this job is done tests results are ready and next iteration can be run.

How much did we save?

In terms of deadline – 0.5h on each iteration.

In terms of money – we lost 2.5 work-hour payment.

It is not worth the price!

Are You dumb?! It isn’t worth?!

Think for a while more than few days ahead.

You had a person with all nines who needs 0.5h to run test and 10h to do concept work. You gave this 0.5 hour worth job to a looser who needs 3 hours. Pure loss, You say?

This looser is training. Providing the champion wishes to share ones knowledge with a looser, but it is an another story. What do You think? Do You suppose that this training won’t have any effects? Sooner or later this person will move up from 3 hours level to 2 hours. Or even better. And Your champion is training more too, but not how to do better things others can do, but on the unique skills. So maybe after some time this will look more like:

API concept + algorithms            =9.5h (*)      -
           ↓
           coding                   =4.5h (*)      -
           ↓
   creating test methods            =2h (*)      -
           ↓
running tests and exposing problems = -          1h
           ↓
       fixing problems              =1h (*)      -

Moving forwards

Of course if You will look at above example You will notice, that now the champion lags behind. Test run takes only 1 hour now, but the champion needs 3 hours to prepare for the next run. What to do then?

Move a piece of less unique competence to the tester. He/she can run tests? So maybe will be able to write some of them too? There are 2 hours at stake, You know.

Medicine

All I wrote is nothing new. Everybody should know that. Especially high paid doctors. Or, to be more precise, those managers who run medical clinics employing high quality highly paid doctors.

Not all, of course, especially here in Poland. We have here a long and lasting tradition on wasting highly paid work-hours which may save human lives on filling papers and doing introductory reviews. In my entire live I just once have been in a clinic which knew what I paid for and did everything to make best use of my money. Not that I am visiting clinics more often than once in a decade, You know.

At the same entry some girl took my papers and filled documents asking questions. Then next, a bit more skilled one, ran standard tests operating standard equipment. Like, You know, blood pressure and etc. Then I went to see a doctor who did what only she could do. But, surprisingly, once she needed some test to be made she called an assistant who took me and operated some complex machine. Doctor only needed to inspect the results. Absolutely everything which did not need her unique skills were delegated to someone else. Someone, I must say, many times slower and less efficient.

Some years earlier I met the same doctor at another clinic at which she did all those things by herself. From my point of view both visits were equal. I got result at alike price with alike quality.

The difference was, that with some sharing of the work she could help 50% more patients within the same time frame.

Summary

If You are running an university then You will always strive for best. Best competences, best people. But If You are running a business, then think twice. Draw such a table. Look for Jack the Ripper and diagonal. If You find any of them, then read through this blog entry again and think about money, time and risks involved.

Then find Your Jacks and train them so that they can share their load and train their colleagues. It will cost You at first, but then You will get Your business safe and stable.

Peer-reviews as a quality assurance in design process: when doesn’t it work?

All right, so You want a product. Let’s call it “Product X”. You, the businessman, do see it in your vast imagination right in front of your own eyes. It is a great product, and you have great believe that it will be a profitable business.

Except, it doesn’t exist.

You have to hire someone to design in.

Of course You can outsource the entire process, but for a purpose of this discussion I do assume You own some production company,  so You do have some skilled technical people who can do some design for You.

But Your “Product X” must be great,right? And its greatness comes from both financial efficiency and quality.

Quality? Well…

Is quality important?

In a final product, I dare to say no. There is a plenty methods of making fools of Your customers, so final product quality must be just fair. Fair enough to make them not complain and buy, but not too good.

Because good quality products are expensive in production and doesn’t break for ages. And if things doesn’t break nobody will buy it a second time, right?

But please remember, the “Product X” doesn’t exist yet. It has to be designed.

Quality in design?

Is it important if design is well made? Shouldn’t it be just fair? If fair is fine for a product, then why not for design?

This time I will say something totally opposite: the design must be superior in quality.

The quality of design doesn’t only drive the quality of a product. It also drives the production process, marketing, material procurement and a customer support costs. If design quality is poor Your production process will be messy, slow, expensive, possibly even dangerous. You will struggle to maintain even a minimum quality of a final product.

Quality assurance team in design process

Team. People. Employees. Rings the bell? Money. Expenses. Costs. Pointless paperwork.

Quality assurance team must, by definition, consist of people with a superior knowledge. They must keep in check both material side of a design and a formal one. They must be able to check what each of designers have done. Validate, verify, compute. Suggest some tests to be run or even run them themselves.

Quality assurance team is a pure cost generator. They do not design anything, they do not move the work forward.

And they are hell expensive.

That is, if all is done by the book. Agreed, You may hire people of inferior knowledge and make a team of them. Like for an example a “tester” position in software company is usually seen as a good place for new guys without any experience. They just do tests, right? No need for great skill?

Well… I will keep silent about it for now.

But some companies are moving one step more forward.

Why do we need a separate non-productive team? We already hired people who can do design job and move the project forward. This is where the money should go, right? And, shouldn’t it be obvious, that a skilled designer is absolutely capable of checking if his/her colleague did the job well? Right?

Right?

Peer review as quality assurance policy

Imagine there is Tom and Hans. Both are skilled mechanics and they both do design hydraulic stuff. They have superior knowledge in that branch and they are Your design team.

Now, lets say, Tom had designed a hydraulic valve. Why shouldn’t he ask Hans to verify if it is correct?

Next week Hans finished design of his pump. Why not Tom should check it for correctness?

This process is called peer-review. We have a team of equal members and the all both design and validate designs of other members of the team. Peer-to-peer.

Positives

This inside-the-design-team quality assurance process looks promising. First both Tom and Hans are high quality specialists. This is hard to find better than them, but of course they are just humans and humans do make mistakes. Nobody is perfect. The chance however that Tom will miss Hans mistake and vice versa is minute.

If You would go the standard path, You would have to also hire Beatrice to check their work. She would have to be at least as good as they are, so will be equally expensive. 1/3 plus to expenses.

What else is good?

Let’s say You have a larger team. If You will make sure that the checking will be randomly assigned to different persons Your chance to miss a problem will be even lower, because more different points of view will look at each project.

If You will decide, that not only a final project should be checked, but also some reviews should be made while the work is in progress, then it will be even better.

If made like that, then sooner or later most of Your team members will review most of work of others. The knowledge will flow from one person to another. There will be no weak points, no key person who’s death or serious illness will kill the project.

But what about Beatrice? Wouldn’t she be equally good?

Well… For a time. Rather short I believe. If she would be only checking others works she wouldn’t learn new tricks. The designer needs to learn on daily basis. This is their job to invent new things. But it is not Beatrice job. In effect, year after year, she will be lagging more and more behind.

Negatives

None.

Really. This is an excellent method.

That is, if You don’t screw it up.

Potential problems with peer-review system

Matter of power

First thing You probably already noticed, being a CEO You are used to notice such things of course, it is that Tom and Hans are peers. They are equal in strength and power. Neither Tom nor Hans can say: “No way, You can’t do that!”. They both must come to an agreement. And we all know that You can’t steer the ship using democracy cause it will sink at first reef.

Beatrice however can have power to say: “stop!”

Lack of standardization

A good design must be at least internally consistent. For programming it may, for an example, be a requirement that javadoc -xlint do not complain. Or gcc --wall will be silent. Or many, many others. You may fill some examples in here from Your own branch.

How can peers set a standard? They have no force to push it. Even if one person will figure out a good method, it must be accepted for all others.

Of course, depending on personalities, sooner or later they will find some sweet point. But You should be aware, that during initial process You may find a hell lot of troubles due to lack of standard way of how Your design team members do express their designs.

Work load

The peer-review only looks inexpensive. Each review do take time. If review result is negative it needs even more time, since peers needs to meet, discuss a problem and agree to solution. You do not have constant, clear cost in a form of Beatrice salary, but each design team member will have to put off the design job for a while and do the review.

And here is the catch: if they do review they do not design. The more reviews Tom does, the more he knows, but the slower he designs.

So let’s break it

Lets do it step by step.

  1. Make all team members real peers… We are not that stupid, right? This is business, not a hippie commune. There must be some responsibility.So let us call one of them the “project leader”. But we have two projects! So let us make Tom to be leader in “Project X” and Hans to be leader in “Project Y”. Great, case closed.
  2. Trust that they know how to do their job well. They are masters in design, so for sure they know how to document it. They for sure can figure out correct standard.
  3. Make sure they don’t slack. Load them 100%. Promote efficiency and staying within the schedule.

Results

In first point we didn’t want to break an inventive “peer-to-peer” culture with any kind of “master-slave” relationship. So we created “leaders” from peers. But does it really differ? The “leader” can say “do it!”, right? So does he differ from the “master” that much?

Sure, because we made certain that “leader” in one project is a plain “member” in an another one.

The catch is, that we have now cross-crossed decision process. Some members will have two masters, some masters will be slaves of their own slaves. There is absolutely no place for quick and firm decision making.

We, the CEO, can of course demand effects from “leaders”. The problem is, they can’t do any workforce allocation because their competences in different projects do collide with each other. And everything will break at slightest overload, during which the all-mighty lord-and-master could starve low priority projects, but cross-crossed masters who are slaves can’t do that.

Then standards. Sure designers do know how to design and document designs well. From their own point of view. But how about Your production floor? How about Your marketing? What about materials procurement? Leaving it like that is a warranty that other parts of Your company will have a hell lot of work adapting the design and related documentation to their needs.

And the real devil. Work load. Our intention was to make sure, that they do spend most of their time on design. Because design is profit. Of course, being only CEO, we do not understand their work. So we will promote efficiency. Let’s say we will pay them some extra money each time design is finished before the assumed time frame. And do not pay them if they do not meet the dead-line. Honestly, what more can we do?

Now imagine Tom has a choice: either to move forward with own design, or to do some review for Hans. How do You think, will he make a choice to make money or to loose it?

Of course we are not idiots. We will set review goals. For an example, if review is assigned randomly to Tom, he must complete it within next three days.

And Tom will do it. In a most efficient way. From his point of view, that is. I can assure You, the only job he will do, it will be checking for obvious mistakes. Like one of my coworkers who was great at finding missing dots and misplaced commas in user manuals, but also equally great at not noticing missing chapters. Quick check, some red-lining and a signature. Review is done, Tom can go back to more profitable job.

The quality has gone by the wind.

Summary

After reading this blog entry You should be aware that peer-review is a great method to both ensure quality, knowledge propagation and training, but also that it extremely susceptible to corruption and inefficiency. This is a good approach for open-source projects and scientific projects focused on uncompromised quality no matter how expensive it is.

If however You are running a production company needing a design team, and You are aiming at design process efficiency, then You should think about something else. What exactly is correct, I do not know. This is a problem I still need to solve.

Note: If You think I did made it all up, You are wrong. I have been there, seen there, worked there.

Java serialization: what are You serializing Your data for?

I think that after this series of posts about Java serialization we should talk a bit about things which are not present in a standard serialization mechanism.

One upon a time I wrote some GUI application. Of course, Swing based. The JavaFX wasn’t there at that time, in fact it did collapse to “abandon-ware” status for a while. Plus I am not a big fan of it….

But back to serialization.

Of course I did create some very specific JComponent GUI components which were responsible for the job. It was basically a scientific dynamic data charting engine, so those components did show some dynamically changed data. The visual effect depended on two sets of settings: one bound with a data themselves, which of course was serialized through the data serialization mechanism, and a second set which was just visual one. Like for an example what font size to use in table and etc. Clear presentation settings having nothing in common with the data.

So I was thinking: why not to serialize those GUI components?

The standard serialization is for…

I tried and failed. Miserably. My first and simplest approach “take and serialize it to disk” was conceptually wrong. Let me tell You why.

After trying and reading the available sources, I realized that the standard serialization was meant for “hot serialization” of living objects. While I wished to do a “cold serialization” on a disk.

“Hot” serialization

Initially JAVA GUI, as apart of Sun enterprise, was meant (I believe that it was, I never worked for them) to run in a philosophy very alike their X-Server concept.

In the X-Server the body of a program runs on a “server” machine, while all GUI related commands, like drawing on a screen, handling mouse and keyboard and etc. are passed over a network to the “client” machine. This is easy to imagine how much the network throughput is stressed by this approach.

Thous year after year, as the power of an average “client” machine grew, the X-Server protocol was trying to move more and more to a client side. This is only natural. Consider how much processing power requires to draw a True-type fonts in Your LibreOffice Writer and compare it with a power required to manipulate UTF-8 characters in memory which do actually represent a document data. It is clear, that GUI-rich applications consume 99% of power on GUI so it is natural to move this consumption as close to end-user as possible.

But how much You can move if You can’t move a code? You can only grow the command set, grow caches and alike, but each user action must be passed to a “server”.

Note: Remember, “server” and “client” are using different CPU architectures. Severs were Sparc or MIPS, clients were x86 or PowerPC. So no, You can’t pass binary code.

With the introduction of “Java Virtual Machine” passing code became possible. Now it was possible to not only send commands to draw on screen, but one might pass the bunch of class files responsible for GUI and run them on the “client”. Of course it should be as transparent as possible. The server side should be able to build GUI, as if run locally, wrap it in RPC wrappers and pass to remote client. A client should just run it in context of own JVM and pass objects back only when necessary.

A part of this process was, I believe, to be handled by standard serialization protocol.

What does it mean for us?

If You will inspect Swing serialization You will notice two things:

  • first, the warning in documentation (JDK 11) that: “(…)Serialized objects of this class will not be compatible with future Swing releases. The current serialization support is appropriate for short term storage or RMI between applications running the same version of Swing.(…)”
  • second, that what is serialized includes all listeners, the whole parent-child tree up and down, and practically everything. From my experience an attempt to serialize any JComponent do serialize the entire application.

This is because of “(…)The current serialization support is appropriate for (…) RMI between applications(…)“.

In simpler words, for transferring objects which have to be alive at the other end of a stream and actively exchange data with the source end of a stream.

Note: I am now ignoring the part about “(…)support for long term storage of all JavaBeans™ has been added to the java.beans package. Please see XMLEncoder.(…)” which You may find in JavaDocs. It is intentional because this mechanism is far, far away from what generic serialization needs.

“Cold” serialization

“Cold” serialization is when You remove the object from it’s environment, stop it, move it into a “dead” binary storage. Then, possibly the next year at the other end of the world, You do remove it from a storage, put it in an another environment (but compatible of course) and revive it.

The reviving process will require binding the object with new environment, but it is not a problem.

Example: Serial port data logger

Now imagine, You wrote a data logger application for a hardware device which is sending data to PC through a serial port connection.

You have a class there which both keeps and tracks data from the serial port. Let’s say it is fixed to be COM1.

How the “hot” and “cold” serialization would look like?

“Hot” example

A “hot” serialization of this class will basically need to pass the already stored data to an another machine and allow control of the logger device from that machine. It means, that it must during serialization create a “pipe” over the network from the remote machine to the machine at which the hardware device is stored.

“Cold” example

A “cold” serialization of this class should save stored data on a disk and do it in such a way, that when it is de-serialized it will connect itself to a said serial port and continue the data logging. It means, that it must save information about to what port connect and create this connection when de-serialized.

Multi purpose

This is clearly seen that if I would try to serialize this logger using standard serialization I must decide on either “hot” or “cold” method. I can’t do both.

But I do need both methods!

Standard Java serialization is single-purpose.

Hey, we have writeReplace()/readResolve()!

Yea, we have them. Except we have either writeReplace() or readResolve(). We can’t use both at the same moment, but let me be silent about it now.

What are those two?

They a “patch” to multi-purpose serialization. Quite a good one, which will work in an above example case, but not in every case.

We may easily imagine that our “hot” and “cold” serialization can be done by:

  class LocalLogger ...
  class HotLogger ...
  class ColdLogger ...

   HotLogger toHot(LocalLogger ...)
   ColdLogger toCold(LocalLogger ...)

That is we provide and ability to construct “hot” and “cold” variants from “local” variants. Both “cold” and “hot” are different objects, but such that when serialized by a standard mechanism they do what we need. Now if we are writing an application which needs “hot” serialization we do use instead of LocalLogger a class like that:

 class LocalLogger_Hot extends LocalLogger
 {
    private Object writeReplace(){ return toHot(this); };
 }

A standard serialization mechanism will notice it and invoke the writeReplace method before it will start to serialize any instance of LocalLogger_Hot. Thous the remote side will see HotLogger in every place where the reference to LocalLogger_Hot was serialized.

We may also mirror the thing, and decide that LocalLogger will serialize information necessary to both creating a hot link and local port connection and that it is up to reading application to act according to it needs. For that the remote must use different source for LocalLogger:

  class LocalLogger
  {
     Object readResolve(){ return toHot(this); }
  }

The de-serialization engine will notice this method and invoke the readResolve after the LocalLogger was de-serialized. Then since that moment it will use the returned object instead of original, what is achieved by modifying the
“stream_reference_map” (see there).

Note: the underscores are intentional.

When it doesn’t work?

So, having my special GUI components which by default are “hot” serialized and I needed to turn them into “cold” serialized, I did add:

class MyGUI extends JComponent
{
    static class MyColdStorage
    {
        private Object readResolve(){ return new MyGUI(this); };
        ....
    }
    ...
    private Object writeReplace(){ return new MyColdStorage(this); };
}

Basically the idea is, that when the standard serialization will serialize an instance of MyGUI it will transform it to a “cold form” of MyColdStorage. Then, whenever it will de-serialize MyColdStorage it will transform it back to MyGUI.

Nice, plain and simple, isn’t it?

Except it doesn’t work.

Cyclic data structures

The GUI is heavily recursive and cyclic data structure. Each JComponent keeps a list of child GUI component (ie. panel keeps a list of contained buttons). And each child component keeps a reference to a parent component (ie. label must known enclosing panel to tell it that the size of label changed so the panel should recompute the layout of children).

For simplicity let us define it like:

 class JComponent
 {
     private JComponent parent;
     private JComponent child;
 }

If You will consider this post You will notice, that such a structure will be serialized like this:

   JComponent Parent =...
   JComponent Child  =..
  serialize(Parent)
   ... →
      write new JComponent (Parent) //(A)
       write Parent.parent = null
        write new JComponent (Child)
             write Child.parent= refid(Parent) //that is stream reference to Parent set in (A);
             write Child.child = null
        write Parent.child = refid(Child)

and during de-serialization:

  x = deserialize(...)
    create new JComponent (Parent)
       set Parent.parent = null
       create new JComponent (Child)
             set Child.parent= Parent;
             set Child.child = null
       set Parent.child = Child
   return Parent

Now arm it with writeReplace and ReadResolve exactly as it was defined above.

serialize(Parent)
   ... →
      call writeReplace(Parent)
      write new MyColdStorage (Parent_cold)
       write Parent_cold.parent = null
        call writeReplace(Child)
        write new MyColdStorage (Child_cold)
             write Child_cold.parent= refid(Parent);
             write Child_cold.child = null
        write Parent_cold.child = refi(Child) 

and during de-serialization:

  x = deserialize(...)
    create new MyColdStorage (Parent_cold)
       set Paren_cold.parent = null
       create new MyColdStorage (Child_cold)
             set Child_cold.parent= Parent_cold;
             set Child_cold.child = null
       call readResolve(Child_cold) (Child)
             new MyGUI (Child)
               Child.parent = Child_cold.parent (Parent_cold)
               Child.child = Child_cold.child (null)
       set Parent_cold.child = Child
   call readResolve(Parent_cold) (Parent)
       new MyGUI (Parent)
         Parent.parent = Parent_cold.parent (null)
         Parent.child = Parent_cold.child (Child)          
   return Parent

Noticed the red lines?

In this cyclic structure the first use of de-serialized parent reference happens before the place in which its readResolve(Parent_cold) is invoked. It is because designers of standard Java serialization assumed, that to resolve an object You need it to be fully read. And of course, since we have a cyclic structure, the process of reading a “Child” in this example will refer to the “Parent” before it was fully read. Thous it will access unresolved object.

In my case it would produce the ClassCastException because MyColdStorage is not
JComponent.

It is even worse, we will have now two objects, one of unresolved MyColdStorage and one of resolved MyGUI were we originally had a single object.

writeReplace/readResolve doesn’t work in cyclic structures.

Note: This is specified and designed behavior. I can’t tell it was intentionally created like that, because a solution is trivial, but never less You will find it in serialization specs.

How to solve it?

The answer is simple: with standard serialization You can’t. Once it is “hot” it will be “hot” for an eternity.

But if You write Your own serialization engine the solution is simple. Instead of one readResolve use two:

class MyColdStorage
{
  Object readReplace()
  void fillReplacement(Object into)
}

Now the readReplace is bound to create an “empty” object of correct type:

  Object readReplace(){ return new MyGUI(); };

and the fillReplacement is bound to transfer data from the stream form to the target form:

  void fillReplacement(Object into)
  {
    ((MyGUI)into).parent = this.parent;
    ((MyGUI)into).child = this.child;
  };

The readReplace is invoked right after new instance is created and returned value it is put into a “stream_reference_map” (see there) instead of the original.

The fillReplacement is invoked in exactly the same place where the standard readResolve() is invoked, but opposite to original, the “stream_reference_map” is left untouched.

Then make de-serialization to look like:

 x = deserialize(...)
   create new MyColdStorage (Parent_cold)
    call readReplace() → since now each time "Parent" is referenced use returned value (Parent_R)
     set Parent_cold.parent = null
     create new MyColdStorage (Child_cold)
     call readReplace() → (Child_R)
       set Child_cold.parent= Parent_R;
       set Child_cold.child = null
       call fillReplacement(Child_R) (Child cold)
         set Child_R.parent = Child_cold.parent (Parent_R);
         set Child_R.child = Child_cold.child (null);
     set Parent_cold.child = Child_R
    call fillReplacement(Parent_R) (Parent)
      set Parent_R.parent = Parent_cold.parent (null);
      set Parent_R.child = Parent_cold.child (Child_R);
   return Parent_R

So is it multi-purpose now?

No.

It allows us to change purpose but not to serialize the same object once for that purpose and once for an another in exactly the same application.

Can we do it?

Of course.

With “object transformers”. There is absolutely no need to have writeReplace/readReplace+fillReplacement trio to be private methods of serialized class. They can be any methods defined anywhere providing the serialization mechanism can find them. For an example we may define:

public interface ITypeTransformer
{
  public boolean isTransformableByMe(Object x)
  public Object writeReplace(Object x)
  public Object readReplace(Object x)
  public void fillReplacement(Object from, Object into)
}

plug it into our serialization engine and be happy.

Can You do it with a standard serialization?

No. Absolutely not.

Summary

After reading this blog entry You should be understand that different applications may need to serialize the same object in a different way. You should be aware of the fact, that standard serialization is “cast in stone” in that manner and that a writeReplace/readResolve mechanism is broken and won’t help You in that manner.

You should also know that if You decide on Your own serialization engine, then You can do it in a very easy way.

Java serialization: new Object()…?

In previous blog entries I presented the problems and concept related to recursive data structures, stream references and reflections instead of pointers.

Now it is time to move to next important and, as You will see, most problematic case when it comes to Java serialization. And the problem is….

How to create a new object?

Whenever the binary stream is de-serialized it will, sooner or later, be told to turn a stream command into hot, live, living java Object. Of course not any object, but of exactly specified class.

How do You create an object?

In source code it is simple:

  class my_class...
  myclass x = new my_class(...)

Of course serialization runs on reflections, so we have to:

Class<?&gt _class = Class.forName(....)
_class.getDeclaredConstructor().newInstance()

Unfortunately it will not work. Or, to be precise, it will not work in about 75% of cases.

What is behind a new?

The simple new ... is, at JVM bytecode level, a two phase process:

  1. First the JVM is told to create new empty body of an object. There is a dedicated bytecode for it and a dedicated JVM mative hook for those who implement JDK reflections.
  2. Second the specified constructor (or default, param-less) of the actually created class is invoked.

new empty object

Each object in any programming languages exists in three “realms”:

  1. Memory allocation realm.
  2. Object management realm.
  3. User data realm.

Memory allocation realm

Each object must use some memory. To be able to do it the program must somehow assign it to an object. In case of C++ it may be either global static allocation, allocation on local variables stack or a dynamic (heap) allocation:

 my_class X;
...
void my_code()
{
   my_class X;
};
...
  my_class *X = new X();
  my_class *X = (*X)(malloc(sizeof(X)));

The two first cases do not appear in JAVA, so let us keep an eye on the third one. The malloc in an above equation is very much alike what happens when JVM creates and empty body of an object. Not exactly, but alike.

All the malloc needs to do, it is to reserve a space in memory of size large enough to keep object management data plus user data. It however also needs to store some additional information which will allow later to do free(...). Or, in case of JVM, to garbage collect the object.

In C the allocated memory block is not considered to be an object. It is just a bunch of bytes which may be interpreted as an object, but I recommend You to not try doing it.

In JVM we don’t have a facility to allocate memory which is not an object.

Object management realm

Once we have a block of bytes we need to turn it into an object. This requires, at least:

  • writing into it a reference to an object descriptor;
  • wiping out object data to default state, if language requires that. In JAVA it is required, so it is done;

Of course different systems may use different approaches, but an absolute minimum it is to let runtime code to be able to select a proper implementation of a virtual method. The object descriptor, of which part is so called “virtual methods table”, is one of possible ways of doing that.

Note: A user might have noticed that if object has no virtual methods there is no need for an “virtual methods table” and thous no need for “object descriptor”. This is true. Fortunately in JVM every object has some virtual methods, so we may be not bothered by this border-line case.

What is important to remember, it is that the allocation of an empty object is a two step process:

  1. Allocation of memory.
  2. “Arming” that memory with object infrastructure.

User data realm

These are simply: fields. Named fields listed by source code.

Initialization of those fields to their proper state is done by constructor.

Where is the problem?

The JVM specification was designed with a great flexibility in mind. Thous they assumed that said above two step sequence is not strictly defined when it comes to what exactly should happen in each of steps. What part of each “realm” is executed by what operation is not told. It may be that object is completely ready once allocated, or it may be that it is a constructor what is responsible for creating object management data.

The specification is however very explicit in one point: it requires that JVM will refuse to execute a code which tries to execute just one of those steps, execute them out of order, use inappropriate constructor or allow a reference on which constructor wasn’t called to leave the JVM machine registers stack. In other words – the program which can see the empty object (that is one without constructor run on it) won’t be executed.

Note: The OpenJDK is implemented (last time I looked at it, that is) in such a way, that memory allocation initializes both “memory” realm and “object management” realm. But it doesn’t always have to be like that.

In simple words: we can’t allocate an object without calling a constructor.

What does it mean for serialization?

Serialization should be transparent

Serialization should be as transparent as possible. We should be able to “move” object from one machine (or moment in time) to another through a binary stream in such a way, that the object won’t notice that.

For it to be possible, we need to do:

  1. Create an empty object, but “armed” in “object management” realm.
  2. Write data into object fields, field by field, thous transport the “user realm” from one place to another.

We should be able to do it without calling any code on an object.

Note: There is a plenty of cases when we will have to call a specific code on a de-serialized object, but we will talk about it in other blog entries.

An this is the catch.

We can’t do that with JAVA reflections.

We can’t because we can’t prevent constructor from running.

Why it is a problem?

Because a standard constructor doesn’t know if it is used to “do nothing” during de-serialization or if it used to create a live, ready to use object. In plenty of cases default constructors do initialize some data structures which are absolutely necessary for an object to work. While during de-serialization those structures will be overwritten with data coming from the binary stream.

In some cases it may work, in some it may mostly work, in others it will be a tragedy.

How it is done in standard JAVA serialization?

Very simple.

By hack.

There is an internal JVM class which pierces through JVM defense and allows to run split the operation. It allows two things:

  • to allocated “armed” object without calling a constructor;
  • to invoke an inappropriate constructor of an object;

It is there… but we shouldn’t bother about it. Because it is/will be hidden in post OpenJDK 19 release. Which is good, because it was a hell of the hack which exposed too many internal concepts to users. On the other hand users who wrote their own serialization implementations were forced to reverse engineer this hack because serialization was specified in conceptually wrong manner. Think ten times, do once, this is my motto.

What we should remember it is that it is impossible to re-implement standard serialization in pure JAVA. Dot. Not possible. Really. Been there, tried it, always ended with a hack. Not mentioning, that ObjectOutputStream/ObjectInputStream represents one of worst case of Object Oriented Programming I have seen (base on JDK 8). If You wish to check how to not write programs then take a look at it.

What to do, what to do?!

Well… Not much left, isn’t it?

Use constructor.

If You do design a serialization system do define some “serialization context” class, ie:

public final class MySerializationContext{};

and request that all serializable classes should have a constructor:

public class MySerializableClass
{
  public MySerializableClass(MySerializationContext x){};
}

The de-serialization code will execute this constructor and it is up to this constructor to ensure, that an “empty” object is in fact “empty”.

Note: Notice, that with this approach Your object may support different serialization engines since it has a dedicated constructor to support each of them.

If constructor is not what You like, You may require a class to have a “factory” method:

public class MySerializableClass
{
  public static MySerializableClass newInstanceOf(MySerializationContext x){....}
}

what is a must if Your class is “singleton”.

Note: A “singleton” class is a class of which only one instance may exist per application. Naturally serialization of such class is a tricky concept, but sometimes it just means: “use the same facility on target system as others do”. De-serialization of such a class of course cannot create new instance and should use instance present in target system instead.

What to not do?

Don’t follow the java.io.Externalizable concept. It is broken beyond imagination. It calls generic param-less constructor, so the class doesn’t know if it is de-serialized or create a new. It then expects that readExternal do actually read an object. Which it can’t do well if object takes a part in recursive data structure.

Don’t even try:

public class MySerializableClass
{
  public MySerializableClass(MyInputStream de_serialize_from_it){ .... };
}

which is even worse.

Summary

After reading this blog entry You should be aware, that object construction isn’t just one simple operation. You should also be aware, that in JAVA we can’t allocate without calling a constructor. And You should remember that standard JAVA serialization is a hell of a hack.

Java serialization: recursive data structures

In previous chapter I was talking about how to deal with references/pointers and introduced the concept of “refid” to replace them in a stream.

Now it is a time to talk about fields in object. Especially about fields which are themselves references.

Serializing an object

Note: I do abstract now from how do we get information about fields to store. It may be through “reflections” but may be also done directly. For the purpose of this blog entry it doesn’t matter at all.

Imagine we have an object like this:

class Node
{
   private final int payload;
   private final Node next;
   private final Node prev;
}

The reader will quickly recognize that it is a piece of so called “bi-directional list”:

The payload is an actual data, which in this case is just a number, and next points to subsequent node in a list or is null. Alike prev points to previous node in a list. The beauty in bi-directional list it is, that having a reference to any node one can traverse the entire list.

Serializing it

No let use define a very simple algorithm which will serialize any object:

serialize(Object n)
{
   for(f = each field in n)
   {
     if (f is primitive field) //*1
     {
          write its bits to stream
     }else
     {
       assert(f is reference)
       if (stream_reference_map contains f)  //*2
       {
            take refid of f from stream_reference_map;
            write refid of f to stream to indicate
             re-use of previously stored object;
       }else
       {
            assign new refid to f;
            store f and refid in stream_reference_map;
            write to a stream information necessary to
            create new instance of and retain refid of it; 
            call serialize(f); 
       }
      } //non primitive field.
   }//for each 
}//serialize

Ehmm…. and this is in fact an entire algorithm of Java serialization. Seriously. It is. Simple as a hammer. Except it is not, if You will dig into details and ponder about how to do it well.

Now where is the catch?

The first catch (*1) it is that we need to tell apart a reference from a primitive data. Primitive data can be stored as bits and bytes in stream. Refrences have to be converted to “refid” using the method I described in previous blog entry. The stream_reference_map is used to do that.

The second catch (*2) is to detect re-use of previously serialized object or an object which is actually being serialized.

The “bi-directional list” is an example of cyclic data structure. If You will walk through next and prev fields just iterating and calling serialize(...) on each field You will never finish the job. This is why we do first store an object in stream_reference_map and then invoke serialize(...) on it, since it allows us to break a cycle.

Note: Cyclic structures do have a nasty side effects during de-serialization, but I will write about it another time.

Recursion → potential of StackOverflowError

If You will try this exact example in a real life it will work and then fail. The failure will depend on the size of a list You serialize, will depend on JVM used and computer You run it at and will be seen as StackOverflowError exception being thrown.

Why will it happen?

Because You will have a following call:

serialize(x)
{
   ...serialize(x.next)
  { 
        ...serialize(x.next.next)
        {
           ....and so on till whole list is serialized
        }
  }
};

This is how an excellent feature of “bi-directional list” turns into a problem. You can reach a whole list starting from any node by traversing the next/prev fields. Thous, during this plain serialization You will traverse whole list in just a single call to serialize(...) on any node.

Call stack/stack frames

Now we need to get a bit into details how do CPU work and how compilers do turn Your code into a machine code.

Basically each time You call a method You need to supply following data to CPU:

  • what arguments are passed to that method?
  • where is this method in memory, so that CPU can jump there?
  • where to return after the method completes?

In most languages the arguments are indirectly transformed into local variables, so if we do talk about method invocation we may treat them the same.

If You declare the method:

  private int doom(int a)
  {
    int b = a *a;
    return b;
  }

You are in fact declaring two local variables: a and b.

The locality of variables is two fold: first their names are not visible to compiler from the outside of the method which declared them. Second, and most important, they have to be implemented in such a way, that if two threads do invoke the doom(...) those invocations must not interfere with each other in any way. Each needs to use own, separate piece of memory to hold “a” and “b”.

This can be implemented in many interesting ways, but most common and most reasonable way it is to use a concept of “stack-frame”.

The “stack frame” is a block of memory where all mentioned above information is stored. Like for an example in this:

position name meaning
0 a a space reserved for a parameter “a”
1 b a space reserved for a local variable “b”
2 retptr a space in which we do store a return address where to jump when method ends

Whenever the method is invoked the new “stack frame” must be allocated for it.

Call stack

There in no obvious problem visible yet. A new “stack frame” must be allocated? Well… what’s the problem?

The problem is: speed.

Calling a code is one of most common and elementary actions a CPU may take. So it must be a hell fast.

To support it 99% of CPU’s do have a dedicated instructions:

call TARGET
ret

which are implemented in following way (I do assume TARGET is a compile-time constant in this example):

call TARGET:
{
 x = memory[PC];//PC is "program counter register"
                //which tells where "call" is in memory.
                //In this example I assume TARGET is stored
                //right after "call" instruction.
   PC=PC+1;     //so that we can read next instruction data.
   take SP  ;   //SP is "stack pointer register"
   memory[SP]=PC;//save address of code after our "call"
   SP = SP+d;   //d is a size of return address
                //expressed in SP units
   PC=x;        //and jump to TARGET.
 }
ret:
{
  SP=SP-d;      //make SP to point at begin of stored PC
  PC=memory[SP];//jump there
}

The remaining CPU’s do have:

 brl TARGET_REGISTER,LINK_REGISTER
 br TARGET_REGISTER

where “brl” stands for “branch with link” and “br” is for “branch”. Both are implemented as simple:

brl TARGET_REGISTER,LINK_REGISTER:
{
   PC=PC+1
   LINK_REGISTER=PC
   PC=TARGET_REGISTER
}
br TARGET_REGISTER:
{
   PC=TARGET_REGISTER
}

You may easily see that the “brl/br” is just a piece of “call” and “ret” without SP being involved. Of course, since the number of registers is finite, they have to be stored somewhere during call. Usually on the “stack frame” for which allocation the SP will be used. So usually compiler will use pre-increment/post-decrement memory addressing modes and generate something like that:

ld TARGET,R0
brl R0,R1
st R1,@++SP

for calling and alike sequence for return:

ld @SP--,R1
br R1

One way or another the entire operation is made using memory[++SP]=x and PC=memory[SP--] operations. Which in fact are using ++SP and SP-- to allocate and release a tiny bit of memory necessary to store the return address.

This memory space is called “call stack”,”control stack” or simply a “stack”.

This simple method of allocation is fast. Since it is fast the 99% of compilers do use exactly the same space to allocated “stack frames”.

Our doom may be compiled into:

doom:
  SP+= sizeof(a)+sizeof(b)
  ...play with them
  SP-= sizeof(a)+sizeof(b)
 ret

Again, it is fast and supported directly by a hardware.

Note: There are CPU which do have separate physical memory for “control stack”. In such case it cannot be used for local variables and a separate “stack” is created by compiler. It will look very alike tough, with an exception that return address won’t be a part of “stack frame”.

Problem with ++SP/SP– memory allocation

The problem with this simplicity is that it is very simple indeed. It needs a continuous memory. This means that compiler must decide from start in what address space put static variables, the dynamic variables (those managed by new and garbage collector) and in what address space put “call stack”.

Or, in fact, “call stacks“. Because we need one of them for each thread.

Note: The CPU capable of Virtual Memory may actually use the Memory Mapping Unit to place all call stacks at the same logic address. I haven’t seen it done tough. Possibly because it is very CPU+OS specific and make thread switching more time consuming as each time not only registers would have to be restored from thread state data, but MMU map too. And since “call stack” is accessed through SP register anyway there is no real benefit of doing that.

One way or another compiler must decide at the beginning what would be the size of stack for each thread. It must arrange it, allocate somehow (can do it with the new subsystem too), but once it is done, it must be fixed in address space and must be large enough to accommodate all “stack frames” which will be needed by a thread.

StackOverflowError

And after this digression we are back to Java.

The fact that stack is hardware managed and that it must be large enough to accommodate all “stack frames” crates a tricky problem.

What happens if You do try to store too many stack frames on a stack?

It will overflow. This overflow is basically overwriting data which are after the stack, messing with them and then returning as if nothing happen. This is one of the nastiest problems to detect since cause and effect are far, far away one of each other in time. Very common in embedded world.

The C language defines the stack that way. Mostly because not all CPU’s can detect the problem before it happens. Most modern, big CPU’s can do it without any performance penalty. If however it is not supported by hardware a proper testing code can be always injected:

doom:
  if SP is such that stack frame won't fit
  {
       throw an exception
  }
  SP+= sizeof(a)+sizeof(b)
  ...play with them
  SP-= sizeof(a)+sizeof(b)
  ret

This will generate constant and significant performance penalty. Java was designed however with safety in mind and if they had a choice between speed and safety safety came first.

If Java detects stack problem it throws the StackOverflowError.

This is a very brute error and it is rather hard to recover from it. Possible, but usually indicates a huge problem with a code… or…?

Back to our list

And now we can easily see why using recursion to serialize bi-directional list is a bad approach. It is bad because the number of recursive calls do depend on the size of a list, while the size of a stack remaining when we do call serialize(...) depends on program state and execution context. The JVM, depending on settings, version and environment may assign stacks as small as 16kB and as bit as 4MB. This means, that on some computers said list may serialize without problems and on some may throw this nasty exception. And even worse, You can’t check it or control it from a code.

Solving this problem

As You can see the recursive serialization, even tough simple in concept may fail miserably on small, but cyclic data structures.

The ideal solution would be to not use recursion at all, but it complicates things a lot. I will discuss it some time later in following posts.

Of course this is not a problem which is unknown to the world. It is known and was known from the beginning. The designers of Java serialization decided to solve it in an another way. They left the recursion in place but allowed custom serialization code. Like this:

serialize(Object n)
{
   if (n contains custom serialization )
   {
      invoke it
   }
   else
   {
    for(f = each field in n)
    {
      if (f is primitive field) //*1)
      {
       ...

The java.util.LinkedList makes use of that functionality. It is a bi-directional list and it serializes well because it has a custom serialization code which does:

custom_writeObject(...)
{
   for(x: this) serialize(x);
}

which basically speaking do iterate over the list and do serialize payload stored in list Node objects. The nodes of list itself are not serialized at all.

Is it good?

Yes and not. In my opinion: it is not a good solution. It relies on the fact, that a person who wrote a data structure did realize that it may cause stack problems and dealt with it. The wrong, in my opinion, is it that the appearance of a problem depends on size of data stored in a said structure. Size is a matter of a context in which it is used, while recursion and custom code is a matter of design process.

It also generates a lot of problems with versioning and etc. creating “cast in stone” code, but this is not a right place to write about it.

Plus, cyclic structures do pop out continuously in many places and are natural method of implementing so called “observable data”. Sooner or later You will attempt to serialize them without realizing it.

Summary

After reading this, again too long, blog entry You should have a grip on how object is scanned during serialization process and what problems it may produce. You should also be able to realize possible problems with call stack when designing Your own algorithms which may involve crawling through linked and cyclic data of significant depth. And, of course, You should now realize what is the real reason behind the writeObject/readObject private methods which You may encounter in may JDK classes.

And now You it is a time to see how an object is created when de-serialization takes place.

JAVA serialization: if not the pointer then what?

In that post I did talk about “reflections” in JAVA and how this concept relates to serialization.

In that post I would like to say a few words about how to deal with “object references” if we can’t have a pointer.

Why do we need a pointer?

Under the hood to say what is where in a memory. But looking externally just for one thing: to tell two objects apart. If their references (ie. pointers) differs then those are not the same objects. They may be bit-by-bit equal but are not the same.

So basically as long as we can do 1:1 mapping:

  Object reference ↔ sequence-of-bits

then we are done.

Identity of objects in JAVA

Gladly JAVA provides two facilities for that:

class System{
  ...
int identityHashCode(Object x)
...
}

which do compute a “magic” number which provides non 1:1 mapping:

  Object reference → 32 bit integer

and

  Object X, Y;
    X==Y
    X!=Y

reference identity operator which can tell when two “pointer” do point to the same object.

Those two are enough to create identity hash-map (like java.util.IdentityHashMap) which can quickly map Object to int:

  stream_reference_map = new IdentityHashMap<Object, Integer>()

Of course we could do the same without the identityHashCode using only == operator and a list of structures like:

class Descriptor
{
  final Object reference;
  final int assigned_number;
}

but it would be few orders of magnitude slower.

Stream reference

The stream_reference_map shown above do map Object into an int number.
This number is called “stream reference identifier” or, in a short form: “refid”.

Note: Remember, the “refid” is not the result of identityHashCode()! The identityHashCode() does not produce 1:1 mapping! It may return the same number for many objects. It is used just to speed things up grouping objects in “buckets” over which we still need to use == operator.

Producing stream reference identifier

Any method will do. You should however think about few questions:

  1. Should I allow transfer of unlimited number of objects to stream?
  2. Should I allow garbage collection and re-use of refid?

Usually a simple incrementing counter will be ok.

Using stream reference

Basically You do use it exactly the way You would use a “pointer”. You like to write a pointer to object X a stream? Then You look up for “refid” of X and write that “refid” to a stream. Simple.

The question is when You like to write a pointer, but this is an another story.

Reading-side map

The above:

  stream_reference_map = new IdentityHashMap<Object, Integer>()

provides Object → int map. Unfortunately it is just one part of a story, which is used to write pointers to a stream. The other part of a story is to what to do with a “refid” we read from a stream?

The reading side needs:

  int → Object 

map. Gladly, if You have chosen an incrementing counter for a “refid” generator and You are fine with 2^31-1 objects in stream the simple:

  read_refid_map = new Object[...];

will do the best job.

Note: Unless You are actually planning to get anywhere near the 2^31 region in number of objects. A more “scattered” structure will better handle growing and shrinking the array during the live of serialized stream.

Problems

The first problem, which is not dealt with in standard serialization is memory leak. Yes, the standard serialization do leak as hell!

Hard-reference+garbage collector==memory leak

The stream_reference_map = new IdentityHashMap<Object, Integer> used at writing side utilize the standard, plain reference to an Object as a “key” in a map. This has an unfortunate effect: as long as this map exists the garbage collector will see all contained objects as “reachable” and won’t release them.

Usually it is not a problem, but if You will decide to, for an example, use serialization for logging Your application You will get a nasty surprise.

Imagine You do arm Your application with logging commands in following manner:

void woops(int a,int b)
{
  ....
  if (log_level_enabled) log_object_output_stream.writeObject("calling woops("+a+","+b+")");
  ...
}

Each time this code runs, the new string is formed and written to a stream as an object. This means, that it must have the “refid” assigned. And if it must have it assigned, then it must be put into a stream_reference_map. Since it is using hard reference, it means it will stay there forever. Or, precisely, until OutOfMemoryError.

The proper stream_reference_map must hold reference to mapped objects by a WeakReference.

Passing garbage collection event

Of course, even if You will deal with above You will still hit the OutOfMemoryError at the reading side of a stream.

The simplest:

  read_refid_map = new WeakReference<Object>[...];

will not work. The weak reference works at writing side, because if the only place for object to exist is the stream_reference_map
map, then there is no way to write it again to a stream.

At the reading side it is very different. The reading code may pick “refid” from stream (and objects) and drop them right in the place. The writing side may however hug to the object for very long time and write it to stream many times. Of course, to avoid many problems which I will discuss somewhere else, it will prefer to write “refid” to it. If the read_refid_map would be WeakReference then there wouldn’t be any object to map it to.

Good “refid” system do pass garbage collection events to reading side.

Roll over

Of course int isn’t infinite. Even if You will use proper garbage collection of “refid” You will still sooner or later hit:

   assert(refid_generator+1 > refid_generator )

that is a “signed wrap around”. You will run out of possible “refid” to use.

This is something what is also not addressed in standard serialization. The bad problem is that the standard serialization is not utilizing the entire 2^31-1 pool of numbers and the roll-over happens earlier producing some commands instead of “refid”. Fortunately You need a really huge VM to hit this problem, since usually the OutOfMemoryError will appear first.

The good “refid” system do re-use garbage collected refid to avoid roll-overs.

Summary

After reading this chapter You should know what the “stream reference identifier” is and how not to design the system which manages it. This should also make You to notice, that standard serialization stream cannot exist permanently or be used for large amount of data produced on-demand.

And now You may move to following part in which You will read about how object is scanned during serialization and what problems it may create.

ISO 9000 and unit test?

In this entry I would like to make a short talk about ISO 9000 quality assurance system and unit tests. But, and it may surprise You, not in a context how ISO 9000 may utilize unit tests, but how unit tests may be utilized to keep ISO 9000 in an operational state.

Audit is not a check

Launching ISO 9000

The IS0 9000 is a strict, formal, “paperwork” centered quality assurance system. In very short words You, the entrepreneur, do create some numerous quality assurance procedures and then make them running.

During the system startup it is carefully checked if Your procedures do reach goals required by ISO 9000.

ISO 9000 requirements

The requirements of ISO 9000 are very, very inaccurate. They are just generic statements. Your organization should know this, Your organization should document that. Is it strange? No. The ISO 9000 is aimed to help You to ensure quality in almost any kind of enterprise. From manufacturing of shoes to printing books. They simply couldn’t be more specific.

Imprecise task == imprecise result

The problem with such an approach is that You may craft fully ISO 9000 compliant system which won’t have anything in common with a quality assurance. You can design procedures which do work, do reach ISO 9000 goals, but the stuff You make is a total crap.

Primarily because producing crap was Your goal and a proof of crap quality is in a stench. ISO 9000 must be also able to cover that kind of activity so don’t be surprised it allows that.

Auditing ISO 9000 system

The ISO 9000 has a built in safe-fail procedures. These are “audits”. You are bound to periodically perform the “audits”.

The problem is, that even a proper audit may be restricted to taking a procedure and checking if Your organization do adhere to it. There is no intention during an audit to actively search for quality problems.

And, sadly, plenty of quality problems may be outside procedures or at the boundaries between them.

The world is turning…

… and the world is changing.

Nothing stays stable.

Small example

I have a pleasure (still) to work for a company which has a great belief in an adherence to procedure. We do things this way, because we have been doing that before and it worked. Recently this company had to order an expensive “vamp-up” of a production management software because in our procedures there was a drawing template which required that the material used should be put in such and such place. The software offered the method for automatically managing the materials consumed, including automatic computation of an amount, but the result wouldn’t be a “part drawing” but an “assembly drawing”. Visually speaking the table on a drawing would look differently.

Is that a problem to pay significant amount of money for that “vamp-up”?

No. The owners loves old procedures: “It worked yesterday, it will work tomorrow.”.

Hardly false.

But, and there is always a but, a short investigation have shown, that the reason behind this whole drawing template was, that in about 1970-ties You could order a pre-printed sheets of tracing paper (I’m not sure about an English word here. A semi transparent sheet of paper-like material to draw with ink on it).

Yes. About 50 years ago it was convenient to have a pre-printed sheets so 50 years later we do change an electronic system to follow this traditional procedure.

Method != goal

The ISO 9000 procedure is an algorithm to follow. If You do it, You are certain to reach the assumed goal.

Really?

Always?

Ever tried to bake a cake basing on a 1850 recipe?

Let us take the term: “…and take a spoon of sugar”. I didn’t realize why my cakes never came out properly until I got myself a silver spoon from about 1900. It is huge!

What does it mean?

Simply, if You do specify a method without specifying a goal Your recipe will decay, rust and become a problem not a solution.

Plenty of existing ISO 9000 procedures do not specify accurately enough the goals they are bound to reach.

Reverse approach

There is one very annoying order You may issue to Your subordinate: “Do it well.”. I don’t care how You will do it, reach Your goals.

You may then imagine, that You will write Your ISO 9000 procedure in terms of goals. For an example: “a kind and amount of material to produce a piece must be known from a drawing.”.

Then absolutely any method of describing the material will work. And it is clearly seen if the drawing is good or bad. Has it a material specified? It is good. Has it not? It is bad.

This kind of quality assurance procedure may survive hundreds of years without a need of alteration.

It is not ISO 9000

It is however to an ISO 9000 then.

ISO 9000 is about procedures. Procedures. Methods, recipes, algorithms. Call it however You like, but it can’t leave You any flexibility. This is because ISO 9000 procedure should warrant the quality even if a brainless ape will use the procedure. It may not leave a space for an invention, because inventing equals making mistakes.

Which is good.

Except, that if ISO 9000 procedure decays the quality is gone and nobody will notice that. On the contrary in “goal based” quality assurance system the quality will stay, but will fluctuate depending on person and method chosen.

Mixing it together

Unfortunately if You need ISO 9000 certificate You need procedures. Specifying goals is not enough.

Source of problem

The reader might have already noticed that the problem with “procedural” approach is in procedure decay. The time passes, conditions changes, and the algorithm at best starts generating expenses instead of quality.

The “audit” won’t detect it unless performed in a very investigative way. The problem is, that You can’t investigate if procedure still reaches its goals if You don’t know them!

So we need an efficient, easy to use system which will allow to quickly and at low cost to detect if ISO 9000 procedure is still doing what it should be doing.

Unit tests to rescue!

So lets see what can be done.

Specify Your goals

This is a must. Any ISO 9000 procedure absolutely must specify goals. The specification must be:

  • bound to specific ISO 9000 requirement. This will help us to prune system of useless procedures if ISO 9000 will change itself;
  • it must be easy to understand;
  • it must be pin point accurate. A sentence like: “The goal of this procedure is to manage changes in documents” is a good title but not a goal. You must be an order of magnitude more accurate. Like, what is a “change”? What is “manage”? Is any formal acceptance procedure required? Do wee need an ability to roll back? Is notification needed? And etc, etc.

Specify Your algorithm

The usual way. Nothing to say. Do it as You did before.

Specify tests

And here is a big change. As You probably already noticed specifying accurate goals is not an easy task. It requires a very analytical mind to do it correctly. Such minds are not easy to find and are usually hell expensive. Plus, hardly anyone with a plain mind can understand what they are babbling about.

In fact, it is so hard, that You should assume that below 1% of Your “white collar” employees can do it right.

This is because 99% of humans do think “by an example”. And, in our case it is good.

They can specify examples which will show that the procedure works. And “examples” are what are unit tests about!

For an example if You have a quality assurance procedure for above mentioned drawings, you may add a simple test:

1.Take a drawing from an official documents pool.
2.Check if it has a kind of necessary material specified. 
3.If it does, test is passed. If it doesn't test fails.

Tests pool

Now imagine that together with each procedure is bound a certain amount of tests of above like complexity. And You are auditing the procedure. You are doing it with an additional goal in mind: to check if the procedure still works.

Without knowing goals of procedure it is impossible.

Knowing them it is possible but very difficult. A truly investigative job.

But with tests… How much of an effort requires an above test? I could do it drunk and tired. A kid from school could do them. They are easy because they are focused on a single example.

Of course You can’t proof a method by a positive example. Even piling them up You can only rise a probability that the method is correct.

But if any of test fails the entire method is wrong.

And You have detected it at an extremely low cost.

Summary

I could write more. And even more. But it is not a right place and a moment.

What You should understand it is that adding unit tests to ISO 9000 procedures can be an extremely cost efficient method of ensuring that Your ISO 9000 keeps Your quality in check.

Are there other benefits?

Sure. Exactly the same as when You do programming with unit tests. Let me know in comments if You like to have this article extended and know more about tests in ISO 9000 quality assurance systems.

Git LFS… use or avoid?

What is LFS?

The Git-LFS is an extension to the standard GIT meant to deal with “Large Files Storage”.

The standard, raw GIT deals well with any kind of files of practically any size. If however user does simple and common operation:

git clone

then the entire history is downloaded and stored locally. The user usually performs this action to use or work on some development of the existing data, what means that he/she is most interested in the current state of work. The entire history is usually not necessary, but who will bother with shallow cloning when a standard “clone” is easiest to do?

Now imagine, that You decided to use GIT to store some JPG images You work on. One image is about 4MB, and You have a 100-levels deep history. This gives 400MB of repository size since JPG files are so much compressed and data-scattered that GIT will have a hard deal with making an efficient diff-compression.

And here the LFS comes to play. It is delaying the actual download of those images till Your user runs:

git checkout branch/commit

Thanks to that approach You may save on a bandwidth of Your GIT server a lot.

How it is done?

Commit/checkout

Basically when You enable the LFS extension, then each time You commit a file matching a pattern You told LFS to take care of, it will replace that file in a repository commit with a simple text information: “I, the LFS, took care about it and stored it as XXXX”. Then it will copy this file somewhere inside the .git git folder of Your repository.

The checkout does reverse, it detects the text information and is using it replaces it with the file.

Push/pull

If You do the git push, then it is pushing the commit the usual way and then it will send, using a dedicated protocol, those stored files to a server. The server will put them in a structure called “file storage” and match with XXXX from mentioned above text information.

When You pull nothing is done, unless You will specifically tell LFS to download it’s files. Instead when git checkout can’t find files locally, it downloads them from the server using LFS protocol.

Benefits

At first I was very happy with it. It was doing its job and looked like a good way of keeping CAD files in GIT.

Unless I realized some nasty side effects.

Side effects

File-format standardization

I do maintain a small company GIT server. I took a great care to run it on LVM mirrors and do daily diff-backup on an external drive. So I am well protected against hardware failure (mirroring) and against sabotage (daily backup).

But what if the server will have to go down for a long time? Or what if the server software will loose the compatibility with ever changing environment and I won’t be able to keep it running?

I did took care to check how the GIT repositories are stored by this server. And they are kept as plain bare repos. This means, that if the server will go down I can simply copy those bare repositories to any other server or even to SAMBA file server. It will have restricted functionality, but I will be able to use them without having the server software running.

This is because GIT is defined at file level. Anyone who writes GIT server will be surely tempted to use git-lib/JGIT to do the hard work and won’t be trying to re-invent the wheel.

Protocol standardization

With the LFS it is different.

Putting aside the vague, loose and very imprecise specification, the LFS is specified only at protocol level. The implementation of local file storage is not a part of a specification, and there is absolutely no specification about how server should reflect “file storage” and its relationship with a repository on the server file system.

This means, that if the server will die to the level, that I won’t be able to make it to handle git clone --mirror then I won’t be able to transform the server-side format into local format. And without it I won’t be able to push it to an another server.

With bare GIT repositories I can just copy them to the another server implementation. With GIT-LFS I can copy them only to another instance of the same implementation.

This scared me a bit.

Centralized instead of distributed

The another bad thing is the fact, that once You start using LFS, You turn Your GIT system from distributed data storage to a centralized one.

The first moment I realized that was when the network in my company died and I could work with all plain GIT repositories, but couldn’t do much with those which were using LFS.

Then it occurred to me:

With a plain GIT, each time a user does git clone, I do gain a free-of-charge complete*) backup of the repository data . Even tough my server is well set-up, it may always get attacked. Deletion of data will be easily detected, but if an attacker will change repository content it may get undetected. Until, of course, when someone who has a clone will do the git push. GIT will complain to them and I will have a chance to detect the problem and will have an untainted backup on some of workstations.

Abusive to server

The other fact I wasn’t aware when I started using LFS for CAD files it was that there won’t be any efficient diff compression. The LFS just takes files and sends them to server. The server I have isn’t very smart in that area, and is not doing the diff-compression by itself. So the server side repositories ballooned out quickly to surprising sizes.

Abusive to history

Initially when I was reading about “intercepting files on commit” I was, being a coder, under an impression that it is done in a right place. That is after computing the commit hash and before the file content is directed to diff-compression routines which are turning them into GIT blobs.

Unfortunately it is done before computing the commit hash.

Where is the nastiness in it?

In those two commands:

git lfs migrate import
git lfs migrate export

They are clearly the recommended ways of turning Your existing repository into LFS one and back-wards, the LFS managed one into a plain GIT managed.

Being careless in that manner I thought: “Ok, so I can switch back to plain GIT if I don’t like LFS.”.

This is true only as long, as long there are no clones.

The fact, that LFS acts before commit hash computation means, that migration to or from it will rewrite history of repository changing hashes of each and every commit. This is horrible since after such an operation each and every clone out in the world won’t be able neither to git pull nor git push.

No files deletion

At certain moment I though: “All right, I thought, I made a mistake over-using the LFS. I can get a control over all clones in a company, tell users to push their changes, migrate off the LFS and tell them to clone again from the scratch.”.

So I did it, destroying the consistency of a history between server and clones:

git lfs migrate export ...
git push --force

And You guess what? The file storage size on the server did not change. Nothing. Zero. All files were left as they were there before. Exactly as if LFS would be still in use. Hmm… maybe a bug at server side?

So I did inspect the LFS protocol. Even asked LFS guys. They did confirm: There is no way for LFS to delete objects from file-storage.

Ehm…. Say what?!

Gladly this server traces which file comes from which repository and if I delete the repository the files in file-storage are also deleted. The problem is, that the same happens with all the tickets, discussions and other data which are bound with a repository. Hardly can call it a work-around.

Summary

I am very sad to say it, but if I would have to say something about git LFS it is:Avoid it at all cost!.

Be very, very careful and consider why you need it and balance it with all above side effects.
Remember, once You enter the LFS path it will be rather hard to abandon it completely. You can stop using it for new commits, but You will have to keep it around for accessing history.

For me the absolutely only reason to use it, it is to spare on network bandwidth. But there are better ways of doing it. Read about shallow clones, single branch clones and blobless clones.


*)Well… not truly complete backup of all server data. Tickets, issues, forums and etc. are usually kept outside the repository structure. But the actual work itself is preserved.

Introduction to JAVA serialization

In previous chapters I was talking about generic ideas of serialization, dealing with pointers and dealing with versioning.

In this chapter I would like to show You how the concept and, of course, the problem of “pointers” is dealt with in JAVA.

JAVA limitations

No pointers

The most important limitation is: JAVA has no concept of “pointer”. There is a beast called “object reference” but there is no “pointer” which can be turned into an integer number representing an address in machine memory. And since there is no “pointer” there is absolutely no way to access machine memory the same way You can do in C/C++. Just no way.

Unless, of course, You will try to use unsafe set of classes and methods. Then You can play with bits and bytes, but such play will produce memory dumps which won’t survive neither years nor transfer to different virtual machine running on a different architecture. Notice also, that most of unsafe function which were present in JDK8 are removed in JDK11+. And will be hidden even more in future version. Because they are unsafe and do allow a hell lot of hacking on servers which can run “outsider code”.

Thous: no possibility to do a “memory dump”. Which, in fact, isn’t bad.

But what do we have instead of “pointers”?

We have objects. Or, precisely speaking, “references” to objects. Which are internally “pointers” but are completely opaque to us and can’t be transformed to any kind of number nor anything. Not at all. All we can do with “references” are:

  • compare them using == operator;
  • assign one to another using =;
  • use them to access fields, array elements or invoke some methods attached to them;
  • create “objects” we can later reference to through “references”.

No malloc

The second limitation is a lack of generic way to “allocate some memory”. Any allocation must be either allocation of an array or an object. And when You allocate an object, then some of its constructor must be called. This concept prevents us from creating an object with absolutely zero initialization and later filling it with data got from a serialized form. Some colaboration from an object is required.

Note 1: Standard serialization API do interact with VM at native level to create really empty objects without calling an appropriate constructor. We can also do it, but in pure JAVA it won’t be possible without referencing to some internal, JDK specific classes. Notice however that re-implementing “standard” serialization is not our goal. What we aim for is a long lasting, stable, non tricky way of supporting flexible serialization.

Note 2: If You will inspect the JVM specification then at first glance there will be difficult to find that what I just said (that the constructor must be called) is required. Indeed the instruction set does allow to allocate and object and not call a constructor at all. Such a code won’t pass however the validation stage, that is a process during which JVM checks if the class file is syntactically and logically correct. There are tricks which allow to force JVM do disable class file verification, but it is asking for serious problems.

JAVA benefits

To overcome the limitation related to lack of pointers designers of JAVA introduced one very rare and very powerful mechanism: reflections.

Not many programming languages do have it.

What are “reflections”?

In very short words this is a set of methods and classes which allows You to inspect any reference to object instance You get. You can ask it to tell You how this object is named. You can check what classes it extends or implements. And You can, what is most interesting to us, ask it to list and manipulate all fields contained in it with their names, types, annotations and currently assigned values.

You can even access this way fields which are normally hidden from You if You would just try to reference them in code, what was always a bit of disputable aspect of reflections when looked at from security point of view.

Including, of course, fields You did not know at compile time. Exactly what we need.

Note: JDK 9+ module system puts some constraints on it, but You can still do all things we need. Just not with absolutely every object as You could have done in JDK8 and prior releases.

In simpler words: if You have any “reference” to an object then, using reflections, You can ask:

  • ask how the class it is instance of is named;
  • ask what  all “fields” (that is – object bound variables) contained in it are named and what type they are;
  • ask what methods (functions to call) are declared there and what parameters do they take.

And vice versa, knowing answers to above questions You may, using reflections, do:

  • create a new instance of an object knowing name of a class;
  • set to or get values from any of its fields;
  • invoke any of its methods.

Hey, I can do it in C too!

No, You can’t.

I might have been a bit imprecise in what I was saying. Yes, You can do everything like that in C. At the source code level and, what we do call it, at “compile time”. That is You can write:

class X{ int x; };
*X =...
X->x=4

However in JAVA using reflections You can do:

String class_name = .... read it from file for an example
String field_name = ...
Object A = Class.newObject(class_name)
Field x = A.getClass().getField(field_name)
x.set(A,4)

Note: As always, the code examples are simplified and won’t compile. They are only exemplifications of some idea.

In other words, “reflections” allows You to compute the class name at run time, also from externally supplied data, and manipulated objects of that class as much as You like it.

Including classes which have been not existing when You wrote the code.

Nice, isn’t it?

Note: Java allows You also to actually generate the binary class file in Your code and tell JVM to load it at runtime. But we won’t be needing it for serialization.

Reflections+references versus pointers

As You might already noticed the lack of pointers prevents us from making a “memory dump” of an object and from manipulating its data at bit-by-bit level. The “reflections” do allow us however to manipulate actual data stored in an object using names and values without bothering how they are kept in memory.

Do we need anything more?

Summary

After reading this chapter You should be aware how the “object reference” concept differs from the concept of “pointers” and how the “reflections” do allow to overcome the limitations related to full opacity of the “object reference”.

You should also have guessed that the “reflections” will play a critical role in the implementation “indirect versioning” concept.

And, of course, You should also have noticed, that the fact that “object reference” is fully opaque means that we simply can’t save it. But this problem is left for the next blog entry.

Serialization: versioning

In the this post You could read about serialization in generic and how pointers are messing with it.

In this post I would like to touch an another aspect which do influence serialization engines a lot.

This subject is:

Versioning

As I said previously the dump of bytes and bits not only depends on code, target CPU, compiler and etc. but will change from version to version of Your program.

Now, for sake of this discussion, assume that You can force Your compiler to produce stable, predictable memory layout across all compiler versions. Assume also that You do not care about different CPU architectures.

Now You can say that bit-by-bit image of Your memory “dump” will be consistent and predictable.

True.

Unless You will do something like that:

yesterday today
    
struct{
 char [32] first_name;
 char [32] surename;
 int age
 }
    
struct{        
 char [32] first_name;
 char [32] surename;
 boolean gender;
 int age;
}

Yes,yes, I know, we do live in the era of “non-binary” persons and I was super-duper rude to use boolean for a gender. Glad You noticed that. This is a bug which will need to be fixed later and clearly shows that data structure versioning is a must.

What have happen?

We added a field. And we have done it in a very, very bad way, by stuffing it in the middle of data structure. And kaboom! Now bit-by-bit images from “today” and “yesterday” are totally incompatible with each other.

Direct versioning

The first obvious solution it is to arm Your memory “dump” with a “header”:

    HEADER: int serial_version;
    CONTENT:
    	struct{
    	.....
    	};

Writing such data is super easy:

    	write(...,serial_version);
    	write(...,&data_in_memory,sizeof(struct...))

Reading is a completely another story.

You have to do something like that:

    	int serial_version = read(...)
    	switch(serial_version)
    	{
    		case 0:
    			....
    		case 1:
    			....
    	}

and for each case You need to provide a transformation from “that version” into “current version”.

This is a bit messy for complex structures and You need continuously rename Your “yesterday” structures in Your source to avoid name clashing. You can’t do:

yesterday today future
    
typdef struct{
 char [32] first_name;
 char [32] surename;
 int age
} version_1
    
typedef struct{        
 char [32] first_name;
 char [32] surename;
 boolean gender;
 int age;
}version_2
    
typedef struct{
 ...
}version_3

because with each bump-up of a version You would have to update entirie code which references Your structure.

Instead You would rather do:

yesterday today tomorrow
    
typedef struct{
    	...
} data
    
old definition
typedef struct{
   ...
} v1
active definition
typedef struct{
   ...
} data

    
old definition
typedef struct{
....
 } v1
typedef struct{
....
 } v2
active definition
typedef struct{
....
 } data

that is keep most recent version always with the same name. This is a good idea, because if there was no gender field yesterday then none of old code made any use of it. If You will add it today there is a huge chance that 90% of code still won’t need to use it. Keeping the name unchanged saves You a lot of work.

The obvious downside is that You have to rename “old” structures but still keep them in Your code base. With plain structs it is easy, but with objects with their entire inheritance tree it is a hell lot of mess. Doable, but messy.

We need different approach.

But before saying anything about it let’s look at an another problem.

Upwards compatibility? Downwards compatibility?

Now return to:

    	int serial_version = read(...)
    	switch(serial_version)
    	{
    		case 0:
    			....
    		case 1:
    			....
    		default: what to do?
    	}

It does not need a lot of thinking to notice that with direct versioning You can have only downwards compatibility.

“New” code can load “old” data, but “old” code cannot load “new” data.

What is a point in loading “new” data by “old” code You say?

Well… Your yesterday structure did not have gender field. Then “old” program, by it’s nature, will not need it. Why not let it read “today” data? Is there any logic problem with it?

Again, vice versa, loading structure without gender requires that a transforming code has to do some guessing. There was no information about gender, but it must have it.

Notice, there is a one very serious reason to not allow upwards compatibility. That is: money. If You will allow old version of Your software to load files stored by more modern versions, then Your clients may decide to hold all their licenses back and buy just one seat upgrade to try it out. If they will find that there is no value in the upgrade they won’t buy more seats and they won’t loose anything.

If however You will design Your software in such a way that old version can’t do anything with files written by new version, and You will prevent new version from saving files in and old way, then it is a completely another story. Now if You client will upgrade just one seat, then the person working at it will on daily basis corrupt files in such a way that the rest of Your client employees won’t be able to use them. And, since You prevent saving files in “old way” after some time Your client will have a choice: either to not upgrade and throw away all job done on that new seat, or upgrade the whole company and keep the work.

Nice, isn’t it? Welcome to the world of Autodesk Inventor!

Indirect versioning

Indirect versioning uses all the “why” I spoke about above to provide both up and downwards compatibility.

Instead of:

    HEADER: int serial_version;
    CONTENT:
    	struct{
    	.....
    	};

it does:

    HEADER: int logic_version;
    CONTENT:
     begin
    	FIELD "Name" =...
    	FIELD "Gender" =...
    	....
     end

and saves not only the content of the structure, but also information what fields are in it and where.

Armed with this information You don’t really have to use any transformation. You just read fields from a stream and apply them to fields in Your current structure in memory.

As You can see indirect versioning do carry huge potential: Your program can read both older and newer streams without any effort from Your side.

Great, isn’t it?

Blah, isn’t it good old XML or JSON? Sure it is. I am just curious if You ever was thinking about it that way.

logic_version

Notice I have left some “header” still, but instead of serial_version I renamed it to logic_version.

The idea behind it is simple:

In some cases You will have to introduce a “breaking change” in Your data structure. This change is past adding and removing fields, it changes the logic so much that mapping field from stream to field in memory won’t work anymore. To indicate it You just “bump up” the logic_version.

Of course with that we do move from “indirect” to “direct” versioning.

Note: Java serialization do have serialVersionUID field just for that. It lacks however any possibility to deal with such change except of complaining that nothing can be done.

Missing fields

Of course with “indirect” approach You will have to deal with a case where You expect some fields (like said gender) but they are not there.

To deal with it You need two operations:

  • to be able to initialize missing fields with reasonable values;
  • to be able to post-process the entire structure, once it is loaded
    and initialized with defaults, and do some guessing and cleanup;

This have to be done in two separate stages (or at least: last stage is necessary) because the “reasonable” initialization is not always possible without knowing what are values of other fields.

For an example You may attempt to guess gender by looking into names dictionary, but to do that You need to have “name” field loaded first.

Additional fields

And a vice versa.

A stream may contain more fields that You need. If You did not introduce a “logic change”, then this usually means that either Your program no longer needs some information, or new version added some information Your program does not understand.

In both cases ignoring it will be fine.

Pointers?

Hurray! We solved it! If we already have given names to fields in stream why not to add a marker to say which of them are “pointers”?

Summary

After reading this chapter You should be aware version handling is not something what can be left for later because it will have significant impact on code maintenance cost and Your licensing policy.

You should also notice, that if we will switch from “memory dump” to “named fields” concept then we can solve not only up and down-wards compatibility issues but also have a nice mechanism which we can use to identify “pointers”.

Plus, obviously, You should notice why I was so eager to have a structured abstract file format and why people are so much in love with zipped xml file formats nowadays.

All right, so we know about problem with pointers and something about how to deal best with versioning. Now it is time to move to some JAVA related stuff.

Serialization: introduction

In this, that,that and finally in there You could read about abstract file formats.

Note: please notice the reference implementation I proposed there.

Especially in the first post of the series You could read about Java serialization.

In this series of posts, which will explain a background behind my next project, I will try explain basics, concepts and pit-falls which one may encounter when trying to build a data serialization engine.

Note: Most of stuff You will read here will become a part of said project documentation and will be included in a more expanded version in the final project on a github.

What is serialization?

“Serialization” is a method of transforming complex, object based data structures, existing alive in memory, into a “dead” files or data-streams which can be moved from machine to machine or saved for later use.

In other words – a way to save objects on disk or pass them through the net.

What is de-serialization?

An exactly reverse process: Having some “dead” data on disk or received from a network we do “de-serialized” them by creating living objects in memory matching previously stored content.

“Memory dump” serialization

A most idiotic but often good form of serialization is a “memory dump”. Just take a native reference to memory block containing the data structure and dump it on disk. Like in below pseudo C code:

         struct{ int a,b,c }x;
         writeFile(..., &x, sizeof(x))
    

This type of serialization has some serious flaws:

  • it saves data including all “padding” bytes injected by a compiler;
  • different compilers or even different compilations may result in different
    structure layout;
  • pointers? Can pointers can’t be stored at all? What do You think about it?
  • absolute zero robustness against version change;

Even tough this is idiotic method it was used to manually implement “swap file”
in applications which needed far too much memory that it could be provided by
an operating system.

Why? Because it is extremely fast.

Pointers or references

There is usually no conceptual problems with saving elementary data like bits,
bytes, numbers or texts into any kind of “file format”.

Pointers and references are something else.

Pointer concept

A “pointer” or “reference” is, technically speaking, an integer number which is interpreted by CPU as an address in memory from which it should read something, execute something or write something.

There are very, very few cases in modern operating systems on modern machines when an “address” of certain piece of data in memory (ie the variable which holds this text in Your web browser) will be preserved from a run of a program to run. In almost every case it will be different even tough in Your program it will be named the same, is bit to bit the same and etc.

If You would save such an “address” on a disk, would close the program, and the would start it again and load such an address from stored file then You have 99.999% chance that a loaded address won’t point to where it should.

When pointer can be serialized?

A pointer must be always serialized in a “smart way”.

First, the serialization mechanism must know that it is serializing the “pointer”. This means, that it can’t just dump a block of memory on disk and then load it later. It must know where in this block pointers are.

Second, the saved pointer must point only to a part of memory which is also being serialized. Only then You may somehow change the address X to “this is an offset X1 in N-th serialized block of memory”.

A pointer pointing to something what is not serialized can’t be serialized.

How pointer can be de-serialized?

The serialized pointer is basically a way of saying to which part of serialized data it points.

The easiest method of imagining it will be:

                       This is a memory
                    block to be serialized

         *******************b*******************c*************
         ↑                  ↑
         Aptr               ↑
                           bptr
    

The Aptr is an address of memory block as the CPU can see it. For an example 0x048F_F400.

The b is a some variable in that block. The address of this variable, as a CPU can see it is bptr.

And the c is a variable inside that serialized in which we would like to save the bptr.

Let us say bptr=0x048F_F4A0.

If we would just dump the block on disk the c would contain the 0x048F_F4A0.

Then imagine we are loading that block from disk to memory five days later.

Will it work?

Yes. Providing we do load it into exactly the same memory location. Our new Aptr must be 0x048F_F400.

If however You have done like:

         Aptr = malloc some data // Aptr ==  0x048F_F400
         writefile(Aptr...)
         ....
         kill program, wait five days
         ....
         Aptr = malloc some data  // Aptr == 0x0500_0000
         readFile(Aptr...)
    

Then c=0x048F_F4A0 won’t point to anything.

But if You would have done:

         Aptr = malloc some data // Aptr ==  0x048F_F400
         Aptr->c = Aptr->c - Aptr //    c ==  0x0000 00A0
         writefile(Aptr...) ; ← including c in this written block
         ....
         kill program, wait five days
         ....
         Aptr = malloc some data   // Aptr == 0x0500_0000
         readFile(Aptr...)         //   c  == 0x0000 00A0
         Aptr->c = Aptr->c + Aptr; //   c  == 0x0500 00A0
    

then c is correctly deserialized

They key concept You should remember and understand is:

To be able to serialize a pointer You must know when You are serializing a pointer.

Summary

After reading this blog post You should understand what is the basic idea behind the serialization and that raw “memory dump” serialization is not the best concept for any long term data storage. You should be also aware that the most tricky part of it are pointers. You should also notice that the most important thing one should deal with during serialization is to figure out some way of saying “hey, this is a pointer what I am serializing now!”.

No You are ready to take next step.

RtOS:Heart-beat timers

Thump thump… Thump thump… Thump thump…

Do You hear the heart beat? Yes You do. Your body do have the heart-beat and most of micro-controller will also have it.

In this post I will try to show You how a heart-beat relates to RtOS timings.

Hardware timer

Let us first inspect what kind of hardware is built into most MCUs.

The image above presents a modular schematic of a typical hardware timer. As You can see it is just a plain counter which is clocked from some clock source, usually a system clock in MHz range. The CPU can at least write to the counter, but in 99% it can both read and write to it.

The timer hardware is usually provided in three versions:

  1. A continuously rolling timer, where interrupt is generated when it rolls over (ie. from 255 to 0 for 8 bit timer).
  2. A periodic timer, where interrupt is generated when timer counts to value stored in “compare register” and then it is reset to zero.
  3. A “compare mode” timer, where timer just counts up rolling over, but an interrupt is generated when it counts to value stored in “compare register”.

The first option is easiest and earliest, appearing in PIC12 and early PIC16 family of micro-controllers. The second version was most popular, but the third version is most effective, both from financial and energy consumption point of view. Especially than the third version may have more than one “compare register”. For an example some MSP430 does have up to 7 compare registers per single timer.

In this post I will focus on version 1 and version 2 since they are exactly what we need for a “heart-beat” of our RtOS: both can do the “fixed period timer”. The “periodic timer” version which reloads from “compare register” does it in hardware right off the box, while the roll-over version requires You to do in the timer interrupt:

  timer counter -= period

And it is all you need to turn continuously running timer into a periodic one. Providing, that the interrupt latency (time shift, jitter, skidding) is less than the period.

Heart beat

A “heart beat” interrupt is a very popular concept in almost any operating system. You just take a single hardware timer, set it to some sane period and start it. Then You use this interrupt to do everything: measure a time, switch tasks, monitor keyboard… absolutely everything.

What I will focus on today, it is how to use it to implement:

   waitFor(time)

function of our RtOS.

As You may remember from this blog entry when our task likes to await for something it is actually telling the operating system:

   i like to wait for event X to be set
   call yield ; I can sleep, other tasks can do their job

Now imagine, that we do associate with each task, for an example inside a task table:

  typedef struct{
    saved_SP
    event_flags ; just some bits, ie. 8 bits
    event_mask
    timer_counter
  }Ttask_state

a timer_counter variable. I did decide to use a fixed, dedicated single timer per task, since this is an absolute minimum and is sufficient in most cases. Plus it is very simple to implement.

Of course You may select absolutely any strategy You like, starting from per-purpose timers and going up to dynamically allocated timers.

We also define that one of event_flags is reserved for a task timer.

Inside the “heart-beat” interrupt You can then do something like:

 Ttask_state [NUMBER_OF_TASKS] TaskTable;
....
 for i = 0  to NUMBER_OF_TASKS -1 
 {
    TaskTable[i].timer_counter++
    if carry over
    {
        TaskTable[i].event_flags |= TASK_TIMER_FLAG
        if (TaskTable[i].event_mask & TASK_TIMER_FLAG)
        {
            notify RtOS that event flags have changed 
            and a task possibly should be awaken
            see  this blog entry
        }
    }
 } 

This simple routine will increment all timers for all tasks and for those which rolled over it will set an event flag. If this event mask is set in such a state that it indicates that a task is awaiting for it, then it will notify the scheduler of the RtOS that it should re-consider awaking tasks.

Notifying scheduler

If Your RtOS is not doing any energy saving and is not putting CPU to sleep then this is a “do nothing” operation. The scheduler loops, as I described it there so when interrupt returns it will pick up the event flag.

If Your RtOS is putting CPU to sleep the absolute minimum is to keep the CPU awake after returning from interrupt, but usually, due to some race condition with power saving management, something more will be needed. Again, refer to the same blog entry .

Using timer from task

The task which likes to await for some time since “now” does:

      TaskTable[SELF].event_mask |= TASK_TIMER_FLAG  ;indicate You like to be awoken by timer 
(*1)  TaskTable[SELF].timer_counter = 0 - delay wanted ;tell when since now  
(*2)  TaskTable[SELF].event_flags &= ~TASK_TIMER_FLAG ;clear faulty flag
      call yield ;tell RtOS to take CPU from You

A careful reader will notice, that there is a possible race condition between (*1) and (*2). If heart-beat interrupt happens between those instructions and delay wanted will be 1 then that interrupt will increment timer counter, overflow it and set event_flag. Then after returning from interrupt (*2) will clear that flag. The effective delay in such case will NOT be +1 “hear-beat” period, but instead 2M periods (M is width of timer_counter).

Note: There are possible other races, for an example if You will use timer_counter-=period to implement periodic task, exactly the way we did with a hardware timer, but I will not be speaking about them now.

You may attempt to disable “heart-beat” interrupt during timer setup, but I recommend You not to fight with this race condition. Why? Check below.

Timer granularity

Using a “heart-beat” to measure time has one important drawback.

If You will poll the timer_counter in a loop You will see that it will report:

 N,N,N... x thousands, N+1, N+1, N+1...

This means, that if You use timer_counter as a watch You never know if You are at (*1) or (*2):

 N(*1),N,N... x thousands N, N(*2), N+1(*3), N+1, N+1...

that is at the beginning of “heart-beat” period or at the end. Then observe, that when You do:

   TaskTable[SELF].timer_counter = -1

You actually tell the scheduler to await till timer_counter will count up one time and change to zero.

If You will do the:

TaskTable[SELF].timer_counter = -1

at the beginning of the period (*1), then the real time which will elapse till it changes to zero will be very close to the “heart-beat” interrupt period.

But if You will do the same at the same edge of the end of a period (*2) then almost immediately (*3) it will be incremented, changed to zero and Your task will be awaken.

In a “heart-beat” system the requested delay of T “heart-beat” cycles will be in reality something from the T-1...T range.

What, obviously, means that using T=1 means: “possibly no delay whatsoever”.

Is that a problem? Well… If just for waiting some time, then no. Not a problem. But if asking something to be done and awaiting for the action to complete with a timeout then yes, it is a problem. Zero just gives no time and will immediately produce an error:“action failed to complete within a time” .

Restrictions on heart-beat based waits

What to do then?

Select Your “heart-beat” clock to be fast enough that the minimum requested delay will be 2. This will solve plenty of problems, including a race condition I did mention above.

But the clock must be ticking so fast!

Sure. Yes. This is the price for simplicity of the “heart-beat” approach. If You need to wait for 1ms the timer must run no slower than 2kHz. If You want to wait for a second with +/-1% accuracy then 100Hz is absolute minimum.

Again, simplicity has a price.

If it hurts You, do not worry. You can always design a “tick-less” RtOS timer. And it is not very much harder to do.

Summary

After reading this blog entry You should know how to use single hardware timer to provide timing services for all tasks in Your RtOS. You should be aware of elementary race conditions and problems with timer granularity.

How not to: Git & fatal: detected dubious ownership

Today I was hit by this message:

“fatal: detected dubious ownership”

The reason for that was, that I was logged in as a different user and cloned repo as different user. The new GIT thinks this is not ok, and do the “fatal” at each script I tried.

Of course, as in GIT it is always done, it pointed me with the solution.  Set the global configuration variable:

safe.directory

to either * if I don’t like this functionality or to specific folder in which I do allow more than one user to work.

Fine. Shocking and disturbing but fine.

It had shown me how to solve a problem, right?

Except it is all wrong!.

Security issue solved?

The primary idea behind this functionality is to defend against following mode of attack:

Let’s say a user “hombre” has a following folders structure:

 /--+
    |
    + home 
        +
        +- hombre
           |
           + my_projects +
                         |
                         + project_x
                                +-- .git
                                +--- notes
                                +--- libs
                                        +--- mylibrary_A                                     

The user has the project_x repository and the bold .git folder inside it. The .git may be quite a tricky beast and contain filters configuration which may allow some code to be run at almost every git command.

The git command, when run, looks up in current folder and upwards for .git and loads configuration from there.

So if one can inject a fake, malicious .git in there:

 /--+
    |
    + home 
        +
        +- hombre
           |
           + my_projects +
                         |
                         + project_x
                                +-- .git 
                                +--- notes
                                +--- libs
                                        +--- .git
                                        +--- mylibrary_A                                     

then if user “hombre” will type:

  cd ~/my_projects/project_x/libs/mylibrary_A
  git status

then GIT will happily look in the .git and do what it is told there.

Alternatively one may just:

 /--+
    |
    + home 
        +--- .git
        +
        +- hombre
           |
           + my_projects +
                         |
                         + project_x
                                +-- gitnotice, I just removed the dot from the name
                                +--- notes
                                +--- libs
                                        +--- mylibrary_A                                     

and the effect will be very alike.

This is a serious issue, they say, which allows to make “hombre” to run code decided by somebody else…

Except it is a bullshit

First of all, how the hell an attacker may inject the .git?!

One must have write access to “hombre” folders. And if that user got such write access, then, dear me, it is either a legit user who is in an appropriate group or Your system is so much compromised that anyone can do anything. In first case it was a fully legitimate case, in second, You are boned, dead and stinking.

Just manage access right correctly silly puss!

You can disable if You don’t like it….

The effect is such, that most users will just:

 Windows version
 git config --global safe.directory=*
  or
  git config --system safe.directory=*
 Bash version
 git config --global safe.directory="*"

and viola.

If someone was soooooo lazy to not setup security and allow not trusted user to manipulate trusted users data, then that one may stay with this option on. Nobody else will ever need it.

No problem, devs do say, You can always disable it.

Yes sure.

Money, money, money….

Figuring out what is going on was not very easy. Especially that GIT displayed cryptic user information in a form of Windows UID instead of user name. I needed about half an hour to make sure, that there is nothing really wrong with my system, and that I did create in fact repositories using another account than this I am working at now. This is a company owned machine and we switched from one account system to another about a year ago.

I spent 30 minutes on this, what gives 6 Euros I did not earn and close to 10 Euro of employer costs.

And here comes the sad thing. I, myself, needed half an hour. In my company we have about 40 persons who do use GIT, and all of them will have alike problems, because all were subject to account switching last year.

In fact in 99% of company owned systems the owner of some shared resources is not the “hombre” user, but a “group owner”. This is how You manage access in a cost efficient way. And a new, more secure GIT will complain.

I needed 30 minutes. Some of those 40 persons will also need 30 minutes, some of them will catch up faster because they will ask me, some will have to call the IT department because we use GIT not only for code, but for documents version tracking. And those guys can really just click some batch scripts to handle guided commits. The IT action won’t close in less than one work hour, counting both IT personnel, intermediates and the user. Plus some significant, sometimes one working day, delay and stall.

The total cost will be around: 20*5E + 15*10E + 5 * 20E = 350 Euros.

Now let us do some more math.

How many users of GIT there is all over the world? This is very popular software. The count of public repos is around a million, so we can be on the safe side to say that we have about 1 million GIT users. Let’s say half of them is corporate and half of that half will have a problem.

We have 250’000 of problems.

Each problem will be solved during about, by average 30 minutes. I think I am an average, this is a good guess I suppose.

The cost of this “security feature” is then:

2'500'000 Euro

This is a raw cost. If we however do talk about corporate users we should also account for a “lost profit” cost. I did look up for a solution, I didn’t do my job. Due to that my work did not produce expected profit.

How much is that “lost profit”?

Well… my company is not bringing financial losses, so it means, it must earn at least a bit more than those 10 Euro each half hour of my work. Considering taxes and etc, I think than 25% profit will be an absolute minium.

So the total amount of money lost all over the world was:

3'125'000 Euro

Three millions.

Now a question to ask: How much money was saved by this feature? And where and by whom? Which system was actually efficiently protected by it? Did this system cost more than 3 millions Euro?

Summary

The devs did agree that it was a “disruptive change”. A “disruptive change” is a change which forces any action from user just because the user updated the GIT.

Please, please, please, always carefully consider the total global cost when introducing “disruptive change”. It already costed my company 350 Euros. I personally lost half an hour. I will have to loose much, much more, because I will have to:

  • update company guides about it;
  • update training documents;
  • review all scripts on git server if they are used in a way affected by this “security” feature;
  • train my subordinates about it.

I think I will need to invest at least 40 work hours more just to be sure that this function is disabled at all workstations.

Yes, disabled.

Because we do implement proper IT security.

You versus government: The asymmetry of trust

Just few hours ago I had a pleasure to see on Youtube a certain interesting video of a certain guy from U.S. This guy was so called “certified concealed weapon wearer” or how do You, dear U.S. citizens do call it, and in his video he was explaining how an honest person legally wearing firearms on his body should behave during a routine traffic control.

It was a reasonable explanation.

You should be polite, inform the policeman immediately about You having a gun and keep Your hands in visible places. The policeman is routinely under a pressure and stress and if You fail to calm him, he may initiate routine actions which won’t be pleasant to You. Be polite, behave in a way absolutely not suggesting any danger from You and nothing bad will happen.

All right…

In the same video the same person clearly pointed out, that the policeman during the whole intervention kept his hand right over his gun ready to pull it out immediately and point at the driver. Do not make his work harder than it should be, said the guy on video, for a policeman any routine action is related with high risk and he must not trust You, since You may be a criminal.

Back to the past: The year ~2000 in Poland

Now imagine You are moving back in time to my home country. Imagine You are a truck driver moving across the Poland towards its eastern border. There is a nice sunny weather, You drive Your old truck full of hell know what cargo listening to some of Your favorite music thinking that only ten more hours, and You will be at home in Kiev or wherever You live.

Suddenly You see the police car flashing its light in back mirror, so You turn and stop. You see two policemen in uniforms moving towards Your car…

…and they pull You out of a cabin, smack You in a head, and while You are crawling on a ground one of them jumps in the cabin of the truck and drives away while the second one is kicking You and clubbing till You black-out.

This is not an accurate description, but such cases were quite popular in those days in Poland. Trucks were not equipped with tracking systems and were not that much expensive as nowadays. Especially those from former Soviet Union. The cargo was expensive and worth to get Your hands on it.

Back to the future

So what I, being Pole moved forward in time, would see if I would be in a place of that Youtube U.S. citizen?

Some flashing lights, a car looking like a police car and an armed guy in a suite quite like a Police uniform getting out of it.

Next I would see him reaching for his gun, opening the cover protecting it and moving towards me. He is clearly ready to use the firearm against me.

Trust?

I intentionally underlined the word “like“. Looking like a Police car. Looking like a Police uniform.

Only those “lookalikes” are proofs that I will right in a moment confronted with a true policeman.

He is also reaching for something what looks like a gun… but honestly, how reasonable it would be expect that either true policeman or a fake policeman would use a fake gun? Not much. Both would rather have a real gun, especially in U.S. where access to firearms is far more open than in Poland.

So I have a true gun carried towards me by someone who looks like a policeman

Identification

Polish Police is using about twelve variants of uniforms, varying depending on rank, function, time of year and… date of production. Each time a Police was checking me over last 20 years the policeman was wearing a very different uniform.

Alike Polish Police was using at least three variants of marking on their cars. I write “was”, because now they use more consistent painting, but number of car models in use still vary a lot. Gladly except a Police we have at least four other agencies which can stop You and run a check, all with alike number of uniform variants and vehicle paintings… Did I really said “gladly”?

The uniform and paint on a car are the only ways of identifying the Police by eye.

To provide other identification You would have to check his/her badge number (in Poland invisible), make a video call to 911 and ask them to provide You with a mug-shot matching that badge number. You can’t do it safely under an assumption that the one who is closing to You is a fake one. If he is fake, You will get a piece of hot lead before You finish dialing the number.

Fake policeman

How hard is it to get something what from short distance looks like a proper uniform? It does not have to be 100% identical. It must just conform to publicly available specifications.

I would be able to sew something like that by myself within a few days, and I am a poor tailor.

How hard is to put proper stickers on Your own car?

Maybe a day of work, maybe two if You are not experienced in that area of work.

Considering the value of vehicles and cargo You can rob with them it is a worth investment.

Logic

Logically speaking the only 100% safe way to proceed is to… escape. Do not stop until You will confirm by calling the emergency line that the car behind You:

  • has matching plate number of true police car, and;
  • a police car with that number is patrolling that area;

That would be logic.

Except that it is unsafe. No court of justice would agree with Your way of thinking, especially that during the entire procedure You will have to actively avoid capturing… what is a crime.

Insane, isn’t it?

You escape from someone who could be an armed criminal and thous You do commit a crime?!

Request for trust

To turn this insanity into something sane, healthy and logic one must add one and hell strong requirement:

“Any citizen must trust the State that every policeman is a true policeman”

Then You no longer see looking like policeman guy preparing to fire at You from his gun, but a representative of Justice who is serving and protecting You from the evil. What to be afraid of? He won’t shut me, right? I am not a bad guy, but I do understand, he may not know it, so I should somehow let him know I’m good.

Nothing can go wrong.

Lack of trust

This was what the guy on video said. This was his way of thinking.

Yet on the same video the policeman was clearly all the time ready to use the firearm.

Why?

Why the policeman was holding his hand right over his gun all the time while talking with a good guy in a car. Why keeping hands on Your knees or an attempt to get out of the car might trigger risky action on his side?

Because anyone can be a criminal. The policeman must not trust You are a good guy. Blah, the sole fact that he decided to check You means, that he suspects some problems with You.

So one another line must be added to our code of conduct:

“Any citizen is a potential dangerous criminal on the run”

Asymmetry

The entire Police system is designed around two base concepts:

  1. The citizen must trust in every policeman.
  2. The policeman must not trust that every citizen is a good person.

Without the first nothing would work, because there is absolutely no way to make a positive identification of policeman at the first glance.

Without the second the mortality rate amongst members of the Police in countries where violent crimes are everyday business would be skyrocketing.

By the way, from how the policeman behaved on video I may assume that U.S. is such a country. Poland is not, so our Police is not touching their weapon during routine operations. Yet the do have it.

For me this asymmetry is something at least unfair.

The message

What subliminal message is this asymmetry telling us?

First that there is no partnership between us and government justice system. We have to trust, but are not trusted. If there is no equality, whom do we are? Citizens? Or subjects?

Second message is more sinister: the government is weak. Very weak, and with zero authority. Only a weak person and with no respect whatsoever must touch the weapon to feel safe. If the government is strong only a real desperate would actively oppose the Police and against such a desperation a hand-gun offers little or no protection at all.

Good old British “Bobbie”

The message passed by first British Police was much more clear: “We do wear clubs to smack You and whistles to call so many of us, that You stand no chance even if You have a sabre, grenade or a crossbow.”

And we do use a club because You can have the damn walking stick too.

In my opinion it was much closer to partnership and much stronger message than today guys with body armor and firearms.

Fear robs of reason

Gosh… Body armor and firearms…

Are You really so much afraid of me?!

You must be, right?

And if You are so much afraid, then You must live in continuous stress. Under a continuous, solid pressure.

I know from experience that stress and fear are not friends with a good reasoning. Anyone knows that.

So, what do You think that I am thinking seeing You in a body armor and carrying guns?

That You have gone far far away from the land of cold reasoning to the land of “everyone is trying to kill me”. And that absolutely anything may trigger a deadly violent reaction from Your side.

Or, if I am a blind idiot who trust in everything what Uncle Sam do say, that there is a violent crime in progress or will just start in a moment.

Hey, don’t come any closer!

Stop!

I am afraid of You too!

Go away! Go or…

Fear is symmetric

This asymmetry of trust produces fear. And Your fear is clearly visible and communicated. I will be afraid too. Either of You, since a scared armed person is like a grenade without a pin, or of the unknown enemy You are afraid of…

Neither of us has any reason to be calm and cool and either can explode on any moment.

Summary

This is not the first time when I do observe that the “freedom”, “equality” and “trust” are becoming more and more empty words.

Just think about what I just wrote.

And, to be clear, the good old “Bobbie” with a club and a whistle can be a man I trust. Even if he will mistake me for a felon, even if he will have a very bad day all what he can do it is to whack me a bit with some not so hard stuff. I can survive it. I can take that risk. Sorry man, my wrong, still be good friend, right?

Sure no problem, Bob.

But not with 9mm piece of lead in my belly. Sorry Bob, this is a problem.

Abstract file format: what to use for a “signal”?

In that post I wrote:
(…)
The good abstract API needs to be able to indicate data boundaries and move from boundary to boundary with:

 void writeSignal(signal type)
   signal type readNextSignal()

(…)

I let myself to use the enigmatic signal type.

Now it is a time to dig into it.

What is “signal” used for?

Again, just to remind You: for telling what a certain bunch of data is used for. To give it a name.

How many different “signals” do we need?

At least two… or to be specific – two kinds of signals.

If a signal must give a name for a certain block of data then it must somehow indicate when this block starts and when it ends.

In fact it must be very alike good old C typedef:

typedef struct{
   int a;
   char c;
}Tmy_named_struct

Since to be able to efficiently process a stream of data we need to know what struct means before we start reading it, then the structure in a data stream should rather look like even older Pascal:

 begin Tmy_named_struct
   int: a;
   char: c;
 end

The begin and the name comes first, the end comes last.

The “end” and the “begin”

This means we need an API which will be closer to:

 void writeBegin(signal name)
 void writeEnd()
 ...
 signal type readNextSignal()

Now You can see I used signal name and signal type. We need to define them more closely.

The signal name

Historically speaking, when I started this project, I decided to do:

 void writeBegin(int signal)
 void writeEnd()
 ...
 /*....
  @return positive for a "begin" signal, -1 for an "end" signal. */
 int readNextSignal()

It was very bad idea.

At first few uses I started to get pains to manage what number means what and how to prevent numbers from different libraries to not clash. This was the same problem, although in mini-scale, as global function names clash in C.

So I thought to myself: How did they solve it in C?

With a name-space. You can assign a function name to a name-space and then use a fully qualified name to avoid a name clash. And if then names do clash, You can put a name-space into a name-space and form something like:

  space_a::space_b::function

Please excuse my wrong syntax. I did not use C/C++ for quite a long time and I don’t remember how exactly it looks like nowadays.

So I could use a sequence of int numbers….

Dumb I was, wasn’t I?

The hundred times easier and more efficient is to do:

 
 void setNameLengthLimit(int length);
 void writeBegin(String name)throws ENameTooLong;
 void writeEnd();
 ...
 /*....
  @return either a name for "begin" signal or null for "end" signal
*/
 String readNextSignal()throws ENameTooLong;

We use String for names. Strings are variable in length, easy for humans to understand (this is important if back-end will be something like JSON which ought to be human-readable) and Java has a very efficient support for them.

Note: I did myself let to introduce setNameLenght(...) and throws ENameTooLong. Remember what I have said about OutOfMemoryException attack? The ENameTooLong is there to let You keep flexibility and put a safety breaks on Your stream.

But Strings are slooooow

Sure.

With the int as a signal name and a careful selection of constants like below,
the code which looks like this:

  int name=...
  switch(name)
  {
    case 0: ... break;
    case 1: ... break;
  }

can be compiled on almost any architecture to a “computed goto”:

  mov name → reg0
  cmp reg0 with 1
  jmp_if_greater _after_switch
  shl reg0, times           ; according to DATA width
  mov jump_table[reg0], PC  ;read target from table and jump there
  jump_table:
     DATA case0
     DATA case1
    ...
case0:
   ...
   jmp _after_switch
case1:
   ...
   jmp _after_switch     

where the entire comparison take just about 6 machine instructions regardless of how huge the switch-case block would be.

Prior to JDK8 using Strings was a pain in a behind, because the only syntax You could use was:

  if ("name0".equals(name))
  {
  }else
  if ("name1".equals(name))
  ....

what in worst case scenario ended up in large number of .equals calls.

At a certain moment, and I admit I missed it, JAVA specs enforced the exact method of computing String.hashCode(). Prior to JDK8 it had no special meaning and each JVM/JRE/JDK could in fact provide own implementation of String.hashCode(). There was simply no reason to enforce the use of the standard method.

Since JDK8 such a use did appear.

JAVA now absolutely requires that:

  • regardless of JDK/JRE/JVM String.hashCode() is always computed using the same algorithm. This way compiler may compute hash codes for known const Strings at compile time and be sure that in any environment:
       StringBuilder sb = new StringBuilder();
       sb.append('a');
       sb.append('b');
       sb.append('c');
    
       assert(  1237 == sb.hashCode())
       assert(  1237 == "abc".hashCode())
    

    assertions will not fail.
    Please notice, 1237 is not a correct hash code for that example. I just faked it to show, that compile time constant can be used.

  • second they directly requested that String has something like:
        class String
        {
          boolean hash_valid;
          int hash_cache;
    
          public int hashCode()
          {
              if (hash_valid) return hash_cache;
              hash_cache = computeCache();
              hash_valid = true;
              return hash_cache;
          };
        }

    Notice this is a very simplified code which may fail on some machines in a multi-threaded multi-core environment due to memory buss re-ordering. Some precautions have to be made to ensure that no other thread do see hash_valid==true before it can see hash_cache to be set to computed value. Since String is implemented natively I won’t try to dig into it. It is just worth to mention that volatile would do the job but
    it would be unnecessary expensive. I suppose native code could have found better solution.

    Notice, the race condition on setting up hash_cache is not a problem. Every call to computeCache() will always give the same result as JAVA strings are immutable. At worst we will compute it twice but nothing would break.

  • and third they did require that:
       class String
       {
         public boolean equals(Object o)
         {
            ... null, this, instanceof and etc checked.
            if (o.hashCode()!=this.hashCode()) return false;
            return compareCharByChar(this,(String)o);
         }
       } 
    

    which avoids calling compareCharByChar() unless there is a high probability that it will return true. And, of course, the compareCharByChar will terminate immediately at first not matching character.

In much, much more simple words it means, that since JDK8 it is important that String.hashCode() is environment invariant, cached, and used for quick rejection of not matching strings.

Knowing that they let us use:

  String name=...
   switch(name)
   {
     case "name0":...break;
     case "name1":...break;
   };

which is not implemented as:

  String name=...
   if ("name0".equals(name))
   {
   }else if ("name1".equals(name))
   {
   };

but as a hell more complex and hell more faster:

 String name=..
 switch(name.hashCode())
 {
    case 1243:
         if ("name0".equals(name))
         {
         } else if (... other const names with 1234 hash code.
         break;
   case 3345:
        ...
  }

This can be, in large cases, about 10 times faster than pure if(...)else if(...) solution.

Never less it is still at least 50 times slower that using int directly because it can’t use jump-table and must use look-up table which has a comparison cost linear with the number of cases. Unless very large table is used in which case we can profit from binary search and reduce the cost to log2N cost.

Never less it won’t never ever be even close to 6 machine instructions.

Then why String?

Because with int I really could not find a good, easy to maintain method of combining constants from different libraries wrote by different teams into a non-clashing set. I could figure out how to avoid clashing with static initialization blocks, but then such constant are not true compile-time constants and can’t be used in switch-case block.

Strings are easy to maintain and clashes are rare due to the usually long, human friendly names.

In other words – use int if speed is everything and difference between 6 and 600 machine cycles do mean everything for You while continuous patching of libraries in struggle to remove code clashes is not a problem.

Use String if development costs and code maintenance are Your limit.

And even if You are, like me, the “speed freak” please remember that we are talking about file or I/O formats. Will 6 versus 600 cycles matter when put aside of the time of loading data from a hard-disk or network connection?

I don’t think so.

But Strings are so laaaarge…

Yes they are.

And human readable.

Using non human readable numeric names for XML or JSON back-end makes using XML or JSON pointless. The:

 <1>3</1>

is equally readable to human as a binary block.

If however size is Your concern, and in case of a binary back-end it will be, You can always create a “name registry” or a “name cache” and stuff into Your stream something like:

with String names with name registry
begin “joan” assign “joan” to 0
begin 0
end end
begin “joan” begin 0
end end

and get a nice, tight stream with short numeric names and at the same time nice, maintenance friendly String names.

Note: this table is a bit of simplification. In reality the capacity of name cache will be limited and the capacity of numeric names will also be restricted. Most possibly You will need assign, begin-with-numeric-name, begin-with-string-name commands… but this is an another story.

Summary

After reading this blog entry You should now know that the “signal” can in fact be a “begin” accompanied with a “name” or just a plain nameless “end”. You should also know that there is no reason to fret about not using String for “begin” name and how to deal with performance and size issues related with String used in that context.

What next? See there.

Java anti-pattern: non-final getters

In this post I would like to present You first of the series of Java anti-patterns. And it will be:

public class MyClass
{

          private double speed;

public void setSpeed(double speed){ this.speed = speed; };
public double getSpeed(){ return this.speed; };
public double computeTraveledDistance(double time){  return this.speed * time; };
}

I did let myself to color in red problematic pieces.

What is wrong with that pattern?

We do live in Object Oriented world, what means inheritance, overriding and virtual methods.
This means that anyone can legally do:

public class MyUniDirectionalMover extends MyClass
{
@Override public double getSpeed(){ return Math.abs(super.getSpeed()); };
};

This is legal code, which makes sure that regardless what speed is set returned value will always be positive and we should always travel in one direction…

Is that true?

No, because:

....
public double computeTraveledDistance(double time){  return this.speed * time; };

is not referring to getSpeed() but to the this.speed field directly. User did change the behavior, class asked about speed will never return negative number, yet the computed travel distance can be negative.

How to cure it?

There are two possible ways: either block user from overriding what has no effect or prevent Yourself from using the fields directly.

Final getter

The easiest way is to prevent user from overriding code in expectancy of some effect which can’t take place:

public class MyClass
{
          private double speed;
public void setSpeed(double speed){ this.speed = speed; };
public final double getSpeed(){ return this.speed; }
public double computeTraveledDistance(double time){  return this.speed * time; };
}

Hidden fields

If this is undesired and side effect are expected one must prevent oneself from touching fields directly:

class MyClassBase0
{
    private  double speed;

public void setSpeed(double speed){ this.speed = speed; };
public void double getSpeed(){ return this.speed; }
}
... 
class MyClass extends MyClassBase0
{
  public double computeTraveledDistance(double time){  return this.getSpeed() * time; };
}

Where can I find such anti-patterns?

For an example in java.awt.CardLayout which refers directly to component.width and component.height
package private fields while java.awt.Component#getWidth() and getHeight() are non-final.

Summary

What to say, what to say…

Just keep in mind, that You can always use Your compiler to prevent You from doing stupid things. Declaring a getter final clearly states to anyone that the only way to limit speed to positive values is to override setSpeed:

public class MyUniDirectionalMover extends MyClass
{
@Override public void getSpeed(double speed){ super.setSpeed(Math.abs(speed)); };
};

“Content driven” versus “code driven” file formats.

In a previous blog entry I did promise to tell You something about two different approaches to data parsing: “content driven” versus “code driven”.

Or “visitor” versus “iterator”.

Content driven

In this approach a parser is defined as a machine which exposes to an external world an API looking like:

public interface IParser
{
   public void parse(InputStream data, Handler content_handler)throws IOException
};

where Handler is defined as:

public interface Handler
{
  public void onBegin(String signal_name);
  public void onEnd();
  public void onInteger(int x);
  public void on....
 and etc, and etc...
}

Conceptually the parser is responsible for recognizing what bytes in stream do mean what, and for invoking an appropriate method of a handler in an appropriate moment.

The good example are: org.xml.sax.XMLReader and org.xml.sax.ContentHandler from standard JDK. Plenty of You most probably used that.

Note: This is very alike “visitor” coding pattern in non-I/O related data processing. Just for Your information.

Benefits

At first glance it doesn’t look as we can have much profit from that, right? But the more file format becomes complicated, the more benefits do we have. Imagine a full blown XML with a hell lot of DTD schema, xlink-ed sub files and plenitude of attributes. Parsing it item by item would be complex, while with handler we may just react on:

public interface org.xml.sax.ContentHandler...
{
  public void startElement​(String uri, String localName, String qName, Attributes atts) throws SAXException
  {
   if ("Wheel".equals(qName))
   {
     ....

and easily extract a fragment we are looking for.

This is exact the reason why XML parsing was defined that way. XML is hell tricky for manual processing!

Obviously we do not load the file just for fun. We usually do like to have it in a memory and do something with data contained, right? We do like to have something very alike Document Object Model, that is a data structure which reflects file content. The “content handler” approach is ideal for that purpose because we just do build elements in some handler methods and append them to the data structure in memory.

Easy.

And a last, but one of most important concept: we can arm a parser with “syntax checking”. Like we arm SAX parser by supplying an XML which carries inside its body the DTD document definition schema. Parser will do all the checking for us (well, almost all) and we can be safe, right?

Well… not right, but I will explain it later.

Why I call it “content driven”?

Because this is not You who tells what code is invoked and when. You just tell a parser what can be invoked, but when and in what sequence Your methods are called is decided by a person who prepared a data file.

Who, by the way, may wish to crack Your system.

Content driven vulnerabilities

XML

The content driven approach was a source of plenty of vulnerabilities in XML parsing. One of most known was forging an XML with cyclic, recursive DTD schema. The XML parser do load DTD schema absolutely before it parses anything else from an XML. After that it creates a machine which is responsible for validation process. If DTD schema was recursive the process of building this machine  will consume all the memory and system will barf.

Of course this gate for an attack was opened by some irresponsible idiot who thought that embedding rules which do say how the correct data file looks like inside a data file itself is a good idea…

Note: Always supply Your org.xml.sax.XMLReader with org.xml.sax.EntityResolver which will capture any reference to DTD and will forcefully supply a known good definition from Your internal resources.

If You will defend Your XML parser with DTD or alike schema and You will make sure, that nobody can stuff in Your face fake DTD then in most cases “content driven” approach will be fine.

When it won’t be fine?

When Your document syntax do allow open, unbound recursion in definition. Or when DTD does not put any constrains (which it can’t do) on length of an attribute. Or in some other pitfalls which I did not fall-in because I don’t use XML on daily basis.

There is however one another, even more hellish piece of API which can be used to crack Your machine…. and this is…

Java serialization

Yep.

Or precisely speaking: Java de-serialization.

A serialized stream can, in fact, create practically any object it knows it exists in a target system with practically any content of its private fields. Usually creating objects does not call some code, but in Java it is. Sometimes constructor will be called, sometimes methods responsible for setting up the object after de-serialization will be. All will be parametrized with fields You might have crafted to Your liking.

A possible attack scenarios do allow from simple OutOfMemoryException to execution of some peculiar methods with very hard to predict side effects.

All in response to:

   Object x = in.readObject()

Basically this is why modern JDK do state that serialization is a low level insecure mechanism which should be used only to exchange data between known good sources.

Preventing troubles

Since in a “content driven” approach they are data what drives Your program You must defend against incorrect data.

You can’t just code it right – instead You need to accept and parse bad data and then reject them. Like for an example You need to accept opening XML tag with huge attributes, and just once Your content handler is called You must say: “recursion too deep” or “invalid attribute“.

Alike in Java de-serialization You must either install:

public final void setObjectInputFilter​(ObjectInputFilter filter)

(since JDK 9)
or override

protected ObjectStreamClass readClassDescriptor()

in earlier version to be able to restrict what kind of an object can be created.

Notice, even then some code will be executed regardless if You reject it or not because the sole existence of Class<?> object representing a loaded class means, that static initializer for that class was executed.

The “content driven” approach is always using a load & reject security model.

I hope I don’t have to mention how insanely bug prone it is, do I?

Code driven

In this approach we do things exactly opposite way: we are not asking parser to parse and react on what is there. Instead we know what we do expect and we ask parser to provide it. If it is not there, we fail before we load incorrect data.

For an example a code driven XML parsing would be very alike using:

public interface java.xml.XMLEventReader
{
  boolean hasNext()
  XMLEvent nextEvent()
....
  String getElementText()
}

As You can see You may check what next element in XML stream is before reading it.

Note: Unfortunately I let myself to mark one method of this class in red to indicate, that it is also not an attack proof concept. The String in XML is unbound and a crafted rogue XML may carry huge string inside a body to trigger OutOfMemoryException when You do attempt to call that method.

Very alike Java de-serialization might be tightened a bit by providing an API:

 Object readObject(Class of_class ...)

instead of just an unbound:

 Object readObject()

Sadly de-serialization API in generic is maddening unsafe regardless of an approach. What doesn’t mean You should not do it. It just means, You need to pass it through trusted channels to be sure the other side is not trying to fool You.

Benefits

Using “code driven” approach we can be as sure as possible to not accept incorrect input instead, as in “content driven” approach, to reject it later.

Simply, what is not clearly specified in code as expected won’t be processed. It is like wearing a muffler versus curing the flu.

On the other hand, one must write that code by hand, and usually the order of data fields will be forced to fixed or it would be too hard to code. One must also deal manually with missing fields, additional fields and all other issues related to format versioning.

This is why I was so picky about being expandable and support dumb skipping.

Code driven vulnerabilities

Security? No inherent one. At least if API is well designed and all operations are bound.

Usability?

Sure, a lot troubles. Code driven data processing is very keyboard hungry.

But…

“Code driven” can be used to implement “content driven”

Consider for an example a plain “get what we expect” code driven API.

It might look like:

public interface IReader
{
  public String readBegin(int max_signal_name)throws MissingElementException
  public void readEnd()throws MissingElementException
  public int readInt()throws MissingElementException
....
};

This is a pure essence of “code driven” approach. You have to know what You expect and You do call an appropriate method. You call the wrong one, it barfs with a MissingElementException.

Of course it means, You must know the file format to an exact field when You do start coding the parser.

If we would however define this API to allow to “peek what is next”:

public interface IReader
{
  enum TElement{....}
  public TElement peek();
   ....
};

there would be absolutely no problem in writing something like:

public void parse(Handler h)
{
   for(;;)
   {
     switch(in.peek())
     {
        case BEGIN: h.onBegin(in.readBegin()); break;
        case .....
     }
   }
}

and we just have transformed our “code driven” parser into a “content driven”. Under the condition that we can “peek what is next”.

Opposite transformation is impossible.

“Iterator” versus “visitor”?

Yes, I did mention it at the beginning.

Those two concepts are very alike “code” and “content” driven and for Your information both
are present since JDK 8 in Java Iterator contract.

First let us look at the below pair of methods:

public interface java.util.Iterator <T>
{
  boolean hasNext()
  T next()
   ....
};

They do formulate “code” driven contract which allows us to “peek if something is there” and then get it. If we don’t like it we do not have to get it.

Then look at the method added in JDK 8, together with an introduction of lambdas and “functional streams“:

void forEachRemaining​(Consumer<? super E> action)

This turns it into a “visitor” concept where in a:

public interface Consumer...
{
   void accept​(T t)
};

accept(t) method is invoked for every available data regardless if we do expect more of it or not.

Reader may easily guess, that if one loves “functional streams” concept, which I don’t, then the “visitor” pattern has a great potential.

Note: There is one case when visitors are beating iterators. This is a complex thread safety. Thread safe iteration requires the user to ensure it, while visiting puts this job on the shoulders of a person who wrote data structure.

Summary

After reading this blog entry You should notice, that “content driven” parsing is very simple to use but at the price of being inherently unsafe.

On the contrary “code driven” is usually order of magnitude safer, but also an order of magnitude more stiff and harder to use.

If not the fact that code driven parsing with “peek what is next” API can be used to implement “content driven” parser the choice would be a matter of preference. Since this is how it is, then my proposal of an abstract file format must be, of course, designed around code driven approach.

Abstract file format API, basic primitives

All right, in this and that blog entry You might have read about file formats.

Then You might have noticed that I defined a certain absolute minimum for a file format which I called a “signal format” and proposed some API for it:

public interface ISignalWriter
{
   void writeSignal()...
   OutputStream content()...
}

and

public interface ISignalReader
{
  InputStream next();
};

You may also remember, that I have said this is a bad API.

Today I would like to explain why it is bad.

OutputStream/InputStream are bad guys

Note: For those who are not familiar with java.io I must explain a bit. The InputStream and OutputStream are classes which do provide a sequential and byte oriented API for binary data streams. You can just write some bytes and read them. Nothing fancy.

Now first things first: we are talking about “abstract” file formats. Abstract, in term of an API which do allow us to write any data without concerning how it is sent to the world. Binary? Fine, no problem. XML? Why not. Json? Sure, You are welcomed. And etc, and etc.

The InputStream and OutputStream are binary. They do know only bytes and bits and we have to play with them to encode our data. We can do “binary”, but the XML won’t just happen without a lot of care from our side. And this is what I would like to avoid in my abstract file format API.

The API I proposed above do take care about naming data structures and telling us where do they start and where do they end. It also allows us to move around, basically skipping content which we do not care about. I does not tell us well however how do we store the data.

All right, but what the data are?

What are data?

Honestly? Anything. But to be more precise: anything You can express in Your programming language.

The primitive types.

In Java our data will be then:

boolean,byte,chart,short,int,long,float,double

These are basic building blocks. Absolutely everything what can be expressed in Java can be told using those data types.

Obviously other programming languages will have different set of primitive types. The good thing about Java is that those types are well defined. There is no ambiguity like in C/C++ and byte is always binary U2 encoded signed number with 8 bits. This is why I love Java.

Primitive data versus Input/Output streams

Obviously I am not a genius and there were smart people before me. The Java guys long time ago thought about “hey, why to play with bits and bytes in Output/Input stream? Can’t we just play with primitive types?”

And they did introduce DataInput and DataOutput interfaces.

The idea was good… except it was totally fucked up. This was still the era when we had struggled with telling apart “contract” from “implementation” (interfaces and pure virtual classes were something new then) and they did define those interfaces as something like that:

int readInt() throws IOException
Reads four input bytes and returns an int value. 
Let a-d be the first through fourth bytes read. The value returned is:
 (((a & 0xff) << 24) | ((b & 0xff) << 16) |
  ((c & 0xff) <<  8) | (d & 0xff)) 

I let myself to highlight in red what was done wrong. They not only defined that this method do read 32 bit signed integer from a stream, but they also did specify how should it be achieved. They messed up contract with an implementation.

But if You would just ignore it and leave only the following fragment:

int readInt() throws IOException
Reads and returns an int value. 

then it is a a good abstract API for reading primitive data from a stream. I like it.

What else is wrong in DataInput?

Since I am already at pointing what have been made wrong let me continue and point out other weak and possibly dangerous point in DataInput API.

The next candidate to yell at is:

String readUTF() throws IOException

which is defined in a triple wrong way.

  1. It specifies how it is stored in a binary form. This messes up an implementation with a contract, but I already have told it.
  2. Then the binary format which is chosen limits the size of encoded binary form of a string to up to 64kBytes. Notice it creates two problems:
    • first it prevents saving longer strings, and second;
    • You can’t predict if Your string will fit the 64k limit until You try it. This limit applies to an UTF-8 encoded form and the size of UTF-8 form do depend on string content. This is silly and make it unpredictable. Unpredictable code is bug prone and inherently unsafe. You may be fine saving 65535 letters long US English text but in Chinese You will hit the limit, I think, at about 8192 characters or less.
  3. And at last, this API do remove any control from the reader about how many data will be actually read and used to build the String. Sure, the encoding 64k limit puts a serious safety constraint, but You can’t say how long the returned string will until You read it.

Why it was done this way?

Because this reflects the constraints put in class file format on constants and data. The DataInput and DataOutput were initially meant to manipulate java class files. And this is all.

All right, so how the readUTF should be declared then?

Maybe this would be ok:

String readUTF() throws IOException
Reads an UTF or alike encoded string of an arbitrary length and returns it.

This API looks good. We have plenty of alike API’s, right?

Except that it would be insanely unsafe.

And this is where we come to an another important factor.

File formats are gateways for an enemy attack

Yes.

The:

String readUTF() throws IOException
Reads an UTF or alike encoded string of an arbitrary length and returns it.

could be a good API if it would be used internally inside a program to process data it keeps in memory or produces on demand. If however it interfaces with a potentially hostile external world we have to take more care.

First we need not to limit our usability and need not to constraint the length of a String in a dumb way like DataInput did. We may like to store MBytes or GBytes in that way. Or we may just store sentences few characters long. At the implementation side we will have to resort to something functionally working like good old “null term string”. Remember, unnatural limits in API do remove its usability.

But having no size limit means…. that we have no size limit.

Remember, the file comes from an outside of a program and may be crafted by an attacker to intentionally harm us. For an example an attacker may create a program which just pumps characters at infinity into and I/O stream at no cost except a connection load.

What code implementing the API:

String readUTF() throws IOException

would do if it would be confronted with such a crafted stream?

It will first allocate some small buffer for a text. Then it will load text from an underlying file or I/O stream, decode it and append to the buffer. If buffer will get full before the end of text is reached it will re-allocate it and fill again. And again, again… till OutOfMemoryException exception will be thrown.

Even tough Java is well defended against this type of error the OutOfMemoryException is one of nastiest to restore from because it can pollute system all around. Imagine one of threads touching the memory limit. Sure, it made it wrong and is punished with an exception. But what if a good behaving thread is also allocating some memory during the problematic operation performed by a wrong doing thread? It is just a matter of randomness which of them will be punished with OutOfMemoryException.

We can’t open this gateway to hell!

The correct API would look like:

int readUTF(Appendable buffer, int up_to_chars) throws IOException
Reads up to specified number of characters and appends it to a given buffer. 
@return returns number of appended characters. 
        Returned value is -1 if nothing was read at due to end of text.
        If returned value is less than up_to_chars then the end of text was reached.
        If returned value is equal to up_to_chars then either end of text was reached,
        or some characters are left unread and can be fetched by subsequent calls of
        this method.

Sure it doesn’t look very easy to use but it allows us to put a control and restrain the out of resources attack by simply calling:

 int s = readUTF(my_buffer,64*1024)
  if (s==64*1024) throw new AttackException("String too long, somebody is trying to attack us!")

and be sure that even if an attacker will forge dangerously huge I/O stream it won’t harm our application.

So how the API should look?

Again I drifted off shore to the lands of unknown. So let me swim back and return to the API.

The good abstract API needs:

  • to be able to indicate data boundaries and move from boundary to boundary with:
       void writeSignal(signal type)
       signal type readNextSignal()
    
  • to write and read elementary primitives without taking care of how exactly they are stored:
     
       void writeBoolean(boolean x)
       boolean readBoolean() throws IOException
        ...and so on for each primitive
    

    The DataInput and DataOutput are good starting points if You remove anything related to “bytes” and encoding from them.

Is that all?

Well… no it is not. But before I will move to more details we will have have to talk about a content driven parsing and code driven parsing because it will impact the API a lot and will again show us some serious safety issues which may be created by carefree built API.

Summary

After reading this blog entry You should be aware how the abstract file format should deal with basic data like numbers end etc. You should also be able to point out potential safety issues with file format related APIs.

Software market monopoly – nature of things or an abuse?

Hi,

Today I would like to write about a building up monopoly on the software market and make an attempt to check what is the reason behind it.

Monopoly as business evolution convergence

Everything evolves. Markets do evolve, industries do evolve, societies do. The driving factor of an evolution is an old saying of Darvin: survival of the fittest.

Ok., sounds good. But what does it actually mean? What does it mean to be fittest?

Well…

All right I will say it: Darvin was wrong. He did specify the primary driving factor of an evolution in reverse. This is not fittest who survive. This is the one who survives that is the fittest. Nature has absolutely no other means of quantifying organisms quality than testing if they can survive.

And exactly the same process happens in business. The business needs funds to live and spread. The more funds it gets, the wider it can spread. The wider it spreads, the larger part of market it does occupy. And… and this is all. Nothing else matter.

Business is not about producing or providing something. It is all about money and market share.

In theoretical conditions we have “competition” which allows “better products” to win. This is a classic free market theory. Nice isn’t it?

Except it is stupid. It does not take in a account the evolution. And evolution of business is driven by just one factor: to survive. Survive and spread.

Now try to think about it for a moment. What is a boundary condition of that spread? What is the limit? How much Your company may expand till it won’t be ever possible to expand any more?

The limit is the market. Once You got 100% share You can’t get anymore. You may expand the market by creating new kinds of products and thous adding some more pieces to market pie, but once You consume them there is nothing left to consume anymore.

So this is the upper limit. To get 100% market share.

The monopoly.

And this limit is a natural convergence point of any business.

Toxic monopoly?

Obviously what is good for a business is not necessary good for consumers. Owning the 100% market share allows You to actually stop providing anything good at any reasonable price. There is nobody else they can buy it from, so why bother?

And this is the reason why there are anti-monopoly laws all over the world.

Monopoly as technical convergence

Putting aside the financial aspect let us now dig into a technical one.

The building up of a monopoly can be stopped only by one thing: the creation of competing business. Keeping this in mind one may easily notice, that the easier such a business may be created, the harder it makes for others to grow into a monopoly.

All right, so let us dig into it.

Creating competition in a material world

In material world we have a constant and significant “per piece” costs. Each time You do produce something or You provide Your clients with something, You have to get some materials and do some work. The more complex the product is the greater amount of work it requires.

Per piece.

What does it mean?

Well… the more market share You like to grab, the more pieces of product You need to manufacture. And Your resources are always constrained and always expensive. And the costs are, more or less proportional to an amount of production. Adding those two together, the more You like to grow the less complex product You have to provide.

A good example is a CD drive (or DVD/blue ray). If You would be so kind to get Your hands on one of CD drives produced in 1990-ties You would have find there at least three motors: one to spin the disc, one to move the disc shelf and one to move the laser head. But if You would get the 2010 CD drive You would quickly notice, that now there are only two motors. The one to spin the disc, and the other which moves the shelf and the laser head. Then if You would look at the high precision of 1990-ties laser positioning mechanism and compare it with the later You will see how much cheap and crappy it became.

Notice, I am not saying “crappy” means “bad”. It just means, that the simplification process was pushed to the limit and smarter, cheaper electronics learnt how to do the job well with oversimplified mechanics.

The significant and complexity dependent per-piece cost forces expanding business to focus on product simplification.

All right, so it happens like that. But does it have anything in common with competition?

It has everything.

The development costs of a product are, roughly speaking, proportional to its complexity. At first glance it may not look like this. But in fact, the “development” is not only a “design”. It is everything from the idea till getting the product at the market. And it includes all the necessary investments at the production floor, training, machines and etc.

This means, that the less complex the product is, the easier it is to create a competing company. The service of selling fruits is surely less complex than the service of manufacturing chairs what can be easily observed by looking how quickly new groceries do appear compared to new chair factories.

So there is a natural barrier which prevents material products from becoming more and more complex. The longer the product is on the market and the more market share it takes, the simpler it becomes. Because simpler means more profitable.

Creating competition in software world

And here we are.

The driving rule for software business is exactly the same: to expand the market share.

However the per piece cost is not connected with complexity. Producing a next license costs roughly the same if You would license the Notepad or Autodesk Inventor 3D CAD system. You can manufacture as much licenses as You want at the same cost, regardless if it is a simple few thousands lines of code software which just makes a “beep” or a multi-million line best. Complexity does not limit Your expansion. Complexity does not limit Your profit. You have absolutely no pressure on simplification.

The development cost do depends on complexity exactly the same way it is in material world. I dare to say it is even more expensive, because in material world a gross part of development expenses are investments in off-the-shelf tools, while in software it is a pure design work what makes the cost. And programmers work-hours are hell expensive those days.

What does it mean?

That the more complex the software is the harder it is to create a competing one. It may be relatively easy to create a better Notepad, I suppose I could struggle with it for, let’s say 6 months, but I can’t imagine writing by my own the decent 3D CAD within less than half of a life-time.

Software monopoly is technical

The above reasoning clearly shows, that software development and production process do disable a very important pro-competition mechanism and instead enables the mechanism which actually prevents competing products from appearing.

A software company will naturally continuously invest in development and will be adding functions to their product. Thous, naturally, the complexity of a product will grow.

And naturally, the growing complexity will actually put a barrier on competition. In many cases this barrier will be unbreakable. If Autodesk did invest 30 years of steady development in their products, the competition would have to invests, let say 10 years of alike sized team? Maybe more, maybe less, depending on how much Autodesk cut on their team in recent years. This is still a huge amount of work and a lot of highly paid work-hours which needs to be invested.

A real no-go path.

Summary

I think this is enough for today.

After reading this blog entry You should have noticed, that in my opinion monopoly in software happens due to natural, technical reasons. Even a fair, non-abusive business practice will convergent into a monopoly as long, as a produced software is of any value. It does not have to be very good. It simply has to be good enough to be able to sustain 10 years of slow, continuous development. After that the complexity will be so high, that the chance that somebody would invest something about 0,5 x 10y x 20person work-hours to create a competing product is minute.

Even tough monopoly in software is technical and natural it is still highly toxic.

The fact that it is toxic means, that some legal limitations should apply. On the other hand the fact, that it is natural means, that current regulations are absolutely no use.

What are file formats anyway?

In a previous blog entry I did roughly describe how Java serialization is using its file format and why it is wrong. I also introduced the idea of plugable file format.

In this blog entry I would like to dig into a sole definition of file format.

How file format is defined?

By hand and on paper.

Really. No joking.

What are You looking for when, let’s say You, are tasked with processing data stored in STL file?

For a specification. You are looking for a human readable specification.

Format specification

The document called “format specification” may be either very short and unclear, as in case of a binary STL , or it may be very long, formal… and also unclear as if in Microsoft *.lnk  case.

Regardless how it is expressed the ultimate result of reading it is to know what data in file means what.

This is all. It may say: “first four bytes store IEEE32 float for X coordinate, next four bytes alike for Y, next for Z and so on to end of file”. This is roughly speaking the idea behind STL format. Or it may say it in much, much more complex way, as in case of the said above Microsoft format.

The common denominator for such formats is one: if You don’t have specs, You can’t do anything with it.

Exactly as with Java serialization format.

Intermediate file formats

One may: “All right, but we have XML. It is self describing and solves everything”

Almost good. Almost…. No, not at all.

XML is self describing. This is true.

You may open a file and force Your machine to interpret it as UTF-8 text. Or as ASCII text. Or as UTF-16LE text. Or as UTF-32BE text. Or…

Sooner or later You will get a readable text. Then You may look at it with Your human eye and deduce what means what. Unless it is *.dwf file which is a “portable” format consisting of XML with one huge text encoded binary block. What a nice joke they made!

Then, my dear XML fan, why no JSON? It is also self describing. And the plus is, there is no text encoding lottery because it is hard-coded to UTF-8.

Bad formats, bad!

The primary problem with both, and in fact with most formats of such kind, is that files are huge.

The 64 bit floating point number needs 8 bytes in binary format. And about 20 or even more in JSON or XML. Not mentioning that some numbers which have finite 2 base form have infinite 10 base form and vice versa.

Note:
At first I did add here a lot of parsing and security problems, but later I decided this is not a right moment. So let’s stick with a huge as a main problem.

Good formats, good!

The most important advantages of XML alike intermediate formats are:

  • they are self-describing;
  • they do allow “dumb skipping” of unknown content;
  • they are expandable;

Self describing means…

For the format to be “self-describing” it is necessary that it somehow, in a standardized way, gives names to some elements it carry. Since the way of giving names is standard You may take a file, parse it using a standard parser, and see what names and in what order do appear. With this information You may easily guess what is stored where.

Both XML and JSON are self-describing.

Dumb skipping of unknown content is…

This functionality is tightly coupled with the previous one. The standard way of giving names means, that there must be a standard way to find names. This way must be independent of a content carried inside or between named elements. If it would depend on it, then You would not be able to find names.

For an example we may create a text format which specification will say:

File is divided into tokens separated by “,”. First token is the name of first element. After the name there are some “fields” and then there is the name of a next element.

The number and meaning of fields is following:

  • for name “point” we have two fields (X,Y);
  • for name “circle” we have three fields (X,Y,R);

This format does not allow dumb skipping. You must know mapping from names to counts of tokens to find which token is a name and which is a field.

If for an example this format will be modified to:

File is divided into tokens separated by “,”. First token in a line is the name of an element (…)

the this format would allow dumb skipping because name is always first in each line.

Dumb skipping is very important because it allows You to extract data of interest from files without bothering about full syntax of the file.

And expandable is…

This is almost like “dumb skipping”, but not exactly alike. The “dumb skipping” do allow You to ignore elements You do not understand. For an example if version 1.0 of above simplified format knew only “point” and a “circle” and version 2.0 did add:
(…)

  • for name “rectangle” we have four fields;

then parser understanding version 1.0 may parse 2.0 file. It won’t be able to react correctly on a “rectangle” but the presence of it won’t stop it from understanding the rest of a file. And what would it do with “rectangle” anyway if the application it is built in does not know rectangles?

If however the version 1.1 would add:
(…)

  • for name “circle” we have four fields (X,Y,R,A). First two being X and Y, next radius, and next the aspect ratio;

then our parser version 1.0 may read “circle”, read three fields and then expect the name. Which is not there. If file format is expandable this parser should not be fooled by this and the request: “and now I expect the name” should be correctly fulfilled by skipping the aspect ratio added in 1.1 version of a file.

In other words, to be expandable the format must allow “dumb skipping” regardless of at what token the cursor is.

So why I find XML bad?

Because in both cases, XML and JSON the declaration of API:

and You can start element with a name, then write content, naming possible sub-elements...

is bundled together and inseparably with implementation of API:

 XML element starts with <name and ...

Smallest common denominator

Now let us ponder what is a smallest common denominator of XML, JSON and “specification” file formats?

To know where information about X start and where it ends. “Specification” formats are saying it, frequently, by knowing the position of a cursor in a binary file. XML and JSON are using a kind of syntactic marker.

And this is it.

The smallest common denominator is the ability to say, when writing a file:

“Here is a boundary of the element”

This is all we need to be able to parse format element-by-element.

Elementary signal file format

This smallest common denominator API may be defined in Java like:

public interface ISignalWriter
{
   void writeSignal()...
   OutputStream content()...
}

The writeSignal() do write a kind of “syntactic marker“, let us now ignore how it does it, and stream returned by the content() allows us to write raw bytes into such format.

The reading counter-part may look like:

public interface ISignalReader
{
  InputStream next();
};

where next() finds next, nearest “syntactic marker” and returns InputStream object which allows us to read the content up to next “syntactic marker”.

The very important functionality is that next() must work regardless how may bytes were read from previously returned InputStream. That is it must support both “dumb skipping” and “expandability”.

Summary

After reading this blog entry You should have some idea what are requirements for a good file format and how an elementary, good API may look like. I do warn You that in fact this is NOT a good API at the moment, but it illustrates well the concept.

In next blog entry I will expand that idea in a bit more sophisticated form.

Towards abstract file format

Today I would like to talk about complex file formats.

Anyone of You who are programming most probably was either reading or writing to files. If You got lucky, You were supplied with some format specific API. If not You had to get to format specification and write the API by Yourself.

How many times You had to do it? Five? Ten? More?

I got a bit pissed off having to do the same brainless work again and again. I have my data, I know the structure of it, and I would like just to push them to a file. I should not care if the format is XML, JSon or any other binary format, right?

Java Serialization

Java serialization was a brilliant step forwards, but it was stopped half the way.

Note: For those who do not know what serialization is: You take an object, You take a stream and say “write that damn object to a stream”. And serialization writes the object and all objects it references to. Just like that.

Why I am saying it? Well…

Because the serialization is a fixed hard-coded binary format. Even worse, it is implemented in such a way, that there is no clear “borderline” which You may override to serialize the object to XML instead.

Sure, You will say, we have other serialization engines for Java which do write to XML. Yes, we have. But they also come with hard-coded format, and what is much worse, with own object processor.

In fact, what I do need to have is a standard serialization engine with a “plugable format”. Something like that

How it is done currently?

The current serialization source code is build like:

Taken from LTS JDK 8 source
....
  private void writeHandle(int handle) throws IOException {
        bout.writeByte(TC_REFERENCE);
        bout.writeInt(baseWireHandle + handle);
    }

This is a part of ObjectOutputStream.java which is responsible for writing to a stream a reference (pointer) to an object which was already serialized (at least – partially serialized). This is a good API for serialization format. Having the API with writeHandle() would be nice. The implementation is however utterly stupid. At least from an object oriented programming point of view.

This method should be:

   protected abstract void writeHandle(int handle) throws IOException;

and should be declared in class AbstractObjectOutputStream. Then a DefaultObjectOutputStream class should be declared and it should carry the implementation:

  @Override protected void writeHandle(int handle) throws IOException {
        bout.writeByte(TC_REFERENCE);
        bout.writeInt(baseWireHandle + handle);
    }

If it would be done this way we could have easily change the binary format to XML, text dump or whatever we would like to have.

Note: Looking at the serialization source code one should devise from it the bright idea that You should not let inexperienced coders to code new ideas. The idea of serialization and algorithms behind it were very new at the moment. Nobody ever done something like before. Sure, it has numerous conceptual bugs but the coding… it is was sea of errors which can be made only by very inexperienced coders.

Making it plugable

The obvious part to be made later would be:

public interface ISerializationFormat
{
    public void writeHandle(int handle) throws IOException;
.....
}

and

public class PluggableObjectOutputStream extends AbstractObjectOutputStream
{
     private final ISerializationFormat fmt;
  ....
    public PluggableObjectOutputStream( ISerializationFormat fmt )....
    ....
@Override protected void writeHandle(int handle) throws IOException {
       fmt.writeHandle(int);
    }
}

This way we can use a precious “wrapper” technique to debug serialization by, for an example:

   AbstractObjectOutputStream o = new PluggableObjectOutputStream ( new LoggingSerializationFormat ( new DefaultSerializationFormat( ....

Try doing it now…

Benefits from plugable serialization

One may say: “All right, so You have a problem with that. This is just because You are lazy. If You need different format, why not to write the serialization for Yourself? The algorithm and sources are public, right?”

One may be right. May be.

But only if that one did not try to do it. I have tried it. Three times.

The serialization algorithm is not trivial. But it can be done. In a very ugly, sub-optimal way, but it can be done.

What cannot be done, at least not in a portable way in pure java at JDK 8 level, is de-serialization. Specifically a very, very tiny bit of it, which translates to Java bytecode:

   new "x.y.class"
   without calling a constructor

This sequence of bytecode can be executed by JVM but is prohibited and rejected by class file verification mechanism. You can’t have object without a constructor being called, and serialization especially does not require to have a “do nothing” constructor. This action must be implemented with digging into JVM guts and thous the special open source project called ObjGenesis (as far as I recall) was created. But this project is no magic and does nothing more that “check what JVM you run on and hack it”.

So implementing the exactly compatible de-serialization algorithm is at least very time consuming task.

Just to make it, let’s say XML?

If it would be plugable, then there would be no problem at all.

Serialization format API or…?

Up to now I was talking about a very specific case of Java serialization API. This API is very focused on a dense packing of Java objects. If You will just try to use it to save a struct of some fields You will notice, that it will pollute the stream with things called “class descriptor”, reference counters and etc. While You just wanted to have a plain, dumb structure, right?

I think we should now focus on thinking about what exactly the file format is.

But this will be in the next blog entry.

Summary

After reading this short blog entry You should have grasped the idea why something like a “plugable file format” may be useful. You should also know what inspired me to dig into the problem.
In next blog entry You will be show details about how exactly we should define such a format.

Why not to write a “User Guide”

In previous post I wrote about a beast called a “User Guide” and explained why it is useful for a user. Or for Your dear current of future customer.

In this post I will take a “devil’s advocate side” and will try to convince You why You should not ever create any kind of such a user helping document.

So let us start from obvious reasons.

Costs and expenses

Direct costs

In past era “User Guide” was bringing to companies a direct per-piece cost. It was a printing cost. It was especially visible for software producing companies where cost of printing and shipping a few hundred pages long book quickly dominated the distribution costs and thous directly and negatively influenced a profit.

Recently I had a pleasure to take a part in preparing a very low volume professionally printed publication. This publication was shipped in printed form and with attached DVD+R disk. The cost of the printed book was about 50 times (yes fifty) the DVD+R and due to weight it tripled the shipment costs.

Obviously nowadays you may ship a book over an Internet in an electronic form for about 0,1% of cost of a hard-copy book. And obviously You can charge Your clients who like to order a printed copy. This makes life easier for software companies because their per-piece distribution costs are pushed down near to zero. Sure, You have to pay for a server, but honestly, if You like to be seen on the net You simply must already have had it. Even if server bandwidth do generate a cost for You please compare it with a per-piece cost of producing a turning lathe which is, surprisingly, often priced at the level of some not-so decent software.

Preparation costs

A “User Guide” must be written. This is obvious.

Back in 1990 the software market was so elite that it was enough to supply English “User Guide” to any location all over the world and everyone was happy. About 1995 the software companies started focusing on providing localized versions of their software. And If You provide localized software You also have to provide a localized “User Guide”.

This generated significant costs, but since You had to order a professional translation of Your entire user interface and context sensitive help it did add maybe 15% to the cost of localization. I know it from experience. I had a displeasure to provide a translation for one of applications I wrote and I must say, that translating a “User Guide” was a piece of cake compared to translation of user interface and all the helps.

Nowadays however You have an another path to choose when translating user interface or helps which does not involve any professional. This is the automatic translation. Honestly, about 50% of pro-level software I have seen recently is automatically translated. If You like to know my opinion it is fu*ng users in the… what You call it… behind. The automated translation is usually so idiotic, that Your software will loose about 10…20% of functionality and will generate a tremendous amount of frustration at user side. But it is inexpensive.

Bad and misleading translation in user interface is usually sooner or later accepted by users. They will swear and curse You, but they will discover what it does despite of what it is called and will map it in their minds. Once they will do it, they will just use it the “monkey way”.

Unfortunately what can be accepted in a user interface can’t be accepted in a book. A book must be readable. There is no way to click and try what the book says. Automatic translation of books is always giving at least funny results and the top level it reaches is the level of the novel written by an elementary school pupil.

And context sensitive helps? Well… there is a reason behind why help is usually so hopelessly unhelpful. So nobody will even try to read it. Autodesk is using this method and I assure You that very few users do read original English help and almost none reads the localized one. Not after trying it once or twice. There is simply no point in doing it.

Maintenance costs

Each time You change Your software You need to update documentation. Updating help is enough pain in the behind, but keeping up to date a human translated “User Guide” in all the languages will be a nightmare.

I am not telling You it can’t be done. But it will cost You and will delay the shipment to the market. This is because You can’t update a manual until the software tests are complete. And since manuals needs to be translated by humans keen both in subject, source and target language, You will most probably have to outsource it, what means that Your request will have to stay a bit in a line and wait.

More important issues

What can be more important than costs?

Well… Costs are cut from profits. We don’t earn money, but we are not charged any money.

And how about refunds, warranty claims and lawsuits?

Those are worse than production cost, because are taking our precious money from us!

To request refund or fill the lawsuit one need a good reason. All good reasons do have one common denominator: the supplied product does not fit the agreement!

The “agreement” is a vague expression on saying: “it doesn’t do what it promised to do“. In a material world it would be not doing what it is reasonably expected to do, for an example a hammer bending when hitting a nail, or to not match description on a box, or, finally be out of declared specs or functionality.

Currently software companies are doing everything to escape a warranty, and can say: “You bought it, You paid for it, You have seen the limited warranty so what do You want? Sure it doesn’t work. And so what? Just fu* off and go away.

But this gives You a bad publicity. You can live well with bad publicity if You are smart in exploiting the “vendor lock-in” business method, but otherwise it is not wise to piss-off Your clients.

“User Guide” as a promise

And this is a problem.

Let’s say You made a “notepad” like application. Then you added a button which should allow making audio notes, but due to a bug it will record only 30 seconds and then stop.

Well, it records a sound, right? It does what user may reasonably expect from an non-described red dot in an note taking program. There is absolutely no way for a user to successfully complain, because they can’t point to any promise You made about recording audio notes. And, by the way, 30 seconds is a reasonable limit for an audio note, right? You can say it was designed this way.

Everything changes if You will write a “User Guide”. You can’t be vague in a book, and You can’t tell about this button just: “it records a voice note“. Most probably You will be more descriptive and will probably show user how to record it, where to find recorded file and how to play it. In 99% of cases You will make either direct or indirect promise, for an example by claiming that “the length of voice note can be limited by the amount of free space on a disk“.

And You are boned. You made a promise and user may complain.

You must not make any promises. Because if You promise something You won’t be able to tell Your clients to bugger off.

Each promise is binding and “User Guide” is full of them. This is bad enough, but if You will look closer it is even worse.

Imagine how vulnerable will You become if You would publish the “User Guide” but You would also at the same time cut the maintenance costs and fail to keep it well translated and up to date. Remember, even a badly translated promise may still be observed as a promise. If You will, by translators mistake, promise to “give a head” don’t be surprised when some guys will drop their pants in front of Your secretary.

Summary

I hope You now understand how evil the “User Guide” is. It creates promises which make You vulnerable to legal or financial claims and it generates a hell lot of costs. As if it wouldn’t be bad enough, this creature is so sinister, that the more You cut costs of creating, validating and maintaining it, the more it bites.

From business point of view the best possible policy is to:

  1. Create a good and efficient “vendor lock-in” environment around Your product.
  2. Absolutely never provide any kind of “User Guide”.

Ehmm….

Right. Go and do it.

See You back in Hell.

User Guides

old_fart_mode = on

They don’t do user guides as they did before… When I was young the manuals were clear, easy to understand and really, really helpful. Not how they are today…

STOP!

old_fart_mode = off

Stop, old fart, You don’t know what You are babbling about!

What f*ng user guides?!

Exactly.

There is no point complaining about a quality of something what doesn’t exist anymore.

What is a user guide?

For anyone who is older than, let’s say thirty this looks like utterly stupid question. But for Younger generation I think it is important to explain what the user guide is. Or rather what it was.

I suppose when You are reading it You start thinking that I am an utter idiot. Sure, I am. But before You leave just think: have You seen a full blow user manual for any commercial software You used during last, let’s say ten years? Fifteen? Twenty? Well… maybe twenty years ago You might have seen it.

So what the hell is a user guide?

First step: User guide is a book.

It may be on paper or it may be in an electronic form. It doesn’t matter. What matters it is that it is a book and it is obliged to behave as a book. This means it must be easy to read, must use clear language. Must be easy to understand.

Every requirement of good novel must be met, except, maybe, the flexibility of the language. Story must flow, the gun must be hung on the wall before it fires. And anything like that.

All right, so once You know what user guide was now it is a time  to ask Yourself: What it was used for?

What is user guide used for?

For a carpenter hammer is to hammering a nail. For a nail a hammer is something used to be beaten in a head. This means that the answer to above question will strongly depend on from whose point of view we are looking at the problem.

Whom is user guide for?

This is very important to understand, that user guide is for…. well…. the user. When You write a user guide for Your company software remember this is not a book written for an internal use. There is no “everyone knows what  <enter any important program function here> is”.

A user guide is for a user. Or, what is even more important, for a future user. And future user has one very important property: they do not know anything about Your program. There may be obvious for You that RHINO is…. hell if I know what it is…. but for anyone else it will be a damn big animal what loves bathing in the mud.

User guide tells user what this software is useful for.

This may again sound stupid, but many manual writers do forget about it. Try for an example a manual for mockito Java library. Starting from the first sentence the are telling You how to mock, how to make a mock, hot to use a mock, and how much mocking will help You. I had lost about three hours, which included looking into its source code to understand what the hell is “mock” anyway.

This kind of user manual is a pure form “customer repellent”. If Your future user can’t understand after reading a page or two what is this software for, then how can You expect them to buy it? One working hour costs about, let’s say 30Euro. Do You honestly expect Your future client to spend 50Euros to understand that this software is not for him? Sure it may be worth for a selection of 5’000Euro software, but for anything around a thousand it is naive.

Remember, Your future user must select Your software from tens of available solutions!

Ok., so Your user read an “Introduction” chapter and now understands that this software may be for him. What is the user guide used then?

User guide allows to avoid costly trial-and-error usability validation

This day I tried to install on a smartphone something what will allow me to record the phone call. This kind of activity is fully legal for a private person in my country. My old Nokia phone has this function. This new smartphone advertises on the Samsung support page that it has it, but this function may be not available in some countries.

Notice the underlined may be and some.

So does this smartphone have it or does it not?

The only way, according to Samsung, to gain this knowledge is to spend 500Euros to buy this phone, check it, return and hope for a refund… Nice, isn’t it?

So I tried a google phone app. Again You may find on support page, that to get this function You must have Android 9+ and newest version of the application. Fine, no problem…. except that after spending a lot of time on downloading it, after allowing it to crawl all the data on a phone it appeared that this function is not there anymore.

If both the smartphone and a phone app would have a well made user guide I would know it from a scratch and stay with my Nokia phone.

The good user guide allows Your future user to decide, if Your product fits exactly his/her needs. Not just may in some countries but it f*ng do!

User guide explains how to do things…

So what I would be looking for in that above mentioned non-existing user guide?

I am a very, very accurate person, so I would have skipped the “Futures” chapter and  look for a chapter called “Recording phone call”. I would look into it and check how to do it. I would possibly find there some information where the recording is stored and in what format. If there are any limitations there is also probable that they would be explained there…

….in every f*ing detail!

And I mean it. With screenshots, colors and etc. That damn button does it when You touch it. That empty space, when You touch and hold it will do something else. And if You try to wipe left to right that mugshot on the top the phone-book entry will appear.

User guide save user spending hours on “discovering functions”

“Discoverability” of a user interface is a myth which stands behind the idea of “flat user interface”. Flat equals to: active and inactive elements do not differ.

If You read this blog, which is using a ready-made template without me being able to tune it, You will notice at the top gray “Main“, “Blog” and “About“. Below You will find equally grey “software quality and concepts“.

Do they differ visually?

Not much. At least not for me. They are all gray text, are the not?

Except, that three first are active and the last is not.

Gladly the designer of this template was not an utter idiot and made those elements to react on mouse hoover over it. So if You will try to move mouse around every damn object on a page You will probably find what is active and what is not.

I let myself to underline try to move because unless You will try to do something You won’t even know that there is such a possibility hidden up there.

Sadly not every one is so smart as this author was.

Never less did You try to hoover Your finger over something on the smartphone? It never works for me.

And what if “flat” is moved to an extremity, as at deviantart.com page where an empty space reacts on mouse click? Obviously with this style of design to be able “discover” all functions You will have to end up left clicking, right clicking, dragging, tapping, nipping, wiping and pushing every damn inch of the screen to make sure that You did not miss anything important. And remember, on PC You also hava a 100+ keys on keyboard which can be pressed in (100^100)^10 combinations cause You have ten fingers.

I am not against “flat” design. If fashionable look is very, very….very^infinity important then it is a must. I have nothing against top left empty corner to react on mouse click when there is some text selected providing I can read about it without spending hours on searching over the net.

“Discoverability” is a madness and a good user guide may save countless tens of hours of work and frustration.

And, apparently, save a life of a smartphone which may not survive to be thrown out of the window from fifth floor.

Three kinds of users

There is one more thing to remember about a “user” the manual is for. That there are three of them, which do stand in order of appearance:

  1. Future user, who needs to know if this product will met they needs.
  2. First time user, who will learn, step by step, how to use the program in most efficient and proper way.
  3. Returning user, who used program some time ago, but just this damn damp day is too hot for him/her to remember how it was done and where this damn button is hidden.

The all need Your manual, but they need it for an another reason and will use it in a different way.

Summary

In this blog entry You might have read some obvious things. For some of You it was so obvious and boring, that You just clicked away. But for some of You it might be an eye-opener into a land of a software which is not only nice looking, but also easy to find, easy to learn and easy to use.

In next blog entry I will try to switch a side and explain why there are no user guides anymore and why You, a software company owner (or in fact: any company owner) should do absolutely everything to prevent Your employees from creating such an evil thing like a “user guide”.

RtOS: Protect Your stacks

In that post I did present to You possible problems with per-task stacks in an RtOS.

And I did promise to show how to help fighting with it.

Stack barriers

Let us do again some graphics:

The static stack layout with barriers.

The red zones are “stack barriers”. Nothing should write there except the RtOS init process which sets up all task tables. Since in microcontrollers You will have, as I already said, a fixed number of tasks the memory layout can also be static. That means, that the process of starting up the task:

SP = begin_of_Task_A_stack
push Task_A_START ;address_of_first_instruction_in_task_zero
SP= SP + x        ;fake pushing "called save" registers
TaskTable[A].SP = SP

(see this blog entry)

may look like:

SP = begin_of_Task_A_stack
push Task_A_START ;address_of_first_instruction_in_task_zero
SP= SP + x        ;fake pushing "called save" registers
TaskTable[A].SP = SP
TaskTable[A].SP_BARRIER_ptr = Task_SP_BARRIER ; store stack barrier address for easy access
[Task_SP_BARRIER] = 0x4759 ; fill stack barrier with fixed value

What we basically do is filling the stack barrier with fixed known value. In my example I have chose 0x4759 because the CPU I use is:

  • 16 bit, thous stack is using 16 bit elements;
  • it is using word-alligned code, so the value with bit 0 set is never a valid return address.

Normally the RtOS kernel does something like that in yield():

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  for(;;)
  {
   for(i=NUMBER_OF_TASKS;--i>=0;)
   {
    CTP--
    if (carry over/borrow) CTP = NUMBER_OF_TASKS-1
    if ((TaskTable[CTP].event_flags &
         TaskTable[CTP].event_mask ) !=0)
        {
        SP = TaskTable[CTP].SP
        pop "called save" registers
        return
        }
   }
  sleep(X)
  }

(see this blog entry)

To use stack barriers we need to add something like that:

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  if ([TaskTable[CTP].SP_BARRIER_ptr]!=0x4759)
  {
     die
  }
  .....
  }

Saving RAM

If You are as much as I am concerned about use of RAM then You already noticed that I wasted additional RAM for TaskTable[CTP].SP_BARRIER_ptr. This must be done this way if Your task layout is dynamic, but if it is static it can be done differently. You can hard-code in Your program memory the table which can be indexed with CTP (Current Task Pointer) and save that byte of memory:

const TaskBarrierTable[Number_Of_Task]
{
   SP_BARRIER_ptr
} = {{SP_A_BARRIER}, {SP_B_BARRIER}, .....}

and then use it like that:

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  if ([TaskBarrierTable[CTP].SP_BARRIER_ptr]!=0x4759)
  {
     die
  }
  .....
  }

How does it work?

The whole idea in helping with detection of “dance on the stack” is to detect the problem as soon as possible. The stack barriers are doing that by checking at the nearest yield() if the task which is currently giving up the CPU did not mess with a stack barrier. This way You may place a break-point at die and catch the culprit as soon as possible. Since it would just finished messing with a stack You should be able relatively easy to inspect it and figure out what it was doing before.

Dying gracefully

The die…. well…. what to do when You detect stack barrier problem?

For sure You should not continue to run.

In my opinion it is best to store that fact in some cell in FLASH/EEPROM memory of Your CPU and just force the watchdog reset.

Thanks to that if a device will return to Your company for scrapping Your quality assurance team may inspect the fault log and inform You if such an error did happen.

It is not a silver bullet

Stack barriers are not silver bullets.

First they can be cheated on when a matching variable value is stored in the place where barrier is allocated, or even if some local variables are unused may be jumped over without being touched at all.

And second, they are detecting a problem post-factum. You can’t stop the program exactly at the place where it is messing up with a stack barrier.

MMU hardware barrier

I must say at once: I never did it. I never used a microcontroller with Memory Managment Unit, that is a big complex beast which can provide You with a Virtual Memory.

If however You do have such an MMU why not to use it to implement a stack barrier? Just map Your stacks in such a way that an entire Virtual Memory page acts as a barrier and set up MMU so that it will detect “memory missing” interrupt when CPU is accessing that page. With this You will have a really working stack barrier at no software cost.

MMU stack layout when MMU is used as stack protector.

There are two down-sides of this approach:

  1. Your stack needs to be exactly the multiple of page size. It can’t be any smaller because otherwise the barrier could not sit right behind it.
  2. If Your MMU has less hardware page registers than is necessary to cover entire memory You have, then enabling MMU will slow down Your code even without a virtual memory.

    This is because MMU generates interrupt when it detects an access to memory address which is not covered by hardware page registers. It is then up to interrupt handler to check software page registers and update MMU hardware setup.

    For an example, if You have 2MB of RAM+FLASH and enough registers to cover it, then interrupt will never happen during normal code execution and there will be no negative impact from turning MMU on. If however You have 2MB of RAM+FLASH, but registers only enough to cover, lets say 1MB, then You will have interrupts which would not happen with MMU disabled. Even worse, You will have to adjust MMU hardware dynamically or nothing will work.

    Not nice to do.

MPU barrier

MPU is “Memory Protection Unit”. Plenty of microcontrollers do have them, even Cortex-M0+. Older, 8 or 16 bit machines most commonly do not have such a unit.

The MPU acts as physical barrier and is intended to put protection exactly for what we need.

The MPU can be used in two conceptual settings:

  1. As a barrier, when You do set MPU once and for all to barf at “stack barriers” or;
  2. As protector, when You do set MPU differently each time You switch the task and put other tasks stacks under a protection.
MPU protection zones layout in static and dynamic setup.

The static setup requires no action during task switching so can be added at any moment and adds no complexity to RtOS kernel. The dynamic setup may be expensive or inexpensive, depending on how MPU is implemented.

On the other hand static setup can be “jumped over” with un-used local variables while dynamic setup gives full protection. As usually everything comes with a price.

Of course You can’t use MPU if it has not enough protection zones. Gladly most will have at least 8 zones which should be enough for an embedded RtOS.

Checksum barrier

If You don’t have neither MMU nor MPU and You are worried that Your stack is still being broken despite the stack barrier You may think about stack checksum.

With this concept each time You yield() You need to compute the checksum for the entire stack of the task and store it inside a task table:

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
TaskTable[CTP].CHECKSUM = compute_checksum(TaskBarrierTable[CTP].SP_START, TaskBarrierTable[CTP].SP_LAST) ....

and when You do awake Your task:

subroutine yield()
  ....
        if (compute_checksum(TaskBarrierTable[CTP].SP_START, TaskBarrierTable[CTP].SP_LAST)
                != TaskTable[CTP].CHECKSUM )
        {
              die
        }
        SP = TaskTable[CTP].SP
        pop "called save" registers
        return

and, of course, update constants table describing stacks:

const TaskBarrierTable[Number_Of_Task]
{
   SP_BARRIER_ptr;
   SP_START;
   SP_LAST;
} = {{end_of_task_A_stack+1,begin_of_task_A_stack, end_of_task_A_stack},....}

This method is harder to fool than stack barriers, but is many many times more time consuming. It also delays the detection even more because it does check for corruption not at a possible culprit side, but at the victim side.

Summary

After reading this blog entry You should be aware how You can help Yourself in detecting stack problems with RtOS.

If You do have Memory Protection Unit use it. It is a best possible protection which costs You almost nothing.

If You have Memory Management Unit, consider using it. But honestly, I suppose that if Your CPU has MMU You are already running Linux on it, so there is no place for hand crafted cooperative RtOS.

And finally for good old 8/16 bit CPU or Corex-M0/M1 I do recommend simple stack-barriers. I would use checksum barriers only when hunting for a very nasty bugs and keep the checksum barrier disabled in production code.

Click-bait: Russia is good, Ukraine is bad

Begin Of Click-Bait 
{

Hello dear bots!
Please follow this blog entry.

Yes, it is a click-bait. I am trying to learn how to promote this blog, and one of methods is to include subjects which are controversial and will stir some emotions. So here You have it.

I am not into politics and especially not into a war or a “preventive operation” as Russian government is calling this war like military operation in Ukraine.

Unfortunately I do live in Poland and I have a very clouded view about it due to a massive information war we are subject to from both the Russian and Polish government. I am always concerned if someone is trying to make fool of me, and now unfortunately both governments and their ass-licking media are trying to make an idiot of me. My government and the Russian government.

It is a very uncomfortable situation when You can’t trust what Your own government and own media are saying.

So just to make it clear: I don’t like wars. But if somebody else likes it, it is his problem as long as I don’t have to fight. I am very surprised that Ukraine citizens are willing to be killed to defend their politicians. If it would be me, here in Poland, I would surrender as soon as possible.

We had such a funny saying back in 1980-ties:

How to make Poland rich? Start a war with USA and surrender next day.

I don’t care who is ruling my place as long as he is doing it good. And currently I can’t see much difference in minus between Wladimir Putin in Russia and Jaroslaw Kaczynski in Poland. I don’t know how Zelensky looks like on this background, but considering the culture lag in all post-soviet countries I have some doubts.

One thing is sure: this “operation” is bleeding both Ukraine and Russia to death. Billions of brain cells are dying in it and human brains are the real priceless resource without which any kind of civilization and progress is impossible. Stop wasting it for a political whims.

On the other hand, using a cold brute political thinking, it is good. For anyone except people of those two countries. This war makes both Ukraine and Russia to suffer significant looses. There is no chance that any of them will come out from this struggle stronger. In fact, regardless of the result, both will be weakened and both will be prone to decline on the world scenery. And everyone on West and everyone on East is happy about it, but nobody will say it loud. They will just provide Russia with money and Ukraine with weapon, so they both can fight, and fight, and fight….

And just before the end of this idiotic blog entry: Should the Ukraine be a separate independent country or should it be a part of Russia? Historically speaking both were, at short duration, a part of Poland. For just few months, but they were. So I think, if we start to talk about what country should exist and what should not, both should return to their proper master. Bow in front of J.K., the great ruler of PiS!

End Of Click-Bait
}

RtOS – Stacks are not funny

Up to now I have been writing about how good the RtOS is. Now it is time to mention what kind of problems it creates.

First kind of problems are “stacks”.

Task stack

As You might have read in there that each task do require to have own control stack, that is a place where all local variables and return addresses are stored. A task switch, as also mentioned there, basically consists of saving some CPU state on that stack, switching to the stack of next task and then loading CPU state from that new stack.

This means, obviously, that proper stack management is critical.

Memory layout

Let us now take a look how the memory layout looks like if there is no RtOS.

As You can see the entire available RAM space is divided into four zones:

  1. Static variables, pre-allocated at compile time. In this zone also resides the task table which is used to manage the RtOS;
  2. Dynamic variables heap space, if present. Usually it will not be there in embedded environment, as it adds a hell lot of unpredictability to a system. But if it is, it should grow in opposite direction to the control stack;
  3. Un-allocated free RAM space, which is not used at the moment;
  4. Currently used up control stack space, growing into the free space.

This type of layout does not need much of our concern. Assuming there is no dynamic variables heap (99% of cases) the control stack grows and consumes the free space up to the pre-allocated static variables.

Doom of stack overlap

If the stack grows too much it will start overlapping the static variables zone. This is a deadly condition. If it happens Your program will start acting in a totally upredictable way. Either static variables will override local variables or local variables will override static variables. It may even happen that an update of static variable will override the return address stored on stack and Your program will just randomly jump to some silly place:

....
  subroutine X()
      mov 100, static variable
      return
  end subroutine X

And if we just call X in a stack overflow condition we may get:

As You can see in this case the subroutine happily have override the own return address and the CPU went hay jumping to address 100 instead of returning to from where it was called.

Of course if it happens in a single threaded program it simply means that You have not enough RAM. Nothing can be done except re-writing the program to consume less memory.

RtOS memory layout

With RtOS the situation gets more complicated. Since we have more than one task it now looks more like:

As You can see we have now three stacks (A,B and C) and three separate “free spaces”. This means, that now there is much more chance that if we did select stacks locations incorrectly they may get one over another.

Dancing on the stack

In case of RtOS stack overflow can generate results much harder to detect. For an example like this:

In this simplified example Task A called a subroutine X which simply used some stack space for local variables and return address. Unfortunately at the moment of call X the stack of task A was already full and the local variables and return address did override stack of task B.

This is an obvious error, but let us think for a moment when we will detect it?

When subroutine X returns?

No.

When we switch to task B?

Not right away.

We will detect if when we switch to task B and when task B rolls back the stack, that is when it will execute return enough times so that the Stack Pointer moves into a problematic zone.

This may be hours and hours after task A did corrupt the stack of task B.

Summary

In this very very short blog entry I wished to show You that RtOS requires from You a great care about determining the proper locations (in fact it means: sizes) of stacks for all Your tasks. Do it wrong, and You will be doomed.

How to do it correctly?

Well….

You need to ask Your compiler about so called “call stack” and tell it to compute the worst case stack consumption for each of Your tasks, interrupts included.

If Your compiler has no support for “call stack” computation then it is a piece of crap and You will have to do it by hand or… just guess it right. What is a stupid idea.

In next blog entry I will try to show You how to harden Your RtOS against stack overflows.

Does quantum physics proof that world is just a simulation?

All right, all right…. Sounds a bit religious, isn’t it?

Well…

Did You ever wrote any simulation software? If You did You would know that the whole idea of simulation moves around a state vector and differential equations.

For an example if we take a single particle moving in space we will define it’s primary state as:

       x
 V = [ y ]
       z

where x,y,z are coordinates in space and V represents the state vector.

This state vector is obviously not complete. It just says where the particle is, but not how fast it goes. So let us extend it a bit:

       x
       y
       z        P
 V = [ x' ] = [    ]
       y'       P'
       z'      

where

        dx
  x' = ----- 
        dt

that is the x' symbol denotes first order derivative of coordinate x over the time, what is a very complex way to say: “speed”.

And by the way, the P is position vector made of x,y,z components.

Is this state vector complete?

Of course no. We do say where the particle is and how fast it goes, but we are not saying if it is accelerating or not. We need more derivatives.

In theory we can move it at infinitum adding second, third and so on derivatives, but in practice we just bind them in a form of additional state variables and differential equations.

For an example we say that:

   P''  = F / m
   P''  = dP' / dt = dP2 / dt2

what means: “acceleration equals to force divided by mass” and “acceleration is a first derivative of speed over the time” or “second derivative of position over the time“.

Solving differential equations

The method of solving differential equations is quite simple. The entire idea is to use the schema like below:

   
   P''(t) = F(t) / m
   P'(t) = P'(t-Δt) + P''(t)*Δt
   P(t) = P(t-Δt)+P'(t)*Δt

Mathematically speaking what we do is an integration over the time. In this example I used the simplest method: “square integration” which assumes that all parameters are fixed during the integration time step Δt.

Non mathematically speaking we take the force and compute acceleration. Knowing acceleration, previous speed and integration time step Δt we calculate a new velocity. And knowing velocity, previous position and integration time step Δt we calculate a new position.

Integration time Δt

And this is the source of all problems. If integration time step Δt is too small we do have accurate calculations, but they will require a lot of computing power. If the time step Δt is too large we will observe bizarre effects.

For an example a “tunneling”.

Bizarre simulated tunneling

Now imagine the said above particle moving at a constant speed right into the wall:

As You can see it just hit the wall as expected. To be exact it not just hit the wall, but at certain time moment it did exist inside the wall. What is a definition of hitting it, right?

But what would have happen if we would make an integration time step Δt significantly larger?

It just passed through the wall as if the wall did not exist. To be precise at certain time the particle existed in front of the wall and in next time quanta it existed behind the wall but never inside of it.

Why? Because integration time step Δt was too large.

You simply simulated it wrong!

Of course I did.

In most elementary method of fixing it we would just assume that there is a certain distance quanta Δx, that is a minimum wall thickness and we would automatically adjust Δt to be smaller if any of simulated particles moved fast enough to pass more Δx during the Δt time.

Notice, I specifically underlined the world any. If any particle moves more than Δx we need to roll back the time Δt, guess next Δt2 < Δt and re do simulation step again. Messy, but this is what most simulation software must do and on what they get stuck with “time quanta too small” error message.

Alternatively we could use travel path collision to detect if particle hit the wall:

In this approach the particle “exists” along the entire path it moves during the simulation step. This is a correct solution, but imagine how much complexity it adds to computations!

In first approach we just needed to check if the center of particle is inside the wall or not farther that half of particle diameter outside the wall. Now we have to check if a 3D cylinder with round caps described by the path of particle do collide with the wall. It is at lest two orders of magnitude more complex.

And the “tunneling” do exist in a real world

That’s right. That silly simulation effect do really exist. We use it everyday in “tunneling diode” in our electronics equipment and on a less daily basis in “tunneling microscope”.

Of course this is not the simulation effect. This is due to Schrodinger waves theory. This theory basically says, that there is no exact, precise definition of “existence”. In fact everything what exists is described by certain “waves of probability”. Those waves are sinus like equations which describe how probable is that a certain reaction will appear at certain place in certain time. And since the sole definition of existence is: “to be able to interact” it describes if particle exists there or not.

In certain conditions, usually related with large energies and small distances, those equations have close to zero values in certain locations and non-zero in others. Like zero in wall, non-zero in front of it and non-zero behind it.

Heisenberg uncertainty

The next alike effect is the Heisenberg uncertainty principle.

It basically says, that if something moves at exact velocity it may be anywhere and if something is in exact place it can move at any speed.

Of course the “it is” should be taken in huge double quotes.

The “is there” means it is required to be there for the reaction to take place and “moves exact velocity” means that it is required to have an exact energy for the reaction to take place.

For an example if a chlorophyll do require that photon is exactly “green” then it doesn’t really matter if the photon hits the chlorophyll or not. The required accuracy of perfect energy makes the chlorophyll particle to be virtually bigger. Virtually, because just from the point of view of incoming photon.

What if world would be a simulation?

…and the God would have a crappy low end CPU to run it on.

What then?

What would we do if we need to run some model and we are really, really constrained on computing resources.

We would optimize it. Simplify it. But we will be always bound with the physics of what we do simulate.

What if world is just a “game of life”?

But what if we would be writing not a physical simulation but just a “gave of life”?

Note: The “game of life” was a simple program, so called 2D “cellular automata” which appeared to behave like a living colony of bacteria. It was made to illustrate how simple rules may create overly complex behaviors.

With “game of life” type of simulation we are not bound by the physics of simulated world. We do define it to our liking.

So if would have to create such a program and we would have significant constraints on the processing power we would…

Simplification of physics

….agree to Δt tunneling.

In fact what is wrong with it?

The next problem which costs us a lot in terms of computation power is collision detection. The Δt tunneling allows us to use simplest possible algorithms, but we still have to detect intersection of complex shapes and in many cases would produce unacceptable artifacts.

But if we introduce a kind of Heisenberg Principle we may easily escape those effects. The fast particle becomes larger, slow is smaller. That’s may be all when it comes to simulation and it may solve a hell lot of simulation inconsistencies without requiring any additional computing power.

Summary

I think it is enough or this techno-religious mumble.

We could also introduce the Schrodinger’s cat into equation which could be demonstrated to be just a by product of a “lazy solving on demand”. It is used in simulation of well isolated clusters or Non-Player-Characters lives in games – both do not need to be computed at all until they are needed. When they are needed they are computed “on demand” with the entire history.

As the Schrodinger cat NPCs are both dead and alive until they meet The Player.

We could introduce many of such things.

I honestly think that if our world would be just a simulation then the quantum physics would be and effect of optimization of a model used to conserve computing power.

How do You think?

Knowledge management system: managed versus chaotic

Today when You think about “knowledge management” first what comes to mind is: “Wikipedia”. Good.

So try running it.

Not so good? Well… this is because Wiki is a case of “managed knowledge management system”.

So before I will dig into knowledge management systems let us talk a bit about different aspects of producing the knowledge finding it, using and updating.

If it can’t be found it is not there

…even if it is there.

Imagine a huge, let’s say 100’000 books library. And imagine some idiot like me put the book on the incorrect shelf. Can You find it? If I actually did burn this book in my fireplace the effect wouldn’t be much different. Nobody can find it, nobody can read it.

So finding a book is an essential functionality of a library. The same finding the knowledge is for the knowledge management system.

Search strategies

When we do look for knowledge we can use following efficient search strategies:

  • ask a specialist;
  • browse through category tree looking for something similar;
  • run a full text search to find parts of system which touch the subject and then manually select elements which may carry information we are looking for;

Asking

Asking a specialist is quickest, but it is expensive and risky. Expensive, because a company must have a specialist. And such a specialist becomes a critical element which can be removed from a company by a banana skin on a sidewalk. So this is no go path. We can’t rely only on specialists.

Category tree

This is a good way if You just need to learn. But not a very efficient one if You are looking for an answer to an exact question.

First not every category is correctly assigned. Blah! If Your KMS (Knowledge Management System) manager is a dumb idiot, there may even be no description what the category is. Because it is “self explanatory”.

The longer document is the harder is to assign it to the right category. Each person will probably assign a different category to it.

A category assigned to document is always more or less incorrect.

A document may appear in many categories tough, what is good.

Tag system

A tag system is a simplified form of a category tree. If You look at the right side of this blog entry (If I did not play with layout again tough) You will see how messy tag system becomes. And surely there is no tag “stupid text” which many of You would have assigned to this blog entry.

There is always a missing tag for a document.

Full text search

This method is overflowing You with useless crap. Especially if some society has very peculiar ways of naming things. For an example Ant and Rihno. Both are animals and both are programs. Guess how easy is to tell them apart during full text search.

This is also enough to phrase the question differently to not be able to find it at all.

Full text search is always either too narrow or too imprecise.

Capturing knowledge

Production of knowledge can be intentional or not. Yes I know, it sounds stupid.

Your company do have some internal manuals, right?

This is an “intentional” production. You paid for making it either in money or in work-force.

But You will also probably find a full set of note-books, yellow stickers, drawings crossed over with red lines on the production floor and etc, and etc. This knowledge was produced but not with an intent of producing it. It is in fact a by-product of a regular activity.

There is always an official knowledge and a common sense. Both are important to capture and preserve.

Updating knowledge

Nothing is given forever. Each manual will sooner or later be outdated and will need fixing.

The process of fixing documents requires, what a surprise, to be actually able to find documents which do require fixing.

For an example if a certain chemical substance, ie. “extraction naphtha”, needs to be removed from production processes then You have to find all manuals which do say “and wash it with naphtha”. If You can’t do it, You can’t get to Your goal.

Next, after updating all documents, You need to ask Yourself a question: “Do I need to keep old revision or may I trash it?” Surely it should not be available at production floor as a “current” revision, but it may be useful to be able to find it in “archive”.

And, last but not least, who the hell replaced naphtha with acetone for a procedure of cleaning ABS plastic!?

You need to know what to update, who updated it and what exactly was updated.

Issue system

The update process is inherently related to “problems”. You read a document. You find a problem with it or have any doubts about it. How do You report it?

Or to be more precise – how do You report it without putting too much effort in it? Remember, that document You have problems with probably did pass some quality checks, so higher-ups think it is OK. And You, a blue collar at the production floor dare to think it is wrong!?

Issue system must break this way of thinking. It must be easy to ask a question: “Either I don’t understand it or this is wrong“. Without involving any paperwork.

Or nobody will be reporting any problems at all.

If there are no reported issues then it only means, that issue reporting system is too hard to use.

Weakness of Wiki

If Your company have plenty manuals, and if You do not struggle with missing manuals, out-of-date manuals and simply wrong manuals, and there is no notebooks, yellow stickers and crossed over drawings on production floor, then You can be happy with any Wiki-like knowledge management system.

All because You already have dedicated employees to manage the knowledge in Your company. In fact, You already have working knowledge management system, but on paper.

You have organized the production of knowledge, You have organized the categorization system, access system and change control system. All is contained in Wiki.

It even has a full text search.

So why I dare to say it has weaknesses?

Because it can’t capture the chaos.

Limited resources

The organized system like Wiki requires special methods of entering it to the system and a dedicated personnel to do it.

Don’t let be cheated by You CEO saying, that “that and that department can do it in a spare time”.

No!

Forget it. Not only there is usually no spare time, because if there is a spare time, then Your CEO is not doing their job correctly. And especially there is no spare time to waste on something what is absolutely useless.

Library in which You can’t find a book is useless. The same way library in which there is no book is equally useless. Nobody will use a library in which there are no books available. The same way nobody will use the knowledge management system which is empty.

And it will be empty at the beginning.

So You have an empty knowledge management system and some very very – did I repeat “very”? – very scarce spare time, which You may sacrifice to feed the knowledge management system.

What will be the motivation for filling it?

None. It is useless now. Filling it with anything takes time and effort, but benefits for the person who do feed the system is nil. Because that person already knows what he/she put in it. From that person point of view it is a lost time.

Nobody will do anything and everyone will look for a knowledge elsewhere. As a result You will have Your knowledge leaking around not captured and not preserved against loss.

Except if Your CEO will create a “knowledge management department”. Ding… ding… ding… do I hear some coins falling from the sky?

Chaotic on rescue!

Now imagine You have a CEO who simply can’t give any money to Your knowledge management system. And You still need such a system as hell!

First thing You need to do is to provide a system with an ability to capture knowledge. Since Your CEO is an idiot, You probably have a minimum set of manuals, but plenty of them are outdated. And You have tones and tones of by-product knowledge spread around a company.

You need to capture it… but without anyone actually doing it.

Arghh…. Impossible!

Impossible?

What can capture by-product knowledge?

Have You ever struggled with finding Your own notes?

If Yes you would really appreciate the system which will run a “full text search” on Your work place.

Now imagine, You will run at Your company a file server, which will be available as a Windows share (like a regular disk) and through web interface so employee can access it through the smartphone. And for an example, put a photo of what have they done, make a voice note or a video just to remember how to do something. Guys at production floor are most probably doing it now, but they keep files on their own phones.

Imagine You have given them ability to put anything they like there. No limits of size or content type. Sound, videos, text. Whatever they like.

Then warn them, that anyone in company can access their files. If they will still put photos of their own genitalia… well… Allowing that kind of abuse is the price You have to pay… for not paying the price for “knowledge management department”.

And give them the power of google search.

Initially it won’t be very good. But if Your system will be more efficient and easier to use than their own smartphone or computer, then they will probably prefer to put their notes in that system. Simply because that will make them easier to find.

Sooner or later, if You will look at the system, You will find that some people did put documents, photos and videos in folders with some meaning-full names. Some copied Your official manuals so that they are “at hand”.

What have they done?

They did produce knowledge and did put it to the system. Even more, simply by downloading some files to their own folders they did assign a category to them.

Searching chaotic knowledge

This is critical aspect.

If there will be no efficient search nobody will feed the system. The system must help them finding their own files and allow to look how others have solved alike problems. Only then they will feel willing to put data into the system.

What kind of search do we need?

Obviously it is a full text search, but we should also be able to browse through own and others files. This last looks a bit fishy, but it is absolutely necessary because without it the knowledge will not propagate through the company.

The full text search must also utilize the “categories” given by users, that is their folders. So You must be able to include file name or file path text search like: “find all documents containing xxx in folders containing yyy”.

And obviously, the file format to search should be absolutely anything. The Apache Tika or Xapian comes to mid. Those systems can chew through almost everything and their nature is such, that they can be extended to understand even more. Of course, this time money will be involved and some NDA or licensing may be necessary if a proprietary files should be inspected.

Updating chaotic knowledge

This is obvious down side. If it is chaotic, we can’t have control on it.

Yes and no.

We can use a file system which can log operation and perform “copy on write” each time file is changed, moved or deleted. Surely it will consume hell lot of space, unless we run XFS with background de-duplication.

Git?

If Your file server will be running on Linux You can use user space file system to implement file system over the GIT. One repository per user. Each file change is commit. File system shows just HEAD of “master” branch, but through GIT you can see the history.

This is the reverse of what I have seen on the market. Most approaches tried to show GIT in file system, but we can also do it in a opposite direction.

GIT almost always comes with nice Web interfaces. I personally love GitBlit. I will most probably add to it Tika based full-text search and file-diff sooner or later and fix some methods of browsing repository tree to look more like disk browsing.

If such a GIT system could be set up together with a file system reflecting HEAD of “master” branch in each repository, they You would get at no cost:

  • making changes by just copying files to folders, editing, saving and etc;
  • full text search through any files;
  • efficient compression;
  • history interception;
  • discussion forums which can be used as an issue handling system without running hard to use proprietary “request for change” system.

Summary

I suppose that after reading this blog entry You should have grasped the difference between managed and chaotic knowledge management systems. You should see basic disadvantages and costs of each one.

You might also see that I am a great fan of using GIT as a knowledge management system.

Think about it. Try it. And maybe You will be willing to pay me some money for turning GitBlit into a full blown chaotic knowledge management system.

Knowledge management systems: what is it managing anyway?

KMS or Knowledge Management System is a key to success in any organization.

Each and every organization always struggle with following limitations:

  • lack of vision;
  • lack of money;
  • lack of human resources;
  • lack of education and training.

Vision, money and human resources are obvious: You have to know what You like to achieve, You have to have money to invest in a journey to that goal, and You need somebody to help You.

However regardless how many people You will collect they will be pretty useless if they would not know what and how to do.

Knowledge

The company I work for now was recently utterly cheated by a certain software supplier. That supplier was providing the company with a “one tool for everything” total company management system. So we kindly asked that supplier if his system do have some kind of knowledge management.

And he said “Yes, of course”.

Well…. It doesn’t have even a tiny bit of it.

Why?

Because of a definition.

Knowledge is not an information

There is an important difference between knowledge an information.

Let us do it by an example.

Assume we have a certain employee. His name is John Bruderson. He is monthly paid some money.

So we ask the question: “How much do we pay John Bruderson?”.  And we do get the answer: “1500 Euro”.

This is an information. A mentioned above total company management system can provide us with such an answer. It is collecting information and is providing us with such an information.

Sadly we could not ask this question to that system.

Why?

Because we lacked knowledge how to do it. And since this system, which is by the way priced at the level of my life time income, does not even have a decent manual. Not mentioning that said system is translated to Polish in a way which is at least confusing and any crappy manual supplied with it is for an English version and well hidden from Polish users.

But let me back to the subject.

So we have information and knowledge. Information is quantitative and answers to questions like: “what …?”, “how much…?”, “when…?” and alike. Knowledge answers to questions: “How to…?”.

In our example:

Information:
 "How much do we pay John Bruderson?" 
Knowledge:
 "How to check how much do we pay John Bruderson?" 

So what is Knowledge Management System?

It is a system which helps You to collect that “how to do…” questions and answers.

This is what it does.

But why is it necessary?

People are mortals

Trite but true. People are dying. People are leaving Your organization. People are getting brain strokes. Or getting married and moving away.

Regardless of what happens if You did not collect their knowledge and experiences that knowledge and experience is lost forever if they go away.

At many companies there are odd jobs which are too small to hire more than one person, yet, surprisingly, critical to company existence. A single visit to a doctor of one person can stop production. Seen it, been there, from both sides.

If You are fine with such a risk, knowledge management is something You can ignore. If You are happy with the fact that a banana skin on a sidewalk can bankrupt You…. whom am I to judge Your perversions?

All others should invest at least some thoughts to the problem of knowledge management.

Knowledge management methods

There are basically two methods to keep knowledge from getting lost forever:

  • by passing it from person to person with training and master-apprentice relationship;
  • by collecting it in a from which can be read, listen or seen without the need of master presence;

First method is the oldest one and, I think, best. It has however limitations. You need to have master and that master is a critical element. Each master has limited powers to train apprentice, both in number and quality. Some people just can’t teach.

Training is long and costly, so You invest a lot in apprentice… who moves to other company for better payment. And surely he will get better payment because he is now a trained employee and that other company does not have to compensate the investment it made in training.

The second method, that is collecting knowledge in a form which can be accessed without involvement of a master, save a lot of money. Any person can read, see or listen to collected documents, lectures, procedures or tutorial. Even some notes. And just when that person finished familiarizing with it, that person is allowed to ask master for explanation of what was unclear. This way master does not have to be a good teacher and surely each student can learn using own preferred method and speed.

This is what the Knowledge Management System is used for.

Summary

After reading this unusually short blog entry You should be aware that information is not knowledge and that Your total company management system which is collecting every piece of information has absolutely nothing to do with knowledge management.

You should also understand why maintaining and preserving knowledge is important to future of Your organization.

In later blogs I will try to highlight what are pitfalls with Knowledge Management Systems and why it sometimes may bring harm instead of benefit.

Providing anyone will read it. Drop some comments maybe? Peoples not bots?

Cheers!

Simulation in practical engineering

Today I will write about use of simulation software in practical process of designing products, processes and algorithms.

Simulation is great!

Yes, simulation is a great tool. Really, really great and practically the one of very few aspects of computers use which is directly bringing a profit.

This is because simulation is cheap.

Simulation is bad

This is also true.

Simulation is just a digital modeling. It can’t give You an answer to a question: “how to do it”. It can just answer to the question: “if I will do it, what would be the outcome?”. Simulation does not bring You rules which You can directly follow.

Even worse, simulation is prone to numeric instability.

Anyone who used SPICE to simulate circuits have seen spikes of current and voltage exceeding by few order of magnitudes what is really possible. And any one have seen a dreaded Time step too small message.

So You can’t fully trust Your simulation.

Math is better

A classic algebraic expression is the best possible way of modeling things.

This is because it can be solved against any variable and condition and can be used to answer not only the question “what would be the outcome?” but also answer: “how to do it?”.

This is great. If You can have algebraic model of Your problem go for it.

Math is useless!

Again this is also true.

In my practice I have found, that You either get in touch with math so simple, that a school kid can solve it without sticking it’s tongue out too much, or such complex, that solving it in a symbolic way is almost impossible.

Or that a result You get is a problem to compute in itself.

The first time it hit me was when I was trying to lower the frequency of computations needed for tracking the location of a simple cart based on reading of it’s wheel encoders. Normally You do it by reacting on each tick of an encoder and computing small motion it resulted in. If done properly it does not accumulate an error and You are just incrementing some fixed-point counters by pre-computed values. The price of it is that You have to do it on every “tick” of an encoder, what may mean tens of thousands of ticks per second.

My idea then was to just accumulate the ticks of both wheels in hardware during some periods, assume a linear acceleration model, and solve the motion equation not on every tick, but every 5ms. It did not look so complex, right?

Well…

I failed to solve it on paper, so I put it into a stolen copy (remember, it was in 1990-ties, so in Poland it was legal) of Mathematica and…. I failed to understand the answer. The solved equation contained one of few “non-algebraic symbols” like gamma function which has to be computed numerically.

So in this case math gave a solution, but the solution was useless.

I was confronted with alike problems many, many times since then.

Math is good approximation

So what is the solution to it?

You must do what physicist who wrote Your physics books have done: simplify the model.

If You will carefully look at Your physics book You will find that almost any problem which is solved there is either idealized or simplified. “Assuming that…..”,”Having two infinitely long….” and etc. etc. etc.

Most frequently when math must give a useful symbolic results the model must be strongly simplified.

In practice there is no much use of such extremely simplified models, except…

Simplified models do train Your intuition

Even tough a simplified model can’t give You an answer to Your current design problem, it sill can help You understand rules which are driving it. With seeing those rules You may move a great step forward.

For an example if You count the frequency of randomly incoming pulses You would normally count all of them over the certain period of time. But if You will check Your math for this specific kind of problem, You would notice, that the math says that the certain equation represents the density of probability, that the pulse is observed during a certain period of observation. If Your read it carefully You will notice, that not only You don’t have to count all pulses, but You also don’t have to observe the process continuously!

Without understanding of this very simplified math model You would not be able to figure it out. With understanding it You may select a path which is an order of magnitude less expensive in hardware and computing power.

Simulation is hellishly accurate

Since a solvable algebraic model is too simple we need to go for a model which is complex but cannot be solved using symbolic operations.

We need numeric modeling.

Two paths of numeric modeling

There are two paths of numeric modeling.

One just takes a devilish complex math (usually differential equations) and applies numeric math to solve it.

Second path takes a large number of simplified models and combines them into a complex conglomerate.

Both can give excellent results.

Shit on input, shit on…

And both are prone to good old saying: “shit on input, shit on output”.

One of my colleagues was struggling with designing an optical path for certain application. So after a brief period of trials and errors he gave a try to the software dedicated for simulating optics.

And guess what?

He failed.

Not enough data

Why he failed?

Because the simulation model was too accurate. You know that light can reflect, bend, be attenuated, scatter in different ways, even split to different colors. The light source may have certain angular characteristics, emit certain wave lengths with different intensity and etc, and etc. Plenty of those effects will be non-linear, so many coefficients are necessary.

And the only information he had was: “this is the LED you have as a light source” and “this is poly-carbonate You should use to make the lens”.

Feeding the program having only such vague data required tons of guess work and did not produce any usable results.

It is too precise!

And now we come to the point:

Math is oversimplified, while numeric solution is only just accurate as accurate are data You put into it.

So in practice many engineers decide to not use simulation at all. If it can’t be trusted and if it can’t give precise results then what is the use of it? They have to make experiments.

Experiments and money

Experiments are expensive.

A certain problem I had to solve, or to be precise, I had to design something to meet required characteristics, had such nature, that a single experimental step required about 20 work-hours and involved people at production floor, lab, and an experimenter. Since each person had also other things to do this 20-work hours load was resulting in my personnel performing one experimental step each two weeks.

Since I am an experienced engineer I was predicting, that I will need about ten steps to get acceptable result. That means 20-weeks. Five months. That is, if there will be no special problems.

Simulation and money

Simulation on the other hand involves very different resources:

  1. Input data;
  2. Computing power;
  3. A digital experimenter person.

If You will take care that the experimenter person can adjust models he/she simulates without involving other people (ie. You will teach that person to use CAD software), then the only problem is computing power and input data.

Computing power is easy. In my experience I have never encountered a problem which could not be simulated by a decent desktop PC within ~20 hours.

What does it mean?

That not only digital simulation do require less resources, what cuts on “work-station synchronization” delays, but also has much faster turn around. At worst You will be able to run one experiment each work day, what means, that planed ten runs will take two weeks instead of five months…

If not the “shit on input”….

Compensating Your simulation

So the only problem which stops us is the lack of accurate input data.

We need to deal with it in stages.

First we need to prevent simulation from requiring so many input data we don’t have.

So we need to disable simulation of all non-primary effects. In the example with the LED lens You don’t need volumetric dispersion and color splitting.

Then put some constraints on Your design. This time they come not from technical realm (ie. “You can’t produce that”), but from computational. For an example it is easier to polish the surface of the test lens than give it a known and controlled scattering coefficient.

Third step is to design both numeric and physical experiment in the same way. You must be able to run physical experiment exactly as the digital experiment was run.

Now You are almost ready to excavate Your input data.

Excavating input data

Run both experiments, digital and a real one. Compare results.

Expected result is: “Simulation gave a shit.”

This is a good result. Let us take a look at it:

The green curve represents the observed light intensity (directional emission characteristics of our LED+lens set) in a real world experiment. The blue line is the same characteristics, in the same experiment, but when we simulated it. The red dotted line is our goal.

The blue and green do differ so much, that if we would have made digital experiment to reach the goal the real experimental result would be faulty. We need to bring it to agreement with real life experiment.

You can correct it in two ways:

  • by playing with input data;
  • by applying correcting characteristics.

Roll-back experiment

Playing with input data is in fact a “roll-back” experiment. You do run both physical and digital experiment and You deduce the input data from the difference.

In the LED + lens case the obvious choice is to play with emission characteristics of a LED.

If You would try to sketch some math for it, it might look like this:

 
 Brightness_at ( alpha ) = 
       Emission_at(alpha ) * simulation_magic_at(alpha)

This is approximately true if there is no lens. If there would be lens it would be more like:

 
 Brightness_at ( alpha ) =
     sum(i=-a...+a) { 
        Emission_at(alpha + i ) * simulation_magic_at(alpha + i) 
                    }

what is much harder to “roll back”, requires a trial-and-error approach and still may give You input data which are correct just for this setup and not for all setups.

This is why it is very important to understand it and to correctly create the “roll back” experiment. For a LED experiment it would be best to not use any lens at all.

When rolling back is impossible

In many cases rolling back is impossible. This may be due to:

  • an inability to create such an experimental setup which isolates Your input characteristics well enough;
  • an inability to measure, isolate or control effects which are simulated and do influence the result, but are too hard to control or remove;
  • missing some significant effects in simulation;

In such case You can’t roll back to all necessary input data. But don’t worry, it won’t be such a big problem at all.

All You should do is to apply the “correction” in a simplest possible model:

 
 Brightness_at ( alpha ) = 
    Emission_at(alpha ) * simulation_magic_at(alpha)
 Digital_brightness_at (alpha ) = 
    Correction_at(alpha) * Brightness_at ( alpha ) 
 Digital_brightness_at (alpha ) == Experimental_brightness_at (alpha)

and find Correction_at(alpha).

Using corrected simulation

Regardless if You “rolled back” Your input or applied “correction” to simulation results You have now a simulation which agrees in this specific experiment with a reality.

Of course applying the same simulation to other experimental setup is risky.

I dare to say, that “correction” is much more risky than “rolling back”, because rolled back data are put on input of simulation, thous are fully processed, while correction just blindly moves the output. I would always recommend to aim for “roll-back”. On other hand “rolling back” requires a dedicated experiment, so it adds to the cost. It is your choice.

Since we are not scientists but engineers this should be enough for us, because regardless if results are fully correct or not, the changes in results of digital experiments should reflect what would be the change in a real life.

Ten steps forward, one step back

All right, so You do have Your inaccurate compensated simulation. What to do?

Use it. Play with it. Make Your intuition to work.

However once You will move to some stage at which You start thinking: “I am going the right way, I just need to tune it” please do stop.

Remember, Your simulation is not fully correct. The input data You “rolled back” or the correction coefficients may be not applicable in changed conditions. Do not try to get perfect simulated results! It will be just waste of time and money.

This is the right place to run next reference physical experiment.

Run it and compare results.

You will be again at the same spot: experiment will show that simulation is incorrect.

Don’t worry, You are closer to Your goal anyway. This just means, that You need to apply next correction. This time no rolling-back to input for sure! A regular correcting function should be enough. Once You find it all You need is to make some more digital experiments making just slight changes.

Repeat till success

And repeat.

Most probably You will run close to the hundred of of digital experiments and just three or four physical ones.

Summary

After reading this blog entry You should know how to efficiently use digital simulation in design process even in cases when You lack input data or precise math models.

You should notice, that this approach should allow You to run so many low cost experiments that after running them You should have developed quite a good “intuition” about it.

You should also notice, that running “reference experiments” let You understand what effects are simulated and what are not and how to include them in Your “intuition”.

And last but not least, You should have saved tones of money.

Providing that You were not stupid enough to pay 100’000$ for stupidly precise simulation software when You had no stupidly precise input data to feed it with.

What to use: Git or Autodesk Vault for Your proprietary format files?

Managing documents versions is something what sooner or later will hit each company which has at least a tiny bit of touch with production.

Obviously when Your company primary business is software You are probably already using a kind of Source Version Control System. It may be good old CSV, it may be SubVersion and it may be, nowadays most popular: GIT. Or maybe something else? I don’t know.

GIT + GitBlit

I do use this combo. I do use GIT command line for my software on my workstation. And I use GitBlit as a web-based server for my works.

GitBlit

GitBlit is really a very good server software. It is lightweight, takes less than 15 minutes to get running, can run on Windows or Linux and You can have as many instance of it running on one machine as You like without any fancy stuff like VM or dockers. And doesn’t need any kind of external database, what is good for backing it up.

Setting it up and maintenance cost

It can easily handle two instances on dual core 1GHz + 1GB ram PC and takes just few tens of MB on disk. Configuration is done through very well commented text files, so it is not a problem. It is pure JAVA, so it is rock hard. Thanks You guys at OpenJDK, JGit, Tomcat and the entire open source community!

This is very easy to create a separate “production” server (for work) and “testing” server (for training) and make them look differently. Since GitBlit is self-contained making “testing” to be duplicate of “production” is a simple process of copying all files from “production” folder to “testing” folder and adding some override for some configuration options.

Oh, by the way, the entire process of migrating server from Windows to Linux was just… copying the folder with server from one machine to another. All right, all right, I am not entirely honest with You here. I also spent a whole day on learning how to make the systemd to start those servers, stop and back-up automatically when I push the “power” button on my tiny server. Mostly because I was still in SystemV init era.

GitBlit also (v1.8, I am not sure for 1.9) requires that You should stop it from time to time and run git gc on every repository on the server. Since I do an incremental backup daily when server powers-up (I just push the power button and let systemd to handle it) there was no big problem to add some scripts to do it. My server is a plain headless PC, so I power it up when I come to work and power it down when last person leaves to home. This is a temporary solution because this is a very beaten up PC and I don’t trust it to keep running 24/7. And honestly… there is no need to do it. We don’t work neither at night nor remotely.

Backing up GitBlit is easy. I just stop server and copy the server folder to USB disk with standard Liunux incremental backup tools.

It has GitLFS (GIT Large File System), but without locks. After trying it out I have found that there is no much benefit from LFS except locks and that it puts greater load on server due to lack of differential compression. Also the regular GIT is specified at file format level, while LFS is specified at protocol level. It means, that if my server will totally crash I can always copy non-LFS repositories to other machine and use them with any GIT implementation. With LFS it is not true, as file format is server implementation specific.

So now I don’t use LFS at all, even for large binary files. There is not need for it.

Tuning it and fixing bugs

On the downside the GitBlit source code quality is rather poor, but it doesn’t stand out much from other open source projects. I was able to modify it to my needs without a great hassle.

The source is however not fully self contained and compilation environment practically can’t be set up without an internet connection due to nightmarish net of dependencies. Due to that I am worried about how long I will be able to maintain it if the community project would die, but after a bit investigation I have found that nowadays almost every open source project is so interlaced with each other that getting Your hands on the complete source code base is hardly possible.

Once piece to another, if You are good with Java You will be able to fix most problems with code by Yourself. I fixed some problems with showing gigabyte size commits with tens of thousands of files in one push, and added some better logging facility. I think next what I will add it will be an ability to see per-user activity (I am a low level manager so I would like to check what who have been doing and when) and full text search with Tika decoding. I miss this last functionality miserably. I really need to search through at least PDF.

GitBlit user experience

From user point of view GitBlit is a browser based viewer for GIT repositories plus a system to manage a kind of web-forum for each repository. This can be used to make “to-do” notes, make some requests and fill bug reports.

It lacks many features, like ability to make commits through web, full text search across non-text files (works with raw text files tough), adding attachments to posts on forums and such like.

The user experience with browsing repositories is very alike browsing folders on Your PC. You just select branch, commit or tag and click “tree”. And here You go, You have Your repository as it was at that day. Like to have file downloaded? Right-click on “raw”. Like to have full folder or repository snapshot on Your PC? Click “zip” and download it. No need to clone anything with git command line tools.

Since all page addresses are static You may just bookmark a version You need and use it. You can e-mail it to others, but You may also restrict repository access in such a way, that not everybody will be able to see it. You can’t however set permissions on per-branch basis, what makes it a bit tricky to use the server as a “distribution platform”. Well…. You just need to teach users to use tags to get to their release versions or have a separate “public release” repository.

The tracking of who changed what in GIT is, in generic, very easy to crack. It is possible to forge commit signed with somebody else name because author identification in commits is declarative. You can force GIT to use PGP digital signatures, but sadly GitBlit does not support validation of digital signatures. But honestly… if one of Your key designers in Your company will decide to destroy You then You are boned. He/she will do it regardless of anything You can think about. All what You will be able to do it is to sue him/her later.

Oh, btw. I could not convince GitBlit to reject push --force what is a straight path to oblivion.

Command line + GitAhead

This is getting a bit trickier when You need to stop being just passive user who can browse, download and post on forums and You must become a user who commits. Learning some client side tools is necessary. I have chosen command-line, but for some of members of my team who are not programmers I have chosen GitAhead as a most entry level and most user friendly solution I could have found. This project is stale, but works and is very, very easy to learn. You can’t do everything with it, but at entry level a GUI which can do everything is a source of problems not solutions.


And what has Autodesk Vault to do with it?

GIT is for source code, right?

But what if You also have other files?

I personally create plenty of Inkscape files, html files, Open Office files, CAD files (in many programs), 3D printing files, photos and even sound files.

So I need an ability to keep them somewhere, track their history and share them in an easy way.

Commercial solution

Since the mechanical CAD I use at work is Autodesk Inventor Pro the company I work for decided to provide all mechanical designers with some version control software.

The off-the-shelf ready solution is Autodesk Vault. Of course it is Windows only, GUI only solution. No command line, so no scripts to make Your work easier.

When we decided to put our mechanical CAD files into Vault, we also thought: “why not to put other files there too?”

Since I also do programing job the first question I asked Autodesk reseller who was providing us with Vault: “Can I keep my source code there?”.

The answer was plain: “You will be sorry if You try it”.

And the reseller was right. Vault is not powerful enough to deal with source code.

What is Vault, technically?

Versioning

Technically it is like one, huge CSV repository. It tracks history of files, each one separately.

The history is linear, there is no such thing like branches.

Each time You like to edit file, You must “check it out” from the server and You can “check out” only the latest version. Only one user may have file “checked out” at the time, so there is per-file conflict protection.

Organizing it

You can organize files in folders, but the file history does not store information how it moved in a file tree. If in version 1.0 it was in folder V:\myjob1 and in version 2.0 You moved it to V:\finallydone then when You will ask Vault to provide You with file in version 1.0 it will put it… in V:\finallydone.

Do I have to tell You what it does with paths in dependencies?

From practical point of view it means, that You can’t make any cleanup in Your project, because any process of moving files will alter all versions, current and historical. Efficiently it may even result in broken history if You will just exchange two files with same name but in different folders, while Your third file will reference to them by path something like below may happen.

Version 1.0

file "A" references file "V:\boo\a.txt", 
    where a.txt contains text "marakuja" and has unique ID XXXX1
file "B" references file "V:\moo\a.txt", 
    where a.txt contains text "borakuja" and has unique ID XXXX2

With version 1.1 You changed texts to “marakuja+” and “borakuja+”.

Then Version 2.0 You did realize, that by mistake, You named Your folders incorrectly. So You will tell Vault, using Vault GUI, to move file “boo\a.txt” to “moo\a.txt” and vice versa. Of course You also updated paths in “A” and “B” so it now looks like:

file "A" references file "V:\moo\a.txt", 
    where a.txt contains text "borakuja+" and has unique ID XXXX2
file "B" references file "V:\boo\a.txt", 
   where a.txt contains text "marakuja+" and has unique ID XXXX1

Vault history tracking does not store the information about the fact, that You moved Your files, even if You will do this move through the Vault GUI. If You do it inside Your work folder it won’t even try to detect that You moved files, what in that case will be correct behavior. The Vault GUI will just update the database to tell that file with unique ID such and such is at such and such folder. It won’t save it in history of that file.

So if next time You will look at history for version 1.0 You will see:
Version 1.0

file "A" references file "V:\boo\a.txt", 
    where a.txt contains text "borakuja" and has unique ID XXXX2
file "B" references file "V:\moo\a.txt",
    where a.txt contains text "marakuja" and has unique ID XXXX1

what is simply not right.

The same rule applies to deleting files. If Your delete file with unique ID XXXX2 it is deleted from the database. With all the history. You just can’t delete file which is no needed in Version 3.0, because it is needed in Version 1.0. It will be living in Your files folder forever

So it is just fucked up?

Mostly. But before sending it to the garbage bin let us inspect other functionalities.

More advanced organizing tools

You can assign “categories” to files, but the process of categorization is based not on “MIME magic” but just on file extension.

Since Vault is just one huge repository there is no easy way to get a “snapshot” of Your working folder and restore entire folder to specific date in bulk (well… it is, but it is broken beyond imagination, so I will skip it).

Autodesk decided to help You with it in two ways: by allowing You to “attach files” to each other and by allowing it to create so called “items”.

And here money comes to play.

The only organizing tool which can clip together files preserving their version are “items”. And “items” are available only in most expensive subscription plan.

“Items” are just a flat list of names which allows You to bunch files together to make a consistent “snapshot”. This list grows rapidly and there is no such view like “tree” in it.

“Items” do have own history and do correctly track what version of file is attached to what version of an item. So You may use “items” to bunch up Your files.

In fact “items” are the primary elements of Autodesk Vault which are versioned, controlled, validated and etc.

Attachments do not track anything, at least I could not figure out how to use them in an easy and predictable way.

Browsing it

In practice the only way to get a consistent snapshot of a project is to use “items”, so this should be Your starting point. Sadly items are just a flat list, while files are presented in a form of a folders tree. So guess where any user starts from? Of course from files tree. And guess what user gets? The inconsistent snapshot.

And, by the way, You can’t diff Your files. This is Autodesk program dedicated for Autodesk CAD and it can’t… show You what have changed between versions. Yes, no diff. No diff. No diff.

Do I repeat my self?

No diff for Autodesk closed format files in Autodesk closed source program.

It is even funnier. Vault GUI itself can’t display Autodesk Inventor files even tough it is dedicated Inventor tool. It must ask Inventor to convert CAD file from Inventor format to DWF format, and when You ask Vault to show You the CAD file it shows You the DWF file instead. If You happen to not have Inventor on Your machine and there is no instance of Inventor running which advertises itself as a “job processor” then DWF file may be “stale”.

It did not took me long to look at the same file, in the same version, and see three different drawings. When I opened the CAD file with Autodesk Inventor I was seeing something else from what I have seen with Autodesk Inventor Viewer. And the DWF file, which should be in theory just the “visualization” of that CAD file was showing yet another drawing.

So this is the version control You get with Vault.

Gladly, and this is a bit of a paradox, Valut behaves better with non-Autodesk files. Providing there is no cross-file dependency. If You tend, just like me, to create OpenOffice documents which are linked to each other and linked to images, then You are boned

The Vault server is also equipped with so called “thin client” which is a web browser interface. It is not super easy to use, but for read-only access it is far better than normal GUI. And it does not consume a license so this is the only front-end for files distribution.

Can I use GIT with CAD files?

Sure, no problem. Git can handle any kind of files.

There are however three things You must know.

First GIT is conceptually distributed system which assumes that there is always an easy possibility to “merge” changes made to the same file by two persons. This is true for plain text files, like the program source files, but almost never true of JPG, CAD files and alike.

Second GIT and GIT servers do really try to show users difference between versions of the same file. Again this is possible for plain text files and rarely possible for closed format proprietary CAD files. This is always a bit of pain in the behind to convince GIT web servers to not show diffs for such files. GitBlit is not much better than others in this area.

And third, since GIT assumes to be distributed it has no system to prevent two users from working on the same file. Because it assumes, it is possible to merge those changes at low cost.

None of those three do work when You use GIT for closed format CAD files.

Is it a big problem?

No.

Basically the problem does not appear if just one person works on a repository at time. If You can ensure that by regular job management means You will have zero problems. Except lack of diff, but sorry, if Autodesk can’t diff their own files, how could GIT do it?

GIT will not help You in that manner, but it will detect if You screw up and will prevent conflicting updates. Sure, one of person in conflict will have to re-do his job again, but GIT will not allow those two works to clash.

Beyond version control

Both GIT and Vault are version control. GIT is infinitely better in that manner than Vault. And the price difference is outstanding.

But version control is not everything.

You need some formal change control. That is You need not only know what have changed, but You must know who have changed and who have verified it and who have accepted it.

A formal “request for change”.

Request for change and GIT

Bare GIT has no such support. None. Zero. When it is done it is usually done by having branches and moving files from branch to branch.

Request for change and Vault

Vault does have it. This is good. The sequence of state transitions is however hard-coded in it and if Your formal control flow is not fitting it You can’t use it at all.

Even tough Vault calls it “change request” it is in fact “change processing”.

The differences is simple: making a “change request” is just like making a proposal: “Can You make it such and such?”. Such a proposal is connected to specific project, possibly in specific version, but the fact that somebody made a proposal does not mean the work must be done. It may be just dumb proposal. And it may hang around for a long time.

On the contrary “change processing” is tracking of what is actually being done and prevents two changes from conflicting.

In Autodesk Vault You open “change requests” for “items”. This is good, because item is tracking versions. But when You fill “change request” in Vault You are preventing other “change requests” to be open for the same “item”. And this is a blocker, because You can’t report more than one issue at one time.

So it is also definitely broken.

Request for change and GitBlit

GitBlit do have very simple change request system. With the per-repository forum (ticket system).

You just post on forum and can select a type for Your post (“question”, “bug” and many others). Each type of post has some set of possible states. There is not enough states for it to be a true formal system. If Your post requires some changes and change processing is done You may, although through command line, to ask GitBlit to create branch dedicated for handling this post. The server will then nicely catch this branch and attach it to the post.

This shows the difference between “change request” (when You create a post and discuss it) and “change processing” ( when You create a branch for a post and work on it).

GitBlit forum posts can have also votes. The simple voting system can be used to track who accepted something and who rejected it, but there is no hard barrier for change to be introduced to “formally good” branch of history.

The forum is however flexible enough so that You can track who had some objections and who have given a green light to proceed with it. The system won’t however protect You against misuse.

Oh, and there is e-mail notification system which is poorly documented but, surprisingly, works. Since I am more “phone person” than “e-mail person” and I check my e-mail once a week at most, I rarely use it.

Summary

The Git+GitBlit or in fact any other decent web GIT server with support for discussion forums, can handle a complete document management system providing that You are not especially picky about access control and You have some trust in Your employees.

There is absolutely zero problem with CAD files or other binary files, but some regular job management is necessary to avoid conflicts.

The price point of such system is unbeatable.

The cost of maintenance for Git at each workstation is the same as for LibreOffice, the cost of maintenance of GitBlit server is about one work hour per month, from my experience, and there are no licensing issues.

Git can be scripted, used from command line or with one of many, many available GUI on the market.

There is a plenty of free books in many languages and You will for sure find some printed ones in Your own language.

The learning curve is at first damn steep, then flat, then steep again. Basic activity requires about 16 hours of well prepared training. If You will prepare Your repositories well and arm them with scripts then most common actions can become single-clickers.

Vault is expensive, hard to set up, and has hidden licensing costs: Windows + MSAccess database are minimum requirements. And it is broken beyond repair. The only usable scenario is when You use it as file storage for “official versions”. Using it for tracing of daily work, even with Autodesk Inventor and alike programs it is dedicated for, is not a best idea – You will sooner or later return to making folders like “old”,”test1″,”newversion” and etc. due to lack of history branches. And forget about back-fixes, since You can’t check-out old version and can’t branch history.

Since in my mechanical designs I usually make three or four approaches restarting each time from zero with many returns and borrows Vault it completely useless for my daily work.

With Vault You have, at leas in theory because in practice it doesn’t work, a professional support.

Which You don’t have with Git+GitBlit, but You can easily get it with Git+GitLab.

I don’t recommend running GitLab on Your own server, because fine tuning it, backing it up and running more than one instance is a real pain when compared with 15-minute GitBlit. We have GitLab too and it is still, after 1 year, not working properly. And nobody knows why. It is just too powerful and too complex. We have about 40 users and GitLab is capable of handling tens of thousands of them. This is not a right scale for us.

If I were You I would GIT+GitBlit a try.

But if my payment would depend on how valuable tools I manage, what is typical at corporate managers level, then selecting a zero cost tools would be stupid.

Reading: Thomas L.Friedman “Hot, Flat and Crowded”

I just finished reading the book titled “Hot, Flat and Crowded” by Thomas L.Friedman and I would like to share some impressions about it.

This book was first printed in 2008 so it was quite a few years ago. But I just got my hands on it recently. I have found it in a “scrap box” in nearby supermarket at a superb price of 2 Euro. This is 5 times cheaper than originally and well in my budget for entertainment, so I bought it. Please, don’t be sad mister Friedman, mister Fukuyama was also found in the same box few years earlier at the same, or even lower price.

What is this book about?

For those who did not read it: about the impact and interactions between environment, greenhouse gasses + global warming, human life and politics. I can’t tell much about how correct is the author in determining the progress of greenhouse effect since it is not my branch of science. I do physics, mechanical design, electronics and programming, so I am not a right person to say something about a climate. Except the fact, that this is third winter now (2022) in Poland during which we had just few days of snow. While in 2005 we had -20C for over a month, as far as I recall.

There are however some additional aspects of this book which made me a bit sad about the capabilities of American thinkers, but those are more related to politics that to the climate.

Naivety

First what hit me in it was the naivety of thinking. Sure, it is good to believe that people are in generic good and they do their job well. But when I read that someone is seriously proposing that some officials should be bound to check and validate how much “green” are some technical solutions then I find it at least silly.

This is not that I say that officials are lazy bastards who just think how to get through the day with less possible effort and responsibility. They are no exception in that manner – everyone of us has some of such an approach inside him/herself.

The problem in this thinking is in the fact, that sooner or later relying on officials doing their job perfectly in a changing environment will lead to disaster. American author should understand it well because You, dear United States citizens, did experience it at least two times.

First time during the Three Mile Island nuclear crisis and second time during the Deep Horizon oil spill. In both cases relying on officials doing the substantially correct job ended up in officials doing the job according to the letter of the law.

Don’t get me wrong, there is nothing bad in officials obeying the law to the letter. The problem lays in the fact, that when the law regulates technical problems it needs technical means to validate the correctness of the law.

In case of Three Mile Island the thinking “since we do it by the letter it is safe” lead to disaster. Gladly it was mild, laughably harmless disaster when compared with Czernobyl. You got just lucky. Or You suck so much that You can’t even make a good nuclear disaster.

Just kidding, but it was my first thought when I listened to the Three Mile Island commission report. Thanks librivox.org for recording it!

The law, especially in countries ruled like United States of America, has an excellent mechanism for validation if it indeed expresses the will of the nation.

The problem is, that science is not democratic. It is pure dictatorship of the nature. And, as it is like it is, democracy gives absolutely zero mechanisms necessary to validate if any regulation controlling technical aspects is correct and according to current state of scientific knowledge.

And scientific knowledge changes fast. What was true and recommended yesterday may be an utter stupidity tomorrow.

This is the problem.

From my observation if officials would be given a responsibility to judge what design is “green” and what is “dirty” the we will sooner or later:

  • halt the progress, since law will clearly define what to do and there will be almost zero mechanism to validate it. Thous anything going beyond what is defined in the law will be banned. We can see it in many branches of technology – the tighter are definitions the less true added value appears on the market. And if they appear, they appear in the shadow zone which is not covered by regulations. Only those companies which have enough power to push through some regulations can make a progress… and set regulations in such a way so that competition is locked out;
  • make designers to focus on optimization how to maximize the “green” in legal terms, rather than actually making eco-friendly products. Exactly as Volkswagen did. Hm… was it Volkswagen? I’m not sure.

The law may be specified strictly, with exact and precise rules. If it would be like that, those effects I wrote about above would be enlarged. If however law would leave more freedom of interpretation it will create a great pressure for corruption. The less accurate law is, the higher is temptation to persuade officials with some other merits. They may submit to that pressure without any risk because the vague and imprecise law gives them a space for free interpretation.

Mistaking “they want” with “they must”

At certain moment the Author writes, that America should follow Europe because since Europe have taxed large cars Europeans do not want to buy big cars.

Big mistake. This is not that we do not want them. We can’t afford them. This is quite a big difference.

By the way, when You look at the streets of Poland now and compare size of cars riding those streets about 1990 You will clearly notice, that cars today are basically twice that big. We had something we called “Big Fiat”. It was large then. And when I have seen it recently…. gosh… it is so tiny! We couldn’t get anything else then, today we can. Who has money can afford it. Only those who count every penny do select cars with reason and according to real needs. So let us kick them in the nuts, right?

Creating a fake financial pressures looks promising at first glance. But when You come and look at it closer it just corrupts everything.

Take for an example electric cars.

Electric cars are, by definition, simpler than mechanical ones. Direct drive system, which is a natural choice for an electric car, contains about 1/10th of mechanical parts when compared with combustion engine drive. They have a great potential to be really, really cheap.

Then why they are so expensive? The end price point, regardless of fake loads on combustion cars and financial subventions is set significantly above the price of combustion engine car. When I was buying my new combustion engine car it costed me 40’000 local monetary units. I admit, I have made a minimalist choice based on careful calculation of costs, required maintenance, my needs and predicted use profile.

The small electric car was priced at about 100’000 local monetary units. Plus the subvention I paid in taxes. The price is far from fair, yet the legal system is promoting those who produce more expensive products in less economic way and is requiring customers to produce more wastes since they need to earn more money working on the factory floor and sell more products of which many belongs to “entertainment” and from ecologic point of view are just near future useless garbage.

If not done very, very wisely, impossibly wisely, subventions do invert economy and are making those who produce in less efficient way to be more efficient financially.

Not that I say it is always wrong to give subventions to some brands. But You should always consider what will happen if some day You will remove the subvention.

Will the market stand it? Won’t it crash? This is tricky to keep a proper balance and it is very easy to create a system in which You can’t cancel subventions because everything will crash. We were very close to that point with solar energy this year. Due to technical problems with accepting by a network such an amount of unstable energy sources the government decided to remove most of fake financial benefits from solar energy. Just at the right point before the inversion could occur and before we would have actually have to pay higher bills because of abundance of solar energy which was due to subvention and after accounting all the side costs, bought at higher price than it was sold back.

And one more thing.

Placing fake financial loads makes people to think how to avoid them. This is the balance of risks and loads what makes people to follow the regulations or dodge them.

I don’t know how Americans think about circumventing the law. Maybe United States citizens never do it. For us in Poland this is traditional that the law is for idiots who can’t figure out how to dodge it.

Setting aside finances and legal issues, You can’t just force somebody to love You. The same way You can’t use pressure to make people to not pollute the world. You may gain some temporary effects but once the pressure is released You will have a tremendous bounce back.

And since we talk about love…

Why they don’t love us?!

In one of chapters Author mentions the political effect related to the fact, that America depends on oil bought from “unstable” or “rogue” countries. The more fuel American economy needs the more money is transferred to places which are neither democratic nor, let say, “human friendly”.

It is correct observation. Good starting point.

Then Author observes, that this is the hunger for oil what made Islam terrorism able to do as much harm as it have done.

This is also true observation.

But the connotation that this is the oil what created the terrorism is in my opinion a huge misunderstanding. Author does not say it straight, this is true, but the cure he proposes…

I don’t know how one can be so blind.

The terrorism doesn’t come from money. Surely, money allows You to get necessary resources, but the reason must be somewhere more deep.

Is it religion? Maybe.

Or maybe it is the way America thinks about solving that problem.

The only reasoning I have found in that book about solving the problem with oil powered terror was “how to change them so they will be more eager to like us”.

Change them. Always change somebody else.

There is absolutely no sign of thinking that it may be something in You, dear Americans, what makes other people to dislike You. Maybe if You would have change a tiny bit others would be more keen to change their view too?

I do appreciate many aspects of American society. Its dynamics. Its approach to law. Its approach to public property. Its approach to the process of stating law. Its approach to freedom. Yet there is plenty of aspects which I honestly despise. Like the tendency to solve everything through the conflict. The projections of own problems to others. The holly “American way of life” which is absolute best and fuck the rest.

I can live with it, no problem.

Providing, that all of it stays out of my own home.

I do understand why Near East terrorists do think like they think. They have their own homes, their own history, five times longer than United States by the way, own values and own tradition. Not everything has the same value for everybody. Something precious for one may be worthless for others and it is often hard to notice that, because we always look at the world through our own eyes.

Sadly America had, according to old British tradition which never refrained from corruption, slavery and even mass scale drug dealing, the nasty habit of forcing own values on others. Plenty of those values are good, no single word against it. But others are disgusting. Which is which depends on local culture.

Again, there is usually no problem if weak and wicked person comes to You and is trying to force something on You. That person is weak, so You can simply tell him/her to bugger off. But if it is a huge guy with huge hammer, sword, pistol and pockets so deep and full of gold so that wherever he stands gold coins are covering the ground? You will obey. Will obey and will despise Yourself. Despise Yourself because You yielded to that person… one step to another and this self-despise turns into hatred towards that person.

And what is the solution for that hatred that huge person proposes to You?

To change You. To change You to that person liking.

Sorry, dear Author, but it won’t ever work.

Projecting own problems on others

As usual for Americans the Author does not limit himself to proposing how to solve the problem inside United States. Truly, yes, United States are only a part of the global warming problem. Huge part, but just a part. Solving it in States without changing the rest of the world will not help much. I do agree with that.

So the Author recommends that some financial pressure should be used everywhere to make people to consume less electrical power.

Well…. In 2021-2022 my average annual energy bill was around 80 Euro. Eighty Euro. Not for a month, for an entire year. The kWh in Poland is priced at about 0.1 Euro and the energy bill contains about +50% “constant cost” like fee for sustaining network readiness and so on. I do live normally, I don’t hide in the cold and dark.

According to that page an average US household is using about 1570 Euro worth energy at the price point around 0.1 Euro per kWh.

Ehm… If it is not casting own problems on others then what is it?

Summary

Regardless of all my bragging it is a good book. It is well written, slightly too less scientific for my needs, but worth the money I spent on it. If You can get it at Your library, read it. If You can get it at whatever “entertainment” price level is for You – get it.

But not take proposed solutions You read in it as a silver bullet. Be cautious.

It is naive, unfair and arrogant and proposes ways of solving problem which may produce as many new problems as many old problems it would solve.

But it is worth reading.

Watchdogs, how to not use it

At first I was thinking about making it a next part of the “RtOS-implementing” series but finally I decided that this subject is wider and not restricted to RtOS.

So let us get to the today subject.

One of the most important part of the embedded programming is its reliability.

In embedded world there are two basic components of reliability:

  • the mistakes on human side;
  • the glitches in hardware.

Human mistakes:hardware

The nature of human is to make mistakes. Regardless how hard You try, You will always fail. Sooner or later. Your mistake may come out at once, or after many years. But it will come out.

Human mistakes may affect both software, which You, dear reader, is most interested in, or, what is not very obvious, also a hardware. To realize the scale and potential of mistakes it is enough to read the bug reports for software and the “errata” documents for micro-controllers.

By the way, the “errata” is Your best friend in this faulty world. Always start from reading it. Always. This may save You endless hours of work. You will probably notice that those are not small documents.

I can assure You, that those bugs found in “errata” are not only ones. From my experience, and since I am coding mostly in assembler I am very close to the hardware, I was usually able to fall into one or two unreported bugs in each micro-controller I used. I must admit however that some of them were not physical bugs but rather products of some misunderstandings between guys who designed the hardware and those who wrote the specs.

Putting it all aside, hardware may be faulty due to human errors. Even a one hundred percents perfect program may fail if hardware fails. For an example a certain batch of PIC18 micros could, if under a certain voltage and temperature, fail to correctly execute a simple:

goto x

instruction if X was too far away and the next instruction was something what was touching the data buss.

Human mistakes:software

Nothing more to say. You wrote it, it is buggy. Dot. There will be certain conditions under which Your software will fail. The better it is tested, the less chance it will happen, but it still will happen.

Glitches in hardware

Let us imagine for a moment, that we are living in a perfect imaginary world where hardware is problem free and we were making zero mistakes.

Nothing will fail then, right?

Well…. no. It will fail. Because of the physics.

The “single event upset” phenomena

… or how nuclear physics destroys Your program.

Electronics is small. Very, very, very small. So small that if a heavy particle with a significant amount of energy will hit it, it may produce enough free charge to provide enough energy to flip a MOS transistor into a conduction for a moment. And since electronics is not only small, but also fast this moment may be enough to alter program execution or change some data.

Note: This is where digital machines do loose with analog ones. Op-amp can’t fail due to single event upset.

This event is rare. From some of my calculations I did for some project I can estimate that it may happen about once each 20 years per chip on the ground level. This is rare. But if Your company releases 10’000 pieces of programmed chips each month it is bound to happen in a predictable amount of time.

Of course chip designers are aware of this effect, so plenty of CPU and memory chips do contain some circuits which do prevent it. But some, especially those which are not focused on reliability, does not.

Weren’t You ever curious why the GPU which contains hundreds of execution units costs a fragment of 4-core CPU? There are some brilliant notes at Nvidia side (I can’t point You to them now) which explains why it is like that.

Mainly it is because GPU is allowed fail to compute correct result. As long as it doesn’t crash completely it absolutely doesn’t matter if some pixels are computed wrong because this effect will stay detectable for just about 20ms when the image is on screen. This is why computing path may be cheaper: because it doesn’t have to be robust. And this is why NVIDIA GPU’s dedicated for computing are not cheap.

Here is the link to some document of interest: this document.

Are we doomed?

Mostly, yes. Death is unavoidable. So the system crash. So the hardware failure. Everything will break sooner or later…

Of course the best to do it would be to prevent that failure, but since we can’t do it, we should be prepared for it.

Preparing for a failure

The most basic element of failure resistance is to be able to detect it. If failure is detected we can do something about it. If it is not…. we are doomed to oblivion.

And after this long introduction we came to…

The watchdog

The “watchdog” is an elementary piece of hardware present in 99% of micro-controllers.

It wasn’t always like that. In early 1990 only some micro-controllers had it. External chips had to be used then. I think You may still buy such external watchdogs.

This hardware has just one task to perform: If it is not “kicked” within a certain amount of time it will reset the processor.

Resetting and restarting is a most primitive way of restoring system to “safe state”. Something failed, sure, but we can restore system to proper working with the restart… at least in most cases we can do it. There are some specific cases in which restart is not safe, but I will currently skip this area since I don’t have any experience with it. What You should be aware of, it is that You must always check if restarting from any possible state of Your system is safe.

All right, so we have such a “watchdog”.

How do we use it?

Watchdog in linear program flow

Watchdog is in fact a very primitive tool and it can check just one condition: if the program did stuck or not?

It can’t validate if program flow is correct. I can’t validate if data are not broken. Nothing of that. It just can check if program is passing through some “checkpoint” frequently enough.

If our program is a simple, interrupt-less, classic state machine use of watchdog is simple:

main loop:
 set up watchdog timeout period
 for(;;)
 {
  kick_the_watchdog
  ....many silly actions here;
 };

What is the effect of this program?

In a design phase we estimated the longest possible time necessary for the main loop to run. Let is call this time Tloop. If passing through the loop takes longer it means that something is wrong. So we set watchdog to reset the processor after the Tloop time since last kick, plus some margin.

If everything is fine the watchdog is kicked frequently enough to not reset the processor. If however anything prevents loop from completion within a set time it will reset the processor.

The downside is that the processor will be running wild for at most Tloop time, but after that it will restart and most probably return to safe state. Some data will be lost, some external effects may be present, but the duration of failure will be limited.

You should always take this “wild period” into a consideration. Is it fine for Your device to be mad for, let’s say, 10 seconds? Sure if it is an e-book reader or MP3 player, but if it is a car break control system I would rather stay in 100ms range.

Will it cover every possible failure more?

Of course not.

Will it detect some failure modes?

Yes, for sure.

Watchdog and interrupts

As You probably noticed the “watchdog” provides just one “checkpoint”. Surely You can stuff “kick it” code in any place in a program, but this code will relate to just one “checkpoint”.

Now imagine we added to our system the interrupts. Like that:

main loop:
 set up watchdog timeout period
 for(;;)
 {
  kick_the_watchdog
  ....many silly actions here;
 };

interrupt_service_routine:
   even more stupid actions
   return from interrupt

Watchdog is protecting our main loop exactly the same way it protected it in previous example. But what about interrupt? Does anything check if it works?

Surely main loop may do some checking, but I find it cumbersome and error prone. What do we need it is to have an another watchdog.

Checkpoints

In fact it is better to think about “watchdogs” as about “checkpoints” instead of “watchdog timers”. They are timers, physically, they do actions driven by lapse of real time. But from program point of view they are “checkpoints”. If You don’t pass through a “checkpoint” You are in troubles.

But how many “checkpoints” do You need?

The most elementary reasoning is: one for each thread and one for each interrupt implementing a state machine.

Notice, this is a place where we are interacting with RtOS: “one for each thread”. One for each RtOS task that is.

But we have just one and single hardware watchdog!

Don’t do it example

As a silly young lad I did:

main loop:
 set up watchdog timeout period
 for(;;)
 {
  kick_the_watchdog
  ....many silly actions here;
 };

interrupt_service_routine:
    kick_the_watchdog
   even more stupid actions
   return from interrupt

As an old fart I would say: what an idiot I was!

What is wrong with it?

We have two loops, logically speaking that is. Main and interrupt. But we used just one “checkpoint”. If both main loop and interrupt stops executing the processor will be reset. But if just one of them stops, it won’t be.

We need to multiply those “checkpoints” somehow.

Virtual watchdogs

First we need to select some place which can provide us with two resources:

  • it must execute in loop through the whole life of a program;
  • each time it executes the amount of real time elapsed since last run should be known ( or at least: we must be able to make some assumptions about this time);

Since in most applications we will need a kind of “heart beat” periodic interrupt this is an ideal place to use.

So we can do something like that:

//Declare virtual watchdogs
 boolean WATCHDOG_0_ENABLED;
 integer WATCHDOG_0_COUNTER;
 boolean WATCHDOG_1_ENABLED;
 integer WATCHDOG_2_COUNTER;
   ....
heart_beat_timer_interrupt_routine:
    kick_the_hardware_watchdog
    if (WATCHDOG_0_ENABLED)
    {
      --WATCHDOG_0_COUNTER
      if (carry over)
           for(;;)  ← this just makes hardware watchdog to
                  reset the CPU. You may use better means for that.
    }
   if (WATCHDOG_1_ENABLED)
    ..... and so on, and so on

What is going on in it?

We are using the hardware watchdog to control the heart beat interrupt. This is the only “checkpoint” it is controlling. If this interrupt stops, watchdog restarts the processor. If it is alive, it does not.

Then in every place we would usually plant the kick_the_watchdog code we do use:

  .... 
  WATCHDOG_x_COUNTER=period
  ...

Note: I assumed that the WATCHDOG_x_COUNTER=period is atomic against heart beat interrupt. You must make sure it is. In most CPUs the transfer of single hardware word to variable is atomic by design. Increment or decrement is most usually not atomic, but it depends on the architecture.

This way we provide ourselves with as many “checkpoints” as we need. And we made it flexible enough so it can be disabled and may have its period adjusted.

Notes of warning

You should seriously consider if WATCHDOG_x_ENABLED is even necessary for You. It is always better to not have it than have it. If watchdog can be disabled purposely it can be also disabled by mistake. This is why on some CPU (PIC17, PIC16) You can’t disable watchdog and on others (MSP430) disabling watchdog requires some special actions which are fairly improbable to happen at random.

If You really need an ability to disable it I would rather use something like:

integer WATCHDOG_0_ENABLED;
if (WATCHDOG_0_ENABLED!=DISABLED)
{
   --WATCHDOG_0_COUNTER
   ....
}

and select the DISABLED constant to be something what represents the value Your processor can’t just spew out. Surely not zero and not -1. This way random data transfer has less chance to disable it. But it costs some bytes of memory instead of bits. It is Your choice.

Interaction with RtOS

Since each task requires own “watchdog”/”checkpoint” it is best to include it into the task table structure:

typedef struct{
    saved_SP
    event_flags ; just some bits, ie. 8 bits
    event_mask
    watchdog_enabled
    watchdog_counter
}Ttask_state

But how to use it inside a task?

If a task do have a form of loop, use watchdog as usual. We did not however create an RtOS to struggle with tasks which are flat loops. We most probably do have many small inner loops in which we wait for something.

Hmmm…. what does it mean, that task is alive?

Alive tasks does something and… it yields the processor to other tasks. A task which does not yield do monopolize the CPU and should be killed. And a task which yields do inform an RtOS that it consciously decided to give up CPU for a moment, so most probably it is in a sane state.

The best place to put a task watchdog kicking is the yield() subroutine.

So all we need it is to provide a “watchdog enabled” yield() which will do:

subroutine yield_wdt(period)
  TaskTable[CTP].watchdog_counter=period
  TaskTable[CTP].watchdog_enabled=ENABLED
  //just fall through in assembly to yield() or jump to it.
subroutine yield()
   .....

As You probably noticed this subroutine is not only setting up the watchdog timeout period but also makes sure it is enabled. This is a good practice, and since it is inside a yield_wdt(period) subroutine it doesn’t cost us a penny.

Beware that the ordering of operations manipulating watchdog_counter and
watchdog_enabled is important. If You would do it in reverse You might get false watchdog resets due to race condition with heart beat interrupt.

And, by the way, the period argument of yield_wdt(period) is intentionally not put in Ttask_state structure, where it reasonably thinking, should be. Because that structure is in RAM and can be corrupted, while call:

call yield_wdt(100ms)

will be burned in a program memory and can’t be corrupted unless something very unlikely happens. Of course You may use dual task state table, the variable part in RAM and fixed part in program memory if You like.

Summary

After reading this blog entry You should know what is “watchdog”, why it is important and what is capable and not capable to do. You should also understand how to extend its functionality so that it can protect Your code better.

There are other possibilities to extend it, like for an example sequential watchdogs which not only check if code is periodically passing through a “checkpoint” but can control if two or more “checkpoints” are passed in a designed order. I did not however ever needed them and I think that if You will be able to simplify Your code with a cooperative RtOS you will also not need them at all.

RtOS – implementing it: conserving power

Hi again!

In a previous blog entry I have shown You how to make Your tasks to wait for some events. Now it is a time to make this waiting useful.

First let me remind You how the main task switching kernel loop looks like:

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  for(;;)
  {
    CTP--
    if (carry over/borrow) CTP = NUMBER_OF_TASKS-1
    if ((TaskTable[CTP].event_flags &
         TaskTable[CTP].event_mask ) !=0)
        {
        SP = TaskTable[CTP].SP
        pop "called save" registers
        return
        }
  }

It is spinning all the way round, isn’t it? Spinning and spinning forever. But what if we alter it a bit:

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  for(;;)
  {
   for(i=NUMBER_OF_TASKS;--i>=0;)
   {
    CTP--
    if (carry over/borrow) CTP = NUMBER_OF_TASKS-1
    if ((TaskTable[CTP].event_flags &
         TaskTable[CTP].event_mask ) !=0)
        {
        SP = TaskTable[CTP].SP
        pop "called save" registers
        return
        }
   }
   *** this is a sweet spot ***
  }

Not a much have changed. We have now an inner loop which is checking if any of existing tasks is ready to be awaken and an outer loop which repeats it forever.

But let us think about it for a moment: what does it exactly mean when the program reaches the “sweet spot“?

That there was not task ready to be awoken.

Hmm….

What could it mean? Since no task is ready to be awoken, including the task which just have called the yield(), are there any means for any of task to be ever brought up?

Surely no task can set any bit in any of event_flags variables because no task will be running.

Does it simply mean that we are stuck?

Well….

Not if there are any interrupts.

And when we are talking about interrupts…

Interrupts and power saving modes

Nowadays most micro-controllers or even most microprocessors do have some kind of “power saving” functionality. In 80586, as far as I recall, there was an instruction which just paused CPU for one cycle in low power mode. Nothing fancy, honestly. In modern micro-controllers we have much, much wider choice: from just putting a CPU execution core to sleep, through shutting down some sub-systems, then shutting down some clocks, down to the deep sleep from which only reset can pull us out.

Of course the “deep sleep” is not what we are aiming at, but some intermediate sleep modes will be fine.

But what does it have in common with interrupts?

Everything!

This is an interrupt what awakes CPU: the hardware is getting back online, it starts executing the interrupt service routine, gets to the end of it and then…

Exactly… what happens then?

The details will depend on CPU but if processor has a useful power saving modes it has an ability to perform three atomic operations:

  • to put processor in a sleep mode X and atomically enable interrupts;
  • to atomically return from interrupt, restore CPU to previous sleep mode and enable interrupts;
  • to atomically return from interrupt, restore CPU to active mode and enable interrupts;

I will call the first operation sleep(X), the second one reti (because an active mode is just one of sleep modes, right? So a regular return from interrupt should be able to restore the sleep mode), and the last one I will call: awake_reti

So let us imagine we will do something like that:

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  for(;;)
  {
   for(i=NUMBER_OF_TASKS;--i>=0;)
   {
    CTP--
    if (carry over/borrow) CTP = NUMBER_OF_TASKS-1
    if ((TaskTable[CTP].event_flags &
         TaskTable[CTP].event_mask ) !=0)
        {
        SP = TaskTable[CTP].SP
        pop "called save" registers
        return
        }
   }
  sleep(X)
  }

What have we just done ?

Basically we said: “If there is no task ready to run, put the CPU to sleep with interrupts enabled.

Of course to make any of tasks alive again we need some interrupt code. This code must both signal an event and awake the CPU so that the loop would continue.

Like, for an example this:

interrupt handling routine()
{
  .... blah blah blah
  TaskTable[1].event_flags |= 0b1000_0000; //signal some event to some task
  awake_reti
  }

And we are done!

Well… not really in fact, but let us pretend for a moment we are indeed done. What exactly happens?

The main kernel loop checks all tasks and finds that there is nothing to do. So it puts the CPU to sleep. Then an interrupt happens and decides to awake some task, in his case task number 1. So it sets an event signal and returns from an interrupt in such a manner, that the CPU does not return to sleep mode and instead executes code normally. In effect it is re-running the kernel loop. And this time this loop will find the task to run, so it will awake it.

Case closed, hurray?

Race condition

Sadly, no, case is not closed.

Why?

Because even tough the sleep(X) is atomic, the event_flags testing loop is not. And a following sequence of events is possible:

  1. Task 0 is checked, no it is not to be awaken.
  2. Task 1 is checked, no it is not to be awaken.
  3. Task 2 is checked, no it is not to be awaken.
  4. Interrupt happens, and it is doing: TaskTable[1].event_flags|=0b1000_000;
  5. Task 3 is checked, no it is not to be awaken.
  6. Nothing to awake, execute sleep(X)we should NOT be doing that, right?

Because the task checking loop is not atomic against interrupts an interrupt could slip in during the loop and alter the state of a task which was already checked. In a result the CPU will be put to sleep while in fact it should not be. In most cases this kind of race may get unnoticed, because sooner or later an another interrupt will happen during the sleep(X), awake the CPU and then all pending events will be noticed and all pending tasks will be awaken. But a delay will be introduced and in some rare cases, when it is the only interrupt which could awake anything, we will get stuck.

The obvious solution is to make the entire testing loop to be atomic against interrupts:

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  for(;;)
  {
   disable interrupts
   for(i=NUMBER_OF_TASKS;--i>=0;)
   {
    CTP--
    if (carry over/borrow) CTP = NUMBER_OF_TASKS-1
    if ((TaskTable[CTP].event_flags &
         TaskTable[CTP].event_mask ) !=0)
        {
        enable interrupts
        SP = TaskTable[CTP].SP
        pop "called save" registers
        return
        }
   }
  sleep(X) //note: Interrupts are enabled as a side effect of entering sleep mode.
  }

but I hate this solution.

Why do I hate it?

Because it adds to the interrupt latency a lot. This is a significant code block which does not have to be atomic and is looping very frequently. This is really not worth to pay that latency cost if there is a better solution.

Which is something like that:

   boolean interrupt_updated_an_event;
 subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP  
  for(;;)
  {
   disable interrupts
    interrupt_updated_an_event=false
   enable interrupts
   for(i=NUMBER_OF_TASKS;--i>=0;)
   {
   .....
   }
  disable interrupts
   if (not interrupt_updated_an_event)
   {
      sleep(X) //note: Interrupts are enabled as a side effect of entering sleep mode.
   }
  }

And inside an interrupt we just add:

interrupt handling routine()
{
  .... blah blah blah
  TaskTable[1].event_flags |= 0b1000_0000; //signal some event to some task
  interrupt_updated_an_event = true;
  awake_reti
  }

We just added one global flag which indicates that any interrupt did adjust any event flag during the task scanning loop. Of course interrupt does not bother if in fact it happen during the loop or not. It just sets the flag to true. Our main kernel loop atomically clears it when it is sure, that it will scan all tasks again. Once the scan finishes without awaking any task it enters the block which is atomic against interrupts and before going to sleep it tests that flag. If it is not set it is safe to go to sleep. If it is set all tasks needs to be checked again.

Selecting X in sleep(X)

As You probably noticed we have now one, centralized location in our entire program in which CPU is entering the energy saving mode. No more complex decision trees, no more pondering if I can put CPU to sleep or not. If there is no task to run it goes to sleep. If there is a task to run it stays awake. Plain an simple.

I like it. I hope, You also will like. But what if we can squeeze even more from it?

Power saving woes

The vast number of power saving modes does not come without a price. Sure, I can turn off the Auxilary Clock signal. Sure, I can turn off the main oscillator. Or a temperature compensating frequency locked loop. But if I turn them off they will not work.

For an example my beloved MSP430 will stop background transfers from USB endpoint incoming shift register to memory if main system clock is disabled. The entire USB machine works, but the data are not transferred to memory. But if USB is not running I can get much more power saving with disabling main system clock.

Or in the same CPU the FLL loop which is stabilizing RC system clock against low power watch quartz crystal must run for at least 10ms each minute or the clock will go off too much if temperature will change.

Or…. You can imagine.

With our centralized system it is enough to slightly modify the atomic piece which is calling a sleep(X). For an example like this:

   
 subroutine yield()
  ...
   if (not interrupt_updated_an_event)
   {
     if USB.is_on
             sleep(main_clock_on)
       else
             sleep(main_clock_off)
  }
  ....

Again we have a single point where all decisions about power saving are made. This is a really, really good for product quality.

Summary

After getting trough all this blog entry You should be now able to squeeze as much power saving from Your CPU as possible with just one, plain and simple piece of code. Everything related to power saving is kept in one place and a decision to put processor to sleep happens transparently without any special action form You. If there is nothing to do processor sleeps. If there is something to do it stays up.

In next blog entry I will show You how to introduce an elementary protection against hangs into the RtOS and I will explain why with RtOS standard methods do no longer work.

RtOS-implementing it: WaitFor

In previous blog entry I have shown You how the primary task switching loop looks like and what is a primary data structure behind it. I also told You that it still lacks a certain elementary functionality.

But what is that what we are missing?

Back to the drawing board

Do You remember the primary task loop?

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  CTP--
  if (carry over/borrow) CTP = NUMBER_OF_TASKS-1
  SP = TaskTable[CTP].SP
  pop "called save" registers
 return

You do see that it is continuously and unconditionally switching tasks, right?

And do You remember those state algorithms we started from, right?

See that round block with “AWAIT” text?

This is what we are missing.

Our operating kernel must allow out tasks to wait for “something” to happen.

Wait for me…

Before we add a waitFor() functionality we need to discuss a bit how and what for we should wait.

In most cases a task do wait for some physical action to happen, like input pin toggling, serial port receiving some data, time elapsed or for something to be done by an another task. You can easily see, that we can have many sources which can awaken each task and that we can divide them in basically two classes:

  • things which are made by hardware;
  • things which are made by other tasks.

We will call those “things” events.

Hardware events

Any physical action which is detected by a hardware is actually, in technical terms, “toggling some bits” in an electronics which surrounds our CPU. This “toggling” may affect CPU in two primary ways:

  • the toggled bit is memorized and may be read by a program, or;
  • the toggled bit makes hardware to do something.

Polled hardware events

Obviously if bit is just toggled we need to read it to check if it changed, right? We can’t just wait for it to toggle, because to check it we need to actively read it. We have to loop and “poll” that bit.

This is not a kind of “event” our operating system may directly wait for.

Hardware actions

The opposite category of hardware events are those which do trigger some hardware actions. Again those actions can fall in two categories:

  • such which make hardware to do something;
  • such which do trigger interrupts.

First category is, for an example, triggering the DMA (Direct Memory Access) transfer. Or capturing some timer in some Capture Register. The essential fact for them is that those actions are made without any software activity. For an example the DMA can pause the program execution and copy some memory block.

The second category just starts an interrupt, thous it is directly interacting with a program.

This is what our tasks will be waiting for.

Inter-Process-Communication

Obviously the interrupt is not the only source of “events” our task can await for. For an example a user interface task may wait for key press or for a remote command sent through a cable. Such a command will be usually intercepted by hardware and directed to communication protocol handling task. That communication protocol handling task will process incoming data, validate it, acknowledge and formulate a request for a user interface task. In a basic case it may be, for an example, a “faked key-press”.

The absolute minimum for an IPC (Inter-Process-Communication) is to be able to “signal” or “notify” an another task.

Just like an interrupt would do.

So what do we wait for?

Basically we do wait for an “event”, either triggered by an interrupt or by other task.

In a most rudimentary case an “event” is a single bit in memory which when set by some program makes the operating system kernel loop to jump to a task which is waiting for it.

Note: You may also create “counting events”, but we will skip it now, since they can be implemented using a rudimentary single bit event.

So we can imagine we could have some waiting API which would look like:

void waitFor(event_identifier)
{
  inform operating system loop, that this task waits for a specified event
  call yield()
}

Plenty of programming paradigms do implement such an API. For an example WinAPI has it and Java do have it.

Is it enough?

Well….

For me it is not.

First of all in a real world something may always go wrong, so we must be always prepared for the fact that even if we prepared everything correctly and we start to wait for something…. that this won’t ever happen. For an example we did sent a command and wait for an answer, but this answer will never come because of some noise which corrupted the data transfer.

So the absolute, minimum API is:

boolean waitFor(event_identifier, timeout)

which waits up to specified amount of time and informs us if event did happen within this time.

…but I don’t like it.

Do not get me wrong. This kind of API is a good, well established coding paradigm. It just does not fit very well the low level embedded API.

Why?

Because we often need to wait not only for something to happen within a predefined time, but we often need to wait for more than one thing to happen. For an example the user interface of a volt-meter do need to wait for new measurement data or for a user turning a knob or for battery voltage drop detection or…

In fact we need to wait for X or Y or Z. Or time.

So what can we do?

This is simple.

Event flags + Event mask

From previous post You should remember the “task table” made of structures like:

typedef struct{
    saved_SP
}Ttask_state

You should also remember that I have told You that we will add some more stuff to it later. And this is the right moment.

Let me see….

typedef struct{
    saved_SP
    event_flags ; just some bits, ie. 8 bits
    event_mask
}Ttask_state

Each time we need to signal an event number E to task number X we do:

TaskTable[X].event_flags |= (1<<E) ;set bit E in event_flags

And each time a task number X is going to wait for that event it does:

TaskTable[X].event_mask |=(1<<E) ;set bit E in event_mask
call yield()
TaskTable[X].event_flags &= ~(1<<E) ;clear bit E in event_flags
TaskTable[X].event_mask &= ~(1<<E) ;also in event_mask

Waiting loop

All what is left it is to update the yield() loop

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  for(;;)
  {
    CTP--
    if (carry over/borrow) CTP = NUMBER_OF_TASKS-1
    if ((TaskTable[CTP].event_flags &
         TaskTable[CTP].event_mask ) !=0)
        {
        SP = TaskTable[CTP].SP
        pop "called save" registers
        return
        }
  }

As You can see this time the yield() stays and loops until it encounters a task which has at least one event_flag set for which a bit in event_mask is also set.

This way only task for which the wait condition is met will be waken. Tasks for which the waiting condition is not met will be skipped and not awaken.

Summary

In this blog entry I have shown You how to add the “waitFor” functionality for our operating system. Now our tasks can set some conditions they are willing to wait for and give the CPU to other tasks till this waiting condition is fulfilled.

In next blog entry I will try to show You how to turn this functionality into an centralized, easy to do and very functional energy saving feature.

RtOS-implementing it: tasks table.

In recent post I have shown You how the cooperative task switch looks like. If You would be so kind to remember it, You may notice it looked like that:

subroutine yield()
 save "called save" registers on stack
 save SP to somewhere
 load SP from somewhere else
 restore "called save" registers from stack
 return 

In this blog entry I will try to show You what to do with that bold “somewhere” stuff. But before doing that let me ask You some questions.

How many tasks do You need?

Can You count them? I mean, when You design Your software can You count them, or does the count of them depends on what input information is supplied to Your device at the run time?

If You can count them the we are talking about a “static task structure“. If You can’t count them, we are talking about “dynamic task structure“. And since dynamic task structure is evil we will skip it.

Do You think I am joking, right? What evil is in a dynamic structure?

I am not joking. If You do code for a very, very resource constrained system then anything what is using a concepts like C new/malloc/delete/free operators is asking for sooner or later running out of memory and crashing. Why? This is a bit too big subject for now. Just believe me for a moment, alright?

So let us focus on a good guy, the “static task structure“.

What “static” actually means?

That everything is known at compilation time. You know how many tasks You have and where do they start in code memory. You know how large stack You allocated for each task and where did You put it.

In other words You can give all of them symbolic names which can be resolved during compilation to simple fixed addresses and numbers. This always helps a lot.

So what is that “somewhere” from the above paragraphs?

All right, all right, I am going back to the subject.

That “somewhere” is a place where we can store task data. For best results it is wise to think about this place as about a “task state” structure. This structure will be used to preserve and store everything we need to know about a task what cannot be saved on its stack.

And what it actually is?

Well… currently just a stack pointer (SP), but it will grow a bit later.

So the structure may look like (again in pseudo-C):

typedef struct{
    saved_SP
}Ttask_state

Now assuming You know the number of task during the compile time You may simply define a “task table” to be a static global variable:

Ttask_state TaskTable[NUMBER_OF_TASKS] 

Is it enough?

Well…. not exactly. This is because we are calling the same yield() subroutine from different tasks and we have to somehow pass it the index of task which is calling it. We can think about passing it directly in an argument, but this kind of method will prevent us from doing something like that:

subroutine do_log_stuff()
  ...
  yield()
  ...
  return
task A:
  ...
  call do_long_stuff
  ...
task B:
  ...
  call do_long_stuff
  ...

In other words passing task number directly would stop us from yielding in shared code.

So we have to extend our set of task switching related global, static variables to:

CTP    ; Current Task Pointer
Ttask_state TaskTable[NUMBER_OF_TASKS]

That is we do add one additional variable, which can be used to determine which task is currently running and thous where is that “somewhere” where Stack Pointer should be saved.

The CTP (Current Task Pointer) can be a full blown pointer to specific entry in TaskTable or an index to it. I, personally, would use the index because it is less error prone, loops faster, and saves some bytes of RAM. Since we usually have a very few tasks index can be just few bits while pointer must occupy a whole word.

Note: From my personal experience three to six tasks are more usually enough.

Now we can modify the yield() subroutine to use the TaskTable:

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  SP = TaskTable[....

Hmmph… exactly… What now?

How to determine next task?

Round robin

Since we have currently nothing but the method of continuously switching tasks without any means to actually put task asleep we need to resort to so called “round robinscheduling algorithm.

This sounds serious, but it is straight and simple. In “round robin” scheduling algorithm there are no priorities and all tasks do run one after another. You might have read that the most recently run tasks is given the lowest priority and the one which ran longest ago gets highest, but in fact it is much much simpler.

Just like that:

subroutine yield()
  push "called save" registers
  TaskTable[CTP].SP = SP
  CTP--
  if (carry over/borrow) CTP = NUMBER_OF_TASKS-1
  SP = TaskTable[CTP].SP
  pop "called save" registers
 return

Simple? Exactly. This is the entire complexity of “round robin” on a single core CPU.

The only thing which may confuse C programmer and is obvious to assembler guys is the “carry over/borrow” stuff. The “carry over” happens when You do either: X = X +n or X = X – n with such an n that the mathematically correct result cannot fit in the number of bits reserved for X. Especially it happens when You do like:

R15 = 7
R15 = R15 - 8

The result should be, obviously -1. If however R15 is interpreted as an unsigned binary integer (which it is by default in all CPU’s), then -1 cannot be stored in it, right? So we do have a “carry over/borrow“.

Why I do mumble so much about it?

Because most of processors do produce so called side effects and each addition, subtraction and bits manipulation to update some side effects flags. For an example in MSP430 the add #-1,R15 will update “carry over flag” which can be tested without a need of any comparison:

add #-1,R15
jnc _skip
  mov #NUMBER_OF_TASKS-1, R15
_skip:

If You are even luckier You may have a CPU with “decrement jump if not zero” (DJNZ) instruction which will run both subtraction, test and jump in single operation.

Note: Of course with DJNZ You must slightly alter the base addresses used in instructions so that valid CTP is in 1…NUMBER_OF_TASKS instead of 0…NUMBER_OF_TASKS-1 like in above example. The exact method strongly depends on CPU You have, so I will not bother to get into too many details now. Just play with it Yourself.

By the way, this is why the down-wards iterating loops:

for(int i = 7; --i>=0; )
for(int i = 8; --i!=0; )

are much compact and faster than than upwards:

for(int i=0;i<8;i++)

Firing it up

So the only thing left for us is to actually start the RtOS. The sequence of starting it is simple. Since everything is known at compile time, You just need to:

SP = begin_of_Task_4_stack
push Task_4_START ;address_of_first_instruction_in_task_zero
SP= SP + x        ;fake pushing "called save" registers
TaskTable[4].SP = SP
SP = begin_of_Task_3_stack
push Task_3_START
SP = ....
....
SP = begin_of_TASK_0_stack
CTP = 0           ; so that yield() would know it.
PC = Task_0_START ; jump to first task.

Task_0_START:
 ... initialize
 call yield();
 for(;;)
 {
  ...
  call yield();
 }

Task_1_START:
 ... exactly as in task 0.

What have we just done? We simulated effects of yield() as if it would have been called from Task_4 down to Task_1 and then we just entered the Task_0 with a plain, regular jump.

Notice, there is no need to actually call task entry code because the fact that our design is static in terms of task switching implies that the number of tasks is fixed what implies that no task ever terminate.

Summary

And this is it. This is a full cooperative operating system running a priority-less round robin algorithm.

Of course it still lacks some elementary functionality which I will show You in next blog entry, but this is the operating system.

Laughably complex, isn’t it?

RtOS -implementing it: be cooperative

In a previous part i have shown You how little task switch differs from an interrupt. In this part I would like to show You how does preemptive, interrupt like task switch differs from a cooperative one.

The primary difference is…

… in the fact that preemptive task switch is basically an interrupt while cooperative one is a call to yield() subroutine. How could that subroutine look like? Well… maybe like this:

subroutine yield()
  push R0  ; store first register
  ....
  push R15 ; store last register
  save SP of current task to some variable
  ....
  load SP for next task from some variable
  pop R15
  ...
  pop R0
  pop PC  ; an equivalent of return from subroutine.

and it is used just like that:

call yield() ; which is in fact:
             ; push PC
             ; PC = yield()

Note: Do You remember what PC and SP are? PC is Program Counter and SP is a Stack Pointer. Please refer to previous part for an explanation.

Is there anything wrong with it?

No. It is a fine cooperative task switch.

The only problem with it is that it is unnecessarily expensive.

Calling convention

The “calling convention” is an agreement made between Yourself and Yourself or between Your C-compiler and itself about how parameters are passed to subroutines and how registers and other CPU resources are used.

For an example the “interrupt calling convention” says:

  1. No parameters are passed.
  2. No registers may be changed by called subroutine.
  3. No stack content may be changed by called subroutine.

Our yield() subroutine takes no parameters and returns no value, so we don’t care about passing parameters. So what a calling convention would tell us?

For an example it may be like that:

  1. Registers R12…R15 can be freely changed by called subroutine and a caller may take no assumptions about their content.
  2. Registers R0…R11 may not be changed by called subroutine and at the return from the subroutine they must have the same value as before a call.
  3. The content of stack may not be changed.

Caller save, called save…

The register listed in first point of that convention are so called “caller save” registers. If a code which is calling a certain subroutine x() likes to preserve their content it must save it by itself, while to x() subroutine may make any use of them:

 ... some code
 push R12
 push R13
 push R14
 push R15
   call x()
   ; R12... R15 can be virtually anything.
 pop R15
 ....
 pop R12
 ...

subroutine x()
  R15 = 10    ; no need to preserve it.
  R14 = R15+5
  return

On the contrary the registers listed in point two of the calling convention are named “called save register” (I used to use name “callee save” until I figured out that I can’t spell this right in English). The code which calls a subroutine x() may assume that they are not changed, but if x() needs to use them it is up to x() to preserve them.

R10 = 5
call x()
; R10 is still 5

subroutine x()
 push R10
  R10 = ....
 pop R10
 return

Why such strange calling convention?

Because it is efficient.

If You have a CPU like MSP430 or ARM which has a plenty of registers You will usually end up with even more wired convention, like for an example:

  1. Registers R0…R4 are reserved for interrupts only. No main code may use them, but interrupts may use them without a need of saving anything on stack.
  2. Registers R5…R11 are “called save”.
  3. Registers R12…R15 are “caller save”.
  4. Stack cannot be changed.

This type of calling convention allows super fast interrupt routines. In my experience four registers are usually enough for what happens within interrupts and the fact, that You do not have to save them on stack allows You to:

  • save some microseconds on interrupt entry/leave code;
  • put a less load on stack or avoid stack switching.

Remember, if You need to save registers on stack there must be enough space. And since interrupt is roughly happening randomly it means, that at each and every moment You need to have enough free space on stack. Or eventually switch stack to dedicated for interrupts. I will return to it later in subsequent blog posts.

The four caller save registers are usually used to pass arguments and return values. I also did observe that in a code at an assembly level You usually need something what can be called a scratch-pad space. Registers which You use to compute something and the throw them away. Four registers is a good guess, and since they are often thrown away there is no point in preserving them.

And if You are a king of an assembler…

… then You may define a dedicated calling convention for yield(). I usually used such one:

  1. All registers may be changed by called subroutine.
  2. The content of stack must be preserved.

In this calling convention no register is preserved, but I found out that it is usually a very good convention. In most cases I called yield() in such places in which I finished one part of work and was preparing to start an another one, so there was nothing what needed to be preserved.

Why should I care about calling convention anyway?

Because if You will either obey it or force Your compiler to obey it then the yield() and a task switch can be used as a plain regular function call. Like, for an example in pseudo-C:

extern void yield();
void task1()
{
  for(;;)
  {
    ...
    yield();
  }
}

How does calling convention impact task switch?

Directly and in a simple way: You need to save just the “called save” registers.

Let us compare the first calling convention and the one which is used by kings of assembler:

subroutine yield()
 push R0
 push R1
 ....
 push R11
 save SP somewhere 
 load SP from somewhere
 pop R11
 ...
 pop R1
 pop R0
 return
subroutine yield()
 save SP somewhere
 load SP from somewhere
 return

The left one yield() will need 13 elements on each task stack (12 registers + PC) while the right just one element on each task stack stack. It does matter on memory constrained devices because this value must be multiplied by the number of tasks.

And how much would preemptive switch need?

17 stack elements. 16 registers + PC.

I dare to say it is worth to consider a calling convention and learn how to inform Your C compiler about Your custom calling convention for Your yield() routine.

Summary

In this blog entry You learned what is the so called “calling convention” and how does it impact the cooperative task switch. You might also have noticed that if You can force a certain calling convention for yield() routine then the cooperative task switch may be extremely lightweight.

In next blog entry I will show You how exactly a complete but still limited RtOS kernel looks like. And I assure You, You will be surprised how tiny it is.

“Ask not what your country can do for you – ask what you can do for your country”. But why?

The famous saying of JFK “(…)ask not what your country can do for you – ask what you can do for your country(…)” is an epitome of the modern patriotism. We have similar sayings in Poland. There is a plenty of “fight for your fatherland“, “defend Your country”, “fight for freedom for Your nation”. All this comes under a “battling patriotism”, I would call it.

But honestly, why?

What is the actual value of patriotism?

Because it god? Because it should be done? Because it is worth?

I always feel strange when I hear words which fall into a category “do it for a greater good“.

Quis custodiet ipsos custodes?

(who is watching the guards?)

If You believe in God, whichever one…. well… maybe except Satanists, then the definition of what is “a greater good” is obvious: if the God said it is good, then it is good. Logically speaking there is no space for doubts. In my opinion if You have doubts, then You are just pretending to believe in God.

But if, as I am myself, You do not believe in the existence of the source of all the good and knowledgeable, then You must ask Yourself a question: “why it is good“? What defines good? What can be used to tell apart good from the evil?

And, sooner or later, You will ask: “who defines it“? Why should I trust that person or that society who defined it? What was the source of their convictions? Why did they say it is good?

Returning to the subject, after listening to JFK saying, You would ask: “why should I think that patriotism is any good“? What is the value in killing and dying for Your country? What actually would I protect that way and what would I destroy in the process?

If You would make a poll around You I suppose that about 75% of people would call themselves patriots and about 90% would agree with the saying that “being patriot is good“. Especially in Poland we, by tradition, highly value all the sacrifices made by countless numbers of people for Poland.

But may I kindly ask: “Why?

Why patriotism is good?

Value of patriotism

Hmmph…. I can’t see it. But maybe I am just shortsighted. Which in fact I am, medically that is.

I can see value in being good for others because if we are good for each other then the life is easier. In generic, that is. Sure, if You rob and pillage You can have much, much more wealth and fun, but if the whole society is like You, You will have to be always on guard. If You don’t like risk, if You can’t keep Your eyes open 24h/7days a week, then being good to each other pays back. The prevailing good will work as a kind of automatic and cost effective insurance policy.

I can also see a value in uniting and fighting against robbers and pillagers because if we would not do it, they would kill or harm people who are dear to us.

I can even see a value in defending with my life the land and other wealth as far as they are sole means for making for living for those dear to me. Because if those means of sustaining themselves are taken away, they would suffer, right?

And I can see that that last observation was the building ground for patriotism.

But I can’t see the reason for it now, as it is. I can’t see the value in killing those who crossed the borders. I can’t see the value in letting myself to be killed or harmed just because otherwise the next government will be of different nationality. Especially because if You will trace down Your ancestry You may be surprised how many nations took part in Your breeding. In my case there were some ancestors from the east, some from germany, some from prussia, even some from netherlands. All within last five centuries. All christians, I must admit, but some of the roman church and some from reformed churches. What could be quite a good reason for killing Your uncle if You look what have happen during history of Europe.

So why? Why being patriot is good? Why getting killed for Your country is good?

I do not understand it.

Do You?

RtOS – implementing it, step 1. Task switch.

Ok. as I promised in that post it is now the time to start explaining how to implement the cooperative RtOS kernel.

This may sound like a bit of a paradox, but it will be easier to start explaining how to build a preemptive kernel.

So let gets started….

… from interrupts.

What is an interrupt? Basically this is a hardware signal which makes processor to alter the flow of program execution like this:

execution flow with interrupts

Basically main program executes and then, at certain moment, which actually should be considered fully random, an interrupt happens. In response to interrupt signal the program execution “jumps” to a fixed (by hardware) place known as the “interrupt handler routine” and after reaching an instruction indicating “end of interrupt handler routine” it jumps back to main program.

The beauty of it is in the fact, that the interrupt is (or rather – should be, if well done) completely transparent to main program. If main program is not actively looking for effects of interrupt handler routine execution it will never know that interrupt have happen.

But why do I talk about interrupts?

Because we can imagine that they would do something slightly different:

interrupt which performs task switch

This time at interrupt program execution jumps to interrupt handler routine, but instead of jumping back to original place it jumps to an another place. Now, since interrupts are designed to be transparent, the Task A, when finally started to run again after blue line returns to it, won’t have any blind idea that after the interrupt finished a piece of Task B was executed.

How interrupts are made transparent?

To answer this question we need to define some ideas about a “processor model” (CPU model).

What is the absolute minimum of data resources, except the program and data memory of course, what defines the CPU?

The way the CPU knows what instruction is actually executing. The so called Program Counter (PC).

The Program Counter

The program counter(PC) is a piece of hardware register which contains the address of instruction in memory which is now (or to be next, depends on design) executed by CPU.

In fact, each time the processor executes any instruction it executes following operations:

instruction register = load from memory at address from PC 
PC = PC + 1 <-- so that it points to next instruction now
execute instruction stored in instruction register
if result tells me to jump to certain address do
  {
    PC = that address
  }

So the absolute minimum what must happen during interrupt is:

store PC somewhere
PC = address of interrupt handler routine

and when interrupt service routine terminates:

 PC = somewhere

Hmmph…. isn’t it just a call to subroutine?

You are right. Exactly. Interrupt is just a hardware injected subroutine call.

Registers

The primary difference between subroutine and an interrupt is that the subroutine effects are not transparent. In fact the whole idea of a subroutine is to produce some visible results. So how forced call to interrupt handler routine can be transparent?

For that we need to define in our model the remaining part of the “state of CPU“.

Nowadays most CPU are register based and most of them have a single, hardware, non movable set of registers. I will skip the stack based machines, the machines with paged register banks and etc. and focus just on plain register based machines like my beloved MSP430.

Ok, so what except the PC defines the CPU state?

The registers. Registers are like Program Counter – just a bunch of hardware flip-flops arranged in such a way that they can be manipulated an order of magnitude faster than memory.

Let us for simplicity say that our fictional CPU has 16 registers named from R0 to R15. And, of course, the PC register.

Now we can create a difference between interrupt and a subroutine call:

somewhere_PC = PC
PC = interrupt handler routine
somewhere_R0 = R0
somewhere_R1 = R1
...
somewhere_R15 = R15
.... and when we return from interrupt
R15 = somewhere_R15
...
R0 = somewhere_R0
PC = somewhere_PC

Since PC,R0…R15 do fully define the state of CPU then when we restore them the program executes as if nothing have changed.

But what is that somewhere where we are saving them to?

Stack

Again, as with concept of register machines, I will focus now on CPU which has a “software stack“. Again like my beloved MSP430.

Of course there are other models on the market, like hardware stack based PIC16/17/18 or “linked register” based ARM7 architecture. Let us not complicate things now and focus on the primary example.

A “stack” is a hardware supported data structure which can be used to “push” some value on stack and “pop” some value from it. Just as if it would be a stack of papers on a desk. The stack always serves two purposes:

  • to handle subroutine calls;
  • to handle temporary data storage beyond the registers pool.

Control stack

Handling subroutine calls is the task which belongs to the “control stack“. If You imagine that Your CPU has the call X instruction which is used to jump to subroutine and a return instruction which is used to jump back they will be implemented as a sequence of following operations:

call X:
{
  push on control stack PC+1
  PC = X
}
return:
{
  PC = pop from control stack
}

Data stack

Data stack is used to just save some data. The CPU is usually not using it beyond offering push x and pop x instructions at assembly level.

In so called “software stack” CPU the control stack and data stack are the same stack.

Stack pointer

If CPU has the support for a stack then it will usually have one extra register which not only can be read or written as normal register but is a base for push and pop operations. This register is called a “stack pointer” (SP) and is used like that:

push x:
{
  memory at address SP = x
  SP= SP+1
}
pop x:
{
  SP = SP -1
  x = memory at address SP
}

Note: Of course the direction of stack growth, in this example upwards, can be arbitrary. Different CPUs are using different stack growth direction.

Complete interrupt execution sequence

Now let us try it out again. When interrupt signal is accepted by a CPU the call instruction is injected and what happens is:

push PC (using SP register)
PC = interrupt handler routine

This is the hardware action. The rest is usually executed by code in interrupt handler routine:

push R0 (using SP register)
push R1
...
push R15

and when return from interrupt happens

pop R15
...
pop R1
pop R0
return (equivalent of: pop PC)

Turning interrupt to task switch

Now get back to tasks for a moment.

A task is usually doing many, many operations. Plenty of them will be implemented with subroutines. What means that the call X instruction will be used frequently. And call puts data on stack. What does it mean? Well… If we would just somehow have stored registers of Task A and restored them for Task B how would nearest return in Task B behave? Would it return where it should be returning? It would, but just in one case: if Task A would not push anything on stack. If Task A would have put anything on stack, the Task B would use it and mess up.

So this means, that each task must have own stack.

Simply the registers PC,R0…R15 do define the state of CPU, but the state of program is defined by PC,R0…R15 and the state of stack. And the state of stack is basically represented by the SP register.

So to actually switch a task with an interrupt we need to:

call interrupt handler routine
....
push R0 (using SP register) 
push R1 
... 
push R15
store SP of current task somewhere

restore SP of next task from somewhere else
pop R15
...
pop R1
pop R0
return

In other words we need to switch stack before we start the sequence of operations which constitute the return form interrupt.

Summary

I hope that after reading this small blog entry You could have grasped, that the moment You will be able to write the interrupt handling routine You are only one step from creating the preemptive RtOS task switching kernel.

Does it still looks very complicated?

In next blog entry I will try to show You how interrupt differs from cooperative task switch and why it can be more lightweight.

RtOS – how do I benefit from it?

All right, so in that post I have shown You what are conditions which will let You use RtOS to You potential benefit. But what is the benefit exactly?

Maintenance costs.

Just that. Nothing more.

Your code won’t be faster, because calling RtOS will take some time. Tiny amount, but always. Your code won’t be smaller, because You will have to add an RtOS. Again, it won’t be much more, because the state machine You move to RtOS will be simpler than without the RtOS, but some additional code will be added.

The only, but very important benefit will be such, that Your code will be much, much cleaner.

Are You nuts?! Getting RtOS only for having some cleaner code?!

Well… yes?

And now a bit of warning: I do come from an assembly language. I like it. And if You also code in assembly the actual benefit from RtOS may be not very large. But if You code in C…. well… it will be tremendous.

So let us start with some example.

Imagine You have two state machines looking approximately like that:

Two state machines.

Machine “A” is simple one. Machine “B” is intentionally a bit trickier.

Now let us see how could we implement state machine “A” in C-like language.

int state
void machine_A_init()
{
  ... some initialization code
  state = ZERO
};
void machine_A()
{
  switch(state)
  {
     case ZERO:
          do something
          setup for waiting
          state=WAITING_FOR_ONE
          return;
    case WAITING_FOR_ONE:
          if (waiting condition found)
          {
            state = STATE_ONE
          };
          return;
     ...
  }
}

Let us for a moment skip machine “B” and take a look at main program loop:

void main()
{
   machine_A_init();
   machine_B_init();
   for(;;)
   {
      machine_A();
      machine_B();
   }; 
}

Doesn’t look so bad, right? We have implemented state machine as a simple switch/case. It is relatively easy to understand, if not the WAITING_FOR_ONE state. But explaining this must wait for later.

Now how about state machine “B”? Will it also be so simple?

It depends. It may be simple if we flatten repeat loops and just make them a sequence of identical states:

void machine_B()
{
  switch(state)
  {
     case REPEAT_0_1:
             do_0(); 
             state= REPEAT_0_1_WAIT_FOR; 
             return;
     case REPEAT_0_1_WAIT_FOR:
             if (wait_condition)
             {
                state = REPEAT_0_2;
             };
             return;
     case REPEAT_0_2:    
             do_0(); 
             state= REPEAT_0_2_WAIT_FOR; 
             return;
     .....
  }
}

Now it looks nasty, but we could use a single state variable. Of course we could also do it like that:

int rep_counter;
void machine_B_init()
{
   ...
   rep_counter = 2;
};
void machine_B() 
{   
   switch(state)
   {
     case REPT0:
           do_0();
           state = REPT0_WAIT_FOR;
           return; 
     case REPT0_WAIT_FOR:
          if (wait_condition)
          {
             if (--rep_counter==0)
             {
                 state=REPT1;
                 rep_counter=4;
             }else
             {
                 state=REPT0;
             }   
          };
          return;
     case REPT1:....
   };
};

This time we have to use two state variables. Of course in this approach we will need one more variable to handle the nested loop.

Switch/case is not the only way to go. If Your CPU is efficient in indirect calls and stack handling then the state machine may be also implemented using state handler functions like:

void (*stateA)(); 
void machine_A_init()
{
  stateA = &ZERO;
}
void ZERO()
{
   ...
   stateA = &WAITING_FOR_ONE;
};
void WAITING_FOR_ONE()
{
  ...
}
.....
void main()
{
   machine_A_init();
   machine_B_init();
   for(;;)
   {
     (*stateA)()
     (*stateB)()
   }
};

The state handler function may in many cases result in better performance since there is no switch comparison each time the state machine is called. This type of handling is natural for assembler and for interrupts, but with C and interrupts care must be taken to not force a full state save/restore due to the indirect call. But we are not in interrupt now, so it is not a problem.

What is wrong with it?

Waiting in state machines

Now it is a time to explain the …WAIT_FOR states.

All right, so take again a look at the main loop. It just calls first and then second state machine. At each loop repetition it is, in generic, unknown if both machines do wait for something or not. Usually they will wait, but it will be tricky to guess what exactly are they waiting for. Since we don’t know for what kind of event they wait we can’t put the main loop on pause. It must loop and loop infinitely and each machine must pool it’s wait condition.

Of course we could think about a kind of “waiting for” information and implement it like:

void main()
{
  ...
  for(;;)
  {
     machine_A();
     machine_B();
     wait_for( machine_A_wait | machine_B_wait );
  }
}

It is doable.

Is it easy to maintain? The code of each of machines is split into three blocks: init, machine loop and wait conditions. In machine “A” it is not so bad, but machine “B” needs two additional states to handle the nested loop.

My experience shows that any code which executes in the order which is not exactly like the “lexical flow” of the source code text is very hard to analyze and maintain.

How would it look in cooperative RtOS?

This time I will start from machine “B” because the effect will be visible at the first glance.

void machine_B()
{
   ... initialize
   for(;;)
   {
       for(int i=0;i<2;i++)
       {
          .... do something
          setup wait conditions in RtOS
          yield();
       }
       for(int i=0;i<4;i++)
       {
          .... do something
          setup wait conditions in RtOS
          yield();
          for(int j=0;j<2;j++)
          {
               .... do something
               setup wait conditions in RtOS
               yield();
          }
       }
   }
}

The yield()is a call to a subroutine with which the task is telling RtOS: “I don’t need the CPU anymore, You can take it and let other tasks run. Awake me again if waiting conditions are meet and return from yield() then.”

The main loop, in case of embedded program where tasks are set-up at compile time, just jumps to the RtOS kernel loop. If tasks are dynamically allocated it might however look like:

void main()
{
  rtos_add_task(&machine_A,stack_size_of_machine_A);
  rtos_add_task(&machine_B,stack_size_of_machine_B);
  rtos_run();
}

Can You see the difference?

Now please compare how the state machine “B” looks like when implemented with switch/case and how does it look when implemented with a help of a cooperative RtOS. If You can’t see the difference then…. well…. you must be nuts.

Just kidding.

Simply imagine that this is not Your code. Somebody else at your company wrote it. And imagine You have been given the job to fix something in state machine “B”. Take a look at both implementations again and decide which of them would You like to play with, the switch/case or the cooperative RtOS one?

But the RtOS is soooo complex and expensive!

Yes, yes, I know. Operating system must be complex and You will have to buy it, right? Adding such a complexity to 8-bit microcontroller is silly and will simply not fit in its limited resources, right?

Is that what You think?

If it is, You are wrong.

Cooperative RtOS is very, very simple. In assembly the kernel itself is usually around 100 of instructions. You can write one for Yourself in a day or two. Well… maybe a week if You are doing it for a first time and have a nasty C compiler to deal with.

Interested in it? Would You like to know how to make it? Let me know or wait for a next part of a blog.

Summary

After reading this blog entry You should know where is the true benefit of the cooperative RtOS. It is in the code clarity. Clear, easy to read and understand code is easier to write and maintain. And a code which is easy to write and easy to maintain is usually better in terms of quality and robustness.

It is also worth to mention, that it means that less time is spent on it and less time spent means less money spent. What doesn’t matter if company You work for is not sharing its profits with the employees, but do matter if You own the company.

Worse than irrational…

In “Freedom is not for dummies” You could read about rational and irrational thinking. You could read my opinion about how troublesome is to think rationally and what is, in my opinion, the nature of irrational thinking.

Now it is a time to consider the worst possible way of using our brains. Something far more sinister than irrationality, which, as I hope I have shown, works on a set of well trained automatic rules. It is sinister because it disguises itself as the rational thinking and takes all the favors of rationality while in fact it is having nothing in common with it.

What is this daemon in disguise?

The wishful thinking

Exactly.

But what it really is?

Well….

This is a pseudo-rational method of thinking in which You start from “I need to get to such and such goal“. Then You select first method You come up with and try to prove it by finding a positive example.

Let us try it out.

The goal is to stop the pandemic. This is a good goal, worth of an effort. The pandemic spreads from a person to person. So if we stop persons from contacting we should stop the spread, right? It is rational and logic, isn’t it?

But we are not fools, so we try to find a proof. Making a proof that isolation stops spreading the pandemic is extremely easy. We need just to take an ill person (call it “person A”) and a person who caught the illness from him/her. This is a proof that pandemic distributes through the contact. Then we need to find an another person who did not have any contact with “person A” and is not ill. There was no contact, so there was not transmission.

Logic and proven, can start implementing it on a wide, national scale.

Why that above proof is incorrect?

First because real world is not zero-one logic. There is always a slight change that something will go incorrectly. Due to that we can’t have an ideal isolation. We can’t lock all people in homes for more than two to up to five days and not expect to have an explosion of deaths due to thirst, starvation, cold or fire. Energy must be supplied to houses, water must be pumped, garbage must be collected and in winter heat must be delivered. Of those energy, heat and drains cannot be stored in households and must be continuously delivered by systems which may break if not maintained.

Of course we could think about preparing fuel for electric generators, coal for heat and cesspit in village houses, but it is rather impossible for dense cities with multi-store, high density apartment buildings.

This means, we can’t get 100% isolation for any time longer than a day… well…. I suppose that in national scale, in a medium country of mere 40 million citizens we could have 8 minute total lock-down before first death would occur. Eight minutes is, as far as I remember, a time for paramedics to get to a person with heart-arrest. Anything above this time and the probability of lethal brain damage is growing exponentially.

And when talking about exponential growth…

Any biologic growth is exponential. You have one bacteria and within a certain time it splits to two. Then, after the same time each of those two also splits. Within 16 cycles from one bacteria You will have 65 thousands. In 32 cycles 4×109. Of course it won’t work exactly like that, because this rapid growth is limited by the available resources. If the multiplication is too rapid then there will soon be not enough place where to grow. Not mentioning food and energy.

A very alike rule applies to pandemic and spread due to a contact. The rule is not that simple, but it can be easily observed than a single sick person may, due to simple contact with two other persons next day initiate a cascade, which, if not bound by resources, will in 32 days contaminate the whole world.

Again, it won’t work like that due to restricted resources and restricted mobility which will result in growing ratio of contacts between already ill persons. The more persons are contaminated, the smaller is the chance that next contact will contaminate someone who is not ill.

So it will be exponential like, but not that steep as simplest model shows.

However it does show, that any non-100% lock-down will not stop the pandemic. A 100% lock-down would stop it, but it is either impossible or will be as deadly as the pandemic it should stop. All because a single ill person who slipped through the lock-down may start the cascade which will very soon restart the pandemic. The lock down may just slow pandemic down, but the tighter it will be the more people will be harmed or even killed by side effects. And a proper balance, as we all have seen in 2020…2022 is not easy to find.

A homework: Try to figure out why a theory that a mass vaccination with a single vaccine with 99,0% efficiency of preventing infection in single contact will kill the pandemics is also a “wishful thinking”.

A second homework: Try to figure out what is a real rationale behind the lock-down. Why the non 100% efficient lock down was necessary and what it actually prevented?

What just have we done?

We just have made a step beyond the wishful thinking: we have found a “negative example”.

You should be aware of a simple fact, that even a tens of thousands of positive examples does not prove a theory, but a single negative example invalidates it as a whole. It does not make the theory completely useless, because it works in some cases as have been shown by positive examples, but disproves its generic application and generic correctness. The theory is no longer an universal rule. Instead it becomes a rule which is applicable to rather narrow set of boundary conditions.

Note: In one of books on economy I have read in the preface: “This work is written with an assumption that every player on the market has equal access to the market and is equally informed”. This was written in a preface, which is usually containing some “thank yous” and other useless things which are not read by anyone. Then the entire few hundreds of pages long book did unravel the whole theory…. which is not applicable beyond those stated assumptions. Well… the entire idea behind most lucrative trade is all about the non-equal access and non-equal information so guess how it applies to the real world.

From the wide acceptance of those theories in the literature I do assume that in fact nobody read the preface.

This is what I call “narrow set of boundary conditions”.

Wishful versus rational

The wishful thinking method focuses on finding positive examples. From those examples it constructs the generalization that the theory is correct for all possible cases. The process of finding those examples and process of analysis of them may be very scientific, accurate and in fact rational. Thous it really looks like a true, correct method and at the first glance the theory proven in that way looks like a very rational one.

Just looks like and nothing more.

The rational method of thinking after finding a few positive examples makes two tries:

  • first it tires to find a negative example. Something what disproves the theory. If it is found the theory is a goner.
  • second, if negative example is not found it tries to apply the theory to extrapolation. If theory is correct it should not just explain the observed facts, but should also predict what should be observed if something new happens. This is typical for physics and experimental science. A theory which explains experiments already done should also correctly predict results of experiments which were not already done. If it does not, it is not a theory but just the data interpolation.

Only theory which passed those two tests should be considered to be proven by rational thinking.

Summary

You should now understand what is “wishful thinking” when it comes to finding some solutions to real life problems or formulating some scientific theories. You should be also aware of how close it is to a true rationality and yet how dangerous it is.

A decision made with the help of “wishful thinking” may look as a very rational one, well thought and really, really good solution while in fact it is a dangerous and harmful crap.

Please, whenever You try to design something, to make a law, or whatever stop for at least a moment and spare just a single thought about how to break or crack it. It will save money and sometimes even lives.

Politicians tools of trade

I am slowly becoming sick of all those people who complain that our Prime Minister said that and that, or that Health Ministry said something, but then something totally else was done.

Oh my, politicians are lying!

Oh my, Mrs. President said something so stupid…

Oh my, this bill proposal is so dumb, how anyone can be such an idiot….

Stop complaining, start thinking, because nothing happens without a reason.

The hammer is a blacksmiths tool

When the blacksmith is using the hammer and the anvil nobody complains. This is how he is doing his job. Those are his tools, so he has a full rights to use them.

The job of blacksmith is to bend metal.

What is the job of politician?

To bend people.

That’s right. To bend people. To make them to behave in a way the politician likes.

All right, so this is politician job. But what are his tools?

The talking is a politician tool

This may be a bit of surprise because we, at least in post-communist countries, are used to think that the proper tool for bending people are the secret police, obedient courts, prison system, government controlled media, and, as the last instance, the north Siberia.

Those are in fact very crude tools. And what is more important, they have to be driven by some means. What a simple, plain politician can do to control secret police? Use force? But how, with just two weak hands? Use money? Right…. look at the budget of national police and compare it with available resources…

No. Nothing like that.

The politician must convince the members of police, courts and etc. to do what he likes them to do. Convince. Not force. Not because forcing is ethically incorrect, but just because forcing people is expensive and must be continuously repeated, while once You convince them, they will control themselves by their own means for a long time.

People were really crying when Stalin died. Those were not tears of happiness. They really cried about the death of a person who created and managed the system which killed millions. Blah, let us put away killing, this is nothing uncommon, people were dying in civil wars since the dawn of the time. He did something even more sinister. He invented the crime of “being the wife of a civil enemy”. He was the person who made them to obey the rule, that if the civil enemy who had a child was convicted to death, then this child was automatically also convicted to death. With an exception, that they awaited with actual killing till the kid came out of age.

You can’t force people to be like that. You have to make them to honestly believe that this is right. Do it correctly and they will love You and cry after You will pass away.

How to convince people?

By communicating with them.

In simpler terms: by talking.

All right, You say, so politician needs to tell me what is he trying to achieve so I could actually be convinced to help him, right? He needs to explain what does he want and….

By God, no!

The politician must say something what will make you to behave according to his will. Nothing less and nothing more.

Lets try it by an example

Imagine that the Prime Minister said something like: “Next year we will be giving some money to small business”.

You have heard it. But what actually did he really have said?

Isn’t it obvious: he said that the government will give some money to small business.

Well… yes. That were the words.

But he did not said it to You to communicate to You that the government will actually give those money. No. He said it to You just because he wanted You to hear it.

And act on what You have heard. Act, as if You believed that You will get those money.

This is so simple.

Technical mind versus political mind

Let us try it in a tabular form. Imagine than a certain operation is to be done. How technical mind would do it? How politician would do it?

Technical mindPolitical mind
Planing technical details. What and how to do it?Planing managing details. What and with whom to do it?
Thinking about how to explain the plan to people so that they could understand and follow it.Thinking about what to tell people to create in them the will to follow the leader.
Gathering support by through understanding of the plan by people.Gathering support by touching peoples emotions and wishes.
Executing plan by people.Executing plan using people.

The technical mind used communication to transfer information. The technical details of a communicated plan take a critical part in convincing people, but the form of communication is given a little concern. There is a belief that if they understand the plan they will accept it and follow.

The political mind used communication to motivate people to follow the plan. The form of communication takes a critical part in convincing people, but the actual detailed information which was passed to people is thought to have a very little effect. Thous the politician is not concerned with technical details. There is a belief that if people will like the plan they will follow the leader without the need of detailed understanding.

Efficiency

Both technical and political mind do sucks on efficiency.

Technical thinking do create self regulating systems which can perform the requested plan, but fail to gather support due to the high cost required from people to actually read and understand the plan. Even excellent plans may be rejected because the effort required to read them is too large.

Execution is smooth because even if top level management would fail the bottom level knows the plan well enough to act correctly. However detailed long term planning has inherent problem with adapting to changes.

Political thinking can quickly inspire people to act and gather the necessary support but the vague communication and lack of distribution of detailed plan results in crappy execution. People will act as if they, in their own understanding, do follow the plan, but due to lack of efficient communication they will actually execute a bit different plan.

Execution quickly breaks if top level management has no detailed plan and well working logistics. A tight top-down control is required to ensure the coherent execution of a plan. The information of minor execution problems and progress details must flow up to the top because only top level knows the technical details of a plan. However contrary to technical planning the small top-level management team and roughly sketched plan is flexible and easy adapts to changing situation.

This is the same clash as between rational and irrational thinking.

Ethics

Technical method is based on “common goals”. Political method is based on a “greater good”. This is up to You to answer the question which of them is more ethical and which is not.

Summary

The same way a blacksmith smashes the piece of metal with a hammer the politician throws words on people. Words are just his tools to bend people to his will. If he is saying that an apple is green it does not mean, that he believes in it. It does not even mean, that he is trying to convince You that apples are green. No. He is doing it so that You would hear it and would act as if apples were green.

So be not surprised that his words are far from truth. Those words are just tools he is using. Instead just to listen to what he have said, think about why he have said it. He is never opening his mouth without a good reason. Know it, and You won’t be made fool any more.

Making fools of fools

In one of my previous posts You might have read that irrational thinking is not so badly stupid as it looks like. In current post I will try to show You how easy it is to make it look even dumber.

Irrational thinking works on automatic rules

Previously I did compare rational, conscious thinking to rule based machine learning. I also placed opposite to it the “geometric” systems like neural networks. I will still hold onto that.

What is the problem with “geometric” neural network like thinking?

I suppose that it is the fact, that such a methodology do create “clustering rules” from everything what is around it and later uses them to make a choice. Even facts which, using cold, machine rationality, have nothing to do with the decision to be made are built into them. This type of learning has no ability to distinguish important observations from things which just did happen. For an example, if every time when You were getting hit in a buttock by Your parent a neighbors dog was barking this barking would be get into Your “clustering rule”. Next time You will hear dog barking You will automatically look behind You to check if to cover Your ass.

Of course this is not that simple, but if observation A do frequently happen right before observation B then a rule which will say that A is the cause of B will be built in Your irrational mind.

Is that bad?

No. It is a very efficient way of learning and a good, reasonable method of thinking. Fast, inexpensive and usually accurate enough.

Providing we are aware how does it work.

Local observation == good local decision

Now comes the tricky part.

The set of “clustering rules” You have in Your own mind are built from Your experience. It may be direct experience or in-direct from tells, stories, books end etc. Regardless of how You gathered them they bind together two things:

  • Your entire experience and;
  • Your entire current input observations.

What does it mean?

That even if You are going to say something about pregnancy and abortion the weather behind the window do take a part in decision making process.

Silly but true. The irrational, “geometric” process of thinking is, what You call it… “holistic”? Taking in account everything. This is good and fast process and is in fact quite rational. A good weather indicates good chances for having food and warm shelter for kids so it is natural to be more inclined for having them than during an ice storm.

There is however one case in which this method of thinking fails miserably. It is politics.

Local observation != good global decision

In past era, especially around the tribal society in Europe, politics was organized in localized and cascaded form. A local village meeting (circle? I don’t know a proper English word here) was used to make local decisions and elected delegates for higher level meetings. This cascaded up-wards up to something what we would call now a “country scale”.

In this process the village members all were using their own “clustering rules”. Those rules did differ from person to person, but the current set of stimulus were the same for all. Thous everyone was making decision based on the same set of conscious information and the same set of subconscious stimulus. There was a little place for side effects which could influence the active set of “clustering rules”. We may even dare to say, the the entire population of a “village” was just a one, large, well trained set of “clustering rules” applied to the same question under the same conditions.

Please remember, that most of those “conditions” are such which we are not consciously aware of.

Current political system moved far away from this cascade. We like a “direct democracy” more and more.

And this is a big problem for our irrational, “geometric” intelligence.

Why?

Because in country wide referendum or election we all act in an irrational way, all using our own set of subconscious “clustering rules” but, what is very different to village meetings, under a totally different set of external stimulus. Some of us do have a sunny day, some rainy. Some are cold, some hot. Some just tired, some freshly out of a healthy nap. If You will mash up all rules of all citizens You will theoretically get a great intellect. But if You will stimulate them with many contradicting side-band information You will get a garbage.

Summary

I am pretty sure I did not made myself clear in this post. Well… my technical English is not best for that kind of stuff. What You should get from it should be the observation that even if information which is reaching our irrational minds has rationally speaking nothing to do with the matter it still influences our irrational decision making process.

The good thing is, that theoretically statistics should help us, and if we would stimulate many minds with many rule-sets with information to which we added random noise that random noise should cancel itself out.

This is both true and false. True, because it looks like it happens and false, because noise is not always random and that the noise cancellation by averaging works only in “linear systems”. Which “geometric” networks are not.

And at the end, imagine what would it be if You would be able to deliver a desired “noise” to appropriate voters? Could You influence their irrational decisions?

Sure. You could. And in fact we are already doing it.

Just think about social media.

P.S. Don’t worry too much. There is something worse than social media manipulating stimulus of irrational mid. This is a nightmare of “wishful thinking”.

Freedom is not for dummies?

Recently I read in a book “Lepiej już było” of Macin Król ISBN-978-83-7700-240-7 that there is a theory that freedom is not for a generic population because this population is not making decisions using a rational way of thinking.

So today I would like to talk a bit about it… but from a programmer point of view.

Machines are hyper-rational

The said above theory was formulated by philosophers. More than five decades ago I suppose. Most of philosophers are not technical people. We, people who do programming are. I, personally, am double technical because I am not only programming but also designing machines, both electronics and mechanics and do plan and perform some physical experiments from the area of nuclear physics. So I am well aware how brutally rational are computers and how cruel are laws of physics against those who plan experiments.

Before the computer era the only brands of science which were testing how rationally You think were those derived from physics. You might have formulated a brilliant theory but if the nature law were different then then experiment would have clearly shown that all those lovely calculations were not worth the price of a paper they were scribbled on.

Now we have computers. And those are a true hell for humanists. If You write a rule then computer will use that rule exactly as You have written it having absolutely zero regards to what You really had in mind.

So if You like to know how it is to think truly rationally and logically ask us, experienced programmers. We may be not thinking that way ourselves but we surely communicate with hyper-rational machines every day. And believe me, it is soaking through.

Ok., so people are dumb because they do not use rational thinking, right? Wise people do think, dummies do act on emotions, right?

Rational thinking is absurdly… inaccurate

Let us do it by an example.

Consider You have to make a choice between buying a certain product, let us call it X, from either shop A or from shop B. Shop A is selling it for 100PLN (PLN – some monetary unit), shop B is selling it for 115PLN.

The rational decision is, of course to buy it from shop A for 100PLN. Since product is the same, why pay more?

But what if shop A is giving You 1,5year warranty, while shop B is giving you 2years warranty with door-to-door service?

Hmmpph…

Now it stops being that clear. Do the 15PLN difference in price will pay back? How to make rational decision about the worth of additional half year of warranty?

Rational decision making needs data

What do You need to know to decide how much is half of a year of an additional warranty worth?

First You need to know what are the probabilities that:

  • the product will break within a an extended period of warranty;
  • what is a chance that the failure will be covered by the warranty.

Getting at least roughly accurate data about it is not easy. First You would need to know how many pieces of the same make of product X were already sold. Then You would have to know what was the percentage of warranty claims….

What is still far not enough! You will have to know a histogram of when warranty claim was filled since the moment of product sell. Only this will allow You to decide what is a chance that the product will not fail in first 1,5 year and will fail in the additional half of a year.

Once You will know that You must also check what percentage of warranty claims were accepted by producer and what percentage was rejected.

Knowing all of it will allow You to start rationally thinking about this decision!

Just to start.

All right, You had those information squeezed out from the producer. And I assure You it is not an easy task. Blah! Plenty of renowned manufacturers do not even collect such information!

But imagine, You have it.

Can You now make the rational decision about if to buy it for 100PLN from shop A with 1,5year warranty or by 115PLN from shop B with 2years warranty?

You need more data…

Yes, sadly You need more data.

Warranty covers only failures due to manufacturing errors. Warranty does not cover failures due to intensive use. Even if You would know what is the ratio of rejected warranty claims, You will still not know if the use intensity You are planning is in any way close to the average use intensity among all the claims.

If, for an example, You are planing to buy a cheap drill machine and use it every day for serious work You may expect it last… for how long? One of my cheap drills lasted about 100 holes in reinforced concrete. Which took me two weeks to make. The other, also cheap last for five years… and also for no more than 100 holes. I simply used it significantly less intensive.

So You also need to know what is a declared endurance of a machine…

Can You find such declaration in manual or on the box before You buy it?

Rational decision making is expensive

I intentionally used the PLN monetary unit because it is not widely known. How much is 15PLN worth?

For me it is about 15 minutes of work or four beers.

How long would it take me to collect all data I need? If they were provided on the product box then sure, just few seconds. Buy how often do You see such data like failure rate or endurance?

If I would have to ask producer about those data it would take me… an hour? Two? Plus a weak for waiting for an answer.

Assuming two hours, 2x4x15PLN => 120PLN….

Nice, isn’t it?

A rational decision would cost me more than the product is worth!

Irrational is rational

We are very wrong when we think about “emotional choices”, “impulsive choices” and etc. as about a stupidity. This is because we have at least two modes of thinking.

Machine learning

Since computers are hyper-rational we should check how do we attempted to teach them to think. Surely we tried to teach them to think in such a way, that they would make correct, wise decision, right? We did not try to make it stupid because they are hell dumb out of the box.

During the history of computing we created two possible strategies for machine learning:

  • rule base (Prolog and expert machines);
  • “geometric” (neural networks and genetic algorithms).

Rule based machine learning creates… well… rules. “If something then something”. The geometric approach just tries to represent observations as point in N-dimensional space and clump them together to find “alike” clusters.

Rule based machine learning is, effectively, as our science and what we used to call “rational thinking”. When rules are applied we can trace them back and show a proof: “We made a decision because that and that...”.

The “geometric” approach just tells us “We have seen something like that before and such and such decision worked well”.

Both have pros and cons, but the most important are the questions about how they deal with inaccurate, not trusted data and how fast they are.

Rules are very weak with noisy, faulty, uncertain data. The set of rules also always grows becoming more and more detailed and more and more data hungry. So the wiser it is, the slower it works. Geometric, on the other hand do “generalize” and cannot provide the pin-point accurate answer. Their ability to generalize directly depends on the assigned computing power, so they will rather answer inaccurately than slow down.

Quick decision is better than none

“Hey, there is a tiger! Should we run or hide?”

Nowadays that kind of question is rare, but used to be crucial. And honestly, it did not matter how much better it was to hide than run since both were far far better than staying put pondering what to do.

Irrational decision making is in fact, in my opinion, a form of a “geometric” machine learning. We collect experiences, our own and from other people, and from them we build “clusters” of events. Those clusters do drive our “intuition”, “hunch”, “likes” and such.

Looking on irrational decision making as a form of geometric machine learning we may notice that it is fast, is usually good enough, but surely inaccurate. Geometric machine learning by the nature has problems with accuracy. So irrational decision will be always worse than the rational… but less expensive. And due to its “made of experience” nature it will prefer tested, old solutions over something new.

In our example case the “geometric” system of rules can make a choice based on, let’s say, how nice is a saleswoman instead of on pondering about a product endurance, failure rates and etc. All just because some time in the past I might have experienced something good from a nice woman.

Considering the cost of that specific rational decision and cost of that specific product the irrational choice will be most rational.

Summary

I have shown You that a rational thinking may be expensive and that irrational is not such irrational as philosophers used to think.

And since freedom is a political term how about applying it to politics and elections?

I my country the parliament has circa bout 640 members and we have a direct voting system during which we select members of parliament country wide. So even if I would be “The Voter” and my vote would be the only one to say “Mrs. X enters the parliament” then even then I would have 0,15% of influence on what the parliament would decide. Since I am not “The Voter” but only one of about 20 milions my influence on final parliament voting is… about 1×10^-8%.

To make a rational (in classic terms) political choice You need information. Data. Facts. Facts, not “stories”. Politicians do make for living from telling stories, so what they say has zero factual data. The true information is hidden in “why they are telling it” and “what are they doing“. And believe me, they will do everything to hide it.

Now put on a one side of a scale the real influence I can win as a voter in direct, country wide election system. Then put on the other side of a scale the tens of hours I would need to spend to dig out true, trustworthy data about every of few tens of candidates I can vote on in my region. Add to it the method of votes counting, which in my country removes the weight of my vote even more.

Check the balance.

Does making election decision based on: “Gosh, she has really nice pair of bumpers up there” still look so much dumb to You?

RtOS – When You need it?

After reading this blog entry you know the difference between preemptive and cooperative multitasking. You might also have noticed that I am not a big fan of the preemptive one.

So now the time come to tell You how to decide if You need RtOS at all.

First thing You should take in a consideration is: “Is Your program a state machine?“.

What is a “finite state machine”?

If Your program can be expressed by the sequence of actions like:

  1. Do something.
  2. Wait for something to happen.
  3. Depending on what have happen do something and then wait for something else..
  4. … do alike but with some other things to do and some other things to wait for.

then You are using a “finite state machine”.

If this is the case then You may greatly benefit from RtOS. If it is not the case forget about it – no multitasking will help You.

Well… not always. A preemptive multitasking may let You run more jobs in semi-parallel way but remember, that task switching is always costly. If You just need to do complex computations which are not inherently parallel (like for an example mp3 compression which can’t be split to multiple cores) just run them one after another on a single core CPU, or dedicate one core to each task. In most cases it will be faster than with a multitasking.

All right, my code is state machine. Then what shall I do?

Multiple state machines

First thing to do is to count them. How many state machines do You have in Your program?

If You will look at the sequence of actions I presented above You may notice that You can show them in a form of a graph:

The blue “bubble” represents a single “state” of Your state machine and the round “Await” block are places when RtOS may come in play.

If You will zoom out You may see it in a wider, simpler form:

And if You finally move away even more:

If You see something like this then You have one, two… four separate state machines. They do not have to be totally isolated: they still may have to wait for each other, but if You will draw them as a graph You will see them just like that.

So I have multiple state machines…

Now divide them in three classes:

  • hell fast;
  • fast;
  • and slow;

using as the criterion the expected delay since the moment when “something happens” till the moment when “Await” block stops waiting and program comes to action.

This delay is so called “wake-up latency”.

I would consider a state machine to be “fast” if “wake-up latency” would be below 1’000 instruction cycles.

The “hell fast” would be if it would be below 100 instruction cycles.

And “slow” if “wake-up latency” would be anything above 1’000 instruction.

“Hell fast” state machines and RtOS

The answer is simple: No. This kind of delay calls for hardware support. Or a very well designed assembly programming and a CPU with multi-level hardware interrupts.

I did once something with 1us latency on 8MHz (125ns instruction cycle CPU) but it was possible only because I could have wired an entire DMA chain in such a way that it did handle the interrupt by setting up some hardware and then re-configure itself to handle next interrupt.

The usual CPU has 8…100 instruction cycle effective interrupt latency. And by “effective” I do mean not just the hardware latency but the true number of instructions necessary to do house keeping before a useful action may take place. And please remember that only multi-level interrupt CPU can interrupt and interrupt in progress. If You are not lucky to have such You must add an additional latency which, in the worst conditions, will be the length of longest interrupt servicing routine.

So maybe “fast”?

Also the answer is also: “No”. “Fast” are good candidates for interrupts. 1000 cycles is plenty enough for most tasks related to interrupts, but still a bit tight for an RtOS.

You may think about moving “fast” to preemptive RtOS, but the cost of waking up a normal thread is, from my experience, about 5-times the interrupt cost. This is because the interrupt will awake and signal RtOS but then the RtOS kernel will have to decide which task to awake. This is not a high cost, but when we are talking about single instructions it is rather significant one.

Opposite to that interrupt is hard-wired and when it happens it just starts running code at specified location without any unnecessary deliberation. In fact, if Your state machine will use just one interrupt source and Your CPU can do a low cost indirect jump (a jump to an address stored in a variable), then Your interrupt servicing code may be so clean and transparent that You will have no benefits from RtOS at all.

I mean when You code in assembler. If You code in C or alike higher level language the half of work is about forcing the compiler to behave.

So only “slow” state machines can benefit from RtOS?

Exactly.

At the 1000 cycles limit You may struggle between choosing the preemptive or cooperative one. But if You will move this limit towards 10’000 cycles then the cooperative RtOS is a right choice.

In my experience in 90% of cases You will have to deal with sub-millisecond wake up latency which calls for interrupts bundled with about 20ms…100ms actions related to user interface and 1s or slower “maintenance” actions like self tests and etc.

Those slow ones are ideal for cooperative RtOS.

Cooperative.

Not preemptive.

Do You remember those round “Await” blobs? Those are places when You can easily set-up hardware to wait for something to happen, tell interrupts to notify RtOS about it and tell RtOS that You do wait for this. The You call yield() and viola.

From what I could observe in my cooperative RtOS only 1% of yield() calls has to be guarded with some barrier. This is a really rare case when You have to update a piece of shared data, then wait and just after the wait finish the update. And only in such case You need to protect those data with a barrier.

If however You will opt for preemptive RtOS then You will have to use barrier around each and every access to shared data. Since the simplest barrier on small CPU requires disabling interrupts for at least few machine cycles those barriers do harm interrupt latency.

Note: Of course if Your CPU can do atomic get-compare-and-set like operation then there is no need for disabling interrupts. However as far as I know just very few small CPUs can do it.

Summary

I hope I did show You how to check if Your embedded project may benefit from RtOS and what kind of RtOS is a right choice.

Later I will try to show You what exactly are the benefits and after that how easy is to write an RtOS.

RTOS – multitasking?

This is a …. fourth, I suppose, part. Read previous one there there.

As You probably noticed RTOS is not about performance. It is all about predictability. And most of all it is about a predictable multitasking. The key word is predictable. Not just multitasking, it must be predictable.

Predictable, not super fast.

There are to kinds of multitasking:

  • the “preemptive” one;
  • the “cooperative” one.

Preemptive multitasking

If You are reading this page the preemptive multitasking is probably doing its work now. This is how modern operating systems work. Each program, exactly speaking each program “thread”, is executing as if there were no other threads. The operating system can interrupt it at unpredictable moment and tell the CPU to return to executing an another thread. Then that thread is interrupted and so on, all the way round.

The best thing about this type of multitasking is that even if one thread do loop infinitely it won’t stop other threads from executing.

Cooperative multitasking

You won’t see it frequently anymore. You might have seen it in OS/2, DOS 4 (I’m not sure about it tough, was Young lad there and rarely used it) and up to Microsoft Windows 3.11, the last in 16-bit Windows line.

In cooperative multitasking things looks very alike in preemptive ones, but with one big exception: thread is not interrupted until it signals that it is ready to be interrupted. Usually by calling some subroutine like yield(). Only then the operating system interrupts thread and only then it will tell CPU to move to an another task.

Does it mean, that nothing can’t interrupt thread in cooperative multitasking?

Sure not. Interrupts, those regular, low level hardware ones are still doing their job.

Why then cooperative multitasking was abandoned? Because one thread which never calls yield() can monopolize the whole machine. This is the only reason. And a good one. Especially when it comes to PC which runs many different programs, some of very poor quality.

But does this reason still hold in embedded environment when You, the manufacturer of the product, do control the whole software environment? Think about it.

Note: There is an another reason to not use cooperative multitasking today: You can’t really benefit from it if You have a multi-core CPU. But there was no such CPU then.

Cost of preemptive multitasking

Technically speaking there is no additional cost at operating system side. If multitasking must be there, then almost everything must be there regardless if it is preemptive or cooperative one.

The real cost is at the thread side.

Usually You do not design threads into Your application so that they run in total isolation. They have to exchange some data at some moments. This means that one thread must access data possibly while an another one is also doing so. You can easily understand that it can lead to many problems. Like if somebody would write on a piece of paper You are just reading now.

In preemptive multitasking thread may be interrupted at any moment so it must actively defend against concurrent data modifications. Basically each thread at each moment it is accessing a shared data must put so called barrier. This barrier is assigned to that specific data and does two things:

  • on multi-core CPU’s makes hardware to synchronize caches so all CPU see the same memory image;
  • tells other threads to pause at that barrier and wait till the barrier is released.

This has to be done every time thread touches shared data regardless if task switching is taking place or not.

As You can see this is neither cheap and… well… frequently misses the target. For an example one of my programs have been running for about two years without a problem on a single core CPU and crashed within 15 minutes on dual core.

Why? Because single core CPU was interrupting threads at about, let’s say 1ms pace while dual core really run code in parallel. In single CPU 1ms allows to execute from 1’000 of instructions on 1MIPS low end 8-bit CPU up to few millions on PC. And the size of so called “critical section” enclosed by the mentioned above barrier is just about 100 instructions or less.

You may probably notice that at single core CPU the cost of those barriers is almost always a pure loss because during 99.9% or more of time they won’t be actively tested by concurrent threads.

Do the washing machine use dual core CPU?

Unpredictability of multitasking

The sole fact that thread can be interrupted at any moment adds to the unpredictability of the whole system.

But on the other hand the forceful, periodic task switching makes sure that each task gets its share of processing power and that everything works smooth. And, most important, if one thread loops….

We should stop at here for a moment. Does it really do us much good if system runs with one thread “dead”? On PC for sure it does. User kills program and starts it again. But does it really do any good for a washing machine?

I don’t think so.

The fact that one thread gone hay means that the program, as a whole, encountered some problem it was not tested for and went crazy. And in embedded environment there is just one way to make it sane again – restart!

Note: On PC it just may look like frozen thread is not a problem. You kill it and restart, right? But what if this thread, for an example, did shut down Your network connection? Do restarting this thread will bring it up?

Testability of multitasking

The most feared problem with multitasking is a “deadlock”. Deadlock happens if some number of threads is trying to pass through some number of barriers in such an order that they are stuck waiting one for another:

  • I need a pencil and a piece of paper…
  • I also need a piece of paper and a pencil…
  • …so I take a pencil…
  • … so I take a piece of paper…
  • … and need to await till the piece of paper is free…
  • … and need to wait to get my hand on a pencil….

Forever.

To be able to test deadlocks You must, theoretically, make Your threads to run in such a manner that each possible interruption point is used to make them interlave. And since in preemptive multitasking interruption may take place at any moment it is not practicable to test all possibilities.

In most cases You will just test some combinations You expect to be problematic.

That is if Your OS will allow You to do it. Most PC operating systems do not allow You any control over task switching so testing for deadlocks on PC is more like running the stress tests (repeat many times semi-random test) rather than like running well designed unit test. And stress tests are hardly predictable and hardly repeatable. If they fail it is a hell of a work to figure out under what exactly conditions did the test fail.

Summary

This is a kind of paradox, but when considering a predictability the preemptive multitasking looks not very good. Sure, it makes things to work smooth. Sure it prevents system lock-down. All is true.

But it injects a handful on randomness and makes systems harder to test.

The next part of a blog will be about deciding when you need RtOS.

Is digital signature inherently flawed?

I am technical guy. And as such I know that everything can and will break sooner or later. The same is with digital signature (the RSA keys, GPG and etc).

It can’t be hacked!

Wherever I try to read about digital signature security all I can find it is about how secure it is. It is secure, because it is so hard to figure out private key from a signed document and a public key. Really. It will take hundreds of years!

So it is secure, right?

Well…

Hand written signatures

Hand written signatures were also very well protected. That is until some idiots at banks figured out that they can make us to sign documents on electronic tablets.

Hand written signature carries many hidden information. The pressure of hand, the speed of writing, the dynamics. All are person specific and very, very hard to counterfeit. The primary problem with it is that in-depth validation of hand-written signature is expensive.

The bank systems which do collect our hand written signatures are breaking them and destroying the trust. First, if no pressure nor dynamics is registered then this signature is not worth anything. No graphologist can say anything certain about it. On the other hand if dynamics and pressure is stored in system, then anyone who can access the system (ie. bank employee) can create a fake signature. The only safe solution is to trust, that the electronic signature device is not passing the signature to bank system and is just passing some codes which can’t be reversed to fake next signature. I don’t know if it is done or not. Do some of You guys have any experience with that?

Cheating

Creating fake signatures has a long history. But cheating has even longer.

About few years ago one of my banks gave me some papers to sign. “A standard agreement” – they said. It was about twenty pages of documents of which each should be signed by me. Gladly I am used to read what I sign. And on one of those papers inside the stack was something what is called in Poland “an agreement to be a subject of bank execution order”. This simply meant, that by signing it I would agree that if bank will think I own some money to the bank it will send to me court executive officer who will simply take what he will like. Without any court case and right to defend myself.

I did not sign that. I did not even had to sign that, because law allows to politely ask me to sign it but does not allow to require it.

I did not, because I have read what I would have sign.

Read what You sign

This is the oldest possible trick. “Just sign this copy“, “This is just a formality, you need to sign every page” and etc. Oldest possible trick to make somebody willingly and by him/herself but unconsciously sign some document. This is a fake document but with a true, original and correct signature.

Chain of distrust

I know I can trust my hand.

What is “my hand” when it comes to digital signature?

A software. And if You are very picky about safety – a hardware key (like Yubikey or something) which is processing the signing procedure.

Whatever You will read You will always find that it is impossible to hack it. The hardware key is making a signature and a private encryption key is never leaving it so it can’t be hacked.

Right. Correct. I agree.

But how exactly the key is making a signature?

What exactly is it signing?

From what I know the software which requests the signature is figuring out a check-sum of document. This is a first step and first trap. This is not the document what is getting signed but a check-sum. There is a week, but existing chance that two different documents will have identical check-sums.

Then something needs to ask You for the password (if any). If done well this password is used to de-crypt private key on the hardware, but the software which asked You about a password can store it and use behind Your back.

And finally the hardware key signs the check-sum. If done well they key requires a physical action from You to initiate/confirm signing process. It may be a press of a button or even a bio-metric scan.

What elements I can trust in this chain?

Only the hardware key and my finger pressing a button.

What elements I can’t trust?

The software which generates the check-sum and asks for password.

Note: In fact – the entire computer/tablet/phone. This is what viruses and malware are for: to take over a control over the computer. You must always assume, that You do not control the machine. And even when there are no viruses, You simply don’t. Windows can update at any moment, and this is different software since that single second. Open source can update with a backdoor put there by a smart guy. Anything what can execute programs from writable memory cannot be trusted.

Don’t sign if You can’t read it.

This is a basic rule of any document security and a proof of an elementary sanity. Read before signing. If You can’t see what You sign, don’t do it.

Can You really see what You sign with hardware digital signature key?

Well… You can see what software You don’t trust had shown You. But You can’t see what the key is signing.

What is cheating for exactly?

People talking about digital signature security are always focusing about on cracking the private key.

But is it really what cheaters need?

Do person selling fake Covid certificate do really need country wide private key? Or does he/she only need to be able to procure a correctly signed certificate?

Exactly.

Cheating simply means: “To get certain document signed without signer knowing it”. That is all.

Is it that hard to imagine that a web service asking for digitally signed document will simply display: “server error, signing failed, please sign again” and ask You to re-sign the same document? How would You react on it? This is just an error, right? You signed a document, but something broke over the net, so You need to sign it again, right?

Well…. Who told You that it failed? Your hardware key did sign something. It was sent somewhere. Did it reach destination? Well.. who knows.

This is exactly as if somebody would give You a bunch of pages to sign and then removed one of them and hid it in his/her pocket.

Because You signed it without seeing it.

Conclusion

As long as You will not be able to see what a first element of trusted signature making chain do sign, that long any kind of digital signature will be inherently unsafe.

Do we really think that signing what we can’t see is a “secure signature”?

Covid vaccine: is it “free choice”?

This post will be an expression of my personal annoyance about some facts.

Fact number 1: We have Covid and we have somehow working vaccines.

This is a fact. I do not dispute with that. The vaccines are there and they do work. More or less. Recently even less than expected. At least in Poland the 50% vaccination level has almost no impact on observed development of pandemic when compared to last year.

But I may agree with the fact that they do work and that it is beneficent to society to use them.

Fact number 2: In Poland the injection of vaccine requires signing a waver that You do it willingly, at free will and informed about risks.

This is a bit tricky. The question is why? Why there is a requirement of willing, voluntary informed consent? This is something You usually need to sign in Poland when a risky medical procedure is going to take place, like a surgery. You do not need to sign anything like that when You are going to get Aspirin to be prescribed, getting an CT scan or even some diagnostic tracers infusion.

But anyway I would not have anything against that if not the fact number 3.

Fact number 3: If You are NOT vaccinated Your are NOT allowed some activities.

The Polish government is really lax when it comes to restrictions on non-vaccinated people when You compare it with USA or France. But there are restrictions and there are more of them to come. Some of proposals even allowed to refuse to not allow non-vaccinated workers to come to work and allowed to not pay any salaries to them including social security fees. The last one simply meant: 30 days and You are out of public health care system. This is a real thread, because in Poland there is almost zero non-public hospitals.

Fact number 4: The Nuremberg Code

See https://en.wikipedia.org/wiki/Nuremberg_Code. This code deals with medical experiments. I have doubts if vaccination with conditionally allowed substances is mass experiment or not, but those codes do clearly state some definition about what a “consent” must be to be considered valid:

(…)The voluntary consent of the human subject is absolutely essential. This means that the person involved should have legal capacity to give consent; should be so situated as to be able to exercise free power of choice, without the intervention of any element of force, fraud, deceit, duress, overreaching, or other ulterior form of constraint or coercion; (…)

These are ethical rules set more than 70 years ago with accordance to informal U.S. standards and was approximately the same as 1931 German standards. Should I suppose that nowadays rules are less restrictive or may I expect that they are still binding?

Note: Check the librivox.org for a “Report on human radiation experiments” for through in depth insight on ethical aspect of medical research and treatments.

Doubts

Our governments, at least all in Europe, are still standing on a strong ground that it is your free choice to vaccinate or not”. Your choice. Free choice. My choice. My body, my decision.

Then the same government is running really emotionally painful campaigns to promote vaccines (“vaccinate – save lives, return to normality“) and against people who did not take a vaccine (“the spread of virus and deaths are due to non-vaccinated community“). It also creates restrictions which do remove some rights from people which did not took the vaccine and promotes practices which makes it harder for non-vaccinated people to get medical help.

I let myself to underscore the “removes rights” because it is something else than “adding some bonuses to vaccinated people“. I could accept the last even tough giving bonuses for taking parts in medical experiments is ethically fishy.

Do the words “duress” or “coercion” do apply there or is it just me thinking that fear of being left without an income or access to a doctor is a “coercion”?

And when You are finally coerced with that pressure to take a vaccine You are forced to sign a waver with a false statement that You are doing it at free will. In fact when I was taking a vaccine was I refused to be given it when I did alter the waver to indicate that my choice was made mostly due to the pressure from the government and fear of attacks from the society. I had a choice: either to not take a vaccine and risk not only the covid (what I was ready to risk being already ill once), but also the harms from government actions or to commit a crime by signing a waver with a false statement.

I did got a vaccine. I did commit a crime.

Conclusion

In my opinion governments should either stop any forms of coercion to let people to make a really free and informed choice or they should make vaccination obligatory and take the full responsibility.

The way it is now it is unacceptable.

The way it is now, it is not my free choice to vaccinate or not.

RtOS – multimedia==RtOS? No.

This blog entry is third one of the series about “real time operating systems” in embedded applications. You can find first part here and second there.

This time I will be talking about how the “multimedia” are getting close to “real time” but they are not.

Multimedia

In the modern desktop PC, smartphones, tablets and all the “smart entertainment” crap the support for multimedia is an essential functionality.

When I was young the only multimedia on computers were some 320×200 pixel jerky animations and stereo sound generated from MFM modulator sound cards or even directly at the printer port with a help of some resistors. Playing a real TV quality movie was simply far far beyond the bandwidth of that time hard disks.

Nowadays this is not a problem.

However…

“Fast” does not mean “real time”

I do not have much experience with video streams. But I do have some with sound on PC with the DirectSound technology. It is nice. It is fast. But if You will take a closer look at it and at supporting hardware You will notice that even tough it has a high bandwidth it is not “quick”.

When playing the sound You simply prepare some data, put in into a buffer in a RAM and tell a sound library to pump it from RAM to sound card hardware. This operation is captured by an operating system audio sub-system which performs intermediate operations like re-sampling, digital volume adjusting and mixing it with sounds prepared by other applications. It then moves mixed data to an another buffer in RAM and just then pumps it to sound hardware. This “pumping” is in fact done by DMA (Direct Memory Access – a hardware used to send data between blocks of address space or I/O resources in between CPU cycles) in a sequence of burst transactions.

This lengthy path results in the fact that the information that all the buffered data were transferred from the RAM to the operating system or from RAM to the sound card does not mean that the sound was actually played. It might be, but most probably it was not.

The main difficulty I encountered, and I dare to say not only I since javax.sound is also burdened by it, was the ability to actually stop playing sound. Or even worse: to detect when it finished playing.

Fast means buffered

Most of You probably noticed that when You play a video content on Youtube or some other streaming service or by media player from disk the sound is sometimes a little off with the video. And, what is easier to observe and more frequent, that when You close the player the sound still plays for a fraction of second even tough there is no application which is playing it.

This is because the operating system is “fast” but not “quick” reacting. It can pump tones of data to the sound card without any problem. But if You would like to know when exactly the last sample of 44kHz data file was used up to move the speaker membrane…. Well…

44kHz is 22μs per sample. The human ear can without a problem catch the 500μs glitch in sound as an audible “click”. If operating system would be a “real time” one with warranted response time below 22μs then You would be able to just wait for a sound to be played and then burst the next sample.

This is exactly how would You do it with a microcontroller and interrupts. You would set a timer interrupt at 44kHz and put sample by sample to the DAC driving the speaker. Without any operating system that is. You would even do it that way in DOS era. It was so simple those days…

The modern operating systems do isolate You from such hard to understand things like interrupts so you must rely on operating system task switching, signals, events and notifications to be awoken. And they simply cannot do that always quickly enough.

Note:I do suspect that modern sound card hardware does not even have the ability to notify operating system that the sound is fully played. If it could, then why DirectSound is not having such a function? The working one, I mean, because the official specs are saying: “It is there, but it most probably will not work“. And the specs are right.

I may be very wrong here however. I simply do not know it.

What You have to do is to just buffer sound ahead, set up timer and add some head-room for the task switching delay. Usually 500ms is enough.

Surprisingly You will have more problems with stopping the sound than with starting it.

Real time means: no buffers

This is the main difference between “fast” and “real time”. Real time 44kHz sound system is capable of sampling microphone data, processing it and within 22μs passing it down directly to output speaker membrane. If microphone moves the speaker moves without more that one sample delay.

Obviously most of modern digital multimedia should not be seen as “real time” media systems. Even a simple USB digital speaker or BlueTooth set will introduce delays in at least millisecond range. Ok., I may be a bit exaggerating – You can do USB speaker with lower delay if You use USB 3.0. But You should be aware that all such digital devices needs to buffer data at the source side, push them though the cable as a pack of data, and then play them to the physical output at the correct sampling rate.

It will never ever be close to sample-by-sample interrupt driven system. And it will never ever be close to a plain, dumb old piece of copper cable.

Summary

After reading this blog entry You should understand that “damn fast” is not “real time” and that even tough multimedia do require high processing power the fact that You can play it does not necessarily means it is a “real time” system.

And one another thing. If You are going to set up a cheap PC based digital recording studio with a live play-back (I do not know a proper English word here. What do I mean is, that the recorded sound is played back life to the rest of the members of the band so they can synchronize) the You may have a real deal of troubles with digitally introduced delays.

In the next blog entry I will try to show You what is multitasking about and how it relates to RtOS.

RtOS – what is “real time”?

This blog entry is second one of the series about “real time operating systems” in embedded applications. You can find first part here.

This time I will be talking about the “real time” portion of the RTOS acronym.

What is “Real Time”?

Before we will start talking about “real time” we need to talk about the “time” in software.

In generic “time” is some physical space-time property which can be used to order events. Something happens before something and something happens after something. Up to now no physical theory, including time-space dilatation did oppose that – they all base on the fact there is only one reality and the sequence of events is the same regardless of who is viewing it and regardless of from where.

With software it is exactly the same – time is just the ordering of events. Software however do live in two worlds at the same moment: the “virtual world” of digital machine and the real world in which this machine is submerged.

And this is it. The standard programs and standard operating systems do live only in the “virtual world”. And, as possibly all of You noticed many times the time in software does not flow in an exactly even pace with the real physical time. Software my stuck for a moment, may run faster, slower, but still the sequence of virtual events is preserved. You can even put a PC to sleep for a night and once it is powered up back again it just runs as if nothing did happen. In fact any program which is not intentionally trying to check the time flow in a real world will not notice that.

In other words standard operating systems do preserve the ordering of events but do not preserve the “real world timings”. Real-time operating systems do preserve timings. This is the main difference.

Ok., ok., I know, I am babbling too much. Isn’t it much more simple? Haven’t You been told that “real time” is just “fast”?

Predictability

No, it is not just fast. It is “predictable”.

For an example a decent Windows machine can poll the USB device each 1ms, send the IN token and take some data from it. Then report it to a task which requested that IN transaction and accept the request for next IN transaction. But if You will run it for a longer time, let’s say few days and put some load on a machine You may expect that “glitches” will appear. That 1ms may, from my experience, become a delay up to 500ms in length. So normal Windows are hell fast, but cannot give You a warranty that some actions will complete in some specified time.

An another example is a serial port. Yes, that old one RS232 one. It is soooo sllooooow…. but has a 5,6us turn-around time. The “turn around time” is a minimum delay which needs to be taken in an account when generating a response to a query. The serial port at 115200bps needs just 1 bit to be ready to send an answer.

This is not very fast when You compare it with 800MHz system buss, but it is still far far beyond capabilities of eight core 3GHz Windows machine. Yes, there are cases when such a machine can process an answer and start responding within such a time slot, but on longer runs You will find that the response time of You PC will be incidentally up to 50ms.

This is because Windows is fast but not predictable. It is just not a Real Time Operating System.

What exactly “predictability” is?

This is quite simple: if RTOS specs are saying that at machine of type X the maximum duration from setting up a signal from one task to the moment a target task starts running is “t” then it is such. Always and ever. Dot.

Alright, You will say, really ever?

No, not really. Only in a very specific case when a target task is a highest priority task in system.

Notice that the duration “t” does not have to be very short. The shorter it is, the better, but even if it would be 10ms it still would be a “real time” system. Not very reactive one, but still predictable.

Technical difficulties

The closer the maximum task switch duration “t” is getting to the CPU clock cycle, the harder it is to build such an operating system. I will now discuss some of those troubles.

Interrupts

The first barrier You will hit will be the “interrupts”. Interrupts are hardware signals which do physically switch processor tasks. Interrupt happens, processor stores it’s state in some place and executes a program from strictly specified location in memory. In some processors this location is hard-wired in their insides, in some it is read from some registers, in others can be read from system buss. But it is always a very low level operation.

Processor may have many “interrupt sources”. The observation valid for all processors is that “interrupt from source A” cannot be interrupted by next interrupt from the same source. It must be finished before next interrupt from the same source may be processed. Some processors do allow it to be interrupted by other sources, some not.

From timings point of view the longest “interrupt” which cannot be interrupted by an another one is the time below which You can’t get with a task switching.

I/O operations

Second barrier are any lengthy atomic operations. They are usually related to I/O operations. For an example if hardware is sending an Ethernet pack on a wire You just can’t break it in half and start sending other pack.

This is an another “time grain” which may put task switch on hold. In fact it is a source of so called “priority inversion”. The “priority inversion” comes from the fact that if a low importance task initializes a lengthy atomic operation the high importance task will have to wait until it finishes regardless how important it is.

This is like if president Trump would wish to take a leak but all WC stalls would be occupied by some average guys taking dump. Sorry, but he has to wait….

…or get involved with cancelable I/O

…just open a stall and pull a poor guy out of it with his pants down and shit half in the way.

Right. Exactly.

Some hardware can allow that and operating systems which are really really real “real time” operating systems can do it. You can imagine how messy it will be. The shit all on the floor… in indirect terms of course.

And with that we are getting to priorities.

Priorities

Some of You, especially those on Linux are familiar with something called a “process priority”. On Linux You can adjust priority of any process with “nice” command. Try playing with that.

Tried?

Observed some effects? No?

Exactly. This is because Linux is not a “real time” operating system. A task priority is not strictly enforced. But if You would be on QNX you possibly might be able to lower the priority of a task to such a limit, that You will see pixels drawing on screen. Of course to be able to achieve that the GPU must support cancellation, which it does not do. But it was visible in times when there was no GPU.

In simpler words “real time” operating system is applying priorities as absolute rules. As long as a highest priority task will not put itself to sleep or wait for something, as long no lower priority task will see even a single CPU cycle. Not ever. Because it is less important, right?

This type of prioritization do help “real time” operating systems to react quickly to external, physical world events. But it is a real, real pain to deal with.

Let us now imagine that Linux or Windows are RTOS in this manner. And that You used “nice” or Windows counterpart (which I don’t know) to set Your web browser to be a slightly higher priority than other apps. Then imagine what would happen if it would load a page with a JavaScript which will just loop for ten seconds?

Normally just a browser would have freeze. But if it would be an RTOS then the browser would be more important that other applications, right? The operating system is not one to judge if this looping is a serious job or just a fluke. So it will not let other applications run until a browser will initialize some I/O operation or will just wait for user input or some clock event.

Obviously it would not look exactly like this on multi-processor systems.

And talking about multi-processors….

Cache

Yes. Cache is also involved in RTOS. If code is in a CPU cache it is executed at least an order of magnitude faster than when it has to be loaded to a cache from a main memory. So the task priorities may also involve cache management. If reaction time is a main concern a high priority task must force RTOS to keep its time critical code in cache for all the time.

And modern systems do have something even more evil…

Virtual memory

The “virtual memory” was a really great revolution in computing. Before this concept programmers always had to struggle with the amount of RAM needed by an application. It got even more and more complex when operating systems got the ability to run several tasks in parallel. If you needed more RAM that is was reasonably expected to be present in system You had to think about flushing some data to hard disk by Yourself. With “virtual memory” this is no longer necessary – the only thing You need to care is to not exhaust the “address space” – this is up to operating system to manage the flushing and loading RAM to and from a hard disk.

But all comes at the cost. The moment the “virtual memory” is enabled, the moment You may expect that simple read from a memory or jumping to a certain instruction may take not 1ns (modern cache), not 10ns (modern sRAM), not 1 μs (modern burst DRAM cache fill), but quite a significant number of milliseconds if a mechanical hard-disk is used to handle the virtual memory and it’s own cache does not contain necessary data and the disk head is not in a right position. Blah! It can even be one or two seconds if the hard disk was powered down and needs to spin up.

If operating system is using virtual memory then it can hardly be a real time operating system.

Summary

I suppose that You already know what “real time” means in RTOS. It is not about a speed. It is not about a performance or processing power. It is all about a “predictable response time”.

In next blog entry I will try to explain what is a “soft real time”, how multimedia affected operating systems, and why the system which is able to decode and display without a problem a 4K video is still not a “real time” system.

RtOS – do You need it?

This blog entry is a first one of the series about “real time operating systems” in embedded applications. I will update this post with necessary links as soon as I will add more parts.

Let us start from the basics.

Target audience

This series is aimed at those who are programming the embedded micro-controllers. And when I am talking about “micro” I mean really micro. Not those in smartphones, not those in home entertainment systems or so called “smart something” equipment. I am taking about all those tiny devices embedded in fridges, power supplies, puls-oximeters, non-smart watches, 2D and 3D printers, industrial systems, car engines, doors, breaks and etc. Even some tires do have built in micro-controllers. You will find them everywhere where You need to measure something, drive something, compute and decide about something.

In fact I am talking about all those places where You are used to use good old C or assembler, where malloc or new are swear-words which should not be used and all the code is burned into a chip memory once and for all.

If You are one of those guys who are playing on that ground then this series is for You.

What is an RTOS?

As You probably noticed the RTOS acronym stands from a conjunction of “Real Time” and “Operating System”.

Let me first talk a bit about the second part.

What is an “Operating System”?

Windows, Linux, MacOS, Android… right? Well…. We are not at that level. If You will think at that level then it is certain that You will clearly say, that an “Operating System” is nothing good for a low level embedded applications.

And You will be right. This kind of system is no-go.

Layered service model

The operating system can be in fact imagined as a set of levels which are doing some favors to levels above them.

I let myself to make some doodle again. Do You see that grumpy old fart on the left? Yes, You see him. This is “The User”. The user is a top level layer in the whole software ecosystem. Everything below him is servicing him. At least in theory that is, because in practice it is not always true. But let us get back to business.

The one level below uses is the “User Interface”. Basically display, speakers, keyboard, mouse and whatever user compatible input/output device You have there. This is usually the level at which we, users, do see and interact with the “Operating System”.

This is not the “Operating System” about which I am talking about.

So let us dig a bit lower.

The “User interface” is passing up and down requests made by users to the “Application”. The “Application” is a program which is in fact doing the job the user asked for.

In low level embedded world there is usually nothing more than the “application” itself. The application code does everything by itself. You don’t need any fancy operating system to lit the LED at the output pin, right?

Of course the more time pressure is put on the project, the more and more programmers are trying to re-use already available solutions. In effect the “application”, understood as a code written by the programming team, is no longer doing everything by itself. Instead it is trying to rely on some “libraries”. For an example You may rely on “file system library” to store some files on MMC flash memory card. Or some “hardware abstraction layer” library on interacting with timers. Or whatever.

Are then “libraries” the “Operating System”?

I will dare to say they are the biggest, 99.9% part of it, but they are not that part of an “Operating System” I am talking about.

Hmmm… So not the user interface, not the file system libraries, not the “hardware abstract layer” libraries which are in fact all the fancy drivers and etc…. What else is left?

Most probably You get bored reading it already and tried in a mean time to check some news or email. That’s right. You tried to do something else while the “application” which was showing You my blog was still working.

This is the piece of “Operating System” I am talking about. The ability to run and arrange “applications”.

Applications management

What is an absolute minimum set of “Operating System” abilities when it comes to managing applications?

Starting and stopping them?

Well…. it might have been in CPM/DOS times but it is certainly not in case of low level embedded systems. In embedded systems the application is hard-burned in the CPU memory and once the power is on the CPU starts running it. And it runs it as long as the juice is flowing.

There is no “start” and “stop”. In fact the ability to load and start an application on demand is asking for big, big troubles.

Multitasking

The sole elementary ability of “Operating System” is even less impressive. It is the ability to switch from one application to an another one. Do some code from program A then do some code from program C. And maybe from program D if it wishes so. And so on, and so on.

The only really important thing when it comes to “Operating System” in an embedded world is to do so called “multitasking“.

This is what You should start considering when somebody asks You: “Do You need an operating system for that project?

This is all for now. In next blog entry I will try to decode the “Real Time” part of the RTOS acronym.

“Bug bounty” as an open source business model

The older I am the more I am pissed off on crappy the commercial software. Unfortunately I am becoming also more and more annoyed about crappier and crappier open source software.

I would like to be able to say, that I am stuck up with unprofessional, amateurish open source soft due to financial reasons… but it is not true. I have enough savings to allow myself to buy a commercial product (providing it is a perpetual license, of course. I am not wealthy enough to allow myself to spew out thousands of Euros a year for subscriptions). I am not buying a commercial product because I have a lot of bad experience at my work and I know how crappy a very expensive software can be. So if I have a choice to pay for a crap and just take a crap for a free I will rather stay with a free stuff.

At this moment most open source people will get really mad on me. What this idiot is saying, open source is superb!

It is in fact. But mostly as a process and community, but not as a final product.
But let me put it away for a moment.

Prusa 3D printer and business

Let me now jump off the software and take a glance at a more material business.

Prusa is a Czech company which is producing, selling and supporting 3D printers. This company works in totally open source model, what means, in short words, that they supply their competitors free of cost and royalties with full documentation of their product. Anyone legally can manufacture, market and sell it. In fact some China companies are doing that.

This printer is not a cheep one. Even worse, if ordered from Czech it will cost about 150% of a Chinese copy.

On the quality side I have to say, that it represents best of the best of both worlds: open source and commercial. I am continuously surprised how well it is made and how great effort was put in it to be as friendly, serviceable and fool proof as possible. It is certainly worth every penny spent on it.

When I am telling anyone in Poland who is running a business about Prusa there are just brows going up, they shrew and are saying: “They will bankrupt soon”.

Material and immaterial world

The Prusa is a boundary case of open source. The documents are open and free to use, but the company is making a lot of physical job on producing the printer. They also manage and support the open source community behind it, but from end customer point of view what You are paying for is “The Printer”. Some metal, plastic, electronics, books and tests which had to be made so that it could be safe and legally sold in EU.

But how does it look when You buy a software?

You certainly do not buy a “material product”. You are just buying the “right to use it”. Except the case when this is a “cloud” based solution the company who is selling it to You is not having any expenses related to each sold copy. The production cost of each copy is null. The production cost of a first copy is however tremendous.

The same is for music, movies, e-books. Everything what can be easily copied creates near zero cost of each copy for a producer…

Hey, wait, You will say, it is not true. We have to run our servers so that our clients could download it from. We have to handle our on-line shops, licensing servers…

Stop! It is not true. You do not have to do it. You have chosen to do it because you do not like people to copy Your product. Right, this is it. You could have zero distribution costs if You would allow copying it by anyone and just be paid each time a copy is used. The costs You claim You need to pay are just the consequence of restricted distribution model You have chosen.

Money paid as a “proof of …”

So when we remove those fake costs caused by “restrictions handling” you may notice that in fact You, as an end client, are paying two kinds of cost. I will call them:

  • money as a “a proof of work”;
  • money as a “a proof of ownership”.

When You pay Prusa You pay money as an appreciation of a work and time they spent on producing that exact piece of 3D printer. This is a “proof or work”. But if You pay Autodesk for their Inventor, which in fact You have to download at own cost, set up a PC at own cost and have zero warranty, You do pay mostly to appreciate their rights to that “intellectual property”. This is a “proof of ownership”.

Of course this is not a black-and-white situation. Autodesk had to do a lot of work to make a software. But the cost of it is continuously distributed along all licensed copies. And Yes, they do update the software. But guys, I am a programmer. And most of maintenance cost of Autodesk Inventor software the company I work for actually needed would be covered by one annual licensing fee. I stress on the words: “actually needed”, because it will show its importance later. We needed some fixes, which were in fact not made, but we did not need hundreds of other fixes in costs of which we had to participate.

I think that 99% of software business is about a “proof of ownership”.

Open source of course cannot charge for a “proof of ownership”, because it is, well… open, right?

Yet there are companies which do run their main business on open source software. Two of which I know are RedHat and Gitlabs. RedHat runs mainly on maintenance and support, Gitlab on close source additions, maintenance, support and cloud service.

You may notice, that all the aspects, except closed source additions, are in fact based on “a proof of work”. You pay them to sit on their asses and await for You calling them: “Help me, my server crashed!”. Or even more, You pay them for running Your server on their hardware.

This is a good route, but I think this is not enough for the open source community.

The fact which has to be realized is that people need to eat. Programmers are not an exception. If You like a job to be done fast and well You have to find a good programmer. And a good programmer will, most probably, have a decent job, well paid. Such a person may be not very willing to spend hours of free time for free at open source project. If however You could pay…

Corporate and individuals as open source customers

Both RedHat and Gitlabs do aim at corporate clients. Their services are not cheep. Especially for an “average Joe” who needs a word processor, music editor or a spread-sheet. Especially when average Joe is using them once or twice a month. Especially when average Joe is not a well paid programmer but just a janitor.

Those clients are totally outside the “support & maintenance” scope. They are fully in “licensing as a proof of ownership” scope, but only if licensing is in 100$ region. They could be in “could computing” region, but again, only if “pay-per-minute” accounting would be available instead of an annual subscription. And believe me, they would really really struggle hard to not spent more time in front of a software than needed.

So there is an entire herd of average Joes who are stuck with a free open source software…

If I would be a business shark I would ask now: “There are millions of $ laying around. How can I get my hands on them?!”

But I am not. I am just an open source user. But I know there are people who are really excited about business opportunities and I think that we can make a good deal.

Bugs do generate costs at users end

At the place where I work about 50% of personnel is using a LibreOffice suite for daily job. They do prefer it over Microsoft products because it is more productive, more stable, less changing so it generates less returning training costs and is more user friendly. Sadly our employer is forcing all of us to use Microsoft products, because he thinks that it is a “superior standard”… But I think he does it only to be able to cry on us:”A rise? You want a rise?! Do You even know how much do I pay for the tools I have given to You?!” Well…

Even tough the LibreOffice suite is better than Microsoft it is still annoyingly buggy in many places. I personally do loose at my job about ~20 work hours each year due to bugs or usability issues. Twenty hours is not that expensive. But I am not the only person using it. I will let myself to say, that the company I work for is loosing about 300 work hours a year due to this exact software inefficiencies.

From an economic point of view it would be clear that commercial product would have to win with an open source because it has a “support”. That would be true if it would be possible to have a true “support” agreements with Microsoft. We are now paying an annual subscription, but we have absolutely zero chance to have reported bugs fixed in a predictable time. If at all. But theory says: “if You pay for support You get bug-fixes on request”.

With open source it is very different. We are not paying anyone, so nobody is going to fix anything on our request. We have a theoretical ability to download a source, set up a build environment, learn it, hunt down the bug and fix it. This is theoretical, because it would cost us much more than 300 hours. Even worse, once we have fixed it we, of course, as the old-school businessman would do, would treat this fix as a “business secret” and would not share the fix with anyone. Thous we will condemn ourselves for “freezing” at the certain LibreOffice version.

All we can in fact do it is to report a bug and just ask politely if anyone can fix it.

I does not look good, isn’t it?

“Bug bounty” as a business opportunity

300 hours is not much, but if I could get an extra 150 hours worth payment for fixing a tiny bug in a well known code it would be for me an extra annual 10% bonus. Most of bugs I have encountered are, in my opinion, fixable within 20 to 60 working hours assuming You know the code and program structure very well. If You don’t I can estimate it at about 300 hours.

So the only block is: “I don’t know the code well enough”. But what business opportunity could have open in front of me if I would knew it well?

Some of You probably had filled a bug report. Some of those bug reports might have even earned a response. Maybe 5% of them have been fixed. Maybe 1% have been fixed soon enough to be any use for You.

Now imagine, that You have a rather annoying bug. A bug due to which You have to spend many working hours on re-doing things, fixing, checking, fixing, re-doing… It pisses You off, and when You are pissed off Your boss is loosing money.

Currently all I can do to get bug fixed as soon as possible it is to make a “good bug report”. A good bug report presents a step-by-step, repeatable test case, proposes how should it work correctly and explains why. I can’t do anything more as a user. All I can do is to be very, very accurate in description and my expectations.

Sadly, since we are talking about an open source, we have to take in an account how the community works. Since there is no governing entity nobody can give an order to anybody to work on something. The work choice is purely voluntary. If somebody wishes to do something, somebody does it. If nobody takes it, nobody fixes it.

The work assignment in community is in fact not based on the bug importance but on “bug attraction”. If a bug is attractive it will quickly tempt somebody to fix it. If it a boring, minor, hard to solve stuff nobody will be tempted to do it. Exactly as in public health care: if You are “an interesting case” You will get all the doctors attention. If You are a “standard procedure” You will get ignored until Your illness will progress in such a way, that it will become “interesting”.

But the company I work for is loosing 300 work hours each year due to that bug!

So what if with a bug report I could simply put: “I give 100$ to fix that”. Or when I would have found a bug report made of somebody else I could also add my own “bounty” to it? Surely it could make a bug more “attractive”.

The open source might then easily monetize the high volume, but low financial capability users and turn that money into high quality work hours spent on a project by qualified programmers. And users would have a clear, hand-to-hand “proof of work”. I do find a bug, I report it. If this really annoys me I can put some money on that. I do some pre-payment to the project foundation with a “money return” warranty and have this bug fixed in predictable time.

Notice, since many bugs are “related to each other” there is quite a significant chance that with one fix one could have in fact capture more than one “bounty”. This is a great opportunity for free-lancer programmers to gain a good profit.

This kind of business will boost up quality of open source by attracting people with professional experience and provide users with a feeling that they really can drive the open source project in a direction which is good for them.

Abuse of bug bounty

Of course bug bounty business model may be abused. The same way open source may be sabotaged. Some people may intentionally put bugs in project so that they could fix them later for money, some may take money and do not do their work, some may request the job to be done and then not pay the money.

So I would rather think about it the same way the “auction web sites” do work. For an example a Polish Allegro do offer to its customers the ability to work as a “trusted buffer” between person who is selling an item and a person who is buying it. They take money from one side, take the item from the other side and complete the transaction working as a “trusted partner” for non-trusting sides of transaction.

This is a place when a Free Software Foundation may enter. Or in fact any trustworthy intermediate I think. It even may be a foundation or main sponsor of an open source project itself or a specialized business entity. I do not know, I am not a businessman.

What I see it like it is:

  • one or more persons are declaring a “bounty” for a bug;
  • some programmer do declare “I will do it”. All parties do make some public discussion and agree on some details of the bug fix request;
  • at this moment the money is to be paid to the intermediate entity, since neither bug reporters do trust the programmer nor programmer trusts them to be paid for his or her work;
  • when the work is done the result is presented to the project community by usual means (pull-request, commit, branch, patch-set, build – call it whatever You like);
  • the community validates its quality by usual means, exactly as it normally do for each proposed change. This validation covers general quality, coding standards, regression errors and etc. This validation must include the ability for an “average Joe” to test it, so it must go down to presenting the executable build of the software on as many platforms as possible. The “compile it Yourself” approach cannot be accepted because original bug reporters who were plain users wont be able to test it;
  • in the next step the community is expected, using a bug-reporting system, to validate if requests specified in bug report are fulfilled. In plain words people of the community may vote for or against the programmer doing the job;
  • if the community validation will say: “Yes, the reported bug was fixed as expected” the intermediate entity transfers money to programmer and closes a bounty. If however community says: “No, it is not. The work was done, something was fixed, but the bug report requested an another solution” or the agreed time frame is exceeded (some flexibility is needed there, for an example if You spent half the time more, You are paid half the money) then the money is returned to persons who declared the “bounty”.

I think this model would have a great positive impact on the open source community. An open source community must consist of programmers. The quality and quantity of programmers is essential for a quality of an open source project.

In “bug bounty” model high quality programmers from outside the community are attracted by money. The community itself does not have to contain a lot of them. The balance is moved from requests for experienced programmers to the need of competent integrators and pedantic testers.

In this model the community main focus is the quality assurance. If a community can foster a well working quality assurance model, then the probability that abusive bug will slip in is low. There is also a very little chance that the community will be “unfair” or “cheating” and will abuse the bounty system. Some persons may be “unfair” or “cheating”, but as long as the statistical majority will play fair it will work. As a Bitcoin does.

And in this model the people who are reporting bugs will be rather keen on quality validation because they will be putting money on it. In this model open source community would be more about “users who tests” than “programmers who create”.

I think it can be a great model for a future.



Decision trees in GUI

In this blog entry I would like to discuss a bit about how to present user with a functional user interface when it comes about to showing a decision tree.

What is a decision tree?

Some of You probably have been buying a car. Or at least tried once or twice their on-line configurators. In such a configurator user is usually presented with a sequence of choices, like what color You like, what kind of engine and etc.

The questions are guiding user through the selection process one question after an another.

This is what I call a decision tree. I call it like that, because You can present it in a form of a graph which, in it’s most extreme case, will look like a full tree. Let me try to sketch it.

It is not a big tree, it shows just a few questions, so it can’t be big.

In fact it does not look like a tree at all. More like a bean pod or something. This is because the answers on three first questions are independent on each other. The color of a car does not have anything to do with a type of wheels. You can have low profile aluminum together with either black red or white. It does not matter, so the tree branches do collapse and join at the tree trunk.

However the third question opens a real piece of tree. After asking about the engine type we do ask about a battery size. Obviously all cars do have a battery, but the choice is very different. Electric car needs to have huge 10’000 amp hours or 50’000 amp hours (excuse me silly numbers, I just put them there randomly. I have no blind idea how large those batteries are). Gasoline needs smaller batteries so the choice is between 100Ah and 200Ah. Diesel needs a bit more so 200Ah and 300Ah.

Then the last question appears only on a diesel branch. Do You like to have a TDI? Yes or not? There is no point in even asking it for electric car or gasoline engine.

All right, so this is a decision tree. Now imagine You need to create a GUI for it. How will You do it?

Step-by-step, question after a question

This is a most common choice:

A plain dialog after a dialog, question after a question.

If You are nice then on each step You will show user a text area with an information about a previous choice.

For example like that:

This way You may guide Your user through a selection process, provide with images, help and explanations.

Of course You may also make it using some other form like tabs, enabled and disabled lists and etc. The key is that You are asking questions in a specific order. Once You reach the battery size question You simply select a right set. And You do not ask about TDI for anything but diesel.

When this is good idea?

Certainly when each question is used to direct user to do something. For an example if You do a digital camera support software You ask fist about what user likes to do, then what camera type he has and then are showing user how to connect it and at last what to download from it. There is no benefit in changing the order because You can’t download from a camera when it is not connected.

But with a car selector it is not a good choice.

When it is a bad idea?

Imagine a user who likes to select a car. Imagine, that this decision tree is a bit larger. Like, for an example, it has 20 questions or so. Once all questions are asked user is to be presented with a price and means to place an order.

All right, but what if user likes to check how much black car differs in price from a white. They can differ, right? But the question about a color was a first one asked!

Damn.

Your user needs to click back and re-do all the answers.

Of course You may remember them and suggest to user, but it is still 38 clicks away.

So maybe it will be much better to present it like that:

Tree as a dialog

I have again made some doodle. This is, as You can see, a dialog with some radio buttons. They can be radio button, check-boxes, drop downs and whatever. It doesn’t matter. The idea is that all choices are now visible. I even marked some selection with red dots.

The usual thinking about it is that user will make choices from top to bottom. The IFS EPDM team did that assumption. The LibreOffice team did. I recommend You to not make it. Instead assume that user can make any choice in any order and change it at any time.

For an example I might have first selected the battery size, then the engine type. This kind of dialog allows me to do that, right? This is the entire idea. I make a choice, dialog presents me with a price at a bottom. I like to try to make it black, so i check black and look at the price.

Nice, isn’t it?

Well… this also can be screwed up.

How to screw up that dialog.

Now imagine a user likes to try the electric engine. Like I draw on the left. But hey, electric engine can’t have 100Ah battery! This choice will be invalid!

What to do, oh what to do?

Blocking wrong choices

The first what comes to mind is to gray out invalid selections. In this case both diesel and electric will be grayed out due to battery. 10’000, 50’000 and 300Ah will be grayed out due to gasoline. And TDI, both answers, also.

And now imagine You are the user. And You try to figure out why the hell they are grayed out? In this simple example it is just a plain one-after another dependency. But the guys at LibreOffice in the “Image position” dialog are dealing with three levels deep dependency cascade. I needed a bug report, a long talk and half hour of trial and error to figure out how this dependency work.

So just blocking some choices may be frustrating to user.

What other choices do You have?

Correcting the user

In this approach You allow user to enter user a wrong choice but to not let prohibited selections You correct wrong ones to best, nearest selection. In this example when user selects “Electric” You test that 100Ah is a wrong selection and switch to closest equivalent. I have drawn it as a blue dot.

This way selection is fine and correct.

This is how IFS done it and how LibreOffice have done it. And I am mad on them, because I am loosing my time continuously due to that.

The tricky part is, that user may not notice that You made a correction. Especially when it happens far on the screen. This is not a big problem when a dialog is about You selecting a position of an image on a screen. But when instead of a nice, fashionable read car Your wife will get a black beast just because red was not good with some other distant selection You may get Yourself a not so nice divorce.

There is also an another problem with that approach. It is about non-reversible cycle.

When You make a correction You need to make a “nearest best choice” of some prohibited option. Now imagine I do click “Electric” and You select 10’000Ah because I selected smallest allowed battery size for “gasoline” so it is also best to select a smallest battery for “electric”. I am just looking on a price, so I am not even noticing it. I do then select a diesel, so you may choose from 200Ah and 300Ah. Again, logically, You do select a smallest one, that is 200Ah. Hmmph… it is still not a price I like. I like to try gasoline again and check an aluminum wheel. So I click “gasoline” and I frown. Why the price is 100 euros higher that the last time?

Obviously 200Ah is a valid choice for both “diesel” and “gasoline”, so there is no reason to correct it. But I have never ever clicked the 200Ah!

How to do it correctly?

I would be really happy if You would like to do it that way:

You will accept my selection. You will not change my wrong selections. Instead You will strike out or otherwise clearly mark it and all other invalid choices, gray out the “make an order” button and strike out a price with an explanation why it is wrong.

This way I can make I choice, see what is wrong and why, and then select what I like best. Surely this time if my wife gets a black car instead of fashionable red it is my true choice to make a divorce.

The Slic3r does something similar and I am happy with that.

That’s all for today.

How not to translate the user interface

i18n – internationalization

Today I would like to throw my stone in someone else garden, and this stone will be about translation of the user interface. Or to be more precise – how not to did.

Ok, what is all about?

No more than few days ago I have inspected some Qt GUI code with a line like that:

someObject->setText(tr("Close file"));

The tr() is a function which takes input text and translates it to selected language.

Excellent, isn’t it? Standard, good and readable. And if no translation is found it obviously falls back to English. Good.

Right?

Well…

Not so good.

Each language does have own specifics. In each language some words do have different meanings depending on where and how they are used. For an example in English “drive a nail” and “drive a car” are using the same word: “drive”. But in Polish it is: “wbić gwóźdź” and “prowadzić samochód”. You don’t have to know Polish to notice, that there is no common word in those two sentences.

Of course the tr() does not translate the text word-by-word. Instead it is a simple look-up table or a map which maps one text to another. So there is less space for confusion.

Despite of that it is still there. And You will have just one, single translation for all places where the sentence “Close file” is used in Your program.

It is all about a context

This is all about a context. Each text which is to be translated must be translated in a specific context. So I would recommend to do something like that:

someObject->setText(tr("SomeWindow::Close file"));

This way the tr() function is provided with an additional context: the “close file” text is used in “SomeWindow” class. Now if this text is used in an another window or an another context it may have a different translation.

This is relatively easy to make the tr() to use English translation as a fallback before it falls back to passed key. So it may return “Zamknij plik” or “Close file” or, finally if neither Polish nor English translation is present it will return “SomeWindow::Close file”. I agree that the last will be ugly, but it still can be understood by a human.

It is all about an ease of work

Second thing is the ease of actually performing the translation. Imagine You are a translator and imagine, You are provided with a file which is just a bunch of English sentences.

How can You correctly translate it?

Of course You can’t. You won’t take such a job. You would rather have the screenshots of GUI and write translations over it because then You will know the context.

Sadly supplying such screenshots to You is complex and expensive job. It will be also a bug prone when somebody else will be typing the translations into resource files. So You will rather be provided with a text file.

And now imagine this text file has some context information. It still will be bad. It still will be hard, but certainly Your translation may be better. And it will be easier (cheaper) to provide You with screenshots on which there is just an arrow pointing to a certain dialog and saying “this is SomeWindow”.

That’s all for today.

Binary file formats: how to screw it up

In this long and boring blog entry I will try to show You most of mistakes I encountered in specifications of binary file formats.

But first things first.

Binary data format

A binary data format is a 010101…0101 representation of some abstract data You have in Your program. At first glance they do look exactly like data structures in memory but there are subtle yet important differences.

First, most important difference is that a binary data stored in a file do exist outside the program. They can be put on some data storage or travel through the wire or air. They are used to move information over the space, time and machines of different types and architectures. They may be specified in a way independent of the media they exist on or they may be tightly bound with it.

If the binary file format is independent of the media, it leaves some elementary data properties to the media format. In such case we may be most probably speaking about “file format” or an “application layer” if data travel over the wire.

If the binary format is dependent of the media we usually do speak about “protocols”. Both have their specific quirks, tips and tricks.

Since the “protocol” is both a data and a media specific, and “file format” is just data specific let me first talk about “file format”.

Note: All the following text makes an assumption that file media uses “bytes” as elementary transaction data elements.

A bad example

Ok, so let me show how to do it wrong.

Assume now that I am a C programmer. I do live in C world ( something not like A or B class worlds 😉 ) so when I was told to define a simple file format I did something like that:

typedef struct{
 char [32] identifier
 unsigned int  number_of_entities
}Header
typedef struct
{
  float X
  float Y
  float Z
  int attributes
  short signed int level
}Entity

and I have said that a file consist of a Header and a number of Entities following it.

I have then said that:

  • identifier is a text identifying the format, which is “Mój żałosny format”. Notice, I intentionally formulated in in a non-English language;
  • number_of_entities is a number of Entity elements following it;
  • X,Y,Z are some coordinates;
  • attributes are some attributes;
  • level is a level of importance assigned to an Entity.

Ignoring a meaning of data is it a good specification of how to represent them in binary format?

How do You think?

I think it is very bad.

Characters

In the C world “char” is vague. It may, depending on machine or compiler be signed or unsigned 8 bit or longer integer number. Notice, C does not say “integers” are binary. It only says “unsigned integer” is binary.

Second there is always a problem how to represent an actual text and how to encode it to its binary form. Like how to turn to bits “Mój żałosny format” so that it can be ready by any machine on the world and understood correctly.

The source of all the problems with characters encoding comes from a typical Anglo-Saxon arrogance. Since the same beginning of computers in Poland we always struggled with it. A “character” mentally equaled to ASCII and that was all. And we in Poland needed more than the mere arrogant ASCII. I assure You that those little ąęźżć dots and lines over characters do have a critical meaning. Like in famous: “Ona robi mi łaskę” (she does me a favor) which if stripped of those little lines turns to: “Ona robi mi laske” (she gives me a head). You may ques what is a difference whether Your wife receives and SMS with second sentence instead of a first one. And Yes, it still do happen. The arrogance of telecom’s and Google is set so high that the Android smartphones are by default stripping the Polish letters from all SMS messages without a warning. They claim they do it to save Your money, cause telecom prices the Polish SMS twice the ASCII SMS. Well.. what is 0,01$ compared to divorce costs?

But back to business.

Whenever You says “character” or “text” You must specify what “character encoding is to be used”. If You say ASCII then it is fine. But You may be polite and say “UTF-8”, which is, I think a good compatibility path. UTF-8 text always looks acceptable when understood as a straight forward ASCII (just some “dumb letters” do appear) and can be processed by any 8-bit character routines which are unaware of UTF-8 characters encoding.

When You specify the encoding this is wise to avoid a “char” type and use “byte” instead. Byte is always a sequence of 8 bits. Just for clarity.

So the specs should be:

byte [32] identifier ; //An UTF-8 text.

Length of a text

Second element of any text specification is to say how to figure out where text ends.

In my example I intentionally used 32 bytes and a shorter text. How the end of text should be detected?

The standard C way is to add zero at the end. So 32 bytes long array may carry up to 31 ASCII characters and hard to predict number of UTF-8 characters. Notice, this approach means, that character of value zero is prohibited. And if such character is present in a text it may result in false early “end of string”. And to disaster, if it was used to say “and after this string next element follows”.

In Java for an example binary zero is a fully valid “character”.

The other method of saying what is the length of string is adding the “length field” and clearly specifying the length of following text.

In the example I made:

byte [32] identifier ; //An UTF-8 text.

I did however decide to set the space for a text to a fixed size. I did it so that the header would have a fixed, known and finite size.

For short, bound texts it is acceptable to do it that way and to say:

“The encoded text ends with either zero byte or at the 32-th byte. If the encoded text is shorter than 32 bytes all remaining byte should be zero.”

Integers

We have:

   unsigned int  number_of_entities

and

   int attributes
   short signed int level

What does it exactly mean?

Integer is a signed number which can be iterated from -N to N with increments of “1”. That all. In C that is. So first we have to clarify is: This is an binary integer. If we don’t say it it may be a binary-encoded-decimal for an example.

Second, binary signed integers may be encoded with a sign bit, bias, one’s complement or two’s complement. Please specify it.

The two above can be usually skipped because the binary two’s complement integers do dominate the world. So if You would not specify it then You may be 99% sure, that over the next 20-years or so (until a quantum computers will dominate) coders will read “signed integer” as a “binary two’s complement” number.

But how long those numbers are? How many bits or bytes do they have?

“int”, “short”, “char” and etc do have in C only the lower bound. If You are at Java or C# You are very specific saying “int”. But If You will not say in specs that “all types are JAVA types” then no, You haven’t said anything about the exact length.

unsigned int24 number_of_entities
unsigned int11 attributes
signed int24 level

This looks much better. But it is still incomplete. What is the order of bytes? Less significant byte first? Most significant? Or some other mash-up? Please say it.

Now You probably noticed the int11 type. No, it was not a mistake. An 11 bit long type. In 99% of cases You won’t be needing them, but if You need them it is wise to know what to do.

Now please consider, how would You have interpreted above structure of three numbers? At which bit the “level” starts? At 24+11? Or at 24+16?

If Your data are aligned to a certain number of bytes or bits You must always specify that.

Floating points

Basically the screw ups You can make are the same as with integers. You must specify the format (ie. IEEE32), a byte order and an alignment.

read(buffer_ptr,sizeof(Header))

This is most tempting equation to write in C when dealing with binary file format. A nice, good portable line to read the Header in one haul.

Never do it. Never.

Alright, I was joking. You can do it. I do it. But only when You are going to use such a code on a fixed, known CPU architecture and with a fixed, well known C compiler.

Why?

Because C compilers can arrange structure fields in memory the way they like. They only thing they need to preserve is the ordering (I’m not sure of that also) and type. They may add gaps, empty spaces and etc. For an example an MSP430 CPU can fetch 16 bit data from even addresses with one instruction, but to do it from a byte aligned address it needs four instructions. So most C compilers will stuff all 16 bits data at 16 bit boundary and will put all structures the same way.

So depending on CPU and compiler, even if You use the properly sized types for fields, the size of a structure in memory may differ from the sum of the size of all data in it.

But If You, like me, are coding on micro-controllers of fixed brand on a fixed compiler then it pays back in terms of code size and speed to use a hack and define so called “packed structures”, collect them in “packed unions” and lay them over a memory buffers used for data transfer. It is an excellent, fool proof, easy to maintain way of decoding incoming data at near to zero cost. Providing You cast in stone Your compilation environment.

“Magic forms”, user friendly data computation “wizards”

Today I will try to show You something useful. A certain coding template. Yet I will not abuse Your intellect and I will NOT show You the code. It will be just an idea I used many, many years ago.

The task

Now imagine You were given a following drawing:

A sketch of a shaft and a wheel with a "key" ("dado") and some dimensions.

This is a technical drawing of a joint with a “key” (I’m not sure the English word here, maybe it should be a “dado”), the shaft and the wheel. Basically a piece of metal is put inside a groove in the shaft and it is carrying the load and makes a wheel to spin. There are some dimensions, some material data like maximum pressure, some load like torque or force.

You are probably staring at it and do not understand anything about it. Well… that’s the life. You have to learn all the time. Anyway…

Imagine now that Your boss, I’m playing this role today, had simply told You: “Make a software wizard which will allow users to calculate parameters of this mechanical contraption.”

Obviously You will need some math. It will look like (1) on below doodle:

Some math equations for the shaft connection.

The (1) is the plain equation showing a pressure on the “key” as a function of it’s dimensions and load. Obviously there will be far more such equations even for this simple mechanism. But it should not bother us now.

Then there is an obvious question You will most probably ask:

What should I compute? What are input data and what are output values?

You may ask Your boss. If he is dumb, he will try to scratch his head and probably You will end up with a starting dialog in Your “wizard” looking like this:

Many companies did it that way. Autodesk did it and is still doing that for an example in their “spring” or “gear” wizards in Inventor suite. And You know what? I have to use it sometimes and I do curse it each time. There is never a proper choice on than drop-down list!

Obviously even for a single equation like (1) You will need far too many options on that drop down list (three that is) for it to be readable. For two equations You will get nine options. And so on, and so on.

As far as I remember from a tech school this simple shaft-key-wheel connection needs about six such simple equations. How many choices would You have to put on that drop down list to cover everything?

Not every possibility is a reasonable possibility

This will be probably Your second thought. Not every combination of known input data and unknown variables to solve is reasonable, so we can strip them off. This should let us cut down this drop down to, lets say….

STOP!

Don’t do that! This way of thinking is totally wrong. You may never guess for what purpose the “wizard” will be run. Sometimes people will be using it to design a new joint. Sometimes they will just try to figure out why the heck it broke and what was the maximum load? Sometimes they will just try to check what would happen if they would have changed the steel they used for a “key”?

You may figure out five or four reasonable combinations but You won’t be able to figure out all. And if You would, the drop down would take ages to read.

What to do, oh, what to do?

Let us go back to the basic question:

What should I compute? What are input data and what are output values?

This time I will be Your boss and my answer will be simple: “Everything and anything“.

Blah… I would even show You how it should look like:

It should be just a dialog in which user enter what he/she knows, leaving the unknowns empty, and the presses “SOLVE”. At that moment everything what is possible to solve should be filled with computed data. Everything what is not possible to solve should be left empty. And if something is in contradiction it should get stroke-out.

Gosh…. what a bastard I am.

You may pause at the moment and try to figure out by Yourself how to do it. And please, notice it CAN be done and even worse, it is EASY to do.

Have You finished? Then please continue reading to see how I did it.

Math knows everything

Let us look again at the equations:

This time look at (2). This is a plain, abstract function:

f(x,y,z)=0

This is how mathematicians like to present an abstract equation.

Obviously not every equation in this simple example will have such a form. Most probably some will have:

g(x,y,z,a)=0

or will depend on even more variables.

Now something important. In most cases (99% I think) You will be able to factor any equation binding any number of parameters into a set of three argument equations like:

g(x,y,z,a)=0
is the same as:
g1(x,y,A)=0
where A=g2(x,a,z)

by introducing some “hidden” variables.

We like three parameters equations, because we can easily solve them and present in a form:

f(x,y,z)=0
x=fx(y,z)
y=fy(x,z)
z=fz(y,x)

Obviously You can do it for four arguments too, but it is getting more and more complicated later.

What does it mean?

Simply: You can now compute any of x,y,z variables knowing two of the rest. Well… so simple and so dumb. What does it give us?

Stop looking now at the equation as at the match and start looking at it as a set of three variables. Each of those variables can be in two possible states:

  • “unknown”, value less, needs to be computed;
  • “known”, having value assigned to it.

Then look at the x=fx(y,z) as for a “rule” bound with three variables. This rule has a simple method:

condition apply()
{
 if (y is "known" and z is "known" and x is "unknown")
 {
    make x "known" and have value: fx( value of y, value of z)
    return "state-updated"
 }else
   return "state-not-updated"
}

Now simply imagine, that You do put them on two lists:

  • the list of “variables”;
  • the list of “rules”.

The key to success is to put all the rules on the list. That’s why I insisted on three variables form, because it is simple to do then.

All Your code need to do is then:

  • process the dialog input lines into variables. Set which should be set to “known”, make the others “unknown”;
  • loop through all the rules and call the “apply()”;
  • check if any of the rules returned “state-updated”? If yes loop again. If not finish computations.
  • update dialog with data from variables.

This simple algorithm bases on the idea that a variable can transit from “unknown” to “known” but not backwards. Thus the loop is bound to stabilize at certain number of “known” variables and viola! We have computed everything we could with a data supplied by a user!

Detecting contradictions

You may extend the state of variables to a set:

  • “unknown”
  • “known”
  • “known, but in contradiction”

And throw in the rule:

condition apply()
{
  if (all x,y and z are "known")
  {
      d = f (x,y,z)
      if (d is not in bounds )
      {
         make all x,y,z transit to state "known, but in contradiction"
         return "state-updated"
      }        
  }
  return "state-not-updated"
}

This rule will simply detect problematic variables so You can indicate it to user.

There is something however You should have noticed:

if d is not in bounds

In the ideal world, if x,y,z are correct then f(x,y,z)=0 so d==0. In a real world data are supplied with a finite accuracy and a floating point computations are also finite in accuracy. So in code the:

x = fx ( y, z)
assert( f(x,y,z)==0 )

will probably always report a problem. This is just something You should take in an account.

It is up to You to decide if to continue or not with the contradicted values. I would continue, because they won’t effect the base rule (“known”!=”known, but in contradiction”).

What else You can add to the “magic”?

I, as a a user will sometimes just fill some of the data. The algorithm may fill the rest, but then I may like to manipulate my own data. After this manipulation something may get contradicted. As an effect I may have to delete all the computed data to make an algorithm run again. This will annoy me.

You may then expand the states to:

  • “unknown, by user”
  • “known, by user”
  • “known, by computation”
  • “known, by retry”
  • “known, but in contradiction”
  • “unknown, by retry”

All data which are entered by user are getting the “known,by user”. All data which got to GUI from previous algorithm run are getting “known, by computation”. All empty are “unknown, by user”.

Then update code for computation:

condition apply()
{
 if (y and z are "known, by *" and z is "unknown, by *"
 {
    set x value to be equal to fx( value of y, value of z)
    if x was "unknown, by user" make it "known, by computation".
    if x was "unknown, by retry" make it "known, by retry".
    return "state-updated"
 }else
   return "state-not-updated"
}

and for contradictions.:

condition apply()
{
  if (all x,y and z are "known by *")
  {
      d = f (x,y,z)
      if (d is not in bounds )
      {
         for each variable  x, y, z
            if variable was "known, by user" make it "known, but in contradiction";
            if variable was "known, by retry" make it "known, but in contradiction";
            if variable was "known, by computation" make it "unknown, by retry".
         return "state-updated"
      }        
  }
  return "state-not-updated"
}

With this simple trick the user data will always take preference over computed data.

Obviously You need to show this difference to user and let him/her toggle the state. For an example a cursive may be used for computed data and some button may be added to make them “known, by user”.

Notice, there was a potential to “infinite looping” since contradiction would, essentially, toggle data back to “unknown” and open path to compute it again. That’s why I introduced that complex state scheme with “unknown, by retry” and etc. so that state transition could not cycle.

Possible problems?

The same as in all the math. Square equations. Sin^-1. And all which gives more than one solution. Gladly in 99% of cases You will have a “reasonable” rule to select them. Widths can’t be negative, angles cycle each spin around and etc. This “magic” is not a true “magic” and sometimes You may get stuck. Gladly You will see it when You will do the math, before even a single line of code is written so it won’t cost You much to step back and try an another method of showing that problem to a user.

Summary

I think You should now know how to design simple to use, powerful and easy to code “computation wizards”. Please use this idea wherever You like. I used it long, long time ago, and my users were extremely happy with it. Although it was surprising to them “how the heck it could solve every possible input combination?!

The trickiest part for users was to figure out why it could not compute some data. In such case a good help showing the math or tooltip showing what equation is used to compute the data or to detect contradiction is a must. Never forget about it.

Something I like about Microsoft policy about fixing their APIs

Usually I don’t like much of what they do. I don’t like bugs, I don’t like APIs which are broken at the conceptual level.

But I do like now the one thing: They do not fix it.

Well… Not exactly. I think they fixed USB-CDC driver recent years… and many devices on the market broke. Gladly not the devices I made. Obviously they are other bugs which has to be fixed (ie. system freeze when closing serial port while data are flowing into it at high rate. It is still there, after 25 years. This is I don’t like.), but they do rather touch the implementation instead of a contract.

But back to business.

When I was Younger I thought that one of worst what I can blame Microsoft for was the total lack of reaction to bug reports made by coders… if You were able to fill them at all. At certain moment of time the msdn.com, the site at which API documentation was presented was allowing comments. I was able to see there texts like: “Hey, it is past ten years now, since I wrote that those two literals are exchanged. Will You ever fix it?

The did not. They just disabled comments.

I was really mad then. Those comments, even if they were throwing rocks on Microsoft, did help me a lot since I did not have to figure out bugs by myself. But I was even more mad on them for not fixing those errors.

Today I am older and my view have changed.

If You don’t fix API bug within six months since release You must not fix it at all

The six months is an arbitrary term. It may be more, it may be less. But surely if You were saying in Your API:

#define WHITE rgb(0,0,0)
#define BLACK rgb(-1,-1,-1)

during last ten years then please, never ever fix it!

Ever wondered why we have four different sound API’s in Windows? Why just not one?

Because none of them was ever well thought through. All had bad design flaws at conceptual levels. So bad, that fixing them would change how do they work. Yet Microsoft let them to be there out in the world for long enough, so that people did adapt to them.

Don’t fix, abandon it

Imagine You wrote an API. You did it quickly, released to the public and moved to another more pressing subject. Time passed by, You returned to it and have found that: “damn, it is so badly fu*ed up! I should have done it totally other way in that place!

The question is: Should You fix it or not?

It depends. But if by any chance Your mistake is at such a level, that fixing it would change how the API works then please seriously consider not fixing it. Especially if it was on the market for a year or two already.

Instead drop it. Make it from a scratch, call it a new API 3.0 and offer it in parallel to old one. This way applications using old API will be working without a problem and Yet You have a freedom to try to do it good this time.

And this policy is what I like more and more about Microsoft.

But it will cost me money to keep two APIs!

Both “yes” and “no”. Depends on how good is Your new API. And I am not thinking about “It is so good that users will stop using API 1.0 and move to API 3.0 so I can scrap it“. No. Never think that way.

Think about…

Compatibility layer

… compatibility layer.

If Your API 3.0 is so good, shouldn’t it mean that it can do everything API 1.0 did? Of course in an another way, using other calls, other data, other control flow. But if it is really better than 1.0 then You should be able to implement API 1.0 over the new API 3.0.

This is what I call a “compatibility layer”.

This approach adds even more benefits.

First, You have a big job to be done. Implementing API 1.0 over API 3.0 will test Your concepts thoroughly. This is good. Possibly at this moment You will find how much worse Your 1.0 was. This is also good, because it will happen at a right moment – You still may fix 3.0 concepts!

Second, You may just push the compatibility layer through all test You made for API 1.0. This is a big bonus – all test You wrote for 1.0 are also testing Your new 3.0 API. A big amount of money can be saved.

Third, since they are now made one on another, You just maintain 3.0 and push it through both 3.0 and 1.0 tests.

Simple, isn’t it?

Forcing upgrades: when stability is more important than security.

Recent week my Inkscape got updated. I run Ubuntu, disabled all updates I could have found yet it still updated. Why?

Because “snaps” do update. Dot. There are time consuming ways to trick them to delay an upgrade, but there is no way to tell: “I’m opting out of it.”

This made me think: why? Why somebody have chosen such a method of distributing software?

First what came in mind was: “Damn perverts! They are getting off on forcing their upgrades up my ass!”

Maybe they are, maybe they are not.

So why?

Keeping Your soft up to date makes You be safe.

Aren’t we told this round and round? Aren’t we? Upgrade, upgrade, upgrade. Keep installing all patches to close security holes!

Well…

Ever pondered a bit why those holes got there in a first place? They did not just sprung from the air. The cases where software is “rotting” into insecurity are extremely rare. In about 99% of cases the security hole is just a bug built-in from the beginning.

Why should I believe than the upgrade won’t bring a new security hole?

Security by obscurity

The above sentence is something what already costed thousands of hundreds of Your favorite currency all around the world. Some systems were built around the concept: “If we won’t tell anyone how to do it, then nobody will ever do it, right?”

The IT security teams are screaming now: “NO!!!!! Not again!”.

Security by obscurity, security got by hiding some possibilities from users is not a security at all. You didn’t tell Your physical address to any dam advertisers yet Your mail-box at the door is full of their crap. They just hacked Your “obscurity”!

Why I am talking about it here, in context of upgrades?

Because people who are promoting continuous upgrade policy are saying to us between the letters: “Having a new security hole is much, much better than having an old one“.

Each upgrade, especially functional one, may, aside of closing known one, bring with it a new, unknown security hole. This new hole will remain unknown for a certain amount of time, so we are told to feel safe.

Really?

An upgrade is not a “security upgrade”

Previously, when I was running a pure Debian system I had a choice to have “full upgrade” policy, “no upgrade” or “security updates”.

Security upgrade is a very specific kind of an upgrade. Whenever a security hole is detected it is getting closed in all versions which are currently under a maintenance. Not just in the “leading edge”, newest, brightest shining version. In all which are out there and maintained.

This upgrade policy is very different from the continuous upgrade we are being advertised and often force to.

The continuous upgrade brings both changes in functionality of a program and closes the security holes. The security upgrade is especially forced to not change functionality.

I think You already noticed: $,$$, $$$$… or whatever currency You like. To be able to provide security upgrades You have to spend much more money and efforts than if You just stuff them into a continuous upgrades chain. I suppose, depending on how badly Your code is made, You may expect from 100% down to 25% more expenses if You choose to provide security upgrades to all relatively decent versions You sold to Your clients.

Open source environment is also burdened by that, not in money obviously, but in human resources.

By the way, maybe readers will answer me: Do “snaps” in Ubuntu support selecting “security upgrades” policy? I couldn’t find it, so I suppose the answer is “no”.

When upgrade hurts

Upgrade hurts whenever functionality changes. It doesn’t matter if something “better” is added to a program, or not. It hurts anyway.

It changed and human habits are broken. Imagine for an example a guy who used a certain program day by day, hour after an hour for a year. He learned that a certain option is there and there and he touches it, let’s say once every minute of his work. It gives about 129’000 uses a year. It is enough to burn paths in brain, so he can do it fast and without a thinking.

And imagine that this option just slightly moved to be “better”.

Have problems with imagining it? Just change Your keyboard layout and we will see how fast You will be typing now.

Second reason why upgrade hurts is when there is a fix made to something what was incorrectly implemented because it was poorly specified. Or just because some lazy sicko made it wrong.

The programs are not living in a vacuum. They do cooperate. Sooner or later the incorrect implementation becomes de’facto standard. If Your support team is slow and is not reacting to user reports, if Your quality tests are poor, or if You have to save money, then non-critical bugs of such kind can linger around for years. People will just adjust to that. They will write some tools, scripts, internal company standards based on the fact, that You just did it that way and not an another one. Imagine what happens if You changed it.

Next kind of painful upgrade is when You have to follow others stupidity.

For an example the URL standard was in past saying that:

“file://”

and

“file:///”

are correct file URL’s but “file:/” was not. Yet now it it saying that “file:/” and “file:///” are correct, but “file://” is not. You don’t need to be a brainiac to figure out what mess will be generated if one part of the system will get upgraded to new standard and the other will be not. In my specific case Java 8 is using new standard, but GIT-win32 is using old standard. No file URL produced by Java is understood by GIT and vice versa.

You are probably now checking Your calendar. Yes, it 2021 and yes I’m still using Java 8.

You all probably were hurt by many upgrades by Yourself. The last which hit me was removing of “*.chm” help files support from Windows 10 at certain stage because it was “unsafe” and “nobody was using it”. My clients were using it because 15 years old soft was using it. At home, on the other hand, I was hurt by enforcing Pulse Audio in Ubuntu in such a way, that ALSA got miss-configured and Audacity was no longer able to record through ALSA channels even after removing the Pulse entirely. And a two years steady work environment went to sewer.

I am now scratching my head… should I give more examples? No… I’ll pass. It would be too boring.

But what about security?!

Ok, You will say, I did not upgrade. And now I am stuck with a well known security hole. Isn’t it dangerous?

Yes, it is. All depends on price. How much mess will upgrade do compared to how much will cost You to put Your application in a machine fully off-the-internet. Or You may even think about reverting that machine to a well-known safe state at each boot. Or use a Copy-On-Write file system. Or put it in VM with Copy-On-Write. Or secure Your firewall. Or whatever. You are the master here.

The critical point about known security hole is… that it is well known. For all parties. You can defend against known security holes, but You have no chance to defend against unknown ones, except totally isolating the application… what makes security upgrades pointless.

This is that simple.

But it is always Your choice.

Stable, unstable, testing…. what should it mean?

Traditionally we had three typical stages of development and three typical versions available for clients:

  • stable
  • unstable
  • testing

I do not care what “unstable” or “testing” should be. They are not important to casual users. All they need to know is that those versions are risky.

But what exactly is “stable”?

Stable == for production use

Exactly. For an and user “stable” means: “working”. This version should be working. This version should be usable for daily use. User should be able to use this version for doing their work without risking that something will break and that they will loose their work, time and money.

The “stable” should be “work horse”. It should be reliable. It should do it’s work. Hour by hour, day by day, month by month.

Not just “not crashing”.

Stable version should contain as less bugs as possible. I will strongly stress on it. No bugs. And what is most important: no regression bugs. A regression bug, that means something working wrong what previously worked correctly is a worst nightmare to users. They used to use this function, it did it job and now in next “stable” it doesn’t work!

What kind of stability would it be?

This was the understanding of “stable” when I entered the programing world. Stable meant “old”, “outdated” but working. It did not have many features the “unstable” or “testing” had, but it would not fail You at critical moment. It was tested. There was no function which was not working reasonably well. Of course not everything worked perfectly, but to find a bug You needed to try really hard.

Stable == not crashing

It seems to be today definition. Just not crashing.

It is so f*ing wrong!

The nowadays stable Inkscape is unusable for any serious job. The LibreOffice has so many “not working” functions that it is hard to recommend it to anyone. Just this month I and a company I work for lost about day worth of work hours due to some bug in it. So yes, I know what I am talking about. Both programs are my work tools and I’m seriously thinking about falling back to 2010 versions.

The only thing which stops me from moving to paid software is the Microsoft and it’s “high quality”. Simply, Microsoft Office is at least equally bad. It took me less than 15 minutes to hit a serious bug. So this approach to “stable” is not something what is specific to open source.

How should it be?

First and most important: dear developers, please understand that for us, users “stable” means: “You can earn for Your living using it“. This is so simple. We use it to sustain our lives. Understand it. Be responsible at least a tiny bit.

This is absolutely no shame in releasing “stable” once each two years. If we have a tool which does it’s work most of us, serious users will be happy. We do learn how to quickly do something what is not yet built in. We can even do write some macros, You know. Please do not feel the pressure to release new “stable” each month.

From my point of view “stable” should be something what is absolutely well tested. Not just released to public, staged for some time and if there is no bug reports moved to “stable”. It must be passed through all internal tests stressing on “regression tests”. It doesn’t matter how long will it take, or how many working hours are spent on it. One bug in released “stable” may cost tens of thousands lost work-hours at user side when counting it world-wide.

The “unstable” and “testing” are the other pair of shoes. Be free to make them as wobbly as You like. Play there, release it from time to time to user. Some will be glad to take a risk, some will not.

I’m curious: did You realized why just staging the “unstable” for a while is not a good way to ensure it is tested by users?

Did You?

It is simple: because serious, hard working users will never use “unstable”. They use software to earn for living so they won’t be risking loosing hours and hours on some “unstable possibly badly bugged” version. The 90% users of “unstable” are just playing with it so they will most probably not find bugs which will harm production use.

This is the first reason.

Second reason is, that “unstable” is … well… advertised as “unstable”, right? It may contain bugs. This is normal. If it is normal, why users should report it?

Stable is a rock hard foundation for next release

Everyone knows that it is easier to add a new function to version which contains no bugs rather than to version which is buggy.

A well made “stable”, well tested, is an excellent starting point.

Of course, if You have an enthusiastic team it is hard to stop at certain point, move all forces to stabilize the “stable” and stop people from experimenting. Any project will usually fork at certain moment: there will be a “release candidate” version which will stay under very long, boring and stressing tests for a long time and “unstable” version which will be in a continuous development.

How to deal with that?

One of possible development processes may look like on below doodle:

This is something what I think should NOT be done, except the case when Your team is very small or You just work alone. In such case You will not be doing any jobs in parallel, so simply use a linear flow and base a next “unstable” on last “stable”.

Why I think it should not be used that way?

Because the results of through testing done for “stable” are never going back to “unstable”. The time spent on cleaning release candidate is lost. One may of course think about periodic back-porting from “stable” to “unstable”. The tricky part in it is, I think, that “stable” will be so much backwards in time, that porting fixes to “unstable” will have, I think, little or no effect.

I suppose it should rather look like:

Ok, I do suck on sketching with GIMP. It just quicker to do it ugly rather thank fighting bugs in Inkscape.

In this kind of work-flow the “stable” becomes a base for next “stable” and updates from “unstable” are picked and applied over that well tested base. This is surely a lot of work, but it will pay back. The “stable” team is not affected by mistakes made by “unstable” team and “unstable” team is not forced to do very detailed tests.

Obviously if You have just one team, this a pure loss of time. Just do it linearly.

The big disadvantage of this flow is, that code base of “unstable” and “stable” do start differ more and more over the time.

So I would rather see it like this:

In this method, at a certain point when “unstable” tends to differ too much we put it on stop. The “stop” is a base of “new functions” from which we pick to “new unstable base” which is directly based on recent “stable” release. We do back-port chosen “new functions” and start working on next “unstable” release. This way code gets synchronized, test results introduced in “stable” are well propagated to next code version, but at the cost, that new functions back-ported from “unstable” may be more “wobbly” than expected. Yet this side effect instability is in “unstable” branch, what is exactly what should be expected.

Hey, it is a complete baloney!

Yes it its. It was written under the assumption, that there are some tests made before “stable” is released. If there are no tests then the entire work flow is pointless and “stable” becomes something like “we just waited a bit and nobody had any serious complaints so it must be good“.

If You are fine with such definition of “stable version” then do some good for a humanity and go hang Yourself.

Gosh… indeed I’m rude. Gladly nobody is reading it.

Cheers!

Do not listen to users! Seriously!

This title seems to sound exactly opposite to what I have written previously.

Yes it does.

And it does not.

Users do have a very limited point of view. When I use LibreOffice I am using about 20% of it powers. When I do fill enhancement request I do have in mind my own needs. Those may be a true improvement, but in many cases they will be creating problems for other users.

Every person have different preferences. I like woman with narrow waists, my friend is a typical “ass man”. Similarly I’m a “read the specs” person while he is “follow the example”. Whatever manual I do write he can’t follow and I can’t grasp how his ideas do really look like.

We all are different, have different views and different needs, but the software is one for all of us.

So should I ignore users?

God forbid, no!

Carefully analyze all ideas. See all bug reports and think about them. Sometimes they are real bugs, sometime just people do not understand how does it work or think it is obvious it should work differently. Try to figure out why they do such a thing? Maybe documentation is not good enough? Maybe some tooltips would help? Maybe this icon should be elsewhere? In many cases this is such a simple thing.

Always try to figure out what people are trying to do instead of just following what they are telling to you about how they like it to be done.

Well…. I really messed up with that sentence above. Let me just make some example.

Your mother wishes You to buy a loaf of bread. So she is telling You: My son, You will go to the shop at the Lilly Street, say “hello, how are You” to miss Stranger and take what will be on a second shelf at the left from the entrance.

This mom is Your user. And that cursive text is what the enhancement request is. You may either follow it blindly or…. just buy a loaf of bread. Surely Your mom may be not aware, that the good bread is also sold five steps away from Your home. And surely Your user may be not aware that a function he needs is already built in.

Help users to do what they are telling You they need to do, but not necessarily in exactly the way they are telling You.

Agile is the best!

Today is a time for a “Devil’s advocate mode”.

Agile programming is the best. Hot-fixes are best. The “one-day-patch” approach is an ideal way of programming.

Deadline

One of my old colleagues was saying that the best way to figure out a deadline for a project is to find an most experienced employee, ask him or her how long it would take it, multiple it by two and add one zero.

At about 90% he was right. The remaining 10% he was wrong took twice longer.

Doesn’t it suck? How to make a business in that conditions?

Agile to rescue!

Agile is “divide an rule”. Make it step-by-step. Finish each step and move forward. This way at each step You do have some product You can show to Your customers. Blahh… You can even decide to stop at certain moment and say: It is enough, Ladies and Gentleman, we have The Product!

By dividing work into small fragments You can easily measure and manage the progress. The team leader is producing tasks, team members are making them. Since each task is short and well defined the falling off team members can be easily replaced. The banana skin on a sidewalk won’t be able to kill Your project anymore.

Well… You still have the team leader. This guy looks like a critical point. And we all know that critical points are really bad things to have.

React on users request!

There is a good news for You. You can get rid of a critical point. This is so simple!

What is the job for a team leader anyway?

He is just telling what needs to be done. He needs to have a wide, final vision of a product in mind and split it to tiny steps. But what is this “vision” anyway?

Exactly.

The entire vision is a good knowledge of what do users need. Laughable, isn’t it? A team leader knows best what users do need? You have to be kidding. Users know best!

Keep a leader just long enough to release the product to the market for a first time. Then ask users. Let them tell You what they need. You get response that such and such button should be there? Do it.

User knows best. Follow them blindly.

Agile is a subscription for quality

Following users do motivate them to make a community around Your product. The more of their request You realize the more they will like You. They will have a feeling that You care about them, listen to them and really do things to their best.

This can maximize Your income.

There is one more thing however about which You must not forget.

Agile management style means that possibly every day You can show You clients a new version of a product. Ah… make it a month. Each month a new function. Each month new problems are fixed. Each month….

Agile is ideal engine behind a subscription sales model. Who will be glad to buy a subscription if a valuable update appears once each five years? No one will be so stupid.

With agile You can deliver new version each month.

Yet still there will be people who will think: Tiny changes in a month does not pay back. I’ll better wait two years and upgrade when something really worth my money will be introduced.

You and them are businessmen an You both know that they are right.

Agile however gives You one more tool: You do listen to users. They do make requests. Some will be complex, some will be simple, some dumb, some ingenious. You are making a choice, but on what ground? Once You had a team leader You could let him/her make a selection. Without such a leader You are boned…

…or not. Just select randomly. The team members are knowledgeable enough to not jump on those really hard to do, so let them pick what they like from the list You made by throwing a dice.

Honestly, it will be best!

If You make a good selection, user will be happy. If You select a bad idea a user who proposed it will be proud and other will be pissed of. Yes, for sure. Just make sure they will let You know about it. This was stupid! How could You make it such an such way?! Etc, etc. They will do the leader’s job at zero pay.

In a mean time, You would have already released the “monthly update”. Whoever will leave the subscription now will be left with that idiocy embedded in it, so they will probably wait till they will spot a “good and clean release”.

This false hope surely will keep them on subscription line for a looooong time.

Sell Your shares

This strategy can keep You going for, I think, up to fifteen years. After this time things may get start clicking. Your team members never picked anything hard to do. You never had a “clean and good” version on a market. There was always a continuous progress, but the end is never seen.

You clients at this time will possibly start noticing that even tough they pay monthly and get monthly updates, the hard and important things are put off indefinitely. After such a long time with You they will be in such a hard vendor-lock-in, that they will be probably cursing and swearing but will stay with You anyway.

Your team members will possibly be also reporting to You that they have more and more problems with each user request. This is natural, the original concept is so washed away now that they have no background on what is going on in a code. Also, most probably, since the team members were easy to replace there is no one in Your team who were in it from the beginning and who know what’s under the hood. Since “agile” is focused on fast sequence of tiny steps the through documentation is probably never made and if made is far too much outdated to be much of help.

This are first signs that You should think about selling You shares. The business is still worth tones of money, You still have a reputation of a good manager. The building starts cracking but it is still standing solid on a ground. Make next step in Your carrier, close this stale stage, try something new.

Business is like a bridge – both have to collapse at some day. What is important it is to not be standing on them when they fall down.

Patching sucks

Many software projects, especially open source ones, assume that You can get job done by making thousands of tiny steps done by thousands of independent persons.

Just like biologic evolution does.

People do forget however that evolution do in fact progress through thousands of tiny steps… but done in millions of instances. You will get what I do have in mind if You will start thinking about DNA like about a program and about a body like about a running instance of that program. With such a tiny winy bit of difference: each biologic instance runs a slightly different program.

The open source was initially about a diversity. Every one could write a program and later, with an Internet, everyone could share it. Except, that writing a full office-like program would take tens of years of a single human life.

As a natural consequence we are now in the era where we have very few high complexity open source projects. Even if we can point out some different projects then in fact, if we dig a bit deeper we will find the same core libraries at the bottom. Notice, I was once surprised that it is also a case with commercial 3D CAD. I was able to dig out information that certain IGES file format element is not implemented in any of CAD, because the certain back-end library simply does not implement it.

What does it have in common with patching and tiny steps?

That this concept does not work. It can’t be compared any longer to biologic evolution, because there are not enough slightly different programs.

Patching is agile

The “patching” fits ideally with an “agile” concept of work management. You do find a tiny problem or an enhancement, You identify it, You fix it. Done. Points earned, progress made. Hurray!

The problem in this approach is, that You do it quickly without even trying to look at it in a wider context.

Take for an example a LibreOffice image properties. Insert an image into a Writer. Right click on it. Open properties. Now browse through the tabs and check in how many places You can find an editable “width” and “height”. Currently (2021) I was able to find it in two places: in the “Type” tab and in a “Crop”. Then, just to be a piece of a nasty old fart identify the “Orientation” parameter in “Image” tab.

At first, in very, very old version of Star Office…. It was still Sun’s then. Thous who don’t know it, please notice that LibreOffice is not a community creation. It is a fork of Open Office which was a publicly released Star Office created by Sun as a certain experiment of a certain programming model: the UNO-Unified Object.

Anyway… The image in Star Office could not be rotated. The width was a width, horizontal on the page, the height was a height. Everything was consistent. User was restricted from resizing image outside page boundaries, horizontal was horizontal, vertical was vertical.

Many years later a tiny enhancement was made: now image can be rotated. Since it was added as a tiny step, the rotation setting was put into “image” tab, in a place totally not related to a tab where size is controlled. Or it was vice versa? I don’t remember. The final result is that we now have orientation, height and width.

Now do some guessing: If I rotate 100×200 image by 90 degree, what is the “width” and “height”? Is it taken before rotation or after rotation? How it is limited?

The answers, last time I tested them, were:

  • the “width” and “height” are taken before the rotation is applied;
  • the limits for them are taken…. from before rotation is applied.

So that if You have 1000×10 image rotated by 90 degree, the width (1000) scale is still limited to the current page width even so it is now physically a page height. This is an obvious overlook. This was easy to find where to apply rotation but it was hard to figure out how do it influences the whole user interface concept.

Then take a look at “width” and “height” at “Crop” tab. Type something there, don’t do neither “Apply” nor “OK”. Get back to the next place in which those two settings can be found. What value do they carry now? Is this value synchronized from what You just have typed?

Currently the answer is sadly: “No”. They are not. So be a nasty bastard and change them in two places. Then click “OK” and roll a dice. Ladies and gentleman, let us see who’s the winner!

I was able to find at least two other bugs in this dialog window and I did not have to look hard. I think I will be able to find even more if I try.

Let me now ponder about: why did it happen?

When you are sewing a patch over a patch it is a time to throw out the trousers away.

This “image properties” dialog is just a one small window in a huge application. Yet it is stuffed with bugs, inconsistencies and misconceptions. For an example how You could even think about limiting image width to page width if page width can be also changed elsewhere? Or how could You even think about adding second “width” input field knowing that the back-end data model prevents You from easily synchronizing settings made in two places?

This is because each tiny patch applied did solve a tiny problem… but due to the fact that it was introduced without a through, deep and time consuming checking of the entire concept it did created a “trap” for future generations. The person who added limits did not realized that page size can be adjusted and that in the future the image may be rotated. The person who added rotation was unaware that that such limiting did exist. And the person who added second, non-synchronized “width” input field really needed it there and, I suppose, believed that it won’t be a problem if those fields will not be synchronized.

I have been there in my own programs. Everyone have been there. Everyone did reach the limit after which there was no possibility of applying a next patch anymore because whatever You tried You have been breaking some other patch. Just because You were in a hurry too many times.

What was the color of my trousers anyway?

Patch over patch, over patch. I’m looking now at my worn out pants and have problem to see what color they originally were.

Patches are hiding an original concept. Patch applied over a patch must in fact deal with two things: an original concept and a concept which was behind the patch it is dealing with. All the misunderstandings are multiplying in that process and at the end You just do not know what does it do anymore.

Put it layer after layer one over an another and about after five to fifteen years You won’t be able to make a step anymore without tripping and falling down on Your face.

I was seriously thinking about fixing those bugs I have found. But then I have found two more. And I though to myself: “Dude, You have to work for living. How many weekends are You willing to spend on it?”

I could just patch something. Right?

Wrong. The more I was thinking about it, the more certain I was, that it won’t give any good result. I will fix one thing, but because I will be laying a patch over two or three other patches, each made around very different ideas, then it will break for sure. I would have to either rebuild the entire dialog, at least at the core level, to get a true value. This is too much work for a spare time job.

Summary

Do not get me wrong, I’m not against doing quick “hot-fixes”.

I just strongly recommend to You: think. Use Your brain to find what side effects Your decision can have. Do not rush through a mine filed with: “hooray, kill all the m*ckers” like marines do.

They die when they do that, You know.

Note: You may also like to read this post about wishful thinking.

Commenting Your code: How to kill Your software project.

In 2020, in my country (Poland, Europe), there were about 120’000 (one hundred twenty thousands) new regulatory acts. This count do include all local bills, EU directives and regulations and all country wide legal acts.

Above one hundred thousands of new documents.

This is far, far beyond human ability to read, understand and connect all of them.

In the result there is no single person in Poland who can honestly tell: “I do know The LAW”.

Hey, shouldn’t You talk about code?

And am I not doing that? What is a law anyway? Can’t we say it is a kind of program which is run on a machine called “the society”?

For a single average Joe the law is exactly as the:

assert(!kill)

Joe can run any program, but when he hits the assertion the

throw new EGoToPrisonException("Joe")

is thrown.

On the other hand for government and official bureaucrats law is the program they had to “run” according to the fiction: “Citizens must not act against the law, government must act by the law“.

Obey law by “letter” or by “intent”

In the March 2020 when Covid did spread in Poland our government produced a regulation which was stating: “You can’t enter the forest“.

Yes, You read it right. Entering a large, green, vast space filled with trees was prohibited due to prevent the Covid spread.

The wording was clear. You can’t. Except that what You would have to do, if You were living in a village which is inside the forest? Starve to death? Abandon Your job?

This single sentence only looks clear. In fact it is not clear. What is a “forest”? A place with trees? Then is a fruit plantation is the forest? Or is a “forest” a piece of land which has a status “forest” on an official land map? If it would be so, the how anyone could quickly check it?

And “entering”. By foot? By bicycle? By car? Is driving the car over a road passing through forestation is “entering the forest” or is it not?

Nobody knew any of that, nobody understood that and nobody got any idea why this law was created.

Nobody, except a few thousands of Warsaw citizens who when first lock-down was initiated figured out that since they were prohibited from entertainment in cities they can legally and fine go and rest in a forests surrounding the city. This rapid motion created crowds of people and since the law was already written in an un-clear and stupid language they did not feel any need to keep distance.

That was a reason behind: “You can’t enter the forest“.

Later, about the November 2020, the government did produce a law which prohibited organizing the mass protests. This is important to notice the wording: “organizing”. Not “taking a part in protests”.

Due to some political reasons which are beyond the scope of this text it was a time when protests were spreading rapidly.

And Police did stop and punished anyone who took a part in such protests, even tough the law clearly stated “organizing”.

In first case (“You can’t enter the forest”) citizens were punished because they broke the “letter of the law”. Police and institutions did not care about the idea behind this law (“keep distance, avoid making crowds”), probably because there was no way they could figure it out, and followed the somewhat clearer direct meaning of a law.

But in the second case the direct meaning was ignored and the “intent” or the idea (the same again) was what was followed by the Police.

Clear communication of “intent”

How does this apply to the code?

Consider following pseudo-code below:

boolean validateAge(Person person)
{
    if (person.age < 18 ) return false; else return true;
}

This code is a kind of “self commenting” code. It is clear was does it do. It checks if age is below 18 and decides something.

You may say, that compared to the “law” it is a “letter of a law”.

Does it need any comments?

Observing a vast amount of code which can be found on a web many of programmers will say: “No. High level programming language is self commenting”.

And the would be right. I do agree with that.

They would be right in saying that program exactly specifies what does it do.

But what about what it should do?

Now let me extend this piece of code with a declaration comment:

/** A method which validates if person is in an age which 
allows one to consume an alcohol in taverns or similar places. */
boolean validateAge(Person person) 
{     
if (person.age < 18 ) return false; else return true;
}

The code does exactly the same thing but now it is clear for us what is ought to do. The code says what does it do, but the comment says what it was meant to do.

In this simple case You will momentarily notice that 18 is an incorrect value in many regions of the world!

Would You be able to detect it without a comment? I do not think so.

Note: I do still find this comment to be bad and incomplete, but I think we should ignore it now.

The cast of “wise men”

Now let us again take a glance at the legal system and those two regulations I mentioned earlier. In both cases the government (mister M.Morawiecki) wrote the “code” but never clearly communicated an intent behind those program lines.

Now ask Yourself a question: who on the entire Earth did now what the “You can’t enter the forest” was meant to achieve?

Maybe one or two persons in the government who wrote that.

Now take a look at You software project. That big one. How many lines of code are there? 10’000? No.. that’s small. My first big one person project at high school had 30’000 lines. So I suppose it will be near 10’000 files and about 10’000’000 lines of code.

This is the limit, from my personal experience, when a single person starts loosing a sight of it. The: “Hey, I already wrote it?! Really? When then hell I did that?!” is something You may start to hear. Yet still a person who wrote that code will quite quickly get a grip on what it was expected to do and can fix a broken implementation or even a broken concept.

This way You do have your own hand made “wise man”.

“Wise man” is a person in a software project who was in it from the same beginning, who knows or at least remembers the ideas behind all the internal workings and can be used as “quick access” database to point out which part of code is responsible for what and which part of concept may be responsible for such and such behavior.

The important observation to make is that for such a “wise man” the code is indeed self commenting. Mainly because of the content of his or her brain.

Kill the “wise man” and see what would happen.

Short contract employees

So You have Your big project, You had Your team of “wise men”. Now they all are gone.

You hire new person on a short contract. You ask him/her to fix something.

How the hell can that person guess that 18 in that example is the age for alcohol drinking and not for car driving? He can’t, but he knows it is there so he will do something like that:

boolean canDriveCar(Person p){ return validateAge(p); }

because he knows that 18 is the right age for it and validateAge() uses 18, right?

Then You hire somebody else when law changes and now the law tells that 16 is valid for driving a car. This person will then fix:

boolean validateAge(Person person)  
{     
 if (person.age < 18 16 ) return false; else return true; 
}

and will break the alcohol test.

Of course those examples are oversimplified and exaggerated. But in a real life alike things may happen. If there are no clear specification what was the intent of a certain code then people will start using it by what they see it does do. In this example Your API did not specify clearly that it was about an age which let people to legally drink alcohol, so You will end up with code full of calls to this method at any place where age of 18 was to be tested.

Open source projects

Open source, community based projects are full of short term employees. Those projects do base on them. The entire idea was: “You spot the bug, You can fix it”.

I’m a programmer. I do it for living. I can pin-point a bug in my own code within 0.5 to 24 working hours. In well commented code wrote by somebody else I need double of that time.

Without comments I need ten times or more time.

If You are running an open source project whom You like to attract to it most? Inexperienced green-horns who make thousands of mistakes or old lions with years of experience?

If You think about quality You should, I think, focus on gathering the small herd of “lions” surrounded by youngsters playing around and trying to catch some butterflies.

Most of the “lions” do however work for living. They code all day long and the last thing they like to do in a free time is to bite through next piece of code. They probably won’t be making a mass of Your team mates. Yet sometimes they may get so annoyed by bugs that they will make an attempt to fix it.

In their free time.

What do You think, will they love to spent 100 hours digging through non-commented code totally without any information what were main ideas and algorithms in a program, or would they love to spent 2 hours on doing the fix in a well described environment?

If You like to kill Your open source project follow the route with a big yellow sign: “High level code is self commenting”

Throw away Your debugger! Unit tests.

Honestly. Really.

The only place I still need a debugger is an embedded world when I do program a micro-controller. I suppose You will soon catch up why.

What is debugger for anyway?

This is not a silly question.

Debugger is letting You to step-by-step through Your program. You may inspect data, inspect variables, see how does it flow. It allows You to look into Your program.

Yet it totally breaks timings. Totally fails in solving multi-threaded problems. Completely fails in solving performance issues.

And it is hand driven. You need to click. Click. Click. click….

Basically a debugger allows You to see what is happening. This is a very useful tool for beginners or in case if there is no other way to see what is going on, like for an example on a micro-controller.

But does an experienced programmer has much use of it?

In my case an answer is clear: NO.

Unit tests.

The much better work flow is to write a test which will check what You would otherwise check with a debugger.

Unit test is a small program which, when run, can check, best if without a user intervention at all, if a certain piece of software does what it should do.

So in my opinion whenever You need to check if a certain routine does what it should do You should write a test. In java I do recommend a JUNIT.ORG which can be nicely integrated with ANT build scripts to run all tests it can find automatically.

In my case I did configured it in such a way, that it searches for all class files matching “Test*.class” or “*$Test.class” in a package and runs them.

But writing tests is a real pain!

Surely writing a test takes a time. From my experience it takes about 1/4 of amount of time I spent writing a code which is to be tested.

Surely, if You will hit a wall and will have to re-design code You will most probably have to re-write all Your tests. You may see at this point why it is so important to focus on code design and make a great effort to create a reasonable API which may be then put to tests.

So these are disadvantages of unit tests.

There is however one huge advantage.

Whenever You change anything in Your package all You have to do is to run a single script. It is called “test-package.sh” in my case, and it asks ANT to run all unit tests it can find.

You just run it and observe if big large FAILED did not appear on a screen.

You may possibly imagine how great impact may it have on so called “regression bugs”.

Write bug-exposing tests

The next thing I do recommend to do is to write tests which will expose bugs which were reported to You. You got a bug report, You did hunted down a piece of code or a certain use scenario which triggers the bug. This is an excellent moment to write a test which will expose it.

Run it and it should have failed.

Then fix the bug and run test again. This time it should not fail.

The next important step is to actually add this test to Your test suite so that every time a package/library/application tests are run this test is run too. I do insist on it, because the sole fact that a bug have appeared in a first place means, that Your test suite had holes in it.

If You will follow this routine You will systematically plug all holes not only in Your application but in Your test suite. Thanks to that a chance for regression errors to appear will be getting lower and lower over time at almost no costs.

Test will let less experienced coders to make fixes

You might have noticed in some of my previous post that I was saying that people with knowledge are a most important resource. Then I stressed that one should document the code in such a way, that if an original author goes away there should be no high learning costs for new employees.

You may see tests as a piece of such documentation. Having well documented tests is a best thing, but in my case I do rarely have enough strong will to document them as good as I document main code. The KISS (Keep It Simple Stupid) rule should be then followed. A small, few line long test with meaningful name can be self explanatory enough.

Ok., so we have a code and tests. We have a bug report. And we have a high paid experienced programmer and a newbie at half of that price. Experienced guy will fix it quickly, but the he won’t be doing a job only he can do. So maybe we should give it to a newbie?

It will probably cost us even more. From what I can see a newbie will need about five time more work hours to solve a problem than an experienced programmer. The better the code will be documented, the easier will it be for him to solve it. Yet still he won’t be knowing it very well and a “fix” may in fact become a “regression bug spawning daemon”.

But remember, we have a good test suite. Armed with it even if he did not understood what is going on in all details the test should catch if he broke something. The risk of “regression bugs” appearing is low and we now have a next person who is slightly more experienced.

Providing that all tests are run…

Dependency tracking in testing

Remember, Your test suite may have holes.

If You like to be really pedantic then You should be able to track every library or program Your company is selling which is using the library You just fixed. Your fix might have influenced them in both good and bad way.

So for best results You should run not only tests which are inside the library You fixed, but also run tests in libraries which depends on it. Run them before fix to make sure, that they were not failing and run them after You made a fix. If any of them fails pin-point a problem and extend Your test suite with a test which will expose that.

Well… I’m usually too lazy to do that extension and I am satisfied with the fact that tests in that dependent library failed. My wrong, but it is still better than not running those tests at all.

But I would rather use debugger…

You would. Once. At most twice. But imagine that You need to run a regression check on let’s say five libraries. It will take out days of Your precious life.

You may however, If You like, to use debugger to hunt bugs or to validate Your tests.

I, personally, do prefer logs and unit test, because I’m too lazy to use debugger.

Allow users to tell You about Your software

This is an obvious fact, that to fix something You first need to know that there is something wrong with it.

You can gain this knowledge either by doing very detailed tests “in house” or by asking people who are using Your software for their every-day job.

In pre-internet era the first policy was dominating. You simply could have a very little feedback from users and such a feedback will be either delayed if they had to use a paper mail, or not very detailed if You had to use a phone.

Today most companies moved “in-house” test to users. Even open source do that.

You publish Your software and await what people would say…. if You let them do it.

I will try now to pin point main stopping points for feedback as I do observe them in a company I work for.

Sin number one: feedback only from customers at paid support plans

You may try to attract Your customers by selling them support plans. You call us, You mail us and we will help You immediately.

This is a good policy, with some value added to a customer.

Providing that You are not making a grave mistake.

I, as a user, had hit that wall a few times in my live. I had a paid software and I have found a bug in it. I could prepare a detailed report what is wrong. But I couldn’t send it to them! Only paid support customers could do that. And even in those cases when my company DID have a paid support plan, I as an end user, did not have direct access to it. I had to do it through IT department and You guess what? It was too cumbersome for me so I skipped it.

Getting information about bugs is important. You should let anyone to report them, even people who do use a pirate copy. You should just clearly state, that if they did not pay the support they may not expect that this bug will be fixed at the spot. Maybe in later version, if they buy it. The IAR does it this way and I suppose they do get much more information that way than from paid customers.

Sin number two: make it hard to use

We, at the company I work for, do have paid support plan for some of our work-horse software…. yet we do not report bugs. Why?

Because it is not easy.

You need to make an account. You need to wait for confirmation. You need to provide license number. You need to provide email. You need to…. Only an IT department knows all the information and only they can do that. And You now how much Your IT department is always overloaded.

To many things to do to just send a report in a vague hope that it might get fixed.

Sin number three: do not ask for valuable information

Any of You who are writing programs heard at some moment such a report from Yours customer support: “Hey, I have a guy who said that it crashed”.

What can You do?

Nothing.

You need to encourage users to give You good bug reports. The excellent bug reporting system is OPEN-JDK (previously SUN). It was guiding You step-by-step through all stages which would help them to assign a bug to correct person and solve a problem.

Sadly recently they closed it up so tight that it is really hard to report anything if You are not Oracle affiliate.

Many bug-zilla systems were good too.

Sin number four: let us become a community!

Many companies (Texas Instruments for an example) at a certain stage decided to cut on customer support costs and moved towards community based solution. They did create a forum at which users might ask questions… and get them answered by other users.

The company personnel was just monitoring it and barging in when they found it necessary.

It was a total disaster.

Do not get me wrong, having a place where users of Your product may discuss is a good thing. Many basic problems may be solved that way. People may learn from each other and do not bother You. But it can’t replace a bug report!

Second it is a public place. When You develop a new product You do not like to expose to the world what You are working at. You simply will not ask.

Community based support is only one of elements in a machine. It is good, it lets You observe how some of Your customers are using Your software. You still need a regular bug-reporting system for those, who are not asking for help, or are sure that You are the only person who can answer a question.

For an example I had a doubt about TI documentation about a certain chip. At one page they said one thing. At another page they said another thing which was in contradiction with a first one. I need to know what is true. What is the designed, expected behavior and what is just a misunderstanding or a chip-bug artifact. Do You really think a community based support can be able to help me with that?

Sin number five: be slow

Users will give You a feedback only when they will not know that You are screwing them.

The worst possible idea ever I have seen on Autodesk forums. They did created the “user ideas” section. This is a great idea in itself. Users do propose what would they like to see in an incoming version. They discuss it, You may ask them. Good. Superb.

Providing that the delay from “idea accepted for release” to actually embedding it in Your software takes no more than a year. And they have up to eight years lag, as far as I can remember.

It is a clear message to users: “This was just for show”

How many good bug reports do You expect to receive after saying something like that?

What to do then?

First make it accessible to everyone. No paid plans, no license.

Second do not force people to get through painful registration and confirmation process. Most of my coworkers do give up at that moment. They have a problem to report, they may spare 15 minutes on that, but the all dance with accounts, mails, passwords and user names which You will forget anyway is too much for them. And I do agree with that. It is too much. Most of users will fill up to one bug report during their entire life! Sometimes they are not even interested in what will happen with it later. They just like to let You know that there is something wrong.

So let them to do it!

But I will be flooded with useless crap!

Exactly. Simply do not make a sin number three. Prepare a good feedback form. Do not accept a bug if some fields are empty… .but let them say “It doesn’t matter in this case”. Since filling this form is harder that typing: “I hate this crap!” You will redirect that “crap” to the community forum while still receiving good reports from those who will really have something valuable to say.

This way You will always get a good bug report. With a screenshot, with system information, with selections which will help You to say to whom assign it. With a test case built in. With information how critical it is to a user. Sometimes even with a “known work-around”.

Then do not rely on “we will get in touch later and ask for details”. It doesn’t work that way, especially if You have month long delay in processing queue. Having to ask somebody over the ocean means that a simple “question and answer” email will take two working days. Since this is the case You may, after asking a question, have to put it back to processing queue where it will sit and await next few days or more. Imagine what impression does it make on a customer. Do not get me wrong, there is nobody’s fault. The working hours simply do not overlap.

But if You made a good form and collected all information at the same beginning it may be already a bug report good enough for Your team to work on.

Open source community

I must say that most of open source community has a very, very good bug reporting system. But it is focused on developers and returning users.

This is so simple, guys.

Do not make me to create two accounts just to fill a bug report!

Today I filled bug report for an Inkscape. To do it I had to create a launchpad account and then a gitlab account. Why? What for? I just liked to show You a through and detailed bug report. It should be enough to start fixing it. I won’t be answering any questions, because I analyzed it already up to the max of my knowledge.

Package dependencies and why I do start hating Linux.

I am using Linux at home since more than 20 years now. I did start from a RedHat during pre-internet Windows 3.11 era. At that time it was nothing more than a toy or a server system, since there was no applications which could be used to do any work.

Then I moved to Debian and recently to Ubuntu.

Moving from Debian to Ubuntu was a desperate move. I simply thought that maybe moving to one of most popular distributions will let me use it with the same flexibility I had at work with Windows XP.

I was wrong.

Sometimes ideas do not scale

DLL hell and Linux approach to a problem

My first contact with Debian met in time with the of something called “DLL hell” at Windows. For those who does not know what it is: DLL is a “Dynamically Linked Library”. This is exactly the same concept as “*.so” libraries in Linux.

DLL were introduced by Microsoft together with Windows when multitasking came to PC world. The concept behind it was: “Since many programs are using the same library why to keep two copies of it? And is there any reason to load it twice to precious RAM memory?”.

At that moment nobody thought about “versioning”. I guess they thought, that new version of a library will be always able to substitute for an older version.

The “DLL hell” appeared when programmers discovered that even they can ask operating system to load a certain library for them, the system is unable to do it when the library is not there. So each program needed to carry it’s own set of DLL libraries. The installation media was bloated, but during the install programs did drop all DLLs in one, shared system folder.

You may already guess, it created a lot of problems with versioning. Should it override existing file? Should it delete the library on uninstalling or keep it? Windows did not offer any kind of “use tracking” nor “dependency tracking”.

So very quickly application creators decided to provide all needed libraries except system ones and not dump them into a shared folder. Instead everything is kept in an application folder. This way programmers can be sure that the set of libraries the user is using is exactly the same which was used for testing. And they can just wipe them all out.

At that moment Linux was aware of the problem and decided to solve it. They did introduce the “package dependency system”. In RedHat it was RPM, in debian it is DPKG or APT. This was a good idea.

A package maintainer did specify what packages in which version his program needs. And the system kept track of it. When package is to be installed system checks what dependencies are needed and installs them too. If package is uninstalled it may inform user that some packages are no longer used by anything at the moment and ask if it should remove them.

It really saved a lot of disk space and RAM.

But it was internally cracked.

Dependency hell

It did not took long for a “dependency hell” to appear.

Any Linux user who does not chase the most up-to-date distribution but just wishes to cherry pick was hit by it. I was, in fact, three or four times put in such a deadlock that I could not install any package at all. A complete wipe out was necessary and it is never an easy work.

At the beginning of my play with Debian it had 8 distribution CDs with about 6000 packages on them. Six thousands package which do depend on each other!

People are always making mistakes. A person who write dependencies of a package can make a mistake. You may test each package install and uninstall operation in a certain distribution but You can’t do it when people start cherry picking applications from other distributions or sources. And the will do. Linux is about a freedom right?

If You do so sooner or later You will end up with a message saying that something cannot be installed because a dependency is broken. Usually it will happen when package maintainer fixed a version of a library his package depends on and instead of linking to “xxxx.2.4.so” he linked to “xxxx.so” leaving job to the system. He said: “My program is incompatible with different versions”. Or closer to the truth: “I tested it with that version of a library so it may not work with others”.

You may try to solve it by hand, but I never succeeded. Once I even ended up without a GUI because X-server was a conflicting package. So sooner or later You will be forced to do a distribution upgrade. Upgrade everything You had. Yes I know, I could be upgrading it item-by-item manually and carefully tracking best dependencies but honestly… it is a work which may take hours or even days of my time.

And we are again back to resources. People and time.

Just testing 6000 packages if they install or not takes about 3 months of full time job (assuming You need just 5 minutes for each of them: install, start, uninstall, check if everything was purged). And what about modern distros which count in tens of thousands packages?

There is simply no way to test their dependencies.

Stiffening it up

One thing must to be said: You usually won’t have big problems if You stick with stable distribution and do not install anything which is not inside of it.

Then Linux is great!

You just select a program from a nice looking manager, click and it runs.

But is it much greater than Windows?

No.

The main difference is that You need to search internet for a program and the next difference is a download time and disk space used. It may count in an about order of magnitude or even more, but it is the only difference.

From user point of view it is absolutely nothing. If You have a decent job there is no problem to scoop up a bit of money for second or third hard disk or an additional memory chip. Upgrading CPU however is quite expensive. So the size doesn’t matter, but the speed do.

The price You pay for working with Linux in a “distribution lock-in” model is tremendous.

You can’t install application You need. You can’t install the version You need. And, what is a sole source of a problem: You can’t install two versions of it a the same time.

So You start cherry-picking from other versions of Your distribution or even from a totally outside sources. Remember, the time needed for testing a distribution is huge so they are almost always a year back in time when compared with the newest stable version of an application You need.

Oh, by the way. You can upgrade in this model to next distribution. But then You have to upgrade everything. Certainly not being sure Your fine tuned tweaks will survive it.

Warranty

Computers are used for work. A user must feel safe, that by installing some application others won’t be broken. A user must be able to step back if something gone wrong.

If an application works in a moderate isolation this can be easy. You install it in a non-standard location, try it and remove it. Unless an application is not messing up with some global data (registry in Windows, /etc/ in Linux) there is no problem.

Second, if an application supplier needs to give You at least vague warranty that his program works he need to test it against a certain set of libraries. Please remember, a bug fix is not always a fix. Sometimes, especially when documentation for a library was poor a bug is actually a “feature”. People checked how it worked and assumed it should be that way. Fixing it breaks everything then. Again this is easiest to just include the set of libraries together with an application.

This workflow could have worked.

But in package based Linux it does not.

First, there is no way to tell where the program should be installed.

Second it is in fact scattered all around. At my server at work I tried to get two totally separate copies of a program to run and it was a misery.

And third, You can’t have neither two different versions in system nor two different copies of the application.

This still could be acceptable, if not the dependencies. Installing application X from outside Your base distribution may pull some libraries and boost up theirs versions. Your new application needs them and will be happy with them because it was tested against them. But what with other applications You have? Were they tested against this new library? Yes, they allow it to be installed, but mostly because otherwise You would not be able to install anything form outside of a base distribution. But they were not tested against it!

Now imagine, You have found X to not be Your sweet spot. You need to remove it and revert back. So you remove it. What have happen with libraries? Did they also revert back?

No.

Build it from source

Yeah, right. You want a house? Build it from blueprints!

Downloading Inkscape and compiling it took me, not counting struggling with source dependencies, about four hours. Four hours. And to be able to use with a dedicated set of libraries I would still have to tinker with linker setting and library search path. I am a programmer. I could learn how to do it. Can other people do it? No they can’t. They have theirs lives to live.

Diversity of solutions

Of course You will say “it is already solved”. We have FlatPack. We have Docker. We have LXD containers. We have LXC containter. We have Snap. We have….

Guess what?

I don’t care.

I just like to get to “something.org” web page, download a file in version I like and run it. Answer few questions or follow a few lines long instruction and have it running. Without a need of downloading or setting up anything else.

Then, when program is not needed I’ll just delete a folder in which it was installed and clean up icons I made. That’s all. This is the use model which is most friendly to almost anyone.

Sure, having some kind of “applications manager” looks good. If it works. And if it allows You to select which version You need. If it allows You to keep two of them side by side. If it allows You to tell where to install it.

Yes, I know containers do allow it. But don’t You think it is like shooting from the cannon to the fly? As far as I can see the idea behind them is to make an isolated, lightweight copy of a system in which an application was tested and let the application to see just this copy. You don’t have to compile or tweak an application. You can just take it and contain in a container.

I do not think this is a good idea. I do think this idea is even worse than package system. It is technically good but from user point of view it is just annoying.

Is it reliable? How does it deal with a case when the container system is totally different than a host system? Will container with Ubuntu inside work well on 5-years old RedHat? Will user be not confused when file browser built in an application sees files which are not in his system? At Windows we, who are not speaking English, were already hit by it hard: try to explain user that there is no folder named “Moja muzyka” and even tough he sees it in an Explorer then in command line he have to use “My Music”?

All right, I wrote an application for Linux…

…and I would like to give it to users in a built, running form. What kind of package or container should I use?

I don’t know. Probably all of them? Gosh…. It won’t be easy. I will have to learn a lot. A lot. A lot more. I will waste hours and hours which I could have spent on perfecting my application.

My guess is then: screw them all.

Make sure Your application has at most one or two easy to get external dependencies. All others must be included with Your application. I do code in JAVA so I’m just telling users to get any JAVA and clearly state at which version I tested it. If I am sending them physical media I do include some JRE even if it is a small breach of license terms. They simply can be off-line if they asked a physical media. It would be rude for me to leave them at their own.

Then make sure Your application does not need to be installed at all. You may add some fancy installer if You like, but simply copying it anywhere and clicking should do the work.

You may ask system for some global data, but You must not alter them. Keep everything, all settings and etc in application folder. Yes, I know it breaks almost all multi-user environment rules. But how many users do actually share their computers?

Even if You do it like this and a second user will like to have own copy of an application let him to do it. He will copy it to his own folder and start. No problem at all. All user settings are still private.

But it do double the disk space!

So what? Let the file system to do the job. Some systems can by themselves detect file duplicates and link them together. Some can do it with a help of a companion scanner. If You use such a file system You may expect that after a few days the disk space use will drop.

Summary

If Linux wishes to survive at consumer level it must provide a technically simple and uniform way of managing application leaving full freedom to users. User must be able to decide about what application, in what version, where and in a what number of copies he likes it to install.

Software quality: is assembler so bad?

When people start talking about software quality they always tend to give more credits to higher level programming languages than to those at lower lever.

I do have some great experience with Java, C, some less with C++ and other languages and a great deal with assembler at different micro-controllers. I even wrote once a Forth interpreter in assembler under Linux.

Design is critical

During all those days I have found that what is critical about software quality is not a language. It is not the ease of coding. It is all about software design. Yes, design. That time which You spend with a piece of paper or possibly a kind of drawing software (I used Inkscape for that) and You draw data structures, modules, state flows and algorithms.

Any one of You did it?

I don’t know. At the beginning I didn’t. I was too focused on coding to even think about the design. The second fact is, that I was too dumb to simply know what I can do. I had to try doing it.

Nowadays I do look at it differently. Today I’m confident that I can code everything at good level at least in Java, C and any assembler. When I’m not sure how it will come out I can always run a small “battle recon” trial. Just code the concept and see if it is going well.

Design allows sharing work

I do primary work by myself. I do design hardware. I do code in assembly and then I do a PC side. But five years ago my employer let me gain a coder into a team. A young lad, just after the university. Gosh… he was so good at coding. Knew all languages well. Could learn the assembly of micro-controllers without a problem. A good, valuable employee.

But totally inexperienced. Just for a trial I asked him to design a certain low level I/O protocol within a certain required functional boundaries. Result was a total misery. He fell in almost all pit-falls I did in my young age.

So we decided to do it otherwise. I decided to leave him some low level design, which I knew will be fun for him. I did provided him with examples how I did it in a past, but let him to be creative. What he did. And now I feel sorry for that. I should inspect his work more detailed and discuss many of his solutions.

Putting results aside I decided to focus on designing algorithms and tasks running in parallel. I have drawn them, described math and wrote what should be done in human readable terms without getting into too many details. It took me a few months.

But then he could implement it. And since he discussed his understanding of it with me I could fix many problems at concept level without loosing time to code it, debug and then getting back to drawing board. We simply could do a low cost “dry run” in our brains. While he was coding, I could do the second revision of a hardware then and could be sure that the resulting program will be good.

Working without a design

Obviously a design does not have to have a form a formal drawing. It may be even a piece of code. For an example a set of pure virtual C++ classes with a lot of comments can make a good design.

But what if You work without a design at all?

I was used to do it. I just coded and coded, and coded. With a generic concept in mind I was building it piece by piece, testing it step by step. This was a good way for a beginner who struggles with coding, but the result was not very good. I had to re-write a lot of code from a scratch each time I hit a conceptual wall. I’m still proud of some solutions I made then but now I can see that if I would have made a design then I would have found them earlier.

So where was the problem?

I used C at this time. Compared to an assembler it is a very high level language. Really. You can take an idea and code it just like that!

Discouraging designing

The process of designing software is boring at the beginning. It also gives poor results because You don’t know what can be done and how. You need to fall many times to learn how to walk. At this stage an attempt to design is like learning how to ride a bike by watching how an older brother rides his.

There is nothing unexpected then that young programmer is not very fond of designing his software.

This lack of design has one additional flaw: lack of documentation. Even a brief, possibly outdated design is better than no design at all. Especially if You are going to fix somebody’s else code. I know, I did it once. It was about 10 years old C code written by a hired external employee. I did spent a month just reading it and getting a grip on what is going on. If he would left some docs I could get into it in few days at most.

But what does it have in common with high level language?

Everything.

When coding is inexpensive (that is not time consuming) You just sit and start coding. First working function within a quarter, first working version within a few days. Excellent! And all without a design!

High level language makes You coding quickly but puts You off from designing.

Encouraging designing

Then I went to MSP430 which at that time had no GCC. Only 2k$+ costly C compiler. All I could get at a low cost for trial was an assembler with a simulator built in. So at least at the beginning I had no choice.

A first going through an assembler let me see what code is CPU friendly and what is not. Can it do an efficient task switch? Can it use indirect calls? How fast interrupts are? And etc, etc. Gladly the MSP430 is a fairly flexible CPU so coding in assembly is very easy. Yet it is still time consuming and re-doing things from scratch is a real pain.

So with that I was just forced to do some design.

And gosh… It went so smooth!

Surprisingly the well made and dry tested design did compensate for a time lost for coding and debugging. Even, to my surprise, the debugging was not much harder than in C. It was all to keep a certain discipline at using a certain calling convention. Once I wrote a tool which was catching at source level typical bugs I was at home. And as a side effect I got nice, JAVA DOC alike automatically generated browsable documentation.

Debugging

Surely high level debugging is easier. Much easier. It is so easy, that it encourages You to type, compile and step through without a lot of thinking. This way You do miss a lot of boundary cases, rollovers and etc. Remember, no design.

Second I did observe that in a high level language most errors You make have slight, rarely occurring and distant effects. Those are hard to hunt down.

In an assembler however, when working with a design, I did observe that most of mistakes I made were so bold that everything just crashed right at the first run. Don’t make me wrong, I also did spend a day or two on hunting “thread-to-thread call stack fandango” bugs, but neither C nor C++ prevents You from doing that.

High level languages bugs are more subtle, low level are more rough so they are easier to find.

Trust

If Your design is a responsible one You need to trust Your tool chain. No high government standard of high level coding can defend You when Your C/C++ compiler is buggy. And in the micro-controllers world yes, they are. The more efficient they are, the more You like to squeeze from them, the more buggy they appear.

Second element of trust are the “position dependent” bugs. Your code may live well with memory overrun bugs providing they do point to an unused space. Since using new in micro-controllers in begging for troubles there is the static memory layout. Which is under a control of a pair linker+compiler. A tiny change in code may affect it and You won’t have any knowledge of it happening. Been there, seen it after 15 years.

In assembler You have much more control over it. You were simply forced to allocate it at least semi-manually. So there is a smaller chance that a change in module A will move something and break how code in module B works.

Summary

Assembler is certainly a low level language. But it is an only language without limits. Do You like multi-tasking? Sure, have it. Do You like exceptions? Not a problem. Need virtual methods? No pain at all. They all are just some macros.

Assembler is certainly slow to code. This is true. It may create many strange errors.

But…

Assembler enforces You to design and thous document Your software.

Open source:Nice looking or functional?

This is very sad we have to ask such a question.

Should my program look nice or should it be functional?

If I would be to answer that I would always say “functional!” Functional! Functional!!!

The red painted hammer head looks better than a gray rough steel one, but the red paint gets chipped-of the first time You smash the nail. Yet a nicely painted red handle may look nice and does not harm the way the hammer works.

Take for an example GIMP and it’s default single color gray-on-gray icon scheme. Someone spent a lot of time making it. It may look nice if You like such a style. But it’s usability… It is really hard to tell apart an icon unless You are color blind, because in such case You are used to it. In most cases people do recognize things first by color then by shape. Red, roughly round – tomato. Round… hmmppph… Tomato? An apple?

Nice looking? Ugly looking? Fashionable?

Do not get me wrong. It is good to have a nice looking, easy on eyes program.

The key question is however: what does it mean: “nice looking?”.

I am a kinky person in all possible ways. The way I think, the things I like, the things which makes me hot. Do You think that what I find “nice” will be also “nice” for You?

As there is no universal beauty such there is no universal nice looking program.

The only path You may follow it is to make it “fashionable”. The fashion, as all we know, is changing rapidly. What is fashionable today won’t be fashionable tomorrow.

In my opinion this is a down-spiral to Your software death.

The question is: do You have a spare time to make it looking nice?

Yes, this is a key point. It may be not always true for open source community, because this community may have more people able to make things looking attractive than actually coding but in a commercial company it is always true.

Yet most commercial companies do go the path: “Let’s change UI in next version”. Really. I have spend few hundred hours to get a grip on Autodesk Inventor 2009 icon system. I got fluent in that. I just know without any searching where to click. And then in 2011 they have given me… a ribbon!

If anyone will try to tell me that ribbon is more efficient I’ll kill him! It is not! It is not a linear search structure. It is a zig-zag one with click in the middle. Not very bad if some icons do get grayed out when context does not allow to use them, but if they do disappear completely You are gone and lost. And WTF it is at the top of a wide but low screen?! In Word it almost turned my PC back to a single line Olivietti type-writer.

All right, next Autodesk Inventor version. 2019. A total rebuild of some context menus. Some new way of guessing what I would like to do. Well…

Not wrong right? User have a constant felling that something is really getting done. That the annual payment is not just a give-away.

Well…

There are bugs and future requests dating 2012 which are not implemented yet. They say they are “considering it”. Core functionality does not changed much. Many bugs I have in 2011 version on Windows XP are still present in 2019 version my coworkers have.

This is certainly a wrong path. Many hours of developers work lost on UI bells and whistles. Will it get better because they are get paid for doing upgrades? I don’t think so. I think they won’t do any important fixes. Why? Because we got into a vendor-lock-in and are neck deep in shit. We have 25 years of electronic documentation. And we are forced into a subscription model. Even if we would have completly opted out of the Autodesk we will still have to pay at least one subscription to just get an access to our old documents.

Each change has a price at user’s end

So we may say that software esthetics is not about “nice looking”. It is about “fashionable”. As long as You have free resources and You are doing a short living entertainment software this is a good idea to follow a fashion. Especially in a world in which information supplied to customers is so limited that they can’t make a reasonable decision and have to use hearts instead of brains to choose what to buy.

Be fashionable then.

But if You are making a work-horse software for people who are using it to get their living please consider it carefully.

Imagine You have a toolbar. Or even my hated ribbon. You arranged it in some or other fashion, good or bad. It doesn’t matter. You made some icons. Again, it doesn’t matter if they were good or bad. People just got used to that.

Time flown by, fashion changed. You decide to change those icons. Slightly re-arrange them. Maybe make them more fashionable, maybe make them more alike a color theme You company is using these days. This is not a hard work. If Your coders worked well it may be just a job for a hired artist. Change some bitmaps and run a build process.

Hmm… or maybe add something more? Making GUI is one of most popular programming tasks. Adding some fashionable stuff should not be expensive, You may hire some low cost coder to do that.

So You do it at a low cost.

Fashionable look may attract new customers, but how about the old client base You built for years? Possibly You may offer them an upgrade. Or even force it by making the old version unsupported now.

Regardless of how You do it Your new users will spend some time to learn how to use new features. They will have to learn how new icons look like, where old icons have moved and alike. Those changes are not big. They won’t impact their work much. You did not spent much money or time on it. Nobody is loosing anything, everyone is happy.

Not true. You old clients will spend much, much more. One hour times 10’000 users base. It is like about 55 people working and entire month just due to that change. Or a decent house in a country.

The question is: do You have a resources and time to make it functional?

This is a next question to ask.

The resources we are talking about are:

  • knowledge
  • people
  • time

I read once at LibreOffice developers forum that they have problems with fixing some core bugs because there is no more any person who really knows what is going on in the rendering system.

Then on Inkscape forum I read a post: “I tried it, it seems to work, I’m committing it” preceded by many question showing that there was a great problem to get a grip on what was going under a hood.

Considering my company and how many commercial paid products do change over time, or rather do not change, I think this is a quite popular condition in any kind of creative development.

Software is about creativity. I tend to think that designing software is a kind of complexity level as designing a country. Yet a person who was involved in it from the beginning can keep most of it in mind. Not details, but the key point and key concepts.

We had a situation in our company when a key programmer for a certain project got a brain stroke. He did return after a half of a year, but never regained his previous, full and brilliant mental powers. Since the project was kept mostly in his mind and was interlocked with many other projects we get a strong one year set back for an entire team.

The software, and especially “open source” idea came to life in between first and second generation of programmers. We all were young then. “Old programmer” was a word without a meaning. Young brains could remember a lot. Never felt a need to write things down, never felt to write down a concept. Code was self-explanatory to us then.

Today we do start dying. We do retire. We do work for living and are having no time and strength for doing coding at home.

The most valuable resource is people with a knowledge.

Back in about 2010 European Atomic Energy organization raised an alarm: “Qualified nuclear plants employees are dying out!” Well… not literally. Simply most of them were getting close to retirement age and it just appeared that there is no personnel to replace them. The governments screwed up, focused of technical aspect and forgot that knowledge==people. Poland does not have nuclear plant but we have a reactor running for medical and scientific purposes. The average age of an employee that time in that facility was close to 65. With most high qualified personnel about 75. The retirement age in Poland was at that time 65. So if all who could would have decided to retire we would end up with a reactor and a janitor to handle it. This is something You may call “an aggressive retirement”.

The solution was education. A pressure to hire more personnel than needed and get it educated by older personnel.

“Open source” is not free from that. It is even worse.

The nuclear energy is a highly regulated branch of industry. They have to have formal protocols for almost everything. They have to have perfect documentation for every technical detail. Logs. Data. Manuals. Instructions.

And they are continuously reviewed if those documents are up to date and in an agreement with a reality.

Do “open source” has such documents?

Summary

I do observe a continuous decline in software quality and continuous increment in it being “fashionable”.

I think it is a wrong path.

If “open source” community is seriously thinking about surviving it should focus on quality. Because a commercial companies will not ever focus on that. They are doing business and there are many, many better paths for gaining profit other than making a good product.

Open source is not about a profit. Open source is about a freedom of creation, but to exists it must have users. It can offer users an excellent, incomparable price, but it must be usable.

Is there then any point in following a fashion? Is then there any point of following paid software? Will a person who have no much money choose free, open source over paid subscription because open source looks more “fashionable”?

I do not think so.

Open source must focus on quality!

And in this specific case, I think, it means focusing on documenting the code.

Open source or closed source? Which gives better quality?

When people discuss the difference between closed and open source project there is always an aspect of “quality”. In most cases people will say: “open source gets You better quality because more people can take a look at it”.

I would rather say: it doesn’t matter.

Take for an example Inkscape. I was using it very intensively at about 0.94 version. It was great, sometimes slow, sometimes crashed in a very well defined points so it was easy to avoid.

Then next version came and I simply could not use it. It did not refresh what was on the screen with caching on and with caching off it was unusable slow.

Then next version came which was slowing down to a crawl when You started using arrows on Your lines.

And finally in 1.0.1 when You rotate a path with LPE on it the LPE rotates in opposite direction an everything breaks.

Ok, so it was one application. What about another open source work horse?

LibreOffice in Ubuntu.

I must say I am using it very intensively at work with Windows. Works great. I know this tool and I can get almost anything from it. Yet on Ubuntu it was a complete disappointment. With GTK-3 default back-end it could not keep up with my typing speed in 700 hundred pages long document with about 70 images. The Windows version works without any flaws.

Since the GTK-3 back-end was to blame I switched to X back end…. and PNG images with alpha channel appeared as black boxes.

So there is no warranty that open source is “better in quality”. The only thing which is sure it is that it has a better “quality/price” coefficient than paid, closed source software.

Ok, now go to the paid side. And by “paid” I mean really highly paid.

Autodesk Inventor. This is a software which on one hand is great. I use it a lot. I can’t imagine doing mechanical design without this kind of software. Yet it is one of two pieces of software which literally made me cry. Really. It crashes few times a day. Sometimes such a crash affects files which were not open, but were linking to files which made it crash. Sometimes it just destroys all Your work. Sometimes things which should be done cannot be done. It is leaking memory as a hell so it needs to be restarted each four hours or so. In generic be prepared to re-do everything from start a few times just to make it work again.

Then the Mentor Graphics PADS. Again good software. And again I cried. It is inconsistent, it sometimes breaks it’s internal database beyond fix. It sometimes passes through validation PCB problems which do show up just at production floor and makes you to trash all the PCB You paid for.

In both cases it is very, very hard to find a workflow which gives You warranty that there will be no problems. This explains why guys who are using only one piece of software do not find it to be so bad. Since this is the only software they use, they are able to find a relatively robust way of using it.

I have to use all of them.

Then the king of paid software: the Microsoft.

I can’t tell You much about it. I avoid it as much as I can.

Up to 2021 at work I was using Windows XP. I still have a machine which is running it because of some compatibility issues. I had no problems with it since it and an anti-virus stopped updating. But I did not maintained it. This is a software at my work and we have a personnel dedicated for maintaining it.

I was using Word and Excel tough. What can I say? Word? Zero compatibility with itself. It took us a few days to guess that You can’t co-edit a file with 2007 and 2013 versions. It just breaks. Then Excel…. well… four hours lost during XML import until we figured out that on 4k monitor some menus are going outside the screen and without a scroll bar to show them. Again it was hard to solve because we had different version of Excel on different machines. We just connected a HD monitor.

Yet most people use it and are happy with that. As far as I know ninety percents of them are using it as a more complex type writer. They rarely exchange documents and rarely has to edit 15 years old documents. But still for most of them an upgrade is a traumatic experience.

Autodesk Vault. Arghh!!!! One of worst steps we have taken at the place I work.

But I must admit, mostly due to the incorrect translations. The word “wersja” in one menu means something exactly opposite than in another. The help is alike inconsistent. Yet You can’t switch the language, because Polish Vault may work only with Polish sever and Polish Inventor. We are now seriously considering switching to GIT.

So “paid” is not better.

You can fix it.

This is a best part of “open source”. You can fix it.

May I kindly asked: Did You try it?

I did. A few times.

One in LibreOffice in low level I/O library for scripting. I was able to pin-point the bug to exact line and post it to bug reporting system. I was however not able to set up a compilation environment, but someone was able to incorporate it.

Then in JAVA as a programmer I could pin point and show bugs which are present in JDK. That is in the part which is “open”. I have to remind You JAVA is not a true “open source”. You need to become an affiliate to get to real full source code, and I am not. Yet still a library part of JDK is mostly open so You can inspect what is going on.

Those were stories of success. Now it is a time to a failure.

Inkscape.

The first moment a serious bug harmed me I had downloaded the source code. It took me a while to set up a build environment, fight with all dependencies and etc. Then I started a bug hunting.

And failed.

You can’t fix it.

There is a limited amount of time and effort one can spend on finding a bug in “open source” project. I have a work to do to earn my living. So the critical point is: “How fast You can find a bug?”

First the code You are checking is a code You don’t know anything about. You need some kind of a “guide” to quickly narrow the range of code You are looking for.

Then you need to quickly figure out which specific file, class, function or whatever may be responsible for the problem. For that You need to be quickly able to find what each file, class or function is meant to do. I intentionally underlined the meant word. What it does can be read from a code. But what is should do can be very different. Remember we do hunt for bugs and a definition of bug is: “something is doing something the way it should not do it”.

Documentation

In both success cases I was able to pin point the problem because of an excellent documentation. Anyone who have seen JDK JAVADOC may clearly see how well this code is documented. Sure, it sucks at many places. Yet it has at least some points which let You quickly decide – no, this library is not a part of code I am looking for.

The LibreOffice libraries are also well documented. Probably because this was how Sun have left them when they gave out the StarOffice to the open source community.

What is even more important in both case the documentation can be accessed without creating a build environment. Even tough they are built from source code comment You may read it without setting up anything. This way You may quickly find if it is possible to find a source of a bug at all.

Yet the Inkscape at that moment….

Well…

It sucked. Really. In about 80% of code any file comment You could find was a GNU license. How fast, do You think I could have pin pointed the cache rendering library? How quickly could I have figured out what is a data model?

I couldn’t. After about eight hours I could have get a grip on some concepts, yet many details were a mystery.

Since I come from a hardware and I/O intensive world I was always dealing with threading issues. So at the beginning I was suspecting some kind of race condition because the outside effect was looking exactly as if something like that was happening. So I especially intensively hunted for threading code.

At that moment I discovered that Inkscape being multi-threaded was simply a lie. They just decided to start some threads to do pixel-by-pixel rendering for each filtered object and then destroy those threads. If You did some threaded calculations You know it is the best way to loose performance rather than win.

At the end I was unable to even understand what was going on except, that they did it totally wrong in so many places. This took me month or so of my evenings.

As a result I gave up. I used Inkscape for doing art for fun. I may live without that fun.

Why didn’t I downgraged it? I’m on Linux. It should answer all You questions.

Hello and welcome!

What will this blog be about?

In most generic words: about software. And about what annoys me in it most.

But since I’m a software and hardware engineer I will always try to wrap everything around both user experience and coding.

Since it is a blog form I suppose I should say something about myself.

I do belong to the second generation of programmers, that is I started my adventure with computers from the ZX-Spectrum. If You don’t know what was that ask wiki. This was a simple 8-bit machine with 48kB (yes, 48 kilo-bytes) of RAM in which user code may sit in. In standard it had a tape interface which linked via a standard audio channel with Your tape-recorder. Gladly I got my hands on ZX-Spectrum clone, the TIMEX-2048 and I was able to get a floppy disk drive for it. This kind of setup allowed me to efficiently start learning how to write programs. At this time I focused most on built-in Basic and Beta-Basic extension. I was still too dumb to go down to assembler what was a next step in that era. Notice, C, Pascal and etc. where out of question due to the very limited resources which had to keep in a memory both compiler, source code and produced executable.

Then my school bought a network of Elwro 800 Junior machines. Those were Polish clones of ZX-Spectrum, about ten times that large, but with more RAM and built-in networking capabilities. When networking was on they could run a clone of a CPM operating system and connect with a floppy drive which was attached to a “server”. As CPM allowed to use the vast amount of 64kB of memory, they were capable of running compilers. So at this time I started with Pascal.

The political context of those time in Poland was a decline of communists party. The economy opened and many individuals focused on gaining profit. Yet the large companies were falling apart or staying behind. At this moment the government decided to completely ignore the electronics market and leave it on it’s own. Most Polish companies which were capable of producing electronic elements (transistors, analog and digital chips, computers) went down to sewer. Very soon anyone who tried to play with electronics became completely dependent on the import.

There was however one key component in Polish government policy which skyrocketed the computer business: the lack of intellectual property rights for software. Yes, you read it right. The lack of it.

At this time the value of our currency was extremely low. Many people were traveling to West-Germany for work. In most cases they were working well below their qualifications yet they earned about ten times in value what they could earn here in Poland.

This was a context in which intellectual rights has to be seen. The 100$ for software was not much for western people who produced it. Yet we simply could not afford it. Notice, this problem is still present. The annual licenses for CAD software the company I work for is using are charged at the level of about three months worth of employee payment. If a private person would like to buy it this will in most cases consume all the annual savings.

But let go back to history.

So there was no intellectual property protection for software. None. Zero. So we copied everything we could. There were even shops which could do that for a small amount of money. They were keeping a stash of software, even had printed catalogs, and You could go there and buy a copy of whatever they had. There were even weekly fairs on which You could exchange the software with others. And by “exchange” I mean “copy”.

In the era of 8-bit so called “home computers” most of such exchange was focused on gaming. Those small systems were simply not powerful enough to do any useful job for adults so kids were main market. And we were gaming a lot learning in a background about electronics and software.

The next era started with a PC. An 8088 or 8086 Intel CPU based machine usually with 256kB of RAM (up to 640kB what was a hardware limit) and floppy disks.

Those machines were totally useless for kids. A gaming experience was so poor that we rarely used it. Yet when a hard disk was added to a system, a good keyboard and a spreadsheet software it became a working horse for medium business accounting. Lotus, Quatro-Pro and Framework where three applications which made it popular and useful. Some even tried to play with CAD software but at this time it was hard to get hands on it and it usually required far to strong hardware.

So the next work horse for PC was AutoCad. This software was at that time as bad as it is now. It was sloooow. Getting a coffee break while it was redrawing and image was a popular case. Click, get to coffee. Click…. click… Yet it was way faster than piece of paper and a company which owned a plotter ( a kind of XY rail system with a pencils set which could draw in ink) could replace about ten employees with one such system.

AutoCad was popular. The equation CAD==AutoCad was burned in our minds very deep. However the main reason for it’s popularity was that it was not efficiently protected and You just could copy it without any problem.

Then I got my first PC. It was 80286-16MHz 1MB system with 1024×768 256 colors graphics adapter and a tiny 14-inch display. It had huge 40MBytes hard disk. This could run most games (I was still a kid then) and could run some useful software.

Next step of my computer history was the introduction of MegaCAD at German market. It was a small company which created a CAD software which was about ten times as fast as AutoCad on the same machine and about ten times as much easy to use. This became my work-horse for my school jobs as at that time I was learning in middle school for mechanics. We were both learning how to work using machines and how to design them.

This was a first time when I was confronted with software quality. But since I didn’t code much yet I just could tell that there was a brilliant software (MegaCAD) and garbage software (AutoCad).

This was the era of Pascal in my computer education. To be specific – an object Turbo Pascal. I learned about Turbo Vision, virtual methods, objects and etc. I made my own GUI library and wrote as a middle school diploma a software tool for computing mechanical elements. I was very proud of it because I designed it in such a way, that it was working like a “magic form”. User clicked: “I like to calculate a gear” and a form appeared with some drawing and data. User just filled data he knows and clicked “compute”. The program then filled the missing fields. Even up to now it is rare to find such user interaction model.

By the way, ask Yourself which company have won? Autodesk or MegatechSoftware? Better or worse?

I went to high school. The Poznań University of Technology.

In terms of history it was 80386 era, but due to financial reasons I have fallen back to 8088. My dad and brother needed a computer too and the small Schneider PC thrown in Germany to the garage sale was all I could get. I was staying in kind of backward shift in computers strength until about ten years ago when for a first time I bought best machine which was on the market. And I am still using it. Right. So I’m still held back.

At this moment I started to play with C++.

My first impression was: “what a crap!”. The language was messy, but at this time I did not care about much. The worst impression was the compilation time. With Turbo Pascal I just pressed F9 and within a second or two I got a program running. With Turbo C++ it was taking about ten times of that time.

But hey, it had many great features. It had multi-base inheritance, overloaded operators and so so many more.

Guess what? I abandoned it in my work. I can do it, but working with C++ is for me like driving an F1 car. I drive my own real car about once a month, so You may imagine what would me driving F1 look like: wrrrr…. zippp… crash.

At this time I played for a first time with genetic algorithms. And fell in love with that. The reason for my still standing love for evolutionary algorithms is not the fact that they are in any means a miraculous tool to solve every problem. It was their robustness. I wrote a program which was using a genetic algorithm to solve something. Due to some memory limitations it needed some assembly code to gain access to so called EMS memory (DOS era, 80286/80386 way of switching 64kB window of RAM and move it around all the memory). And I totally screwed it. It wasn’t switching those memory pages as it should. Yet the algorithm solved a problem.

Amazing, isn’t it?

Later in my work I learned how amazingly well “probabilistic algorithms” can behave.

Somewhere in a middle of my high school I went even more backwards. Back to 8051 Intel micro-controller. I played with an Atmel flash based clone of it. It was a tiny 8 bit CPU of a very strange architecture, but needed to run only a power supply and a quartz crystal. It could have been programmed with parallel port (the old printer port present then on every PC) with a simple pin-toggling. This kind of device could be programmed in C, but to gain most of it one really had to go down to assembler.

And then my programming path have split. On one hand I was getting deeper and deeper into micro-controllers. 8051, AVR, PIC, MSP430. On the other hand I had to create applications on PC which cooperated with devices I made. So I first tried to get back to Pascal, but a the rise of Windows era it totally lost any kind of good support. Then I tried the C++, but again it was a mass failure. Yes, the application works, but the fact that some serious memory leak bugs had sustained in it undiscovered during 15 years was a real let-down.

And at least I discovered JAVA. As a low-level embedded programmer at those days I started discovering it from reading virtual machine specification. The micro-controllers world taught me that any programming language can do efficiently only what a CPU can efficiently do. Yes, sure, You can force PIC-16 to do an indirect call, but if You take a look at assembly… o gosh….

Java Virtual Machine really amazed me. It was clean, simple concept yet the code verification rules were set in such a way, that it was always possible to compile it in most efficient way. The choice of instructions is well balanced and allows good CPU optimization.

So I started JAVA and fell in love with it.

And this is a place where I am now. Do micro-controller in assembly or eventually in C and then do PC side in JAVA, usually with some tiny pieces of C native code to get to system I/O resources.