“Content driven” versus “code driven” file formats.

In a previous blog entry I did promise to tell You something about two different approaches to data parsing: “content driven” versus “code driven”.

Or “visitor” versus “iterator”.

Content driven

In this approach a parser is defined as a machine which exposes to an external world an API looking like:

public interface IParser
{
   public void parse(InputStream data, Handler content_handler)throws IOException
};

where Handler is defined as:

public interface Handler
{
  public void onBegin(String signal_name);
  public void onEnd();
  public void onInteger(int x);
  public void on....
 and etc, and etc...
}

Conceptually the parser is responsible for recognizing what bytes in stream do mean what, and for invoking an appropriate method of a handler in an appropriate moment.

The good example are: org.xml.sax.XMLReader and org.xml.sax.ContentHandler from standard JDK. Plenty of You most probably used that.

Note: This is very alike “visitor” coding pattern in non-I/O related data processing. Just for Your information.

Benefits

At first glance it doesn’t look as we can have much profit from that, right? But the more file format becomes complicated, the more benefits do we have. Imagine a full blown XML with a hell lot of DTD schema, xlink-ed sub files and plenitude of attributes. Parsing it item by item would be complex, while with handler we may just react on:

public interface org.xml.sax.ContentHandler...
{
  public void startElement​(String uri, String localName, String qName, Attributes atts) throws SAXException
  {
   if ("Wheel".equals(qName))
   {
     ....

and easily extract a fragment we are looking for.

This is exact the reason why XML parsing was defined that way. XML is hell tricky for manual processing!

Obviously we do not load the file just for fun. We usually do like to have it in a memory and do something with data contained, right? We do like to have something very alike Document Object Model, that is a data structure which reflects file content. The “content handler” approach is ideal for that purpose because we just do build elements in some handler methods and append them to the data structure in memory.

Easy.

And a last, but one of most important concept: we can arm a parser with “syntax checking”. Like we arm SAX parser by supplying an XML which carries inside its body the DTD document definition schema. Parser will do all the checking for us (well, almost all) and we can be safe, right?

Well… not right, but I will explain it later.

Why I call it “content driven”?

Because this is not You who tells what code is invoked and when. You just tell a parser what can be invoked, but when and in what sequence Your methods are called is decided by a person who prepared a data file.

Who, by the way, may wish to crack Your system.

Content driven vulnerabilities

XML

The content driven approach was a source of plenty of vulnerabilities in XML parsing. One of most known was forging an XML with cyclic, recursive DTD schema. The XML parser do load DTD schema absolutely before it parses anything else from an XML. After that it creates a machine which is responsible for validation process. If DTD schema was recursive the process of building this machine  will consume all the memory and system will barf.

Of course this gate for an attack was opened by some irresponsible idiot who thought that embedding rules which do say how the correct data file looks like inside a data file itself is a good idea…

Note: Always supply Your org.xml.sax.XMLReader with org.xml.sax.EntityResolver which will capture any reference to DTD and will forcefully supply a known good definition from Your internal resources.

If You will defend Your XML parser with DTD or alike schema and You will make sure, that nobody can stuff in Your face fake DTD then in most cases “content driven” approach will be fine.

When it won’t be fine?

When Your document syntax do allow open, unbound recursion in definition. Or when DTD does not put any constrains (which it can’t do) on length of an attribute. Or in some other pitfalls which I did not fall-in because I don’t use XML on daily basis.

There is however one another, even more hellish piece of API which can be used to crack Your machine…. and this is…

Java serialization

Yep.

Or precisely speaking: Java de-serialization.

A serialized stream can, in fact, create practically any object it knows it exists in a target system with practically any content of its private fields. Usually creating objects does not call some code, but in Java it is. Sometimes constructor will be called, sometimes methods responsible for setting up the object after de-serialization will be. All will be parametrized with fields You might have crafted to Your liking.

A possible attack scenarios do allow from simple OutOfMemoryException to execution of some peculiar methods with very hard to predict side effects.

All in response to:

   Object x = in.readObject()

Basically this is why modern JDK do state that serialization is a low level insecure mechanism which should be used only to exchange data between known good sources.

Preventing troubles

Since in a “content driven” approach they are data what drives Your program You must defend against incorrect data.

You can’t just code it right – instead You need to accept and parse bad data and then reject them. Like for an example You need to accept opening XML tag with huge attributes, and just once Your content handler is called You must say: “recursion too deep” or “invalid attribute“.

Alike in Java de-serialization You must either install:

public final void setObjectInputFilter​(ObjectInputFilter filter)

(since JDK 9)
or override

protected ObjectStreamClass readClassDescriptor()

in earlier version to be able to restrict what kind of an object can be created.

Notice, even then some code will be executed regardless if You reject it or not because the sole existence of Class<?> object representing a loaded class means, that static initializer for that class was executed.

The “content driven” approach is always using a load & reject security model.

I hope I don’t have to mention how insanely bug prone it is, do I?

Code driven

In this approach we do things exactly opposite way: we are not asking parser to parse and react on what is there. Instead we know what we do expect and we ask parser to provide it. If it is not there, we fail before we load incorrect data.

For an example a code driven XML parsing would be very alike using:

public interface java.xml.XMLEventReader
{
  boolean hasNext()
  XMLEvent nextEvent()
....
  String getElementText()
}

As You can see You may check what next element in XML stream is before reading it.

Note: Unfortunately I let myself to mark one method of this class in red to indicate, that it is also not an attack proof concept. The String in XML is unbound and a crafted rogue XML may carry huge string inside a body to trigger OutOfMemoryException when You do attempt to call that method.

Very alike Java de-serialization might be tightened a bit by providing an API:

 Object readObject(Class of_class ...)

instead of just an unbound:

 Object readObject()

Sadly de-serialization API in generic is maddening unsafe regardless of an approach. What doesn’t mean You should not do it. It just means, You need to pass it through trusted channels to be sure the other side is not trying to fool You.

Benefits

Using “code driven” approach we can be as sure as possible to not accept incorrect input instead, as in “content driven” approach, to reject it later.

Simply, what is not clearly specified in code as expected won’t be processed. It is like wearing a muffler versus curing the flu.

On the other hand, one must write that code by hand, and usually the order of data fields will be forced to fixed or it would be too hard to code. One must also deal manually with missing fields, additional fields and all other issues related to format versioning.

This is why I was so picky about being expandable and support dumb skipping.

Code driven vulnerabilities

Security? No inherent one. At least if API is well designed and all operations are bound.

Usability?

Sure, a lot troubles. Code driven data processing is very keyboard hungry.

But…

“Code driven” can be used to implement “content driven”

Consider for an example a plain “get what we expect” code driven API.

It might look like:

public interface IReader
{
  public String readBegin(int max_signal_name)throws MissingElementException
  public void readEnd()throws MissingElementException
  public int readInt()throws MissingElementException
....
};

This is a pure essence of “code driven” approach. You have to know what You expect and You do call an appropriate method. You call the wrong one, it barfs with a MissingElementException.

Of course it means, You must know the file format to an exact field when You do start coding the parser.

If we would however define this API to allow to “peek what is next”:

public interface IReader
{
  enum TElement{....}
  public TElement peek();
   ....
};

there would be absolutely no problem in writing something like:

public void parse(Handler h)
{
   for(;;)
   {
     switch(in.peek())
     {
        case BEGIN: h.onBegin(in.readBegin()); break;
        case .....
     }
   }
}

and we just have transformed our “code driven” parser into a “content driven”. Under the condition that we can “peek what is next”.

Opposite transformation is impossible.

“Iterator” versus “visitor”?

Yes, I did mention it at the beginning.

Those two concepts are very alike “code” and “content” driven and for Your information both
are present since JDK 8 in Java Iterator contract.

First let us look at the below pair of methods:

public interface java.util.Iterator <T>
{
  boolean hasNext()
  T next()
   ....
};

They do formulate “code” driven contract which allows us to “peek if something is there” and then get it. If we don’t like it we do not have to get it.

Then look at the method added in JDK 8, together with an introduction of lambdas and “functional streams“:

void forEachRemaining​(Consumer<? super E> action)

This turns it into a “visitor” concept where in a:

public interface Consumer...
{
   void accept​(T t)
};

accept(t) method is invoked for every available data regardless if we do expect more of it or not.

Reader may easily guess, that if one loves “functional streams” concept, which I don’t, then the “visitor” pattern has a great potential.

Note: There is one case when visitors are beating iterators. This is a complex thread safety. Thread safe iteration requires the user to ensure it, while visiting puts this job on the shoulders of a person who wrote data structure.

Summary

After reading this blog entry You should notice, that “content driven” parsing is very simple to use but at the price of being inherently unsafe.

On the contrary “code driven” is usually order of magnitude safer, but also an order of magnitude more stiff and harder to use.

If not the fact that code driven parsing with “peek what is next” API can be used to implement “content driven” parser the choice would be a matter of preference. Since this is how it is, then my proposal of an abstract file format must be, of course, designed around code driven approach.

Leave a comment