Would you like a Byte Order Mark to go with that?

It is possible to encode a little bit of metadata at the beginning of your byte streams to let the stream itself carry information on how it has been encoded. This is known as a Byte Order Mark (BOM) and it is as far as we know completely optional. Some .NET Framework implementations add this BOM to the start of streams given that they are to have a specific encoding. Here is how to find the BOM, and if necessary, remove it from your stream.

Context

Recently Neo4j graph database announced a REST interface. A couple of colleagues at Jayway are working on this. This was (very much) a too intriguing temptation to resist. I naturally had to write a small POC .NET based REST client to be able to talk to that server. (Blog post on this is pending within a few days.) While trying to send REST data in HTTP requests to the server we came across something interesting that caused compatibility issues between the Java and .NET worlds. The culprit was that optional byte order mark that I unintentionally embedded in my streams of bytes. The server did not recognize them. In a joint effort from both sides of the world (.NET client and Java server) we were able to find and identify the problem. Not that this is a new problem – It is very well documented in Wikipedia (link above). However I feel that one more post on this topic with a good test to show what’s going on is warranted.

A very good tool was also used to nail down the problem; Fiddler2. This little baby of a program is an http proxy that sits on your machine and listens to your http traffic. Using the tool you can inspect exactly what you are sending over the wire in your http requests and also view the responses coming back. Sweet stuff indeed. This is a MUST tool for you if you develop anything that sends data over HTTP.

Using Fiddler2 we were able to observe that when sending a stream of bytes as a body for an http POST the byte length, when the request came from my code, was 108 bytes. However when I used the built in request builder in Fiddler2 to reproduce and resend the exact same request it was reported that the length was 105. Why was there a difference in three (3) bytes?

Problem was that these tree bytes, as I stated above, caused the server to throw a fit. It could not recognize the body of the message to be the expected json string in UTF8 encoding. Json, I’ve learned, is always in UTF8 encoding.

The java guys googled and I binged ;~) and we found that these three bytes is an optional “byte order mark” or BOM. It is not wrong to send these three bytes and it is not wrong not to send them. Guess what? .The .NET world tends to send these extra bytes and everyone else tends not to. Why not have a bit of more incompatibility, right? It’s not like we want to talk to each other anyway!

What’s up with the BOM? What is it’s purpose?

The BOM is a way to add a kind of metadata inside of a stream of bytes instead of sending some actual metadata along next to your stream. Option one is saying “Here is my byte stream and btw it is in UTF8 encoding”. The other way to do it is saying “Here is my byte stream and if you look at the first three bytes you can read the encoding of it”. Which is better? I can’t say I care all that much other than the fact that it caused us a problem when I tried to send requests from .NET code to the Neo4j server.

How can you then – finally – handle these three bytes?

Well that depends… on what you intend to do. What I can show you is a piece of code that tells you exactly what this is and then you can copy that behavior into your code and modify it to serve your purpose. The code below should be pretty self explanatory but just to be sure. I take the string "foo" and encode it into a MemoryStream using a StreamWriter. The issue is that I tell the stream writer to use Encoding.UTF8 and this is where .NET Framework adds the BOM. The resulting stream is not 3 bytes as perhaps expected. Instead it is 6 bytes long. What you have to do if you read this stream ‘raw’ – by hand – in some library, is skip over the first three bytes. A better way to do it is to use a reader that handles the BOM. Finally if you don’t want to add the BOM in the first place you can write your bytes yourself in a more ‘raw’ fashion byte by byte to the stream. As you can see .NET Framework is good enough to have a way to find an actual BOM for different encodings. The UTF8 encoding does it this way: Encoding.UTF8.GetPreamble() (msdn library link to .GetPremable()).

You can also if you like compare bytes for the premable one by one rather than converting to a string comparison.

As you can see this BOM can be handled easily if you like. The thing is I did not have a clue that it was there.

Oh – and btw now the Neo4j server accepts REST json bodys that both can have and skip the BOM! Good ‘bug’ to solve or ‘feature’ to have.

Cheers,

M.

1 Comment

  1. Great write Magnus! Worthy reading article. Btw, this [http://msdn.microsoft.com/en-us/library/system.text.encoding.getpreamble.aspx] returns the byte order mark.

Leave a Reply