BOM Away

I find the Visual Studio the best development tool there is and I always find myself missing it whenever I have to work in some other tool (damn you Java!). However, there is one annoying "feature" that bugged me through all .NET versions.

Whenever Visual Studio decides file needs Unicode encoding, it does so by converting it to the UTF-8 which I personally find quite a good decision. What I hate is that file also gets "spoiled" by adding BOM (byte order mark). Now my every UTF-8 file has three nonsense bytes as a prefix (0xEF,0xBB,0xBF).

Although Unicode does not recommend usage of this marking and most programs do right by omitting it, Microsoft decided to use it back in the dark ages of Visual Studio .NET (and Unicode Notepad). At that time this might have been good decision considering how bad everybody treated Unicode. But in the year 2014 it is feature that survived for no good reason (not completely dissimilar to using a Ctrl+Z as an end-of-file character).

If you perform your work in a mixed OS environment (god forbid you have some Unix/Linux laying around) it gets quite annoying dealing with those simple bytes. While partial blame for this is on encoding-challenged Linux tools, truth is that nobody really needs that BOM if we assume all input files are already UTF-8 (which is fair assumption, I believe).

Source control makes this problem visible to the extreme. On multiple occasions I would accidentally check-in a file with no change other than the pesky BOM. As I am a fan of an overkill reactions I decided to stop this issue once and for all. It was the time to build the extension for the Mercurial, my source control of choice.

In order to use this extension just copy killbom.py to a location of your choice and, in the project configuration or global configuration file, add it in the extensions section:

[extensions]
killbom = C:/path/to/killbom.py

From that moment everything you commit will have it's BOM stripped and it will get converted to the UTF-8. Be it annoying UTF-8 with BOM, just unusual UTF-16BE, or any other Unicode encoding. Everything will get an UTF-8 face-lift.

Of course, you can check how new files are doing using hg checkbom, modify existing files by using hg killbom --all or just check on state of your repository with hg checkbom --all.

Those just wanting to know whether files are proper UTF-8 but cringe on the thought of the modifying extension are fine too. You can force extension to only report offending files or only to check for certain offences, e.g. to make it verify-only and sensitive to only big-endian encoding you would add following in the configuration:

[killbom]
action = verify
extensions = utf-8 utf-16be utf-32be

Full source for the extension is available at GitHub.

PS: There is also a Git version of this hook.

2 thoughts to “BOM Away”

  1. Hi,

    Your solution is something I have been waiting around to use for quite some time. I have personally written a couple of incarnations of scripts to do the job but never got around to integrate them with Mercurial.

    Would you grant me read access to the extension code or a copy offline?

    Regards,
    Johan

Leave a Reply

Your email address will not be published. Required fields are marked *