Wednesday, August 27, 2014

Unicode, UTF-8 and character encodings: What every developer should know

UPDATED 8/27/2014 to expand, clarify and expound certain concepts better

If you haven’t already read the excellent article by Joel Spolsky entitled, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)“, then you definitely should. However, I think it makes a lot more sense if you understand a few basics first. If you’re like me, I still didn’t quite “get it” after reading through that article the first time or two.

The Basics

For many years I was always confused by the whole mess. I didn’t understand exactly what Unicode is or what character encodings are all about or what the big deal was with UTF-8. Turns out it’s all pretty simple (sort of). I won’t dive into the history since Joel covers that pretty well in his article and I’m not nearly as qualified. Instead, I’ll just try to explain the basics as simply as I can.

First, it’s important to understand what a character set is. A character set is just a set of symbols (many of which you may recognize as letters and punctuation) and a code (number) that represents those symbols. You may be familiar with the ISO 8859-1 character set since it’s a pretty common ASCII based character set. That’s what maps the letters and other symbols you know to the codes 0-127.

Unicode itself is just a character set - one that’s backward compatible with the first 128 common ASCII character codes (0-127), meaning it maps those same 128 symbols to the same numbers as all other ASCII based character sets. Each character mapping is known as a code point. Unicode is a system of code pages like any character set, that is maintained by the Unicode Technical Committee who meet quarterly since there are always new characters being added to the Unicode character set. The code pages say nothing of how these code points will actually be represented in binary. There are a lot of ways of representing them.

A method of translating numeric code points into binary is known as a character encoding. You’ve probably seen things like U+00A3 or U+00E8. These are Unicode code points. A character encoding is a systematic way of representing these code points in memory. The most widely used character encoding is called UTF-8. As Joel says in his article, it makes no sense to have a string without knowing its character encoding. You can’t really read it. How are you supposed to translate a set of bytes back to the number/code point they represent if you don’t know the method used to translate it to binary in the first place? You can try to guess the encoding in many cases, but that just makes life hard and it’s far from fool-proof.

Understanding all of this comes into play as a developer in two primary ways:

  1. You need to make sure that when you are sending strings around you are also sending the character encoding. As a web developer this can be done in the headers when a document is served, but also with a <meta> tag in the <head> of your HTML document. This allows odd characters to be displayed correctly by the browser.

  2. If you are working in a language that isn’t aware of multi-byte characters (I’m looking at you PHP!), then you can run into a whole slew of problems trying to work with strings. Everything from referencing a character in a string (in PHP $str[3] represents the 4th byte in the string, not the 4th character), to simply determining string length (echo strlen("ÿ"); will output 2 since it takes 2 bytes to represent ÿ a.k.a. U+00FF).

This isn’t something you can afford to just ignore. Whenever you’re dealing with user input in PHP you cannot assume that one character == one byte. Every time you try to measure or manipulate a string you need to use the mbstring functions or something like the Portable UTF-8 library for PHP; otherwise you run the risk of breaking characters in your strings which leads to all sorts of weird errors (like when trying to json encode, for example) and funny characters.

UTF-8 Encoding

I think it’s important to understand how UTF-8 encoding actually works because it will give you a better idea of what character encoding actually means. UTF-8 is the most widely used, and I’ll even say the best character encoding you could choose to use for your website or software generally (unless you’re dealing with a legacy system that is already built to communicate with a different character encoding). I say this because:

  1. UTF-8 can represent every Unicode code point or every number in the full set of code pages the character set is currently comprised of.
  2. It is compatible with ASCII (e.g. the ASCII representation and the UTF-8 representation for all code points from 0-127 are identical - ASCII is valid UTF-8).
  3. It’s pretty conservative on space compared to some of its sibling character encodings. There is no fixed or minimum (greater than one byte) character length. UTF-8 character encodings can be anywhere from 1-4 bytes, depending on the character. In normal English text the space required will only usually be 1-2 bytes per character.
  4. UTF-8 does not require or even use a byte order mark (see http://en.wikipedia.org/wiki/Byte_order_mark and http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark)

To truly understand how UTF-8 encoding works, I have found no better explanation than this table found on Wikipedia and the explanation following it. It does an excellent job. I would summarize the encoding methodology as follows:

  1. For single byte (mostly ASCII) characters, the high-order bit is always 0 for backward compatibility with ASCII.
  2. For multi-byte characters the first byte begins with between two to four 1’s representing the number of bytes the character will use, followed by a 0.
  3. The rest of the bits in the first and following (continuation) bytes are filled with the bits representing the code point, except that each of the continuation bytes begin with 10.

So from the example above (U+00A3) we would first take the decimal value of hex A3 which is 163 and figure out what that is in binary, which happens to be 10100011. You’ll notice that it takes 8 binary digits to represent it in binary. Since the first byte of UTF-8 always must begin with 0 for single byte characters for ASCII compatibility, we are going to need two bytes to represent it. Since it’s a multi-byte character the first byte will begin with two ones to indicate how many bytes it will use and a zero, followed by as many of the bits of the code point as we can fit.

However, 5 of the 16 bits of the two bytes required will be used by the encoding format, so that leaves 11 bits we have to fill with the code point, which means we have to pad it with 0s (00010100011). So the first byte looks like:

11000010

The next byte is a continuation byte which will always begin with 10 followed by as many of the bits of the code point as will fit. So the second byte is:

10100011

If you’re still confused, this example might help. Armed with this knowledge you can now easily take any Unicode code point and write out the UTF-8 binary encoding of the character. Easy as that!

Common Problems

This is based solely on my own experience, but I just want to cover a couple common problems I’ve run into and how to fix them. In PHP my general recommendation is to use the excellent forceutf8 library written by Sebastián Grignoli, and just pass all your arrays or strings through its toUTF8() function.

Double encoded strings

You can get some pretty weird characters showing up, even when you set your page character encoding correctly and think you’ve encoded all your strings correctly, if you are double encoding your strings. I’ve seen this a few times particularly when developing JSON APIs. If some strings are already UTF-8 encoded when you build your result set and then you end up (re)encoding the whole thing before sending, you could end up with some double encoded strings.

To fix this, the solution is simply to check each string within your potentially nested result data individually and see if it’s already encoded properly, and only encode the strings that aren’t. My first recommendation is to just pass your string or array to the toUTF8() function in that forceutf8 library I mentioned above. Otherwise, I wrote this up as a simple solution before I ever found that library:

function utf8EncodeArray($array) {
    if(is_string($array)) {
        $array = array($array);
    }

    foreach($array as $key => $value) { 
        if(is_object($value)) { 
            $value = (array) $value;
        }

        if(is_array($value)) { 
            $array[$key] = utf8EncodeArray($value);
        } elseif(is_string($value) && ($encoding = mb_detect_encoding($value)) != 'UTF-8' && $encoding != 'ASCII') { 
            $array[$key] = mb_convert_encoding($value, 'UTF-8', $encoding);
        }
    }

    return $array;
}

Of course, if you’re already dealing with a string that’s been double (or more) encoded then it needs to be fixed. In that case, the forceutf8 library I mentioned above has a handy little fixUTF8() function you can call that will try and repair the string.

Broken encoding

Another problem I’ve run into is where UTF-8 encoded characters get truncated because of PHP’s naive handling of strings with multi-byte characters. This first became a major problem for me when trying to json_encode() a string with a broken UTF-8 character. It just completely fails. In this case you have to find the character and then either remove it or try to fix it. In reality you probably won’t be able to figure out what the character was supposed to be so you can either replace it with a placeholder character (question mark?) or just add something valid for the truncated byte(s). Once again though, that magical forceutf8 library will fix it for you if you don’t care too much and just want to make sure your broken strings aren’t breaking things.

Conclusion

I hope this was able to fill in some of the gaps for those who may have been a little lost. In the end, it’s not that hard of a concept to understand. It makes a big difference when you actually understand character encodings and you really shouldn’t write code without a basic understanding. Now go forth and write better code!

Tuesday, October 15, 2013

Traffic flow and merging during congestion

A little while ago there was this article about fluid dynamics applied to traffic. Then a couple weeks ago I saw this article pop up on Hacker News talking about the logical errors with the first article. I feel like we still don’t quite have the complete picture here. Both are excellent articles and both speak the truth to some degree or another. I just have a couple points to make. Please note that I’m assuming you have at least skimmed each of those articles.

If you just want to know what I’m getting at here then just jump to the conclusion.

Traffic Flow

The first article proposes that to prevent traffic jams during times of congestion we need to decrease our rate of travel well enough in advance that we can maintain a constant speed and as we finally arrive at the tail end of a “traffic wave” it is beginning to move again. In so doing we can effectively “eat” a traffic wave, avoid stop-and-go traffic and fix traffic jams. This is true. Sort of.

After reading that article I tried it during my regular commute to and from work and I managed to “eat” several traffic waves and maintain a constant speed without having to stop like the cars in front of me.

I can’t speak to what happened to the cars behind me, but the second article makes the argument that by following that proposal you are actually making things worse for the cars behind you. It asserts that, given an average two-second following distance, “there is a limit to the number of cars that can pass by a given point on the highway in a given amount of time, and that limit is one car every 2 seconds, per lane”. Therefore, you can do the math and figure out exactly how many cars can actually fit past the given point. Nothing you can do will change that and therefore you cannot fix a traffic jam. This is also true. Sort of.

In reality, it is very dependent on traffic density. If you are in a bumper-to-bumper traffic jam where all the cars are close together and trying to squish past a certain point, then the second article is correct.

If, however, traffic is less dense, then the first proposal can actually make a difference. Why? Well, you have to account for the time it takes for a car to accelerate. If you just continue riding the bumper of the car in front of you and come to a stop with every traffic wave, then you are maintaining the energy of the wave because you now have to stop and then accelerate which takes extra time. If you back off a bit and maintain a constant speed which allows a great enough distance from the car in front of you then you “eat” the traffic wave and absorb the energy of it. You no longer have to take the extra time to accelerate, and neither do the cars behind you. You also save on gas. The trick, of course, is to do this without causing a traffic wave behind you. Don’t try this in super dense traffic! You will only make things worse.

Traffic Merging

With merging there is yet more to consider. In the comments on the second article there is a comment from a man talking about how, when there is a sign indicating that traffic is merging ahead, he will get over quickly to be courteous or “pro-social”. This is actually a bad idea for a couple reasons.

So, let’s assume that there is construction on a two-lane road two miles ahead and one lane is merging. If most people start merging right away then there is a lot of unused space in the now-empty lane that everyone is merging from. As the second article points out, this only serves to lengthen the distance of the traffic jam. It effectively doubles its length and increases the chance that the jam will “spill out” and affect traffic elsewhere needlessly.

The other problem of course, is that those anti-social, selfish “jerks” (like me) that zoom past in the now-empty lane and merge as late as possible saving themselves from the delay, end up causing even greater delays for those who were courteous. There’s now yet another car that the “pro-social” and “courteous” people have to wait for before they finally get their turn to get through the bottleneck.

The problem is that no matter where you merge, you still take up space and make the jam worse. Whether you merge as late as possible or you merge early, you are lengthening the commute of all the people behind you.

The only thing you accomplish by being “pro-social” is doing yourself and those in the lane you merge into a disservice. YOU are a delay. By merging early you create an uneven distribution of the delay simply moving more of the delay from the lane you are in to the lane you are moving to and decreasing the delay in the lane you are coming from. This gives those in the merging lane an advantage which comes at a cost to the people you are trying to be pro-social or courteous to.

If everyone waited to merge until they had to then everyone would have a similar delay because there would be an even distribution of the delay. The way to be courteous and considerate of everyone is to wait until the last minute to merge if you are merging, or to let one car in front of you at the merging point if you are in the lane being merged in to. This creates the most even distribution.

Conclusion

When you have to merge in dense traffic, please, please do everyone a favor and merge as late as possible. It’s the most courteous and pro-social thing you can do.

Monday, July 8, 2013

Stop telling me to go to college!

UPDATE 2/21/2014: Ryan Higa made an excellent parody that more or less makes my point.

UPDATE 2/23/2013: This article was published today that covers Google’s perspective. It summarizes my point very well:

Your degree is not a proxy for your ability to do any job. The world only cares about — and pays off on — what you can do with what you know (and it doesn’t care how you learned it).

The orignal article:
I don’t have a degree. I started into some college while working full time as a software developer. I was killing myself trying to go to school and work. I didn’t get too far before I started asking myself questions like “What am I doing? Why am I killing myself to get a degree so I can hopefully get a job like what I already have?” It’s not like I would get paid better or have a better position. Despite this, time and time again I’ve had people tell me about how important it is to go to college and get a degree. That said, I think I’ve had more people agree with me overall.

I recently had the college debate again with someone who just graduated in Computer Science. Initially the argument was that you can’t get as good of a job or get paid as well without a degree. This is the typical rhetoric I hear from people who have spent too much money and time being fed this propaganda by the universities and the public school system. It doesn’t add up - at least not in the IT industry and it seems more and more questionable in other fields as well with tuition costs rising and the economy sinking. Programmers are supposed to have a good grasp on logic and reasoning. I’ve debated the necessity of going to college countless times with people but it really bothers me that a Computer Science graduate would so readily ignore reason.

I think the problem we have is that so many people equate college with education. Now, don’t get me wrong. I’m a HUGE fan of education. When I have free time I often spend it learning new things from reading Hacker News. I’ve been guilty more than a few times of getting stuck hour after hour clicking through links on Wikipedia - it’s a nerd trap! Given the choice between reading a novel or a text book, I’d honestly choose the text book. But college isn’t the only way to get an education and not everyone who goes to college actually gets a great education.

Here’s a little background on myself. I started learning HTML and CSS just before I turned 10. I made lots of crappy websites that (thankfully) never saw the light of day, as well as a couple that did. I’ve used one Linux distro or another for my main desktop OS since I was 12. When I was 14 I spent 3 weeks doing a stage 1 install of Gentoo and it was an incredible learning experience. I ran Gentoo for many years on many computers after that. I used to hang out on IRC on #gentoo. I remember once, when I was still 14, helping someone with some JavaScript he was doing for work. I wasn’t even old enough to have a job yet. By the time I was 16 I had gotten my CompTia Linux+ certification, I had jumped into PHP and C++, I was going to an applied technology center where I covered C# and Java, design patterns and algorithms, etc… By 18 I had spent a lot of time in Python and I had won several programming competitions.

I’m not saying this to brag or be arrogant or whatever, but just to illustrate my point. Despite whatever time I spent in school, I spent far more time at home developing my skills. I was the stereotypical socially inept, acne covered, 4-eyed computer nerd living in my parent’s basement.

The guy I was debating with was telling me about how I could only be qualified enough to make good money with a CompSci degree. As we compared notes, so to speak, I discovered that I know a lot more than him, I have a lot more experience than him (personal and professional), and I make a lot more money than him.

I’m glad companies out there recognize the value of knowledge and experience over degrees. When my company hired me they told me that they’ve had a lot of success with people that don’t have degrees - not that they haven’t also had success with people that have degrees, but it’s not required. I recently interviewed with a company that told me that my interview was the best they’ve had in 3 years (which, actually worried me a bit about the type of people I’d have to work with). They offered me six figures, even without a degree (crazy huh? *sarcasm*). It wasn’t quite a good enough offer to leave the awesome company I work for now, but it illustrates the point I’m making.

I haven’t been out looking for another job, but I get contacted quite frequently. In the past year or so I’ve talked to a lot of companies and only one seemed to care much about a degree, despite most of them listing a degree as a requirement in their job listings. Quite frankly though, I probably wouldn’t accept a job with a company that insisted on a degree, even if I had one. That tells me that they’re more concerned about how you look on paper than how you actually perform. They would turn away good talent over a meaningless qualification. I fear that would mean working with less-intelligent people.

Anyway, let’s get back to the story. This guy ultimately resorted to something like, “I just think college is important. It’s just the way I was raised.” No logic, just insistence. I feel like this is a major problem with the mentality of society today. This is how all the universities out there want you to think and this is how they want you to raise your children. They’ve done a pretty good job of convincing most of American society that you get paid more with a degree and that without a degree you can’t be successful. It feeds the beast and the beast wants to be fed. Anyone who dares to question the notion is a threat to their business.

I recognize there are fields where a degree or license is required. I also recognize that there are people who lack ambition or need the structure or whatever else of a university. More power to ya! I don’t care if anybody wants to go to college, but please, PLEASE stop telling me that it’s the only way to be successful, or that people that go are somehow superior, or that I just need to go. Stop telling me to give up a great job and go rack up lots of debt to effectively learn very little more than I already know or can figure out myself, all just to get a job like the one I’ve already got.

If you are passionate and ambitious, then try Google and Wikipedia and Stack Overflow - the free online universities! There really is very little you can’t find online already. And if you can’t, then ask a question somewhere. The internet is filled with people who know things about stuff!

The Point

If you want to go to college that’s fine; but make sure you know why you’re doing it. If the only reason you have is “because that’s how I was raised/taught” then maybe you should rethink your decision.

Friday, June 7, 2013

Why you should reinvent the wheel

This applies to pretty much anything but I’m going to apply it specifically to programming since that’s my field. Have you ever heard someone ask, “Why reinvent the wheel?” as they begin to instruct a new developer on why they shouldn’t undertake a new project that’s been done a thousand times before by other people? That’s always bothered me. When I was learning to program I used to “reinvent the wheel” all the time! Honestly, I hated just using someone else’s code without having taken a crack at it myself.

I always thought I could do it better somehow. I thought I could make it more efficient, add some new feature or somehow simplify it. If nothing else, at least I would know how it worked and could fix bugs myself. I was stubborn for a while after jQuery came about and wouldn’t use it because I always wanted to write my own library to do something like that. I hated using code that wasn’t mine. None of my code from those years ever saw the light of day. I’ve still got some of it sitting around though. It’s interesting to take a look back at it sometimes and see what I was thinking or see how I did certain things. Frankly, I would probably be embarrassed to show the code to anyone now. The libraries that are out there have been done so much better.

If you’re not particularly passionate about programming it’s tempting to just take someone’s black box library and just use their API and assume it runs on magic more powerful than your own! It’s not like you NEED to understand how it works to use it, right? It’s less work you have to do and it gets the job done.

Problem is, if that’s your mentality you won’t go far. I always insisted on seeing inside the black box and figuring out how it works, then re-creating it. Of course, when I would re-create it I would always try to improve it. Heck, back when I was 10 years old trying to learn HTML I was constantly viewing the source on websites across the internet because I had to figure out how they did what they did. Then I would go and try to do it myself. “Reinventing the wheel” was absolutely critical for me in those early years because that’s how I learned to program.

And, that’s how innovation is done every day! You can’t improve something if you don’t even understand how it works. Once you do, you create it again, but better!

For me, it’s paid off. I’m 24 with an awesome job working on a pretty advanced web application with cutting edge technologies and pioneering new solutions in the web application arena. I’m lucky to work for a company that lets me do this. A great example is what’s happening at work right now. We originally tried using SlickGrid in our application and had to pour over the source code and write lots of little hacks to make it do what we needed. Unfortunately, it just can’t handle what we’re throwing at it. Now, we’re reinventing it - but waaay better this time!

We’re actually about to open source the realtime JavaScript framework we’ve been building on top of Backbone. It includes many examples of innovation and lessons learned from other frameworks and libraries we’ve studied or worked with in the process of developing our product. We could have just gone with what was out there at the time we started (which wasn’t much, to be honest1), but that would have defined and limited what we could achieve. Instead, we’ve now created a system unlike any out there to use in the newsroom at KSL and Deseret News. I’m excited to launch the framework we built it on publicly. Sorry about the little tangent there :)

We need people that challenge the assumptions made and the patterns used in the things we take for granted in our every day lives. That goes for any industry. How will we ever have a better wheel if no one tries to reinvent it, or worse, if no one even understands how it works? Who says a wheel is even the best tool for the job?

So why reinvent the wheel?

  1. To learn
  2. To make a better wheel
  3. Because the wheel you have doesn’t solve your needs or wants

The End.


  1. At the time we started our framework Backbone was all the rage. AngularJS and Ember.js were just starting and had no community. SproutCore was on it’s way out. ExtJS had been around for a while but that was slow, had bugs and didn’t allow us the flexibility we wanted - it’s a monolithic framework. While our system bears some interesting conceptual similarities in a few parts with Meteor, it didn’t even exist at the time.

Tuesday, April 9, 2013

Exceptions and resource leaks in Javascript and Node.js

A while ago I read a comment in the Node.js documentation on uncaughtException mentioning that it may be removed (no longer true), that it is a crude mechanism for exception handling and shouldn’t be used and that domains should be used instead. So I made a mental note and figured I’d look into domains when I got the chance. This past week I started looking into it a bit and noticed some more concerning comments in the documentation for domains:

By the very nature of how throw works in JavaScript, there is almost never any way to safely “pick up where you left off”, without leaking references, or creating some other sort of undefined brittle state.

In the comment in the code block a little below that, where it’s demonstrating a bad way to use domains as a global catch-all like uncaughtException it states:

The error won’t crash the process, but what it does is worse! Though we’ve prevented abrupt process restarting, we are leaking resources like crazy if this ever happens.

They recommend that when you catch exceptions in a global way like this you always shut down and restart the process. My first reaction was something like “Woah! Hold on there Skippy!” I’ve written a server that handles a lot of persistent websocket connections and I can’t just shutdown because I didn’t handle some exception thrown while handling a misbehaving client. And what the heck is it talking about with this whole “leaking resources like crazy” thing anyway?

So naturally I did some Googling to see if I could figure out what it’s talking about. Unfortunately I couldn’t find anything useful. I posted a question on Stack Overflow which was helpful though. While the answers there do seem kind of obvious in retrospect they weren’t obvious to me at the time because the resources (open files and sockets) mentioned in the answers don’t apply to the server I’ve written (another reason why it’s bad to blindly tell everyone they’re leaking resources).

My purpose in writing this post is mostly to help answer the question I had about what resources might be leaking and how to write code taking this into account. The Node.js documentation is misleading and non-descriptive. The only time it’s definitely bad to continue after an exception is if you literally have no idea where it was thrown and what resources might have been in use (e.g. global exception handlers).

When an exception is thrown while you have an open stream (file, socket, etc…) the code that follows which normally should finish with the resource and close it, never runs. That means it’s left open and you are now leaking resources. When you throw an exception in a block of code where resources are open you are introducing leaks.

The only way to make sure you aren’t leaking resources is on a case-by-case basis… sort of.

The problem with exceptions can be handled pretty well with domains. However, you run into the exact same problem if you simply forget to close a resource or return too early, while it’s still open.

Some Solutions

Ideally you shouldn’t have unhandled exceptions. We don’t live in an ideal world, but it’s still good to try to handle as many as you can. If you catch the exception close enough to where it occurred that you know what resources are in use and can clean them up properly then it’s alright to continue. If you write code that throws exceptions, keep in mind what resources are open at the time you throw the exception and try to clean them up before throwing.

You can use domains in Node.js to try to have some context and specificity to your error handling. For example, if you were writing an HTTP server, you could use domains to handle the case where an exception is thrown while processing the request (even if it’s doing asynchronous stuff) and then at least shut down the open socket from the client and clean up the request a bit. Otherwise, since an exception was thrown, no response would be sent and the request would be left open until it timed out.

If you are writing a server or process that can fork or spawn workers (such as a request handler for a web server or something else) then the problem can be largely mitigated by simply shutting down the worker (hopefully gracefully) and spawning new workers as needed.

An Ideal Solution

Though it doesn’t exist in JavaScript, an ideal solution would be something like Python’s context managers. They are a powerful tool for managing resources. As mentioned here:

The advantage of using a with statement is that it is guaranteed to close the file no matter how the nested block exits. If an exception occurs before the end of the block, it will close the file before the exception is caught by an outer exception handler. If the nested block were to contain a return statement, or a continue or break statement, the with statement would automatically close the file in those cases, too.

Using the with statement in Python (for those unfamiliar) looks like this:

with open('/path/to/file', 'w') as f:
    f.write('some text')

That’s it! There is no need to bother closing the file. Anything that needs to be done with the file is done within the with block. When it exits the scope the context manager closes the file. And the best part is that you can write your own context manager, just like open() in this example! You could write one for handling sockets, including sockets that are already open.

Though there is nothing like this in JavaScript I think it’d definitely be a nice addition. Even if it were never to become a part of JavaScript itself, it would be a particularly nice addition to Node.js. Unlike JavaScript in the web browser, JavaScript on the server handles a lot more resources like this (open files and sockets) so it is much more important to be able to handle resources in an elegant and clean way without having to shutdown every time an unhandled exception occurs.

That said, I don’t know exactly how they would implement something like context managers with the asynchronicity of JavaScript. I suppose when there are no more references to a resource it’s essentially “out of scope” and the exit on a context manager could be called to close the resource. In that case it would have to run as part of the garbage collection process and you really cannot know when it will run which could produce other unpredictable and/or undesirable behavior.

Conclusion

Keep in mind that this post only really addresses concerns relating to leaking resources, not the brittle state of the application after an unhandled exception that is being caught for the first time globally, and thus without context. It is difficult then to be able to know the state of the application. Whatever function threw the exception (which you don’t know since you have no context) could have modified any shared state variables and not finished its work, leaving it in an unknown state. Global exception handlers really should only be used for logging uncaught exceptions so you can fix them and to shutdown gracefully. Though it’s not impossible, especially in smaller systems, to be able to verify that the system is in a sane state from a global exception handler.

For an interesting and brief explanation of things to consider with exception handling, check out this explanation from an answer to my SO question. Trust me, you want to click that link!

Friday, February 22, 2013

Make your media keys work with Spotify in Linux

This will be a quicky. I just wanted to document how I made my media keys work for Spotify in Linux. I’m going to give instructions for Ubuntu, but it shouldn’t be hard to adapt to any Linux distro.

First install xbindkeys:

apt-get install xbindkeys

Next put this in a file named .xbindkeysrc in your home directory (e.g. vim ~/.xbindkeysrc)

"dbus-send --print-reply --dest=org.mpris.MediaPlayer2.spotify /org/mpris/MediaPlayer2 org.mpris.MediaPlayer2.Player.PlayPause" XF86AudioPlay
"dbus-send --print-reply --dest=org.mpris.MediaPlayer2.spotify /org/mpris/MediaPlayer2 org.mpris.MediaPlayer2.Player.Stop" XF86AudioStop
"dbus-send --print-reply --dest=org.mpris.MediaPlayer2.spotify /org/mpris/MediaPlayer2 org.mpris.MediaPlayer2.Player.Previous" XF86AudioPrev
"dbus-send --print-reply --dest=org.mpris.MediaPlayer2.spotify /org/mpris/MediaPlayer2 org.mpris.MediaPlayer2.Player.Next" XF86AudioNext

To test open a command line and run

xbindkeys &

Finally make sure xbindkeys is started when you login by opening your “Startup Applications” settings and adding the xbindkeys command. You can do this from the command line by creating a the file ‘~/.config/autostart/xbindkeys.desktop with the following contents:

[Desktop Entry]
Type=Application
Exec=xbindkeys
Hidden=false
NoDisplay=false
X-GNOME-Autostart-enabled=true
Name[en_US]=xbindkeys
Name=xbindkeys
Comment[en_US]=
Comment=

If it’s not working for you it’s probably because xbindkeys isn’t picking up the keys. To test this run: xbindkeys -k

Then press the pause/play button. If xbindkeys sits there and doesn’t recognize it then this is your problem. To fix this simply go to your keyboard shortcut preferences (System Settings -> Keyboard -> Shortcuts -> Sound and Media) and change the accelerators for “Play (or play/pause)”, “Stop playback”, “Previous track” and “Next track” to something different. I changed them to Ctrl+[pause/play button], Ctrl+[stop button], Ctrl+[previous] and Ctrl+[next] respectively. After you’ve done this it will allow xbindkeys to capture the shortcut. Keep in mind that if xbindkeys is already running in the background then xbindkeys -k will still not capture the keys but they will work.

Tuesday, February 19, 2013

Fast Mersenne Twister in PHP

I was recently trying to write a class in PHP that I could use to encrypt and decrypt data for communication between our server and remote clients. As I got started I realized I was going to need to generate a lot of pseudo-random numbers. However, I needed to be able to reproduce the same numbers given the same seeds on a different system. The Mersenne Twister algorithm is a well-known, solid random number generator - and apparently faster than PHP’s rand() function. I realized PHP has an implementation of the Mersenne Twister algorithm call mt_rand(), so I thought that would make life easy. It turns out that as of PHP 5.2.1 the same seed no longer produces the same values as stated in the documentation for mt_srand():

The Mersenne Twister implementation in PHP now uses a new seeding algorithm by Richard Wagner. Identical seeds no longer produce the same sequence of values they did in previous versions. This behavior is not expected to change again, but it is considered unsafe to rely upon it nonetheless.

So I decided I would go ahead and implement the algorithm myself. At first, I was just looking to see if I could find anyone else’s PHP implementation of the algorithm, but I had no luck with Google. After that I went to Wikipedia and worked from the pseudocode, which you should check out if you really want to understand the algorithm. That much is pretty straight forward. However, my purpose in writing this article is two-fold:

  1. I want to make a PHP implementation of the Mersenne Twister publicly available for other seekers
  2. I want to discuss some of my modifications to enhance the speed

Here’s my implementation:

/**
 * Mersenne Twister Random Number Generator
 * Returns a random number. Depending on the application, you likely don't have to reseed every time as
 * you can simply select a different index from the last seed and get a different number - it is already
 * auto-incremented each time mt() is called unless the index is specified manually.
 *
 * Note: This has been tweaked for performance. It still generates the same numbers as the original
 *       algorithm but it's slightly more complicated because it maintains both speed and elegance.
 *       Though not a crucial difference under normal use, on 1 million iterations, re-seeding each time,
 *       it will save 5 minutes of time from the orginal algorithm - at least on my system.
 *
 * $seed  : Any number, used to seed the generator. Default is time().
 * $min   : The minimum number to return
 * $max   : The maximum number to return
 *
 * Returns the random number as an INT
 **/
function mtr($seed = null, $min = 0, $max = 1000)
{
  static $mt = array(); // 624 element array used to get random numbers
  static $ps = null; // Previous Seed
  static $idx = 0; // The index to use when selecting a number to randomize


  // Seed if none was given
  if($seed === null && !$ps) {
    $seed = time();
  }


  // Regenerate when reseeding or seeding initially
  if($seed !== null && $seed !== $ps) {
    $mt = array(&$seed, 624 => &$seed);
    $ps = $seed;


  for($i = 1; $i < 624; ++$i) {
      $mt[$i] = (0x6c078965 * ($mt[$i - 1] ^ ($mt[$i - 1] >> 30)) + $i) & 0xffffffff;
    }
  }


  // Every 624 numbers we regenerate
  if($idx == 0) {
    // This has been tweaked for maximum speed and elegance
    // Explanation of possibly confusing variables:
    //   $p = previous index
    //   $sp = split parts of array - the numbers at which to stop and continue on
    //   $n = number to iterate to - we loop up to 227 adding 397 after which we finish looping up to 624
    //        subtracting 227 to continue getting out 397 indices ahead reference
    //   $m = 397 or -227 to add to $i to keep our 397 index difference
    //   $i = the previous element in $sp, our starting index in this iteration
    for($j = 1, $sp = array(0, 227, 397); $j < count($sp); ++$j) {
      for($p = $j - 1, $i = $sp[$p], $m = ((624 - $sp[$j]) * ($p ? -1 : 1)), $n = ($sp[$j] + $sp[$p]); $i < $n; ++$i) {
        $y = ($mt[$i] & 0x80000000) + ($mt[$i + 1] & 0x7fffffff);
        $mt[$i] = $mt[$i + $m] ^ (($y & 0x1) * 0x9908b0df);
      }
    }
  }


  // Select a number from the array and randomize it
  $y = $mt[$idx];
  $y ^= $y >> 11;
  $y ^= ($y << 7) & 0x9d2c5680;
  $y ^= ($y << 15) & 0xefc60000;
  $y ^= $y >> 18;


  // Increment the index so we will be able to get a different random number without changing the seed
  $idx++;
  if($idx == 624) $idx = 0;


 return $y % ($max - $min + 1) + $min;
}

I’m going to focus on lines 31 and 52 in the code above (which are no longer highlighted, so have fun finding them!). First you’ll notice the array called $mt. Before filling this array, two values are assigned. In the original algorithm, only the first value is assigned to the last 32 bits of the seed. But this creates a slow down because then in the second for loop where you see $mt[$i + 1] this reads $mt[($i + 1) % 624] in the original. Division is the slowest math operation on a computer. Since modulus returns the remainder of a division operation, it obviously still has to do the division operation. Thus, this becomes a big slowdown when doing a lot of iterations. This line just looks painful to me because it does the modulus 624 times and it only has an effect on the very last iteration. So, I set out to find a better solution. As I was laying in bed pondering, the thought struck me that all we really need to do is create a 625 element array, but set the last element to the actual reference of the first (since the value of the first element changes on the first iteration). Then, there’s no longer a need for a modulus. The last element of the array is the first element, and we still stop at 624, so there are no boundary problems. Well, that eliminates one modulus, but there’s still a couple more in the pseudocode. The next line of code, before modification, looks like this:

$mt[$i] = $mt[($i + 397) % 624] ^ ($y >> 1) ^ $op[$y & 0x1];

There are several possible ways to eliminate the second, but after hours of playing around with it, I couldn’t seem to find anything that sill looked clean and elegant. It seemed that it would require adding more lines of code and using if statements and too many extra variables. The idea that finally worked came to me, once again, while I was laying in bed pondering about it. Somehow my best ideas always come when I’m trying to sleep.

To solve this, we first have to notice that the algorithm is always trying to access the index that is 397 ahead of the current one. So, if we want to eliminate the modulus, then as soon as we hit index 227 we have to somehow jump to element 0, and count up from there. The only reasonably elegant way of doing this without the modulus that I could find, was to stick the exact indices at which we need to switch in an array, and loop through those. So there is a 3 element array with the values 0, 227, and 397; however, we only loop through the last 2. The first one is used simply so the inner for loop knows what index to begin counting up from.

Now, since I can’t really be bothered explaining much more today, I’ll simply point out that because of the * ($p ? -1 : 1) trick, as soon as it reaches 227, it loops through the outer for loop again, jumping to 397; however, this time it’s subtracting 227. By subtracting 227 it creates a 397 index difference. In other words, (620 + 397) % 624 is the same as 620 - 227. Modulus #2 eliminated.

Probably the most painful part of the algorithm if implemented straight from the pseudocode, is the modulus in the if statement. It goes inside the for loop and looks something like this:

// If $y is odd
if($y % 2 == 1) {
  $y ^= 0x9908b0df;
}

Any good developer knows there’s a better way to tell if a number is odd. In binary all odd numbers end with a 1 which means all you have to do to test if the number is odd is:

if($y & 1) {
  $y ^= 0x9908b0df;
}

This is a virtually instantaneous bitwise operation, versus the countless CPU cycles a division operation might take. But, we can still do better than this in this particular case. We can eliminate the if statement (thus avoiding CPU branching). Basically all we want to do is XOR all odd numbers with 0x9908b0df. Instead of doing the if, let’s just always do this:

$mt[$i] = $mt[$i + $m] ^ ($y >> 1) ^ (($y & 0x1) * 0x9908b0df);

Because the value of $y & 1 will always be either 1 or 0, we can multiply that by our magic number here to determine whether it will modify the current value. The XOR will always run but it will only have an effect on odd numbers. This is still far better since an XOR takes only one cycle where the division and branching would have been much more intensive.

The last modulus is just replaced with an if statement which should be easily predicted with the CPU’s branch predictor (see this and look at this if you still don’t get it), allowing it to be much faster.

That’s it! We’ve eliminated all 4 modulus operators and the if statement. When doing the number of iterations, re-seeding each time, required for my encryption algorithm, it makes a big difference in speed. Hopefully somebody finds this useful. If you find any problems with the code or want to suggest improvements, please let me know.

Please keep in mind that all this math results in numbers far too large for a 32-bit integer and will thus result in different values with the same seed between a 32 and 64-bit processor. To avoid problems (ensure consistent results) this should either be written using a big math library, such as GMP or BCMath, or all numbers need to be truncated (i.e. $num & 0xFFFFFFFF) after each multiplication or left bit shift operation in the algorithm. Another consideration in PHP is that it does not support unsigned integers, which will also throw off the results if not using a big math library (PHP copies the sign bit when doing a right bit shift). I have written my own math library for PHP which can do all math and bitwise operations on normal PHP integers, treating them as unsigned.