Articles

Building Disciplined Disciple Compiler (DDC) on Ubuntu

In Haskell on November 23, 2011 by Matt Giuca Tagged:

A quick note for anybody who finds it: I’ve been trying to build Ben Lippmeier’s Disciplined Disciple Compiler (DDC), a dialect of Haskell, and had little success until I figured out a way of doing it.

I had a problem building it on both Ubuntu 11.04 and 11.10. I followed the instructions here: Building DDC. Not wanting to build Haskell or a significant number of libraries from source or install using Cabal, I used apt-get to install all that I could:

$ sudo apt-get install ghc6 alex libghc6-mtl-* libghc6-parsec3-* libghc6-regex-* libghc6-quickcheck2-* \
        libghc6-haskell-src-* x11proto-xext-dev libx11-dev libxv-dev

But as it says there, you still need to use Cabal to install text and buildbox:

$ sudo apt-get install cabal-install
$ cabal update
$ cabal install text -p

So far, so good.

$ cabal install buildbox -p
Resolving dependencies...
cabal: cannot configure buildbox-1.5.2.2. It requires base ==4.4.*
For the dependency on base ==4.4.* there are these packages: base-4.4.0.0 and
base-4.4.1.0. However none of them are available.
base-4.4.0.0 was excluded because of the top level dependency base -any
base-4.4.1.0 was excluded because of the top level dependency base -any

WTF? So it turns out what this error is trying to tell you is that buildbox-1.5.2.2 requires version 4.4 of the package base, which is the main GHC library. I don’t know why the versions of the base package don’t match the versions of GHC, but they don’t. Ubuntu 11.04 and 11.10 include GHC versions 6.12.3 and 7.0.3, respectively. I don’t know what version of base comes with 11.04, but 11.10 includes base version 4.3.1.0 (try “cabal info base“) — too old for buildbox. And Cabal refuses to install a newer version of base.

The solution? Just install an older version of buildbox (you can see them with “cabal info buildbox“). For me, this is:

$ cabal install buildbox-1.5.2.1 -p

Tada.

Presumably, Ubuntu 12.04 will include a sufficiently new version of GHC that this won’t be a problem (Edit: at the time of writing, the latest version of GHC is 7.2.1). But of course, it could happen again in the future, because that’s what happens when package managers get into fights.

Articles

Lazy dynamic programming with arrays in Haskell

In Haskell on November 4, 2011 by Matt Giuca

Dynamic programming is a weirdly named way to speed up (complexity-wise) recursive algorithms. If you are writing a recursive algorithm that ends up calling itself with the same arguments over and over, dynamic programming can solve the problem by “memoizing” the results in a table. Basically, it just means build an array big enough to hold all the possible inputs, then fill in the array as you go. If you ever need the answer to a sub-problem you have already solved, just look it up in the array. The poster-boy for dynamic programming is the knapsack problem: imagine you are a thief who has broken into a jewellery store. You want to steal as much as you can to maximise the total value of the items you’ve stolen, but there is a maximum weight you can carry in your knapsack. Which items do you steal?

The problem with doing this in Haskell is that you can’t modify arrays (well, not easily anyway). Once you create an array, all of its values are fixed. This means you can’t just build an array full of dummy values and go about inserting the actual values later. But, as with many things in Haskell, laziness can save us.

Naïve solution

Let’s start with a simple solution to the unbounded knapsack problem (you can steal as many copies of each item as you like), taken from the Wikipedia article:

-- knapsack [(v0, w0), ..., (vn, wn)] W
knapsack items wmax = m wmax
 where
    m 0 = 0
    m w = maximum $ 0:[vi + m (w - wi) | (vi, wi) <- items, wi <= w]

You call this by passing a list of (v, w) pairs (the value and weight of each item), and the maximum weight the bag can hold, W. You can use the example given in the image at the top of the Wikipedia page:

knapsack [(4, 12), (2, 2), (2, 1), (1, 1), (10, 4)] 15

Which gives the correct answer, 36. It will recursively compute the m function, which tells us the maximum value we can fit into a bag that can carry weight w. However, it is insanely slow because it recomputes sub-problems many times each. To be more formal, it has worst-case time of O(nW), because it is going to test n possibilities to a depth of W. Even this simple example with a maximum weight of 15 takes about four seconds to compute on my laptop. I just bought a new computer yesterday and I’m waiting for it to arrive, which should cut the time down a bit, but in the mean time, we should optimise the algorithm 🙂

Dynamic programming with lazy arrays

It seems like it will be impossible to use dynamic programming in Haskell, because arrays cannot be mutated. But if you think about it, dynamic programming doesn’t actually require that we mutate any values — it merely requires that we have some values that are uncomputed which later become computed. That’s exactly what laziness is all about.

So here’s the plan: we build an array whose indices and elements are the inputs and outputs of the function m. Because of laziness, the array won’t actually contain the outputs of the function — it will just contain thunks which are willing to compute the output if we ask them to. When we recurse, we won’t call a function, we’ll instead pull a value out of the array. If that particular array element has never been requested before, this will cause it to be computed (which might trigger more recursion). If that particular array element has already been found, this will simply return the existing known value. Here we go:

knapsack items wmax = arr ! wmax
  where
    arr = array (0, wmax) [(w, m w) | w <- [0..wmax]]
    m 0 = 0
    m w = maximum $ 0:[vi + arr ! (w - wi) | (vi, wi) <- items, wi <= w]

We first build the array arr as an array of thunks. If any element of the array is requested for the first time, it will call the function m. Then, we try to pull out element wmax, and since it hasn’t been requested, this calls m wmax, which in turn, starts pulling more values out of arr. You can see that this will only ever call m a maximum of wmax times, and furthermore, it won’t necessarily call m for all possible values of w. It now has a worst-case time of O(nW), since each recursion makes n tests, and it will recurse a maximum number of W times. Indeed, it is now very fast in the 15 case. If we change the 15 to, say, 100000, it takes just a few seconds.

Note that all we had to change (besides adding the definition of arr) was replacing all calls to m with a lookup on arr. We should be able to generalise this solution.

A general dynamic programming solution

We would like a solution that allows people to write completely naïve code, such as our first attempt, and automatically turn it into a dynamic programming solution. This is possible in Mercury (see Tabled evaluation) as a special language feature which allows you to say “the runtime should memoize all past outputs of this function”. But we can’t do it without a special language feature (such as in Haskell), because the naïve version recurses on itself. We need to change the recursion slightly to allow our generic tabling code to insert itself in between the recursive calls. So let us go back to our basic solution, but modify m to accept a function argument called “self“, and call that when it recurses. (This is a bit like the Y combinator approach to recursion in Lambda calculus.)

    m _ 0 = 0
    m self w = maximum $ 0:[vi + self (w - wi) | (vi, wi) <- items, wi <= w]

Note that m calls self on the recursive call instead of m. Now it is possible for whatever calls m in the first place to insert some additional code in self. But for now, let’s just make it work as simply as possible (without any tabling). We write a function called table_null:

table_null :: (Ix a) => (a, a) -> ((a -> b) -> a -> b) -> (a -> b)
table_null _ f = let g x = f g x in g

You can pass m to table_null and it will give you back a simple function that computes knapsack values. The first argument to table_null is supposed to represent the domain of the function, but we will ignore it for now. Note that if you pass m as the second argument to table_null, it returns our original m function — when m calls self, table_null‘s g will just call m again. So, this is equivalent to our original, naïve (very slow) knapsack solution:

knapsack items wmax = table_null (0, wmax) m wmax
  where
    m _ 0 = 0
    m self w = maximum $ 0:[vi + self (w - wi) | (vi, wi) <- items, wi <= w]

So what is the point of this? Now we have abstracted away what it means to do recursion. When we use table_null, it just does plain ordinary recursion. But we could make any other table function we like, which can change the way recursion is handled. For example, we could build a table_trace that shows us just how inefficient this approach is by printing out the input to every recursive call:

table_trace :: (Ix a, Show a) => (a, a) -> ((a -> b) -> a -> b) -> (a -> b)
table_trace _ f = let g x = trace (show x) (f g x) in g

If you have knapsack call table_trace, you will see it printing out the values, because table_trace‘s g prints out its input before calling the supplied function.

And now, a general solution to dynamic programming:

table_array :: (Ix a) => (a, a) -> ((a -> b) -> a -> b) -> (a -> b)
  table_array dom f = g
  where
    arr = array dom [(x, f g x) | x <- range dom]
    g x = arr ! x

If you have knapsack call table_array, it will suddenly become efficient. That’s because the g in table_array doesn’t always call f. First, table_array builds an array of thunks as we did above. The thunks inside the array call f the first time they are required. The g function merely picks values out of the array. So anything you use table_array on will automatically memoize the inputs to the function. The only caveat is that you have to know in advance the domain of inputs the function is likely to accept, and it has to be contiguous.

Articles

Why Python’s whitespace rule is right

In Language design, Python on October 18, 2011 by Matt Giuca Tagged: , , ,

Python is famous among programming languages for its fairly unique syntax: rather than being delimited by curly braces or “begin/end” keywords, blocks are delimited by indentation. Indenting a line is like adding an opening curly brace, and de-denting is like a closing curly brace. When people criticise Python, it is usually the first complaint: “why would I want to use a language which requires me to indent code?” Indeed, while programmers are very used to indenting their code, they are very un-used to being forced to do so, and I can understand why they may take it as an insult that a language tells them how to write code. I don’t usually like to get into syntax arguments, because I find them very superficial — it is much more important to discuss the semantics of a language than its syntax. But this is such a common argument among Python detractors, I wanted to address it. Python is right, and it’s just about the only language that is.

I think the rub is that programmers like to think of languages as a tool, and tools should be as flexible as possible. I think in general it is a good principle for programming languages not to enforce conventions. Languages that do tend to annoy people who don’t subscribe to the same conventions. For example, the Go programming language enforces the “One True Brace Style” — every opening curly brace must appear on the same line as the function header or control statement. This irritates me because that’s not my preferred convention. But the indentation convention is so universal that it is considered bad programming practice to not indent in all cases. (There is disagreement over tabs vs spaces, the number of spaces, etc, but we all agree that indentation is good.) There is not a single situation in any country, in any programming language, or at any skill level, in which is it acceptable to not indent your code the way Python requires it. Therefore, it is technically redundant to have a language that is not whitespace-sensitive. Any language that is not whitespace-sensitive requires (by universal convention) that programmers communicate the scoping of the code in two distinct manners for every single line of code: braces (or begin/end) and indentation. You are required to make sure that these two things match up, and if you don’t, then you have a program that doesn’t work the way it looks like it works, and the compiler isn’t going to tell you.

There are two solutions to this problem. 1: Make the compiler tell you. Force the programmer to indent and put in curly braces, and have the compiler check the indentation and give either a warning or error if they don’t match up. Now you’ve solved the problem of accidentally getting it wrong, but now what is the point of requiring curly braces at all? The programmer would just be doing extra work to please the compiler. We may as well go with 2: take out the curly braces and just have the compiler determine the blocks based on indentation.

When you really analyse it, Python’s whitespace sensitivity is actually the only logical choice for a programming language, because you only communicate your intent one way, and that intent is read the same way by humans and computers. The only reason to use a whitespace-insensitive language is that that’s the way we’ve always done things, and that’s never a good reason. That is why my programming language, Mars, has the same indentation rule as Python.

* * *

An interesting aside: there is a related syntax rule in Python which doesn’t seem quite so logical: you are required to place a colon at the end of any line preceding an indent. I haven’t fully tested this, but I’m pretty sure there is no technical reason for that (the parser could still work unambiguously without that colon), and it doesn’t seem to add much to the readability either. I slavishly followed this rule in Mars too, because as a Python programmer it “feels right” to me. But perhaps it would be better to drop it.

Articles

WikiLeaks password leak FAQ

In Cryptography on September 3, 2011 by Matt Giuca Tagged:

The news broke yesterday that the entire trove of diplomatic cables held by WikiLeaks is now public and unredacted, due to a bizarre cryptographic mixup between WikiLeaks and the Guardian. WikiLeaks broke the story, after keeping quiet about it for months (I think people outside of the organisation had started to put the pieces together), accusing the Guardian, a UK newspaper, of publishing a book that negligently divulged the password to the encrypted file. The Guardian fired back, claiming that WikiLeaks said the password was temporary and that it’s their fault for making the encrypted data available. I don’t want to go into the politics here — this is a technical blog. I have already seen a lot of confusion online, both in the media and in the more technically literate (such as on Slashdot). I want to put together this FAQ to answer some of the common criticisms. Here I will try to remain politically impartial, and answer these questions from a cryptography standpoint. I’m not a qualified cryptographer (I don’t know where to begin a cryptanalysis of AES), but I do have a detailed understanding of the technology involved here. I would appreciate any corrections you might have in the comments section.

I am not affiliated with WikiLeaks. I am piecing together this information from various sources, but as you shall see, the facts are really not in question here (at least, everything relevant to a cryptographic discussion of the situation) — what is in question is who is at fault, and cryptography can tell us that.

The bottom line is, as I wrote yesterday, that cryptographically, WikiLeaks is in the right and Guardian is in the wrong. From what I can gather, WikiLeaks followed reasonable security expectations, and Guardian broke them. I would encourage anyone who places the blame on WikiLeaks to read this FAQ in detail (or at least skim through for any questions you may have).

The executive summary

It makes no sense to blame WikiLeaks. As I will address below, the security practices could have been improved to prevent this from happening, but I don’t think it is reasonable to expect WikiLeaks to plan for this scenario. By definition (by giving the cables to David Leigh, the Guardian editor responsible for disclosing the passphrase), WikiLeaks chief Julian Assange was trusting Leigh with everything. There are a million things Leigh could have done to break the security. He could have left the cables sitting unencrypted on a laptop connected to the Internet (and in fact, according to WikiLeaks, he did that too). He could have uploaded the cables to a public server if he’d been malicious. Assange had to trust that he wasn’t going to be malicious or stupid with the cables, by the definition of the operation. Divulging the password was just another possible malicious/stupid thing that Assange had to trust Leigh not to do, and he did it. In any other security model they could have used to transfer the cables, there are equally stupid things Leigh could have done to mess up. So I feel that it is unfair to blame WikiLeaks for not predicting Leigh’s stupidity.

Ignoring stupidity on the part of trusted parties, the system as described above was entirely secure. This is my main point: WikiLeaks made an encrypted file public. Guardian made the passphrase public. Either one of these is safe on its own, but if both are public, the system breaks. So clearly, one of the two parties is to blame. The clear cryptographic answer is: Guardian is to blame. When we use encryption, it is entirely expected and normal for the encrypted text to be visible to the public. If the encrypted text isn’t visible to the public, it can’t possibly be transmitted. You can’t run a secure system under the assumption that an encrypted file will not be seen by others. The entire point of cryptography is that we transmit the encrypted text “in the clear” (without further encryption). On the flipside, the entire point of cryptography is that we don’t divulge the encryption key. So WikiLeaks was in the right to make the encrypted file public (however that happened), assuming that the passphrase would be kept private. The Guardian was in the wrong to make the passphrase public, assuming that the encrypted file would be kept private. By definition, the encrypted file was public because it was available from a public server for at least a few hours.

Again, Leigh was a non-technical person, so we can’t expect him to have understood all of these subtleties. But let’s remember what we’re dealing with here: arguably the most important secret documents in history. The man should have gotten a better technical understanding of cryptography before he did this, and failing that, given that he was not an expert, he should not have presumed it was safe to disclose the password. Even assuming no technical knowledge, it is completely idiotic to publish any kind of password, even an expired one. If nothing else, it would have been safe in case Assange did re-use the same password again. And just to dig a bit deeper: Assange took the extra step of creating a salt, the additional word that he told Leigh to remember and insert into the password but not write down, for the express purpose of protecting the data in the event that someone got a hold of the piece of paper. The final, clinching, idiotic move is that Leigh wrote the salt in the book as well as the password — the one thing he was never supposed to write down.

What are the facts?

Firstly, let me state the facts as I see them. Feel free to skip this lengthy section, but I will refer to it later on. I’ll begin by quoting a passage from the book WikiLeaks: Inside Julian Assange’s War on Secrecy. This book, published in February 2011, is available in book stores and can be found online. It was written by two Guardian journalists David Leigh and Luke Harding. Chapter 11, entitled “The Cables,” details the story of how Julian Assange, the editor in chief of WikiLeaks, sent over 251,000 U.S. diplomatic cables to David Leigh, for the purpose of redacting and publishing the cables via the Guardian. The following passage appears verbatim from the book, except that I have replaced the passphrases with XXXs (if you want to find them, they are readily available online as of yesterday). This is largely the same passage that WikiLeaks quoted in their release (but I got it from another copy to be sure):

Eventually, Assange capitulated. Late at night, after a two-hour debate, he started the process on one of his little netbooks that would enable Leigh to download the entire tranche of cables. The Guardian journalist had to set up the PGP encryption system on his laptop at home across the other side of London. Then he could feed in a password.  Assange wrote down on a scrap of paper: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX. “That’s the password,” he said. “But you have to add one extra word when you type it in. You have to put in the word ‘XXXXXXXXXX’ before the word ‘XXXXXXX’. Can you remember that?”

“I can remember that.”

Leigh set off home, and successfully installed the PGP software. He typed in the lengthy password, and was gratified to be able to download a huge file from Assange’s temporary website. Then he realized it was zipped up – compressed using a format called 7z which he had never heard of, and couldn’t understand. He got back in his car and drove through the deserted London streets in the small hours, to Assange’s headquarters in Southwick Mews. Assange smiled a little pityingly, and unzipped it for him.

This is about all the information I need to go on, and it’s written in Leigh’s (or his associate’s) own words, so it is difficult to dispute. WikiLeaks quoted this above passage, adding that the reason for having to add the extra word ‘XXXXXXXXXX’ was “so if the paper were seized, the password would not work without Leigh’s co-operation.”

The Guardian’s full statement from yesterday appears below:

It’s nonsense to suggest the Guardian’s WikiLeaks book has compromised security in any way.

Our book about WikiLeaks was published last February. It contained a password, but no details of the location of the files, and we were told it was a temporary password which would expire and be deleted in a matter of hours.

It was a meaningless piece of information to anyone except the person(s) who created the database.

No concerns were expressed when the book was published and if anyone at WikiLeaks had thought this compromised security they have had seven months to remove the files. That they didn’t do so clearly shows the problem was not caused by the Guardian’s book.

Damning for WikiLeaks, and the media have sure picked up on it. Essentially, WikiLeaks was incompetent in their handling of the situation. If WikiLeaks had done their job properly, it wouldn’t have mattered whether or not the password was published.

Except that because I understand the cryptography, I don’t need the Guardian to tell me what WikiLeaks “should” have done. I can piece together the blame just from reading the story of what went down that night when Assange gave the cables to Leigh. The only interesting claim from Guardian’s statement is this: “we were told it was a temporary password which would expire and be deleted in a matter of hours.” That is the only fact that is in dispute here — not whether or not the password was temporary (it is an indisputable fact that it wasn’t), but whether or not WikiLeaks told the Guardian that it was temporary. If they did, then some of the blame can be put on them. But just some. WikiLeaks vehemently denies that they told Guardian that the password was temporary, tweeting today: “It is strictly false that the Guardian was told the password or file were temporary, hence the elaborate password handover method.” We will never know whether WikiLeaks did or did not tell Guardian that the password was temporary. But we can deduce from the facts that the password was not temporary, and anyone with a basic knowledge of security (certainly anyone entrusted with the most politically explosive documents in history) should have known not to publish a password that a serious-looking white-haired Aussie in a black trench coat gave to you on a piece of paper and then explicitly told you a second part of the password which you were not to write down.

So the password is now public knowledge. But what about the archive? The password is useless without access to the encrypted files. We know that the encrypted archive is circulating around the torrents as we speak. It shouldn’t be too hard to find. The question is, where did it come from? Obviously it was originally created by Julian Assange. The details on how it leaked out are very sketchy, but most articles seem to point to this anonymous PasteBin post which seems to link things together (hyperlinks mine):

No-one took note of the Leigh book since where the encrypted file was located was a mystery! Enter the 2nd bad guy of our story: Daniel Domscheit-Berg. DDB said he didn’t know the password before reading the Leigh book, but apparently *did* know the hidden file name on Bittorrent. Using this these two facts (pw+hidden file location), he then went around ingratiating himself with various players by handing them the entire Cablegate archive under the mutually deniable cover of “warning” them about the Leigh book. Enraged after being expelled from the CCC [Chaos Computer Club] he “gave” the cables in this way to more and more people in exchange for alliances and positive spin culminating with the now infamous Freitag and Information.dk articles and now the thing is fucking everywhere…

So it’s hard to say how that file leaked out, but speculation says that since it was on the WikiLeaks server temporarily, and WikiLeaks was aggressively mirroring their site to avoid being taken down, it was copied within the few hours that it was available online, and spread from there. Spiegel reports (courtesy Jonathan for the link) that the file was left on the WikiLeaks server for an extended time, and that Daniel Domscheit-Berg took a copy of the data when he left, and it spread from there. This seems to be corroborated by a report by Nigel Parry (an excellent write-up of the whole debacle — courtesy Will for that link). In any case, it is largely irrelevant how this file got out, because it was already public in the first place, and that shouldn’t have been a problem.

So it seems, the facts were these:

  1. Assange created an encrypted archive with PGP, using symmetric encryption (passphrase). It is unclear when this took place (whether this was created specifically for Leigh, or whether it was created earlier).
  2. Around July 2010, he met Leigh in person, giving him most of the passphrase on a piece of paper, and verbally giving him the remainder of the password as a “salt”. It is unclear whether Assange told Leigh the password was temporary or would expire.
  3. Assange uploaded the encrypted file to the public WikiLeaks server, where it was visible to the public in its encrypted form. It is unclear (and irrelevant) whether SSL was used.
  4. Leigh downloaded the encrypted file from the WikiLeaks server.
  5. Leigh used PGP and the passphrase given to him by Assange to decrypt the file.
  6. Somehow, the encrypted data file ended up in circulation on BitTorrent. The facts surrounding this are sufficiently shady that I won’t go into details here.
  7. About seven months later, in February 2011, Leigh published a book containing the full passphrase given to him by Assange, including the “salt” that he was told to remember.
  8. WikiLeaks immediately found out about this, but kept quiet so as not to draw attention to it.
  9. In August 2011, rumours begin to circulate on the Internet that both the encrypted file and the password are readily available.
  10. On 1 September, WikiLeaks makes a public announcement calling out the Guardian.

Another thing that is unclear is whether WikiLeaks created a separate archive/passphrase for each journalistic organisation they shared the cables with, or whether there was one passphrase shared between all the journalists. Either scenario is feasible. Many people are denouncing the “stupidity” that Assange would reuse the same passphrase that he gave to Leigh when he later distributed the file, but given that we don’t know where the encrypted file came from, it may in fact be that he only used this passphrase one time, and the archive that spread on BitTorrent is the same one that was only available for Leigh to download for a short time.

Assuming the above facts, I will address most of the questions or points of misinformation that I have seen in the past 48 hours.

Why didn’t Assange transmit the file over a secure connection, such as SSL?

We don’t know whether SSL, (TLS) was used, and it doesn’t matter. SSL doesn’t authenticate the client, it authenticates the server. If Leigh had connected to the WikiLeaks server over SSL, then he would have known he was talking to WikiLeaks, but WikiLeaks wouldn’t have known they were talking to Leigh. To demonstrate, visit https://www.google.com/accounts/. This page is sent to your browser via a secure connection, so you know you are talking to Google. But anybody in the world can visit that page. Google doesn’t know who you are (yet).

The file didn’t leak because of a man in the middle capturing the HTTP traffic between WikiLeaks and Leigh. It leaked because it was available on a public server. If I haven’t answered your question, please read the next one.

Okay, why didn’t Assange require that Leigh log in to the server (via SSL) to download the file?

So firstly, let’s assume that we’re using this as an alternative to PGP — that is, the file is made available to Leigh in plain text, but that it required a secure (SSL) log-in with a password that only Leigh knew. This would work much the same way as requiring someone log in to Gmail to retrieve a file — there would be a secure connection between WikiLeaks and Leigh and this time WikiLeaks would know that Leigh was the one downloading the file. Nobody else could log in to that file, and nobody could snoop the HTTP traffic between WikiLeaks and Leigh. Importantly, it would have been a temporary password, as Leigh believed. Once Leigh finished downloading the file, WikiLeaks would permanently delete it, ensuring that the password was now useless. In hindsight, perhaps this would have been a better idea (but once again, Assange would have to have predicted that Leigh would be stupid enough to disclose the passphrase).

It’s a reasonable solution, but has a few technical flaws. Firstly, you have to assume that WikiLeaks is set up with server software that handles authenticated file downloads. Such software is readily available as open source, but takes time to set up and understand its security implications. Also, no software is perfect, and such programs are commonly found to have security flaws which let the wrong person log in. So it would have been more dangerous to trust such software. (PGP, on the other hand, isn’t being used for authentication, but for encryption — there is no software bug that would let someone without the passphrase decrypt the file, since the passphrase is mathematically required to decrypt it.)

Secondly, we would have to assume that the connection to the server and Leigh’s client were secure. SSL has a number of well-known vulnerabilities which make it inappropriate for security of this importance. SSL requires verification from a trusted root certificate authority. WikiLeaks does not seem to have a verified certificate, and given the current political climate, it might be difficult for them to get one. Either way, a man in the middle could be using a dodgy certificate to snoop on the traffic as it is transferred. This also assumes that Leigh’s browser is clean: it doesn’t have any bad certificate authorities installed, it doesn’t have any dodgy browser extensions, he isn’t using Internet Explorer 6, and so on. Any spyware on his machine could grab either his authentication password or the document itself and transfer it to another party. That is why Assange explicitly requested that the cables only be decrypted on a computer not connected to the Internet. If the cables were transferred in the clear over SSL, they would have immediately become stored on a machine connected to the Internet — not to mention the fact that WikiLeaks would have had to store the cables unencrypted on their machine too.

So it would have been much less secure to use authenticated SSL instead of PGP. But that doesn’t mean Assange couldn’t have combined both of these approaches for maximum security. To pull this off, he’d have to have given Leigh two separate passwords: one for the authenticated login, and one for the decryption. He’d have to have set up a web server with SSL and an authentication system, and created a user account for Leigh, with the password given, then encrypted the file with PGP and uploaded it into Leigh’s private directory on the server. Then, he would have had to ensure that Leigh was actually handshaking with his server, and not a man in the middle — WikiLeaks would need a valid SSL certificate. He could have obtained one from a CA, but as I said above, that could have been difficult given the political state. He could alternatively have self-signed it, and then instructed Leigh to install the WikiLeaks root certificate into his browser. Then he would still have had to trust that there was no dodgy software on Leigh’s computer and that his browser was up to date. And after all that, he would still have had to tell Leigh to take the encrypted file off the computer onto a separate offline machine in order to run PGP to decrypt the file. If, after all that, Leigh disclosed the password, it would probably have been okay, because it is unlikely that anybody would have intercepted that encrypted file.

But think about it from Assange’s perspective (without the benefit of hindsight). Why would you go to all the trouble of setting up and training someone to use the above setup (someone who, by his own admission, isn’t even capable of using 7-zip — imagine guiding him through the steps of installing a root certificate in the browser), when a) your basic security is already bulletproof assuming nobody divulges the passphrase, and b) the additional more complex security is riddled with the sorts of holes I mentioned above? I just wouldn’t consider it to be worth the effort. I would consider “just” PGP to be “good enough”. You might say, “but Julian, this is a super duper top secret document, and you really should go to the extra effort.” You may have a point. But I don’t think it’s fair to blame Mr. Assange for “only” using the most robust encryption known to man once, rather than using it twice.

Let’s be clear on how secure the PGP solution is: You are sending encrypted data over the Internet. I still can’t believe that I’m reading technical people berating WikiLeaks for sending an encrypted message “in the clear” (i.e., not taking pains to prevent the encrypted document from going public) — this is a silly argument because by definition, an encrypted message sent “in the clear” is not in the clear — it is encrypted.

Why didn’t WikiLeaks use asymmetric (public/private key) cryptography instead of symmetric (passphrase)?

It wouldn’t have made a significant difference to the security. A lot of people are saying this, going on a general understanding that “asymmetric cryptography is better than symmetric.” Asymmetric cryptography is useful for a lot of things symmetric isn’t, but that doesn’t make it intrinsically “better.”

The asymmetric solution would have been for Leigh to generate a public/private key pair using PGP, and keep his private key safe. Leigh would have had to send his public key to Assange, who wouldn’t be able to trust that it belonged to Leigh because it was sent over the Internet. They would still have to have met in person in order to exchange Leigh’s public key (for Assange to be totally confident that the public key did in fact belong to Leigh) — this sort of in-person meeting is not out of the ordinary for even normal people not engaged in secret international diplomacy. Then Assange would have been able to encrypt the document using Leigh’s public key and send it over the Internet to Leigh, who would then have used PGP to decrypt the document using his private key. We shall assume that, as above, the encrypted document managed to get out into the public sphere.

Note the similarities here between the symmetric and asymmetric version. It would still require an in-person meeting, and it would still require that Leigh kept on his computer a secret which would expose the document if it ever got out. You might say that Leigh wouldn’t have divulged his private key, because he knew how important it was, but then again, you might have thought the same thing about a super top secret passphrase. Fundamentally, both systems are equivalent.

The reason why asymmetric cryptography is useful over symmetric is that it doesn’t require individualised in-person key exchanges. Once you have established the trustworthiness of a public key, you can send encrypted documents to that person forever. Since this was presumably a one-time exchange, and the two would have had to have met in person anyway to be very confident in establishing trust, there is no advantage in this case of using asymmetric cryptography over symmetric.

Why didn’t WikiLeaks choose a meaningless passphrase?

This is a criticism I share. If you have seen the actual passphrase, you will know that it is an ordinary English phrase which is fairly descriptive of the source material. It is long and properly padded with random characters, so would likely never have been brute forced. But it still would have been best if the password was a sequence of random characters, not a sentence. In particular, Leigh possibly wrote the passphrase down in the book to show how tantalising it was when he saw the passphrase. In that sense, the passphrase forms part of his narrative, in a way that “6105ffc4#098f#4f99#acab#c2bcf7b71bd4” would not have. A similar argument would apply to the private key option discussed above (by definition, a private key is not a meaningful phrase).

But once again, Assange would have had to reason “if I choose an English passphrase, Leigh might be tempted to write it down in his book, so I had better choose a random string of digits.” That thought probably never crossed his mind, because the idea that Leigh would write a passphrase in a book just because it was an English passphrase is nonsensical.

Why didn’t WikiLeaks remove the file immediately after the transfer?

It’s not clear to me whether they did or didn’t do this. Spiegel claims that Assange simply left the file there for an extended period of time. This probably wasn’t wise, but again, it wasn’t tremendously irresponsible either, given that the file was encrypted, and had already been public for several hours anyway. Assange could have assumed that the file was already public, and so if the password leaked out, the damage would be done either way. Still, if this is true, it doesn’t sit well for Assange. It would have been a good precaution to delete the file anyway.

Why didn’t WikiLeaks remove the files once they discovered the password printed in the book?

It was too late. This was the Guardian’s retort when this whole thing exploded yesterday: “No concerns were expressed when the book was published and if anyone at WikiLeaks had thought this compromised security they have had seven months to remove the files.” This statement betrays a fatal misunderstanding of the issue here. By the time the book was published, WikiLeaks had no control over that file. It wasn’t a password to a file on the WikiLeaks servers. It was a password to a file that was, by that time, mirrored on hundreds of other servers and circulating the Internet at large. There was no way to delete it. On the other hand, there wasn’t really any evidence linking the password to the file, so there was a chance that nobody knew about it. WikiLeaks did the best thing they could do: they said nothing about it to anyone. Seven months later, they finally broke their silence after it became clear that the Internet was piecing things together.

Why didn’t WikiLeaks simply set the encrypted file to expire after a certain period of time?

This is one of the most common criticisms. Many technical people have stated: “PGP lets you set keys to expire at a certain date. Assange could have used that technology.”

It isn’t true. In fact, if you have such technology, Microsoft, Sony and the MPAA would love to hear from you, because that would be the holy grail of DRM technology. You can’t make bits that “expire” or “self-destruct” — they are infinitely copyable and don’t disappear unless every single person who has a copy of them simultaneously and voluntarily deletes them. That tends not to happen.

When people say “PGP lets keys expire,” they’re thinking of signatures, not encryption. PGP has two main uses: signing files (“is this file really from the person who claims to be its author?”) and encrypting files (“nobody can read this file except its intended recipient”). On the opposite side, you can verify a signature or decrypt a file. When verifying a signature, it is in the client (the one verifying the file)’s best interests to fail if there is any doubt, so the client will first check that the signature is valid, and then check that the key’s expiration date is in the future compared to the computer’s clock. You could set your computer’s clock back if you wanted to trick your computer into passing the verification, or hack the source to PGP to disable this check, but then you would only be hurting yourself: you’d be tricking yourself into thinking the signature is valid when in reality it might not be.

With decryption, it’s the other way around. It is in the client (the one decrypting the file)’s best interests to succeed wherever possible. I can imagine a version of PGP that first decrypts the file, then checks whether the key’s expiration date is in the future compared to the computer’s clock, and if it isn’t, it says “could not decrypt.” But that would be pointless, because any hacker who really wanted the contents of that file would either set her clock back, to trick the computer into passing the time check, or hack the source to PGP to disable it. Since there is this obvious possibility, PGP doesn’t include that feature at all. You cannot make bits rot.

Returning to the DRM analogy, it has been pointed out that companies like Ubisoft and Blizzard make games that require an “always-on” Internet connection to work. Why couldn’t Assange put similar DRM on the files, requiring Leigh to be always connected to his server to access it (and therefore, have the ability to revoke access when necessary). The problem is that DRM is just all of the above discussion wrapped in a different coating. Most DRM relies on security through obscurity — the video game software is hard-coded to fail without an Internet connection, but the data is still present locally. That’s why most of these games get hacked (and if people are willing to put the time into hacking Driver, surely it’ll be worth it to get the WikiLeaks cables). The “phone home” DRM would come in one of two forms: either the data is stored locally, but it is encrypted, and the decryption key would be sent from the server, or the server would be streaming the data live upon request. Both of these are vulnerable to the same problems as the SSL scenario I talked about above: either you are sending the key, or the cables, over the wire, and that isn’t good. It is much safer for Assange to meet Leigh and hand him a piece of paper than to be sending keys over the network. It is really not feasible to use DRM to make a document expire after some time.

It is possible that Leigh believed that Assange had used this sort of technology, not being a technical person. But that was his misunderstanding — one that he should have clarified before publishing the passphrase in a book.

Why did Assange tell Leigh that it was a temporary password?

Whether or not he said this is something we’ll never know the answer to, since it’s WikiLeaks’ word against the Guardian. It’s not scientific of me to make guesses like this, but I’m going to, because I know Mr. Assange’s reputation. Before he was a WikiLeaks activist, Assange was a cryptography researcher. He created the Rubberhose file system to allow people to safely carry digital secrets without divulging their existence. I cannot say for sure what Assange told Leigh about that passphrase, and I have never met Mr. Assange, but judging by his reputation alone, he knows cryptography inside out. He knows which pieces of information are safe to divulge, and which aren’t. I find it hard to believe that Assange would have accidentally told Leigh that this was a temporary password, when we know just by virtue of the fact he used PGP that it wasn’t temporary.

If I can make some further speculation, I would imagine that Assange told Leigh something along these lines: “I am going to give you access to a file on my web server that will be temporarily available. After a few hours, the file will not be available any more, so you have to download it soon. Also, here is the password which you can use to decrypt the file.”

It’s possible that a non-technical person may have misunderstood the above sentences as suggesting that the password would be useless after those few hours. That still doesn’t excuse the divulging of a password. If someone says something about a red button which you didn’t fully understand, it is probably not a good idea to push the red button.

Also, I can only imagine that Assange did stress the utmost importance of keeping the password secure, and not writing down the additional “salt” word — after all, why would he tell him to remember the salt in the first place if it was safe to write it down?

Why did WikiLeaks re-use the same password for subsequent distribution of the same file?

It isn’t clear how that encrypted file got out, or whether or not WikiLeaks reused the password. This is an (as far as I can tell) unsubstantiated claim that is going around. But let us assume that this is the case. That is: We shall assume that subsequent (or prior) to their dealings with Leigh, WikiLeaks distributed the same cables to other journalistic organisations (as they are known to have done), and in their haste, used the same passphrase to avoid re-encrypting the data. I have found no evidence to support this, but let’s assume it for the time being.

It has been claimed that this violated the principle of discretionary access control, or similar: much like you wouldn’t let different users run software under the same user account, you shouldn’t use the same password for different organisations. There is some wisdom to this. Let us assume that WikiLeaks gave the encrypted file to N organisations, and that there is a probability p, where 0 ≤ p ≤ 1, that any given encrypted file will leak out, and another probability, q, where 0 ≤ q ≤ 1, that the corresponding passphrase will leak out. If you make a separate encrypted bundle for each organisation, then the document will be exposed if and only if one of the encrypted files leaked out, and the passphrase corresponding to that same file also leaked out. It doesn’t matter if the Times encrypted file is leaked and the Guardian key is leaked, since the key doesn’t unlock that file. The probability of a “game over” scenario (the cables go public) is 1 – (1 – p×q)N. That’s because p×q is the probability that both the data and key for the same file get loose, (1 – p×q) is the probability that they don’t both get loose, and N is the number of keys you have to worry about (so multiply the safe probability that many times, then invert it to get the game over probability). For example, if p = 0.3, q = 0.1, and N = 10, the game over probability is 1 – (1 – 0.03)10 = 1 – 0.9710 = 1 – 0.74 = 0.26. Compare that to the scenario in which you make the same bundle and passphrase for all the journalists. Now the document will be exposed if and only if any of the encrypted files leaks out and any passphrase leaks out. The probability of the “game over” scenario is now (1 – (1-p)N)×(1 – (1-q)N). That’s because we now measure the individual probability of any data leaking and any key leaking, and multiply them together. This gives us much worse odds: For example, if p = 0.3, q = 0.1, and N = 10, the game over probability is now (1 – (1 – 0.3)10)×(1 – (1 – 0.1)10) = (1 – 0.710)×(1 – 0.910) = (1 – 0.028)×(1 – 0.35) = 0.97 × 0.65 = 0.63. So the underlying maths confirms our intuition that it is better to give each person we deal with a different key.

But now let’s try to make some reasonable estimates for p and q (and continue assuming that N = 10). Given how important the passphrase is, we’d like to think that q is extremely extremely low. Again, Assange was operating under the assumption that q is near 0, but for the sake of argument, let’s keep q at the generously high 0.1. What is a reasonable value for p? As I’ve stated above, we are operating under the assumption that all of the encrypted data is public, at least for a short while. Every time we share this (for each N), the data will be on a public web server for several hours: a web server that is possibly being mirrored by thousands of other computers across the world in real-time. So if I had to guess, I would place p at 1, or very close to it. That’s not a silly assumption, given that again, the whole point of cryptography is that it is safe to assume the encrypted data is being seen by everyone. Basically, although he didn’t realise it, Leigh was betting that p was 0 when he published the passphrase. Now let’s run the numbers again with p = 1 and q = 0.1. If you give everybody their own individual key, the game over probability is 1 – (1 – 1*0.1)10 = 1 – 0.910 = 1 – 0.35 = 0.65. If you share the key between all the journalists, the game over probability is (1 – (1 – 1)10)×(1 – (1 – 0.1)10) = (1 – 010)×(1 – 0.910) = (1 – 0)×(1 – 0.35) = 1 × 0.65 = 0.65. If p (or q) is 1, then it makes no difference. If you are operating under the assumption that all the encrypted files are being captured by the public (and we are), then you cannot afford to let any single key slip out. There is one other advantage to having separate keys for each journalist, and that is that if a key does slip out, you know who to blame. But that isn’t the problem here.

To use a real-world analogy, say there is an office with 10 rooms and 10 keys. It is more secure if each key only unlocks one room, because then if someone steals a key, they can only access the contents of one room and not all 10. But that isn’t a good analogy for the cables because all the organisations were given the same file. If any one leaks out, it’s game over. So a more appropriate analogy is if there is an office with one big room with 10 doors and 10 keys. Now it doesn’t matter if each key only unlocks one specific door, or if all the keys unlock all the doors. Either way, if someone gets one key, they can steal everything in the room. Given that the whole point of this operation was to trust the journalists with the entire set of cables, there isn’t much more security you can have. You simply have to trust that none of the journalists will reveal his key. It doesn’t matter whether all the journalists have the same key, or if they are given individual ones.

In any event, p is probably not quite 1, so it probably would have been prudent for WikiLeaks to use separate passphrases for each journalist, and it certainly wouldn’t have hurt. But did they? It isn’t clear that Assange distributed the same file and passphrase to multiple people, and it isn’t clear how the file leaked out at all (more on that in the next section). But the point of this section is to show that, however negligent it was to leave that encrypted file on a hard drive, it pales into insignificance next to the negligence of divulging the passphrase. Again, we are operating under the assumption that the encrypted file is available to the public (there is no other way to run a cryptosystem), and therefore we require that the password is secure. We assume that p is 1 or close to it, and we require that q is as close as possible to 0.

How did the encrypted file get out onto the Internet?

I’ve put this question last, as it’s the least important. Hopefully by now, I’ve made the case that it doesn’t matter how the data file got out, who is to blame, or whether or not it was deliberate, because of the basic tenet of cryptography that your encrypted text is not something you need to keep secure. But it deserves to be addressed. I have read many theories, and have not yet seen any clear evidence about this:

  • First and foremost, there is some chance that it was grabbed from the WikiLeaks server during the hours that it was online for Leigh to download.
  • According to the Spiegel article, the file was only used the one time (to give the data to Leigh), but it leaked out because it wasn’t deleted by Assange, instead remaining on the server for an extended time before it was ultimately mirrored.
  • An insider at WikiLeaks could have posted the encrypted file online (either accidentally or deliberately).
  • The encrypted file could have been leaked from the Guardian (either accidentally or deliberately). WikiLeaks claims that “David Leigh and the Guardian … violated … our requirements that the unpublished cables be kept safe from state intelligence services by keeping them only on computers not connected to the internet.” If this claim is true, it is not unlikely that the files were stolen from the Guardian computers by government agents or others.
  • If WikiLeaks did use the same passphrase and encrypted file to give the cables to the handful of other organisations they worked with, any one of them could have leaked the encrypted file online (either accidentally or deliberately).

Basically, we don’t know how. But it is the least important fact in this debate.

Update: Several people have written to point out that the encrypted data file, z.gpg, was published by WikiLeaks themselves as part of the November 28, 2010 dump of the full WikiLeaks server archive. A post on the front page of wikileaks.org still reads “All released leaks archived (2010-11-28): Due to recent attacks on our infrastructure, we’ve decided to make sure everyone can reach our content. As part of this process we’re releasing archived copy of all files we ever released – that’s almost 20,000 files. The archive linked here contains a torrent generated for each file and each directory.” That link (now removed), I am told, contained the encrypted file unlocked by Leigh’s password. (Thanks to Justin for a full explanation.) I must stress that this file was made available by WikiLeaks in November 2010, three months before Leigh’s book was published.

Conclusion

Clearly, an avalanche of “mistakes” has led to this unfortunate ending. I genuinely believe that Leigh never meant to disclose a live password, but this doesn’t excuse his ignorance of basic security practices (passwords are secret: even if you think they have expired, it’s best not to assume, especially when international diplomacy is at stake). Roughly outlined, the “mistakes” I have seen people claim are as follows:

  • Assange decided to trust Leigh in the first place,
  • Assange decided to send a file encrypted with PGP “in the clear” (without a second layer of encryption), allowing anyone online to get access to it,
  • Assange chose to symmetrically encrypt the file with a passphrase, rather than require Leigh establish a public/private key pair,
  • Assange chose a passphrase that was an English phrase and not random gibberish,
  • Assange possibly told Leigh that it was a temporary passphrase (according to Leigh),
  • Assange possibly left the file sitting on the drive after it had been sent (according to Spiegel),
  • Assange possibly re-used the same file and passphrase to give to multiple journalists (according to some rumours I’ve seen going around),
  • Leigh published the passphrase in a book.

The above dot points roughly outline what appears to be a rather damning case against Assange. But let’s not forget the first rule of cryptography: establish what is a secret and what is not. Once you have done that, you can set about protecting the secrets at all costs, and operate under the assumption that everything else can be seen by anyone, without any concern. It is not “lazy” or “irresponsible” to encrypt a file, and then send that encrypted file over a public network. It is not “lazy” or “irresponsible” to publish an encrypted file on BitTorrent and point at it with a public tweet saying “I have posted an encrypted file, which I would like as many people as possible to download.” (Assange did that too.) Much of the modern world’s infrastructure is built upon these cryptographic principles, from the DRM on your iPhone apps, to the transfer details you give to your bank website. Every single one of the “mistakes” Assange made was only a mistake in the context of the final error — the publishing of the passphrase in the book. Assange shouldn’t be held accountable for his “mistakes” because they were all rational decisions when operating under the assumption that your passphrase is safe. If you aren’t operating under that assumption, then it’s basically game over anyway.

This document has made the case that protecting the data from your collaborator wilfully publishing the passphrase is a lot more complicated than the media and the commentators are giving Assange credit for. He would have had to go to extreme lengths to seal up all of the issues I’ve mentioned, and that would have bought him a modest amount of additional security. But it probably wouldn’t have been worth it, and it’s pretty hard to make a case for doing so before the fact. What we’re really talking about in this article is shades of caution. Assange could have been more cautious, but in the end, you are pretty much hosed if your collaborator gives up the encryption passphrase, and if you want to blame someone, it should be them.

As my friend Richard said, “[The Guardian have] done the harm that Wikileaks never did but copped months of trash talk and threat of prosecution for.” WikiLeaks will likely take the fall for this, but as cryptographers, we know where the real blame lies.

Acknowledgements to Will for reading through this post and suggesting improvements.

Articles

Integer conversions in C

In C on August 8, 2011 by Matt Giuca

In light of a recent bug in crypt_blowfish highlighted by Steven Gibson on the Security Now podcast (episode #311), I read up on signed / unsigned conversions in C. Wow, is it complex. Here are some things you might not know about integer conversions in C (relating to converting between different bit widths and signedness). (Note: I, like the C specification, use the word “integer” to refer to any type that represents whole numbers, and the word “int” to refer to the C type of the same name.)

  • The type “int” is equivalent to “signed int”. However, it is implementation-defined whether “char” refers to “signed char” or “unsigned char”. If you care, you have to be specific (this is probably what caused the crypt_blowfish bug, because a variable was declared as “char” which really needed to be an unsigned char).
  • Converting between unsigned integers of different sizes is always well-defined (for example, converting from an unsigned int to unsigned char takes the value % 256, assuming an 8-bit char type).
  • However, a narrowing conversion of signed integers is undefined if the value cannot be represented in the new type (for example, converting the value 256 to a signed char is undefined, assuming an 8-bit char type). In almost all compilers, you can assume two’s complement format, which means the behaviour is nearly uniform, but the language standard itself does not mandate two’s complement representation for signed integers, and therefore cannot specify the behaviour when a signed integer is narrowed. The same applies to operations that overflow.
  • Similarly, converting from signed to unsigned is well-defined, but converting from unsigned to signed has undefined behaviour if the value is too large to fit. (Converting from unsigned to a signed value with more bits is always well-defined.)
  • Converting a signed value to an unsigned value with more bits (a widening conversion) is equivalent to first making the signed value wider (preserving the value) and then converting it to unsigned (taking the modulo). For example, converting a signed char -38 to an unsigned short results in 65536 – 38 = 65498, assuming a 16-bit short type. If you think about it, this isn’t obvious, because it could have been the other way around (first convert to unsigned, then widen, so in the above example you would get 256 – 38 = 218), but I think the way they chose makes more sense. In two’s complement notation, this is a sign-extension (and the alternative would be zero-extension).
  • For an operation A * B (“*” represents any binary operator, not necessarily multiplication), if A is signed and B is unsigned, A gets converted to an unsigned number. This is quite unintuitive to me. I would have thought that any operation involving a signed number would be performed as a signed operation, but C seems to always favour unsigned numbers (I presume since the results of conversion are well-defined). This means that if A is a signed char and B is an unsigned int, A is first converted to an unsigned int via sign-extension.

Source: ISO/IEC 9899:TC3 (the C99 draft standard from 2007), §6.3.1

You can see the bug here (in the function BF_set_key). There are two variables involved: unsigned int tmp and char* ptr. Because it doesn’t specify whether ptr points to an unsigned or signed char, the compiler is allowed to choose (and most modern compilers choose signed char). The killer line is “tmp |= *ptr;” which, according to the above rules and assuming a signed char, converts a signed char to an unsigned int by sign-extension. This is very bad since it then performs a bitwise OR with tmp, setting all of the bits above 8 to 1 (when the programmer expected them to be left alone), causing a massive weakening of the hash. The bug fix explicitly casts *ptr to an unsigned char first, and funnily enough includes a switch “sign_extension_bug” to re-enable the old, buggy behaviour in case you want your old hashes to match!

Articles

Simulating classes with prototypes in JavaScript

In JavaScript, Object-oriented programming on June 5, 2011 by Matt Giuca

I’ve been programming in JavaScript for years, and have previously vaguely gotten my head around prototypes (what JavaScript has instead of classes), but never fully. I think I have the hang of them now, though, so I thought I’d share some knowledge with you. If you haven’t tried some real object-oriented programming in JavaScript, basically it’s quite different from everything else, because in everything else, you have classes and objects. As we are all taught in OOP 101, objects are instances of classes. So we build classes and then we instantiate them, creating objects.

But JavaScript is a prototype programming language, which means it has no classes, only objects. In prototype languages, objects are instances of other objects. If Object B is an instance of Object A, then Object A is said to be Object B’s prototype. So basically, this sort of unifies the concept if instantiation and inheritance. There is no difference between saying “Foo is an instance of Bar” and “Foo is a subclass of Bar” — in both cases, we say “Foo’s prototype is Bar.”

So that’s all well and good, but there’s one problem: I just want to write code. This is a pretty dismissive attitude, but I think, since we web programmers are forced to use this language, most of us are not really interested in learning a new and rather experimental programming paradigm. We just want to do things “the usual way”. (Note: There is a lot of strong opinion on The Wiki, and I mean the original Wiki and not Wikipedia, regarding prototype programming and why sliced bread does not compare. I personally don’t see the point of conflating objects and classes — to me, it is helpful to distinguish between objects and the types that describe them. That isn’t to say I am against it, just that in this case, I just want to know how to apply my hardened OOP techniques in JavaScript.)

In short, I have written the most basic Python program that uses inheritance. What is the most idiomatic JavaScript equivalent of this code?

class Foo(object):
    def __init__(self, x):              # Constructor
        self.x = x
    def getX(self):
        return self.x

class Bar(Foo):                         # Inherit from Foo
    def __init__(self, x, y):
        super(Bar, self).__init__(x)    # Superclass constructor call
        self.y = y
    def getY(self):
        return self.y

foo = Foo(4)
print(foo.getX())

bar = Bar(2, 9)
print(bar.getX())
print(bar.getY())

If you aren’t versed in Python, the above code simply has a class Foo with a variable x and a getter, getX, and a class Bar which is a subclass of Foo, which introduces a new variable y and a getter, getY. Note that getters aren’t particularly good Python style, but in this case they just serve as examples of methods which operate on the data. Note that Bar’s constructor takes both x and y, and defers the initialisation of x to the superclass constructor. Also note that the instance variables x and y are not private (Python doesn’t have a very good notion of “private members”); I don’t wish to cover encapsulation in this post.

So here’s what I came up with. JavaScript gurus might like to suggest improvements in the comments:

function Foo(x) {                   // Constructor
    this.x = x;
}
Foo.prototype.getX = function() {
    return this.x;
};

function Bar(x, y) {
    Foo.call(this, x);              // Superclass constructor call
    this.y = y;
}
Bar.prototype = new Foo();          // Inherit from Foo
Bar.prototype.getY = function() {
    return this.y;
};

foo = new Foo(4);
print(foo.getX());

bar = new Bar(2, 9);
print(bar.getX());
print(bar.getY());

Let’s break this down. First, we’ll just focus on Foo. You’ll note that I am curiously defining a function, not a class. Yes, to define the JavaScript equivalent of what I think of as a class, you just write its constructor as a top-level function. Note that Foo assigns to “this.x”. The key to what makes this a constructor and not an ordinary function is the way it is called — note the calling syntax uses the “new” keyword. If you call a function in JavaScript with “new“, it creates an object and calls the given function, allowing the function to assign to the new object via “this“. So “new Foo(4)” creates an object “{“x”: 4}”. But what about methods?

Well, since JavaScript is a proper functional programming language, we could have the constructor assign the methods to attributes of the this object, like so:

function Foo(x) {                   // Don't do this!
    this.x = x;
    this.getX = function() {
        return this.x;
    };
}

This is actually a common approach, but I don’t like it. It’s extremely inefficient in terms of space and speed. With this approach, every single object contains a separate getX function, wasting a lot of space. For details, see this Stack Overflow discussion, including some speed tests I ran. We need a solution which lets all instances of Foo share the same method table, like they would in a class-based language (where all instances share a common class).

This is where prototypes come in. In JavaScript, every object has a prototype, which is another object that this object inherits from (think of it like inheritance, but for objects rather than classes). More formally, if an object X has prototype P, if an attempt to look up an attribute of X fails, JavaScript will search for the attribute in P (this can be recursive; a prototype can have a prototype, etc). Therefore, we need to make sure that all “Foo objects” share a prototype, and that that prototype contains all of the “Foo” methods. In the above code, I do this by setting Foo.prototype.getX to a function object.

What’s confusing about this is that Foo.prototype is not the prototype of Foo (the constructor function). It is just an attribute of the function Foo. Foo.prototype is the prototype of any object constructed by Foo. So when I later write “foo = new Foo(4)”, JavaScript does the following:

  1. Create a new object,
  2. Set the new object’s prototype to Foo.prototype,
  3. Call Foo, with the new object as this.
  4. Assign the new object to foo.

Therefore, assigning to attributes of a constructor.prototype makes those attributes available to all objects constructed with that constructor, and they share a value. Commonly, you assign methods here, but you can also assign other values, which will be shared across instances (like static fields in Java, or class variables in Python).

So that’s how Foo works. How about Bar, which requires inheritance? I’ll use the prototype system to simulate class inheritance. The key is the line “Bar.prototype = new Foo().” This creates a new Foo instance (a dummy instance; we won’t actually be using its value, just its prototype), and stores that Foo instance as the “prototype” attribute of Bar — i.e., the prototype of all objects constructed by Bar. We then extend that prototype with a new method, getY. We could also override methods by assigning them to Bar.prototype (not shown here). Note that if we had just said “Bar.prototype = Foo.prototype”, that would be wrong, because adding or overriding methods would also affect any Foo instances. With this approach, Bar.prototype is not equal to Foo.prototype; rather, it is an object whose prototype is Foo.prototype. Therefore, any method lookups which fail on Bar.prototype will fall back to Foo.prototype.

(It sucks that I need to actually call new Foo() to create the dummy Foo object, since it means the constructor must work on undefined inputs. There should be some way to construct “an object with Foo.prototype as its prototype” without directly calling Foo, but I can’t think of one.)

This is easier to visualise with a diagram. Looking at the diagram below, we can see the distinction between constructors with a “prototype” attribute, and prototypes of objects. You can visually trace the lookup of the “getX” method on the “bar” object — “bar” itself has no attribute getX, so we follow the dashed line to its prototype. The prototype (Bar.prototype) has a getY attribute, but no getX attribute, so we follow the dashed line to the prototype’s prototype (Foo.prototype), which has the getX method, so we call that.

Object diagram of the above example

The last thing to note is how the Bar constructor calls its superclass constructor, a very common thing for a subclass constructor to do. I’m using the ‘call’ method, which lets you call a function and explicitly specify the value of “this“. If I just called “Foo(x)”, it would construct a new object and assign to its x attribute, which isn’t what I want. Performing “Foo.call(this, x)” says “call Foo(x), but use the this object from the Bar constructor as the this object in the Foo constructor.” Hence, when Foo assigns to “this.x”, it is assigning to the same this that Bar is constructing.

In summary:

  • When you want to define a “class,” define its constructor as an ordinary function, and have it write to fields of this.
  • Add methods to the “class” by assigning function objects to constructor.prototype.method.
  • Instantiate “classes” by calling the constructor with new.
  • Inherit from another class by assigning new SuperClass() to SubClass.prototype.
  • Call superclass constructors using SuperClass.call(this, args…).

Edit: Added semicolons at the end of “= function” statements.

Articles

You should XML-escape your URIs

In Web development on June 2, 2011 by Matt Giuca

A quick post about a sneaky little edge case I thought of. I don’t know if this is common practice, but when I encode a URI, I usually consider it “safe for XML”. It isn’t.

In other words, once I have taken a string like “dog’s bed & breakfast” and URI-encoded it to “dog%27s%20bed%20%26%20breakfast”, I would consider it safe to slap into an XML or HTML attribute value, such as <a href=”dog%27s%20bed%20%26%20breakfast”>…</a>. But it isn’t.

There’s one little problem: URIs allow (and frequently use) the ampersand character (“&”) to separate query string arguments. XML attribute values specifically disallow this character. It isn’t a problem with the above string, because the ampersand was part of the string, and so it was escaped as “%26”, which is a perfectly legal XML attribute value. But any URI with multiple query string parameters that you put into an attribute value is technically invalid XML. For instance, if you had the query parameters {“name”: “henry”, “age”: “27”}, that encodes to the query string “name=henry&age=27”. The XML element <a href=”http://example.com/?name=henry&age=27″>…</a&gt; is invalid, because it contains a bare ampersand in the attribute value. Browsers, however, don’t seem to mind, and will process the above link properly.

The problem happens on the edge cases. Consider the query parameters {“name”: “henry”, “lt;”: “27”}. They encode to the query string “name=henry&lt;=27″, and if you put that unquoted into XML, you get <a href=”http://example.com/?name=henry&lt;=27”>…</a&gt;, which is valid, and completely different to what you intended (it parses as query parameters {“name”: “henry<=27”}). Even if your URI encoder escapes the “;” (which it should, as “;” is a reserved character), you’ll still get <a href=”http://example.com/?name=henry&lt%2B=27″>…</a&gt;, once again invalid, but both Firefox and Chrome still manage to parse &lt as “<“.

So even if you have URI-encoded (which you should do), you still need to XML-encode before putting it into the attribute value: <a href=”http://example.com/?name=henry&amp;lt;=27”>…</a&gt;. A minor point is that if you are using single-quoted XML attribute values, you also need to make sure that if your URI has single quotes (which it is technically allowed to have), that they are being XML-encoded as well.

Also see my full post about URI encoding, URI Encoding Done Right.