Archive for the ‘JavaScript’ Category

Articles

Why you should never mutate a JavaScript Date

In JavaScript on August 2, 2013 by Matt Giuca Tagged: ,

Poor JavaScript. Everybody’s friend, but nobody loves it. Perhaps because some of its bits are just made of stupid. Consider this JavaScript snippet involving the Date class:

> date = new Date(2013, 0, 1)
Tue Jan 01 2013 00:00:00 GMT+1100 (EST)
> date.setUTCMonth(10)

First, I create a new date at midnight on January 1, 2013 (yes, months start from 0 and dates start from 1, it’s insane). Then, I set the month to November. What is the resulting value of date? Surely it’s November 01 2013? Nope, it’s:

Sun Dec 02 2012 00:00:00 GMT+1100 (EST)

Wat? The month is one year earlier, one month later, and one day later than I had expected. But it gets worse:

> date.setUTCMonth(10)
Fri Nov 02 2012 00:00:00 GMT+1100 (EST)

That’s right: setUTCMonth is not even idempotent! Calling it a second time gives a different result than the first. What on Earth is going on? Two things, actually.

Firstly, you may have noticed that the dates are in GMT+1100 (my local time zone, Australian Eastern Daylight Time). The constructor and string display both use the computer’s local time, which is a big mistake. For one thing, it means that software behaves differently in different parts of the world. (Remember this bug I found in the SOPA blackout snippet?) Software should only ever deal with UTC except when displaying times to the user. So in eastern Australia, “new Date(2013, 0, 1)” is actually Dec 31 2012 13:00:00 UTC. You can work around this, but JavaScript isn’t helpful: there is a Date.UTC function, but it doesn’t return a Date object. To construct a date from UTC time, call “new Date(Date.UTC(year, month, day, hours, minutes, seconds))”. Okay, so that’s confusing, but not altogether wrong.

Secondly, and more awfully, we are running into the intrinsic problem of Gregorian dates, that day 31 is valid for some months but not others. Every date library has to deal with this issue, but here JavaScript just sucks. When you start with December 31 and set the month to November, you get November 31. Since that date doesn’t exist, JavaScript “helpfully” gives you December 1. (Fancy that: a setter that doesn’t always end up setting the field to the value you gave it!) Dec 01 2012 13:00:00 UTC is Dec 02 2012 00:00:00 in my local time zone. Calling setUTCMonth a second time sets the month to November, keeping the date value at 1. Seriously, setting the month when the date is not valid for that month should be an error; at least then you would notice the problem. Here JavaScript falls into the classic API design trap of giving unhelpful garbage instead of useful errors.

So what should you do to avoid this? I would recommend never using any setters on the Date class. It is fundamentally dangerous to change the year, month or date, because in all three cases you can get silent corruption of other fields. Treat Date objects as immutable, and make new ones when you need to. Use the “new Date(Date.UTC(…))” form I gave above.

Articles

How do you escape a complete URI?

In JavaScript,Python,Unicode,URI,Web development on February 12, 2012 by Matt Giuca

This question comes up quite a lot: “I have a URI. How do I percent-encode it?” In this post, I want to explore this question fully, because it is a hard question, and one that, on the face of it, is nonsense. The short answer to this question is: “You can’t. If you haven’t percent-encoded it yet, you’re too late.” But there are some specific reasons why you might want to do this, so read on.

Let’s get at the meaning of this question. I’ll use the term URI (Uniform Resource Identifier), but I’m mostly talking about URLs (a URI is just a bit more general). A URI is a complete identifier of a resource, such as:

http://example.com/admin/login?name=Helen&gender=f

A URI component is any of the individual atomic pieces of the URI, such as “example.com”, “admin”, “login”, “name”, “Helen”, “gender” or “f”. The components are the parts that the URI syntax has no interest in parsing; they represent plain text strings. Now the problem comes when we encounter a URI such as this:

http://example.com/admin/login?name=Helen Ødegård&gender=f

This isn’t a legal URI because the last query argument “Helen Ødegård” is not percent-encoded — the space (U+0020, meaning that this is the Unicode character with hexadecimal value 20), as well as the non-ASCII characters ‘Ø’ (U+00D8) and ‘å’ (U+00E5) are forbidden in any URI. So the answer to “can’t we fix this?” is “yes” — we can retroactively percent-encode the URI so it appears like this:

http://example.com/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f

A digression: note that the Unicode characters were first encoded to bytes with UTF-8: ‘Ø’ (U+00D8) encodes to the byte sequence hex C3 98, which then percent-encodes to %C3%98. This is not actually part of the standard: none of the URI standards (the most recent being RFC 3986) specify how a non-ASCII character is to be converted into bytes. I could also have encoded them using Latin-1: “Helen%20%D8deg%E5rd,” but then I couldn’t support non-European scripts. This is a mess, but it isn’t the subject of this article, and the world mostly gets along fine by using UTF-8, which I’ll assume we’re using for the rest of this article.

Okay, so that’s solved, but will it work in all cases? How about this URI:

http://example.com/admin/login?redirect=http://example.com/news#funny&name=Helen&gender=f

Clearly, a human looking at this can tell that the value of the “redirect” argument is “http://example.com/news#funny”, which means that the “#” (U+0023) needs to be percent-encoded as “%23″:

http://example.com/admin/login?redirect=http://example.com/news%23funny&name=Helen&gender=f

But how did we know to encode the “#”? What if whoever typed this URI genuinely meant for there to be a query of “redirect=http://example.com/news” and a fragment of “funny&name=Helen&gender=f”. It is wrong for us to meddle with the given URI, assuming that the “#” was intended to be a literal character and not a delimiter. The answer to “can we fix it?” is “no“. Fixing the above URI would only introduce bugs. The answer is “if you wanted that ‘#’ to be interpreted literally, you should have encoded it before you stuck it in the URI.”

The idea that you can:

  1. Take a bunch of URI components (as bare strings),
  2. Concatenate them together using URI delimiters (such as “?”, “&” and “#”),
  3. Percent-encode the URI.

is nonsense, because once you have done step #2, you cannot possibly know (in general) which characters were part of the original URI components, and which are delimiters. Instead, error-free software must:

  1. Take a bunch of URI components (as bare strings),
  2. Percent-encode each individual URI component,
  3. Concatenate them together using URI delimiters (such as “?”, “&” and “#”).

This is why I previously recommended never using JavaScript’s encodeURI function, and instead to use encodeURIComponent. The encodeURI function “assumes that the URI is a complete URI” — it is designed to perform step #3 in the bad algorithm above, which by definition, is meaningless. The encodeURI function will not encode the “#” character, because it might be a delimiter, so it would fail to interpret the above example in its intended meaning.

The encodeURIComponent function, on the other hand, is designed to be called on the individual components of the URI before they are concatenated together — step #2 of the correct algorithm above. Calling that function on just the component “http://example.com/news#funny” would produce:

http%3A%2F%2Fexample.com%2Fnews%23funny

which is a bit of overkill (the “:” and “/” characters do not strictly need to be encoded in a query parameter), but perfectly valid — when the data reaches the other end it will be decoded back into the original string.

So having said all of that, is there any legitimate need to break the rule and percent-encode a complete URI?

URI cleaning

Well, yes there is. (I have been bullish in the past that there isn’t, such as in my answer to this question on Stack Overflow, so this post is me reconsidering that position a bit.) It happens all the time: in your browser’s address bar. If you type this URL into the address bar:

http://example.com/admin/login?name=Helen Ødegård&gender=f

it is not typically an error. Most browsers will automatically “clean up” the URL and send an HTTP request to the server with the line:

GET /admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f HTTP/1.1

(Unfortunately, IANA will redirect this immediately, but if you inspect the packets or try it on a server you control, then check the logs, you will see this is true.) Most browsers don’t show you that they’re cleaning up the URIs — they attempt to display them as nicely as possible (in fact, if you type in the escaped version, Firefox will automatically convert it so you can see the space and Unicode characters in the address bar).

Does this mean that we can relax and encode our URIs after composing them? No. This is an application of Postel’s Law (“Be liberal in what you accept, and conservative in what you send.”) The browser’s address bar, being a human interface mechanism, is helpfully attempting to take an invalid URI and make it valid, with no guarantee of success. It wouldn’t help on my second example (“redirect=http://example.com/news#funny”). I think this is a great idea, because it lets users type spaces and Unicode characters into the address bar, and it isn’t necessarily a bad idea for other software to do it too, particularly where user interfaces are concerned. As long as the software is not relying on it.

In other words, software should not use this technique to construct URIs internally. It should only ever use this technique to attempt to “clean up” URIs that have been supplied from an external source.

So that is the point of JavaScript’s encodeURI function. I don’t like to call this “encoding” because that implies it is taking something unencoded and converting it into an encoded form. I prefer to call this “URI cleaning”. That name is suggestive of the actual process: taking a complete URI and cleaning it up a bit.

Unfortunately (as pointed out by Tim Cuthbertson), encodeURI is not quite good for this purpose — it encodes the ‘%’ character, meaning it will double-escape any URI that already has percent-escaped content. More on that later.

We can formalise this process by describing a new type of object called a “super URI.” A super URI is a sequence of Unicode characters with the following properties:

  1. Any character that is in any way valid in a URI is interpreted as normal for a URI,
  2. Any other character is interpreted as a URI would interpret the sequence of characters resulting from percent-encoding the UTF-8 encoding of the character (or some other character encoding scheme).

Now it becomes clear what we are doing: URI cleaning is simply the process of transforming a super URI into a normal URI. In this light, rather than saying that the string:

http://example.com/admin/login?name=Helen Ødegård&gender=f

is “some kind of malformed URI,” we can say it is a super URI, which is equivalent to the normal URI:

http://example.com/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f

Note that super URIs have nothing to do with not percent-encoding of delimiter characters — delimiters such as “#” must still be percent-escaped in the super URI. They are only about not percent-encoding invalid characters. We can consider super URIs to be human-readable syntax, while proper URIs are required for data transmission. This means that we can also take a proper URI and convert it into a more human-readable super URI for display purposes (as web browsers do). That is the purpose of JavaScript’s decodeURI function. Note that, again, I don’t consider this to be “decoding,” rather, “pretty printing.” It doesn’t promise not to show you percent-encoded characters. It only decodes characters that are illegal in normal URIs.

It is probably a good idea for most applications that want to “pretty print” a URI to not decode control characters (U+0000 — U+001F and U+007F), to avoid printing garbage and newlines. Note that decodeURI does decode these characters, so it is probably unwise to use it for display purposes without some post-processing.

Update: My “super URI” concept is similar to the formally specified IRI (Internationalized Resource Identifier) — basically, a URI that can have non-ASCII characters. However, my “super URIs” also allow other ASCII characters that are illegal in URLs.

Which characters?

Okay, so exactly which characters should be escaped for this URI cleaning operation? I thought I’d take the opportunity to break down the different sets of characters described by the URI specification. I will address two versions of the specification: RFC 2396 (published in 1998) and RFC 3986 (published in 2005). 2396 is obsoleted by 3986, but since a lot of encoding functions (including JavaScript’s) were invented before 2005, it gives us a good historical explanation for their behaviour.

RFC 2396

This specification defines two sets of characters: reserved and unreserved.

  • The reserved characters are: $&+,/:;=?@
  • The unreserved characters are: ALPHA and NUM and !'()*-._~

Where ALPHA and NUM are the ASCII alphabetic and numeric characters, respectively. (They do not include non-ASCII characters.)

There is a semantic difference between reserved and unreserved characters. Reserved characters may have a syntactic meaning in the URI syntax, and so if one of them is to appear as a literal character in a URI component, it may need to be escaped. (This will depend upon context — a literal ‘?’ in a path component will need to be escaped, whereas a ‘?’ in a query does not need to be escaped.) Unreserved characters do not have a syntactic meaning in the URI syntax, and never need to be escaped. A corollary to this is that the escaping or unescaping an unreserved character does not change its meaning (“Z” means the same as “%5A”; “~” means the same as “%7E”), but escaping or unescaping a reserved character might change its meaning (“?” may have a different meaning to “%3F”).

The URI component encoding process should percent-encode all characters that are not unreserved. It is safe to escape unreserved characters as well, but not necessary and generally not preferable.

Together, these two sets comprise the valid URI characters, along with two other characters: ‘%’, used for encoding, and ‘#’, used to delimit the fragment (the ‘#’ and fragment were not considered to be part of the URI). I would suggest that both ‘%’ and ‘#’ be treated as reserved characters. All other characters are illegal. The complete set of illegal characters, under this specification, follows:

  • The ASCII control characters (U+0000 — U+001F and U+007F)
  • The space character
  • The characters: “<>[\]^`{|}
  • Non-ASCII characters (U+0080 — U+10FFFD)

The URI cleaning process should percent-encode precisely this set of characters: no more and no less.

RFC 3986

The updated URI specification from 2005 makes a number of changes, both to the way characters are grouped, and to the sets themselves. The reserved and unreserved sets are now as follows:

  • The reserved characters are: !#$&'()*+,/:;=?@[]
  • The unreserved characters are: ALPHA and NUM and -._~

This version features ‘#’ as a reserved character, because fragments are now considered part of the URI proper. There are two more important additions to the restricted set. Firstly, the characters “!'()*” have been moved from unreserved to reserved, because they are “typically unsafe to decode.” This means that, while these characters are still technically legal in a URI, their encoded form may be interpreted differently to their bare form, so encoding a URI component should encode these characters. Note that this is different than banning them from URIs altogether (for example, a “javascript:” URI is allowed to contain bare parentheses, and that scheme simply chooses not to distinguish between “(” and “%28″). Secondly, the characters ‘[‘ and ‘]’ have been moved from illegal to reserved. As of 2005, URIs are allowed to contain square brackets. This unfortunate change was made to allow IPv6 addresses in the host part of a URI. However, note that they are only allowed in the host, and not anywhere else in the URI.

The reserved characters were also split into two sets, gen-delims and sub-delims:

  • The gen-delims are: #/:?@[]
  • The sub-delims are: !$&'()*+,;=

The sub-delims are allowed to appear anywhere in a URI (although, as reserved characters, their meaning may be interpreted differently if they are unescaped). The gen-delims are the important top-level syntactic markers used to delimit the fields of the URI. The gen-delims are assigned meaning by the URI syntax, while the sub-delims are assigned meaning by the scheme. This means that, depending on the scheme, sub-delims may be considered unreserved. For example, a program that encodes a JavaScript program into a “javascript:” URI does not need to encode the sub-delims, because JavaScript will interpret them the same whether they are encoded or not (such a program would need to encode illegal characters such as space, and gen-delims such as ‘?’, but not sub-delims). The gen-delims may also be considered unreserved in certain contexts — for example, in the query part of a URI, the ‘?’ is allowed to appear bare and will generally mean the same thing as “%3F”. However, it is not guaranteed to compare equal: under the Percent-Encoding Normalization rule, encoded and bare versions of unreserved characters must be considered equivalent, but this is not the case for reserved characters.

Taking the square brackets out of the illegal set leaves us with the following illegal characters:

  • The ASCII control characters (U+0000 — U+001F and U+007F)
  • The space character
  • The characters: “<>\^`{|}
  • Non-ASCII characters (U+0080 — U+10FFFD)

A modern URI cleaning function must encode only the above characters. This means that any URI cleaning function written before 2005 (hint: encodeURI) will encode square brackets! That’s bad, because it means that a URI with an IPv6 address:

http://[2001:db8:85a3:8d3:1319:8a2e:370:7348]/admin/login?name=Helen Ødegård&gender=f

would be cleaned as:

http://%5B2001:db8:85a3:8d3:1319:8a2e:370:7348%5D/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f

which refers to the domain name “[2001:db8:85a3:8d3:1319:8a2e:370:7348]” (not the IPv6 address). Mozilla’s reference on encodeURI contains a work-around that ensures that square brackets are not encoded. (Note that this still double-escapes ‘%’ characters, so it isn’t good for URI cleaning.)

So what exactly should I do?

If you are building a URI programmatically, you must encode each component individually before composing them.

  • Escape the following characters: space and !”#$%&'()*+,/:;<=>?@[\]^`{|} and U+0000 — U+001F and U+007F and greater.
  • Do not escape ASCII alphanumeric characters or -._~ (although it doesn’t matter if you do).
  • If you have specific knowledge about how the component will be used, you can relax the encoding of certain characters (for example, in a query, you may leave ‘?’ bare; in a “javascript:” URI, you may leave all sub-delims bare). Bear in mind that this could impact the equivalence of URIs.

If you are parsing a URI component, you should unescape any percent-encoded sequence (this is safe, as ‘%’ characters are not allowed to appear bare in a URI).

If you are “cleaning up” a URI that someone has given you:

  • Escape the following characters: space and “<>\^`{|} and U+0000 — U+001F and U+007F and greater.
  • You may (but shouldn’t) escape ASCII alphanumeric characters or -._~ (if you really want to; it will do no harm).
  • You must not escape the following characters: !#$%&'()*+,/:;=?@[]
  • For an advanced URI cleaning, you may also fix any other syntax errors in an appropriate way (for example, a ‘[‘ in the path segment may be encoded, as may a ‘%’ in an invalid percent sequence).
  • An advanced URI cleaner may be able to escape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

If you are “pretty printing” a URI and want to display escaped characters as bare, where possible:

  • Unescape the following characters: space and “-.<>\^_`{|}~ and ASCII alphanumeric characters and U+0080 and greater.
  • It is probably not wise to unescape U+0000 — U+001F and U+007F, as they are control characters that could cause display problems (and there may be other Unicode characters with similar problems.)
  • You must not unescape the following characters: !#$%&'()*+,/:;=?@[]
  • An advanced URI printer may be able to unescape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

These four activities roughly correspond to JavaScript’s encodeURIComponent, decodeURIComponent, encodeURI and decodeURI functions, respectively. In the next section, we look at how they differ.

Some implementations

JavaScript

As I stated earlier, never use escape. First, it is not properly specified. In Firefox and Chrome, it encodes all characters other than the following: *+-./@_. This makes it unsuitable for URI construction and cleaning. It encodes the unreserved character ‘~’ (which is harmless, but unnecessary), and it leaves the reserved characters ‘*’, ‘+’, ‘/’ and ‘@’ bare, which can be problematic. Worse, it encodes Latin-1 characters with Latin-1 (instead of UTF-8) — not technically a violation of the spec, but likely to be misinterpreted, and even worse, it encodes characters above U+00FF with the malformed syntax “%uxxxx”. Avoid.

JavaScript’s “fixed” URI encoding functions behave according to RFC 2396, and assuming Unicode characters are to be encoded with UTF-8. This means that they are lacking the 2005 changes:

  • encodeURIComponent does not escape the previously-unreserved characters ‘!’, “‘”, “(“, “)” and “*”. Mozilla’s reference includes a work-around for this.
  • decodeURIComponent still works fine.
  • encodeURI erroneously escapes the previously-illegal characters ‘[‘ and ‘]’. Mozilla’s reference includes a work-around for this.
  • decodeURI erroneously unescapes ‘[‘ and ‘]’ (although there doesn’t seem to be a practical case where this is a problem).

Edit: Unfortunately, encodeURI and decodeURI have a single, critical flaw: they escape and unescape (respectively) percent signs (‘%’), which means they can’t be used to clean a URI. (Thanks to Tim Cuthbertson for pointing this out.) For example, assume we wanted to clean the URI:

http://example.com/admin/login?redirect=http://example.com/news%23funny&name=Helen Ødegård&gender=f

This URI has the ‘#’ escaped already, because no URI cleaner can turn a ‘#’ into a “%23″, but it doesn’t have the space or Unicode characters escaped. Passing this to encodeURI produces:

http://example.com/admin/login?redirect=http://example.com/news%2523funny&name=Helen%20%C3%98deg%C3%A5rd&gender=f

Note that the “%23″ has been double-escaped so it reads “%2523″ — completely wrong! We can fix this by extending Mozilla’s work-around to also correct against double-escaped percent characters:

function fixedEncodeURI(str) {
    return encodeURI(str).replace(/%25/g, '%').replace(/%5[Bb]/g, '[').replace(/%5[Dd]/g, ']');
}

Note that decodeURI is similarly broken. The fixed version follows:

function fixedDecodeURI(str) {
    return decodeURI(str.replace(/%25/g, '%2525').replace(/%5[Bb]/g, '%255B').replace(/%5[Dd]/g, '%255D'));
}

Edit: Fixed fixedEncodeURI and fixedDecodeURI so they work on lowercase escape codes. (Thanks to Tim Cuthbertson for pointing this out.)

Python

Python 2’s urllib.quote and urllib.unquote functions perform URI component encoding and decoding on byte strings (non-Unicode).

  • urllib.quote works as I specify above, except that it does escape ‘~’, and does not escape ‘/’. This can be overridden by supplying safe=’~’.
  • urllib.unquote works as expected, returning a byte string.

Note that these do not work properly at all on Unicode strings — you should first encode the string using UTF-8 before passing it to urllib.quote.

In Python 3, the quote and unquote functions have been moved into the urllib.parse module, and upgraded to work on Unicode strings (by me — yay!). By default, these will encode and decode strings as UTF-8, but this can be changed with the encoding and errors parameters (see urllib.parse.quote and urllib.parse.unquote).

I don’t know of any Python built-in functions for doing URI cleaning, but urllib.quote can easily be used for this purpose by passing safe=”!#$%&'()*+,/:;=?@[]~” (the set of reserved characters, as well as ‘%’ and ‘~'; note that alphanumeric characters, and ‘-‘, ‘.’ and ‘_’ are always safe in Python).

Mozilla Firefox

Firefox 10’s URL bar performs URL cleaning, allowing the user to type in URLs with illegal characters, and automatically converting them to correct URLs. It escapes the following characters:

  • space, “‘<>` and U+0000 — U+0001F and U+007F and greater. (Note that this includes the double and single quote.)
  • Note that the control characters for NUL, tab, newline and carriage return don’t actually transmit.

I would say this is erroneous: on a minor note, it should not be escaping the single quote, as that is a reserved character. It also fails to escape the following illegal characters: \^{|}, sending them to the server bare.

Firefox also “prettifies” any URI, decoding most of the percent-escape sequences for the characters that it knows how to encode.

Google Chrome

Chrome 16’s URL bar also performs URL cleaning. It is rather similar to Firefox, but encoding the following characters:

  • space, “<> and U+0000 — U+0001F and U+007F and greater. (Note that this includes only the double quote.)

So Chrome also fails to escape the illegal characters \^`{|} (including the backtick, which Firefox escapes correctly), but unlike Firefox, it does not erroneously escape the single quote.

Articles

SOPA Strike Blackout JavaScript snippet

In JavaScript on January 18, 2012 by Matt Giuca

Websites around the world are preparing to go dark tomorrow to protest the United States’ proposed SOPA act that will present a legal threat to websites worldwide in an overblown measure to stop piracy. The website sopastrike.com, run by Fight For The Future, is doing a good job of organising the protest (initially proposed by Reddit) by suggesting that webmasters place a JavaScript snippet in their HTML pages that will redirect visitors to http://sopastrike.com/strike/ for a twelve-hour time window between 8AM and 8PM US Eastern Standard Time. Unfortunately, there is a time zone bug in this code which will render the strike wholly or partially ineffective for non-American visitors.

If you are planning to use this code to “strike” your website, please read this post and update your code accordingly. I have contacted Fight For The Future about this bug. TL;DR, here is the correct code you should use:

<script type="text/javascript">var a=new Date,b=a.getUTCHours();if(0==a.getUTCMonth()&&2012==a.getUTCFullYear()&&((18==a.getUTCDate()&&13<=b)||(19==a.getUTCDate()&&0>=b)))window.location="http://sopastrike.com/strike";</script>

The original code (found on http://sopastrike.com) reads as follows (do not use this code):

var a=new Date,b=a.getHours()+a.getTimezoneOffset()/60;if(18==a.getDate()&&0==a.getMonth()&&2012==a.getFullYear()&&13=b)window.location="http://sopastrike.com/strike";

This will only work properly in North and South America (UTC-1 or further west). It will fail to redirect during some or all of the time window in Europe, Africa, Asia and Oceania (UTC or further east), and won’t work at all in Eastern Australia (where I live) or New Zealand. :(

The stated goal of the code is to activate the “strike” redirect between 8:00AM and 7:59PM EST (which is between 1300, January 18 and 0059, January 19, UTC). But the code does a bad hand-conversion from local time to UTC — it converts only the hour and not the date. The precise logic is:

“If the date in local time is 18 January, 2012 and the time in UTC is between 1300 and 2459 inclusive, then redirect.”

So wherever you are in the world, if it’s past midnight at the end of January 18, the strike doesn’t happen.

Folks in the United Kingdom at GMT (UTC) will miss the final hour of the strike, because at midnight, getDate() will tick over to 19. Anywhere east of the UK will see increasingly fewer hours of the strike. In AEDT (UTC+11), the strike is scheduled to start at precisely midnight, January 19, local time. So it won’t work for us at all, or anybody east of Australia.

As JavaScript provides explicit UTC conversion methods, I fixed the code by using those for all of the date fields. The following code has been tested at various relevant times in US EST (UTC-5) and AEDT (UTC+11) (by adjusting my computer’s clock and time zone), and works correctly in Firefox and Chrome (I have not tested it in Internet Explorer, but W3Schools says that all major browsers support the UTC methods):

<script type="text/javascript">var a=new Date,b=a.getUTCHours();if(0==a.getUTCMonth()&&2012==a.getUTCFullYear()&&((18==a.getUTCDate()&&13<=b)||(19==a.getUTCDate()&&0>=b)))window.location="http://sopastrike.com/strike";</script>

Note that this code is a bit more complex because it has to explicitly handle the final hour which is on January 19 in UTC. The precise logic for my code is this:

“If the month in UTC is January, 2012, and either the date in UTC is 18 and the time in UTC is 1300 or later, or the date in UTC is 19 and the time in UTC is 0059 or earlier, then redirect.”

I will be blacking out my WordPress blog during this time (not using this JavaScript; WordPress has a built-in SOPA protest option, which is awesome). See you on the other side!

Articles

Simulating classes with prototypes in JavaScript

In JavaScript,Object-oriented programming on June 5, 2011 by Matt Giuca

I’ve been programming in JavaScript for years, and have previously vaguely gotten my head around prototypes (what JavaScript has instead of classes), but never fully. I think I have the hang of them now, though, so I thought I’d share some knowledge with you. If you haven’t tried some real object-oriented programming in JavaScript, basically it’s quite different from everything else, because in everything else, you have classes and objects. As we are all taught in OOP 101, objects are instances of classes. So we build classes and then we instantiate them, creating objects.

But JavaScript is a prototype programming language, which means it has no classes, only objects. In prototype languages, objects are instances of other objects. If Object B is an instance of Object A, then Object A is said to be Object B’s prototype. So basically, this sort of unifies the concept if instantiation and inheritance. There is no difference between saying “Foo is an instance of Bar” and “Foo is a subclass of Bar” — in both cases, we say “Foo’s prototype is Bar.”

So that’s all well and good, but there’s one problem: I just want to write code. This is a pretty dismissive attitude, but I think, since we web programmers are forced to use this language, most of us are not really interested in learning a new and rather experimental programming paradigm. We just want to do things “the usual way”. (Note: There is a lot of strong opinion on The Wiki, and I mean the original Wiki and not Wikipedia, regarding prototype programming and why sliced bread does not compare. I personally don’t see the point of conflating objects and classes — to me, it is helpful to distinguish between objects and the types that describe them. That isn’t to say I am against it, just that in this case, I just want to know how to apply my hardened OOP techniques in JavaScript.)

In short, I have written the most basic Python program that uses inheritance. What is the most idiomatic JavaScript equivalent of this code?

class Foo(object):
    def __init__(self, x):              # Constructor
        self.x = x
    def getX(self):
        return self.x

class Bar(Foo):                         # Inherit from Foo
    def __init__(self, x, y):
        super(Bar, self).__init__(x)    # Superclass constructor call
        self.y = y
    def getY(self):
        return self.y

foo = Foo(4)
print(foo.getX())

bar = Bar(2, 9)
print(bar.getX())
print(bar.getY())

If you aren’t versed in Python, the above code simply has a class Foo with a variable x and a getter, getX, and a class Bar which is a subclass of Foo, which introduces a new variable y and a getter, getY. Note that getters aren’t particularly good Python style, but in this case they just serve as examples of methods which operate on the data. Note that Bar’s constructor takes both x and y, and defers the initialisation of x to the superclass constructor. Also note that the instance variables x and y are not private (Python doesn’t have a very good notion of “private members”); I don’t wish to cover encapsulation in this post.

So here’s what I came up with. JavaScript gurus might like to suggest improvements in the comments:

function Foo(x) {                   // Constructor
    this.x = x;
}
Foo.prototype.getX = function() {
    return this.x;
};

function Bar(x, y) {
    Foo.call(this, x);              // Superclass constructor call
    this.y = y;
}
Bar.prototype = new Foo();          // Inherit from Foo
Bar.prototype.getY = function() {
    return this.y;
};

foo = new Foo(4);
print(foo.getX());

bar = new Bar(2, 9);
print(bar.getX());
print(bar.getY());

Let’s break this down. First, we’ll just focus on Foo. You’ll note that I am curiously defining a function, not a class. Yes, to define the JavaScript equivalent of what I think of as a class, you just write its constructor as a top-level function. Note that Foo assigns to “this.x”. The key to what makes this a constructor and not an ordinary function is the way it is called — note the calling syntax uses the “new” keyword. If you call a function in JavaScript with “new“, it creates an object and calls the given function, allowing the function to assign to the new object via “this“. So “new Foo(4)” creates an object “{“x”: 4}”. But what about methods?

Well, since JavaScript is a proper functional programming language, we could have the constructor assign the methods to attributes of the this object, like so:

function Foo(x) {                   // Don't do this!
    this.x = x;
    this.getX = function() {
        return this.x;
    };
}

This is actually a common approach, but I don’t like it. It’s extremely inefficient in terms of space and speed. With this approach, every single object contains a separate getX function, wasting a lot of space. For details, see this Stack Overflow discussion, including some speed tests I ran. We need a solution which lets all instances of Foo share the same method table, like they would in a class-based language (where all instances share a common class).

This is where prototypes come in. In JavaScript, every object has a prototype, which is another object that this object inherits from (think of it like inheritance, but for objects rather than classes). More formally, if an object X has prototype P, if an attempt to look up an attribute of X fails, JavaScript will search for the attribute in P (this can be recursive; a prototype can have a prototype, etc). Therefore, we need to make sure that all “Foo objects” share a prototype, and that that prototype contains all of the “Foo” methods. In the above code, I do this by setting Foo.prototype.getX to a function object.

What’s confusing about this is that Foo.prototype is not the prototype of Foo (the constructor function). It is just an attribute of the function Foo. Foo.prototype is the prototype of any object constructed by Foo. So when I later write “foo = new Foo(4)”, JavaScript does the following:

  1. Create a new object,
  2. Set the new object’s prototype to Foo.prototype,
  3. Call Foo, with the new object as this.
  4. Assign the new object to foo.

Therefore, assigning to attributes of a constructor.prototype makes those attributes available to all objects constructed with that constructor, and they share a value. Commonly, you assign methods here, but you can also assign other values, which will be shared across instances (like static fields in Java, or class variables in Python).

So that’s how Foo works. How about Bar, which requires inheritance? I’ll use the prototype system to simulate class inheritance. The key is the line “Bar.prototype = new Foo().” This creates a new Foo instance (a dummy instance; we won’t actually be using its value, just its prototype), and stores that Foo instance as the “prototype” attribute of Bar — i.e., the prototype of all objects constructed by Bar. We then extend that prototype with a new method, getY. We could also override methods by assigning them to Bar.prototype (not shown here). Note that if we had just said “Bar.prototype = Foo.prototype”, that would be wrong, because adding or overriding methods would also affect any Foo instances. With this approach, Bar.prototype is not equal to Foo.prototype; rather, it is an object whose prototype is Foo.prototype. Therefore, any method lookups which fail on Bar.prototype will fall back to Foo.prototype.

(It sucks that I need to actually call new Foo() to create the dummy Foo object, since it means the constructor must work on undefined inputs. There should be some way to construct “an object with Foo.prototype as its prototype” without directly calling Foo, but I can’t think of one.)

This is easier to visualise with a diagram. Looking at the diagram below, we can see the distinction between constructors with a “prototype” attribute, and prototypes of objects. You can visually trace the lookup of the “getX” method on the “bar” object — “bar” itself has no attribute getX, so we follow the dashed line to its prototype. The prototype (Bar.prototype) has a getY attribute, but no getX attribute, so we follow the dashed line to the prototype’s prototype (Foo.prototype), which has the getX method, so we call that.

Object diagram of the above example

The last thing to note is how the Bar constructor calls its superclass constructor, a very common thing for a subclass constructor to do. I’m using the ‘call’ method, which lets you call a function and explicitly specify the value of “this“. If I just called “Foo(x)”, it would construct a new object and assign to its x attribute, which isn’t what I want. Performing “Foo.call(this, x)” says “call Foo(x), but use the this object from the Bar constructor as the this object in the Foo constructor.” Hence, when Foo assigns to “this.x”, it is assigning to the same this that Bar is constructing.

In summary:

  • When you want to define a “class,” define its constructor as an ordinary function, and have it write to fields of this.
  • Add methods to the “class” by assigning function objects to constructor.prototype.method.
  • Instantiate “classes” by calling the constructor with new.
  • Inherit from another class by assigning new SuperClass() to SubClass.prototype.
  • Call superclass constructors using SuperClass.call(this, args…).

Edit: Added semicolons at the end of “= function” statements.

Articles

Portable JavaScript: String indexing

In JavaScript,Web development on June 15, 2008 by Matt Giuca

A short community service announcement. I recently got burned on a JavaScript project which, as usual, wasn’t working in Internet Explorer. I decided to track down one of the issues which simply caused an “error on page” message in IE.

It turns out that string indexing (such as str[index]) doesn’t work in IE. You need to do this:

str.charAt(index)

The technical reason for this is that IE doesn’t implement JavaScript, it implements JScript. Both are derivatives of ECMAScript. The str[index] feature is a feature of JavaScript alone, not ECMAScript or JScript.

(I think Microsoft were forced to call it JScript for legal reasons, but it does give them the convenient excuse “hey, this isn’t JavaScript, it’s JScript”).

From Mozilla’s reference:

The second way (treating the string as an array) is not part of the ECMAScript; it’s a JavaScript feature.

So there you go – charAt from now on!

(Another similar tip: Don’t leave a trailing comma at the end of an object literal – it works in JavaScript but not in ECMAScript, or IE).

Now if you’ll excuse me, I’m off to grep for ‘[‘. :(

Update: The square-bracket indexing feature is included in ECMAScript 5 so should become a standard feature. From Mozilla’s reference:

Array-like character access (the second way above) is not part of ECMAScript 3. It is a JavaScript and ECMAScript 5 feature.

Also, it appears to be supported in Internet Explorer 8 and above (although I haven’t tested it). Thanks to Jean-Marc Desperrier and Chris Donnelly (in the comments) for pointing these out.

Follow

Get every new post delivered to your Inbox.