HTML: No, you don’t need to escape that
Posted by Michał ‘mina86’ Nazarewicz on 7th of March 2021
This website being my personal project allows me to experiment and do things I’d never do in professional settings. Most notably, I’m rather found of trying everything I can to reduce the size of the page. This goes beyond mere minification and eventually lead me to wonder if all those characters I’ve been escaping in HTML code really need such treatment.
Libraries offering HTML support will typically provide a function to indiscriminately replace all ampersands, quote characters, less-than and greater-then signs with their corresponding HTML-safe representation. This allows the result to be used in any context in the document and is a good choice for user-input validation. It’s a different matter when it comes to squeezing every last byte. Herein I will explore just which characters and when exactly really need to be escaped in an HTML document.
Greater- and less-than signs
U+003E greater-than sign is a deeply misunderstood character. All it really wants is to be left alone to live in peace. Sadly, all those pesky web developers keep escaping it and replacing with
> even though in vast majority of cases it’s not necessary. U+003C less-than sign, greater-than sign’s little brother, has a slightly easier life. While people indiscriminately escape it as well, this time it’s somewhat justifiable.
Greater-than sign has no special meaning in HTML data. It’s therefore valid and correct to write HTML code such as
<p>Linux > Windows > macOS. Less-than sign on the other has a special meaning — it’s what starts a tag after all — and as such must be escaped when outside of a tag. In other words HTML agrees,
<p>Windows < macOS is not a valid statement.
What’s also not alright is using either of those characters in an unquoted attribute value. For example
<em title=Rust>Go>this code</em> would be rendered as a single EM element with ‘Rust’ title and ‘Go>this code’ text. If that’s not the intention, the value has to be escaped. This is hardly a legitimate use-case though. Rather than escaping the characters it’s better to quote the value. The code ends up shorter and easier to read as evident by
<em title="PW>UW">this code</em>.
U+0026 ampersand, is a whole other story. She leads a tumultuous life full of change and adventure where in the end no one really knows what she wants. Life was simple in the olden days. Ampersand had to be escaped in every context. But HTML5 has complicated things quite a bit.
First of all, an ampersand which is not followed by an alphanumeric character or a hash does not need to be escape.
Steinway & Sons is therefore perfectly fine just like
curlun && !fsg_lun_is_open(curlun).
It would make sense to conclude that escaping is required if an ampersand is followed by one of those characters, which is probably why this is not the case.
<p>Copy&paste for example is a perfectly valid code because there’s no character reference named ‘paste’ (or any which would be a prefix of that string). When processing named character references the standard requires that the longest possible match is made but if none can be found the characters are treated verbatim. Furthermore, the semicolon ending the reference is part of that matching and some references are defined with and some without the semicolon.
As a consequence,
<p>&pi ¬in &Qopf ends up rendered as ‘&pi ¬in &Qopf’. To understand why this is the case one needs to realise that there is a named character reference named
not (without a semicolon) while there isn’t one called
notin. When the browser encounters
¬in the longest string it can match is
not and thus it interprets
¬ as a U+00AC not sign leaving
in intact. Correct way to write the statement is
<p>π ∉ ℚ which yields ‘π ∉ ℚ’ as expected. Adding semicolons fixes the code because
Qopf; (with semicolons) character references are all defined.
That could lead to a conclusion that ampersand does not need escaping unless it’s followed by a hash or a defined named character reference. That’s not quite the case though since there’s also such a thing as an ambiguous ampersand:
An ambiguous ampersand is an U+0026 AMPERSAND character (&) that is followed by one or more ASCII alphanumerics, followed by a U+003B semicolon character (;), where these characters do not match any of the names given in the named character references section.
In other words, while
copy&paste is valid because there’s no named character reference called
copy&paste; isn’t because there’s no named character reference called
paste;… Confused yet? There’s more.
Another exception introduced in HTML5 is treatment of ampersand in attributes. While in data state — when reading code outside of tags — semicolon is not always required, when parsing attribute values a named character reference must be terminated by a semicolon or else it will be taken verbatim. For example,
<a href="/?book=Ecc§=1¶=2">/?book=Ecc§=1¶=2</a> defines a link whose hypertext reference is ‘/?book=Ecc§=1¶=2’ but whose text is ‘/?book=Ecc§=1¶=2’ (despite the same code being used for both strings).
As inconsistent as all of it sounds, this is what browsers were often doing anyway even before the behaviour was codified. They always tried to be accommodating to errors in the document and would therefore accept entity references with missing semicolon and try to interpret them the way author intended rather than strictly adhere to the standard. HTML5 simply got all browsers in line and made it possible to copy URLs into attributes without worry.
For the sake of completeness, last thing to mention are a few more characters which need to be escaped in attribute values. First of all, in an unquoted attribute white-space characters (tab, line feed, form feed or space), U+0022 quotation mark, U+0027 apostrophe, U+003D equals sign and U+0060 grave accent all need to be escaped in addition to the three characters discussed earlier. The usual way of dealing with that is of course quoting the value at which point only quotation mark (a.k.a. double quote) or apostrophe (a.k.a. single quote) needs escaping depending which of the two is used to quote the string.
What exactly are the rules for escaping characters in HTML? Below is a quick recap:
- greater-than sign (>)
- Must be escaped in unquoted attribute only.
- less-than sign (<)
- Must be escaped in data and unquoted attribute.
- ampersand (&)
- Simple rule of thumb is that the character must be escaped if it’s followed by a hash or an alphanumeric character. The actual, more complex, rule is that it must be escaped if i) it is followed by a hash, ii) it is followed by sequence of alphanumeric characters followed by a semicolon or iii) if it’s outside of an attribute and is followed by a named character reference defined by the standard. If decision needs to be made without access to the list of named character references, one helpful observation is that no defined name starts with a digit so unless an ambiguous ampersand would be formed an ampersand followed by a digit does not need escaping.
- quotation mark (")
- apostrophe (')
- Must be escaped in unquoted attribute and in quoted one if it was the character used to quote the value.
- White-space (tab, line feed, form feed or space)
- equals sign (=)
- grave accent (`)
- equals sign (=)
- Must be escaped in unquoted attribute.