I’ve previously written about escaping special characters in HTML using character entities. This article discusses another method of achieving a similar result: CDATA sections. It explains what CDATA is, how it’s used in XML, SVG, MathML etc., when it can be included in HTML5 file and how it impacts XHTML.
What is CDATA?
A CDATA section in an XML document instructs the parser to interpret the enclosed text literally, as pure character data, not as markup. It starts with the <![CDATA[ sequence and ends with the ]]> sequence. Within a CDATA section, characters which normally have special meaning — namely less-than sign and ampersand — are processed verbatim without the need to escape them. It provides a convenient way to embed1 styles and scripts without having to worry about special characters as seen in Fig. 1:<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg">
<title>Is it Xmas?</title>
<rect width="64" height="64"/>
<script><![CDATA[
let day = new Date().getDate();
let fill = 18 <= day && day <= 25 &&
new Date().getMonth() == 11 ?
day == 25 ? '283' : 'cb4' : 'e67';
document.querySelector('rect')
.setAttribute('fill', '#' + fill);
]]></script>
</svg>Fig. 1 An SVG image which uses CDATA section to embed a script.
CDATA in HTML?
Typically, HTML doesn’t recognise CDATA sections. Using them to escape scripts or styles (analogous to the SVG example in Fig. 1) is invalid. On the other hand, script and style HTML elements are raw text elements whose ‘text […] must not contain any occurrences of the string ‘</’ followed by characters that case-insensitively match the tag name of the element followed by one of tab, line feed, form feed, carriage return, space, greater-than sign or solidus [i.e. forward slash].’citation In plain English, anything other than closing tag is valid inside of a raw text element. Fig. 2 shows how the rule works in practice:<!DOCTYPE html>
<title>Raw text element test</title>
<script>alert('</scrip');</script> ✔ valid
<script>alert('</script');</script> ✔ valid
<script>alert('</script ');</script> ✘ invalid
<script>alert('</script>');</script> ✘ invalid
<script>alert('</scriptus');</script> ✔ valid
<script>alert('<\/script>');</script> ✔ validFig. 2 Demonstration of valid and invalid markup for embedding scripts in an HTML document.
Most JavaScript and CSS code will naturally follow this restriction. If it doesn’t, a convenient solution is to escape forward slash, i.e. write '<\/script>' or '<\/style>'. For simplicity’s sake and peace of mind it’s prudent to always escape ‘<\/’.
Embedding SVG or MathML in HTML
HTML accepts CDATA sections inside foreign elements, i.e. elements from the MathML and the SVG namespaces. For example, the image from Fig. 1 can be included in an HTML file as shown in Fig. 3. Since the script element inside of the embedded SVG is foreign, so the HTML rules described above do not apply. Instead, CDATA section can be used to escape the contents of the script.<!DOCTYPE html>
<title>Is it Xmas?</title>
<svg>
<rect width=64 height=64 />
<script><![CDATA[
let day = new Date().getDate();
let fill = 18 <= day && day <= 25 &&
new Date().getMonth() == 11 ?
day == 25 ? '283' : 'cb4' : 'e67';
document.querySelector('rect')
.setAttribute('fill', '#' + fill);
]]></script>
</svg>Fig. 3 An HTML document with embedded SVG image containing an embedded script.
This creates a curious duality where some XML syntax rules seem to apply but others do not. CDATA sections are recognised and there are no special parsing rules for script and style elements. At the same time, attribute arguments do not need to be quoted. This is of course because ultimately HTML5 standard governs how such file is handled.
Another interesting observation is that embedded SVG images and MathML formulæ in HTML become part of the containing Document Object Model (DOM). As a result, scripts and styles can often be moved outside of the foreign element.2 For example, document from Fig. 3 can be rewritten with script as sibling of svg as shown in Fig. 4. Since the element is no longer inside of svg it’s parsed according to typical HTML rules.<!DOCTYPE html>
<title>Is it Xmas?</title>
<svg><rect width=64 height=64 /></svg>
<script>
let day = new Date().getDate();
let fill = 18 <= day && day <= 25 &&
new Date().getMonth() == 11 ?
day == 25 ? '283' : 'cb4' : 'e67';
document.querySelector('rect')
.setAttribute('fill', '#' + fill);
</script>Fig. 4 An HTML document with embedded SVG image whose script was moved outside of the SVG foreign element.
What’s the deal with XHTML?
I first looked at XHTML handling of script element nearly two decades ago. XHTML follows XML syntax rules where a convenient way to escape embedded scripts and styles is to use CDATA section in exactly the same way it was used in Fig. 1. However, a problem may arise if user agent interprets the file as HTML. This can happen if it’s saved with .html file extension or sent with incorrect content type. Compare the following:
The files are identical, but embedded script executes only in the former. In the latter, web browser doesn’t recognise the CDATA section and treats <![CDATA[ and ]]> as part of the JavaScript code resulting in syntax error. To avoid such issues, a ‘standard-agnostic’ syntax can be used where CDATA markers are wrapped in a comment as demonstrated in Fig. 5. Block comments shown because they will work in JavaScript and CSS.<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML scripting</title></head>
<body><script>/* <![CDATA[ */
alert('42 < g₆₄')
/* ]]> */</script></body>
</html>Fig. 5 An XHTML document with embedded JavaScript code that’s going to work even if the file is parsed as HTML.| Markup | Content when parsed as |
|---|
| XHTML | HTML |
|---|
<script><![CDATA[
alert('42 < g₆₄')
]]></script> | alert('42 < g₆₄')
| <![CDATA[
alert('42 < g₆₄')
]]>JavaScript syntax error |
<script>
alert('42 < g₆₄')
</script> | XML syntax error | alert('42 < g₆₄') |
<script>/* <![CDATA[ */
alert('42 < g₆₄')
/* ]]> */</script> | alert('42 < g₆₄')
| /* <![CDATA[ */
alert('42 < g₆₄')
/* ]]> */ |
Fig. 6 Results of parsing different script element depending on whether file is recognised as XHTML or HTML.
Word of caution
N.B., even though CDATA sections eliminate many instances where escaping is necessary, they are not a substitute for validating user input. For example, name = request.GET.get('name', 'world'); print('<text>Hello, <![CDATA[', name, ']]>.</text>') is not correct: adversary can execute Cross-Site Scripting (XSS) attack with ‘]]>evil payload here<![CDATA[’. Text inside of a CDATA section requires instances of ‘]]>’ to be replaced by ‘]]>]]><![CDATA[’.
Summary
- CDATA section starts with
<![CDATA[ sequence and ends with ]]>. Any text inside of it is parsed literally and not treated as markup. - It can be used in an XML document (such as an SVG image or XHTML file), but isn’t normally recognised in an HTML file.
- When embedding an SVG or MathML document in HTML file, the CDATA sections contained therein are recognised. Styles and scripts included within such embedded content can (and usually need to) be escaped with CDATA section.
- As a rule of thumb, in HTML documents only ‘
</’ sequence needs escaping in script and style elements; ‘<\/’ can be substituted in its place. - When embedding scripts and styles in XHTML documents, wrapping CDATA markers in a comment (as in
/* <![CDATA[ */ … code goes here … /* ]]> */) avoids syntax errors if file is mistakenly parsed as HTML.
1 Within this article, embedding means putting source of one file inside of the source of the other. For example, copying SVG source and pasting it directly into an HTML file, as opposed to referencing the image with img element. In context of SVG images such image is often called an inline SVG. However, to avoid confusion with an inline style — which is a similar but distinct feature from embedding styles — this article doesn’t use that nomenclature. ↩
2 In CSS, one exception is @scope. Without scope root specified, it’s scoped to style’s parent element. In JavaScript, moving the script affects which part of Document Object Model (DOM) tree is available when the code executes. For completeness, I feel compelled to also mention document.write, but that method should never be used. ↩