Chapter 4. Hypertext Markup Language

¡Supera tus tareas y exámenes ahora con Quizwiz!

word

A similar behavior can be observed in previous versions of Firefox when dealing with tag names that contain invalid characters (in this case, the equal sign). Instead of doing its best to ignore the entire block, the parser would simply reset and interpret the quoted tag:

word

A simple form markup may look like this:

word

Always output consistent, valid, and browser-supported Content-Type and charset information to prevent the document from being interpreted contrary to your original intent.

word

An HTML form can be thought of as an information-gathering hyperlink: When the "submit" button is clicked, a dynamic request is constructed on the fly from the data collected via any number of input fields. Forms allow user input and files to be uploaded to the server, but in almost every other way, the result of submitting a form is similar to following a normal link.

word

An optional target parameter may be used to target other windows or document views for navigation. The parameter must specify the name of the target view. If the name cannot be found, or if access is denied, the default behavior is typically to open a new window instead. The conditions in which access may be denied are the topic of Chapter 11.

word

Applications that fail to account for this possibility when processing any sensitive, state-changing requests are said to be vulnerable to cross-site request forgery (XSRF or CSRF). This vulnerability can be mitigated in a number of ways, the most common of which is to include a secret user- and session-specific value on such requests (as an additional query parameter or a hidden form field). The attacker will not be able to obtain this value, as read access to cross-domain documents is restricted by the same-origin policy (see Chapter 9).

word

As an illustration of permissible syntax, consider the following directive that, when appearing in an 8-bit ASCII document, will clarify for the browser that the charset of the document is UTF-8 and not, say, ISO-8859-1:

word

As with CSS, in the absence of valid Content-Type data, the charset according to which the script is interpreted may be controlled by the including party.

word

Be mindful that when http-equiv values conflict with each other, or contradict the HTTP headers received from the server earlier on, their behavior is not consistent and should not be relied upon. For example, the first supported charset= value usually prevails (and HTTP headers have precedence over {meta} in this case), but with several conflicting Refresh values, the behavior is highly browser-specific.

word

Both types of angle brackets are obviously problematic inside a tag, unless properly quoted.

word

But what if an HTML document is delivered through a non-HTTP protocol or loaded from a local file? Clearly, in this case, there is no simple way to express or preserve this information. We can part with some of it easily, but parameters such as the MIME type or the character set are essential, and losing them forces browsers to improvise later on. (Consider, for example, that charsets such as UTF-7, UTF-16, and UTF-32 are not ASCII-compatible and, therefore, HTML documents can't even be parsed without determining which of these transformations needs to be used.)

word

Despite having had a lasting impact on the design of the language, in some ways, the idea of a semantic web may be becoming obsolete: Online content less frequently maps to the concept of a single, viewable document, and HTML is often reduced to providing a convenient drawing surface and graphic primitives for JavaScript applications to build their interfaces with.

word

Despite the seemingly open-ended syntax of the tag, other request methods and submission formats are not supported by any browser, and this is unlikely to change. For a short while, the HTML5 standard tried to introduce PUT and DELETE methods in forms, but this proposal was quickly shot down.

word

Developers working with XHTML should be aware of a potential pitfall in that dialect, too. Although HTML entities are not recognized in most of the special parsing modes, XHTML differs from traditional HTML in that tags such as {script} and {style} do not automatically toggle a special parsing mode on their own. Instead, an explicit {![CDATA[...]]} block around any scripts or stylesheets is required to achieve a comparable effect. Therefore, the following snippet with an attacker-controlled string (otherwise scrubbed for angle brackets, quotes, backslashes, and newlines) is perfectly safe in HTML, but not in XHTML:

word

Encoded name=value pairs are then delimited with ampersands and combined into a single string, such as this:

word

Figure 4-1. Predefined targets for hyperlinks

word

For almost all of the type-specific content inclusion methods, Content-Type and Content-Disposition headers provided by the server will typically be ignored (perhaps except for the charset= value), as may be the HTTP response code itself. It is best to assume that whenever the body of any server-provided resource is even vaguely recognizable as one of the data formats enumerated in this section, it may be interpreted as such.

word

For any HTML document, a top-level {!DOCTYPE} directive may be used to instruct the browser to parse the file in a manner that at least superficially conforms to one of the officially defined standards; to a more limited extent, the same signal can be conveyed by the Content-Type header, too. Of all the available parsing modes, the most striking difference exists between XHTML and traditional HTML. In the traditional mode, parsers will attempt to recover from most types of syntax errors, including unmatched opening and closing tags. In addition, tag and parameter names will be considered case insensitive, parameter values will not always need to be quoted, and certain types of tags, such as {img}, will be closed implicitly. In other words, the following input will be grudgingly tolerated:

word

Four special target names can be used, too (as shown on the left of Figure 4-1): _blank always opens a brand-new window, _parent navigates a higher-level view that embeds the link-bearing document (if any), and _top always navigates the top-level browser window, no matter how many document embedding levels are in between. Oh, right, the fourth special target, _self, is identical to not specifying a value at all and exists for no reason whatsoever.

word

Frames are a form of markup that allows the contents of one HTML document to be displayed in a rectangular region of another, embedding page. Several framing tags are supported by modern browsers, but the most common way of achieving this goal is with a hassle-free and flexible inline frame:

word

Frames are of special interest to web security, as they allow almost unconstrained types of content originating from unrelated websites to be combined onto a single page. We will have a second look at the problems associated with this behavior in Chapter 11.

word

From Chapter 3, we recall that HTTP headers may give new meaning to the entire response (Location, Transfer-Encoding, and so on), change the way the payload is presented (Content-Type, Content-Disposition), or affect the client-side environment in other, auxiliary ways (Refresh, Set-Cookie, Cache-Control, Expires, etc.).

word

From a purely theoretical standpoint, HTML relies on a fairly simple syntax: a hierarchical structure of tags, name=value tag parameters, and text nodes (forming the actual document body) in between. For example, a simple document with a title, a heading, and a hyperlink may look like this:

word

From that point on, all hell broke loose: For the next few years, competing browser vendors kept introducing all sorts of flashy, presentation-oriented features and tweaked the language to their liking. Several attempts to amend the original RFC have been undertaken, but ultimately the IETF-managed standardization approach proved to be too inflexible. The newly formed World Wide Web Consortium took over the maintenance of the language and eventually published the HTML 3.2 specification in 1997.[138]

word

Good Engineering Hygiene for All HTML Documents

word

Image files can be retrieved and displayed on a page using {img} tags, via stylesheets, and through a legacy background= parameter on markup such as {body} or {table}.

word

In XHTML documents, additional named entities can be defined using the {!ENTITY} directive and made to resolve to internally defined strings or to the contents of an external file URL. (This last option is obviously unsafe if allowed when processing untrusted content; the resulting attack is sometimes called External XML Entity, or XXE for short.)

word

In addition to content-agnostic link navigation and document framing, HTML also provides multiple ways for a more lightweight inclusion of several predefined types of external content.

word

In addition to the named entities, it is also possible to insert an arbitrary ASCII or Unicode character using a decimal &#number; notation. In this case, < maps to a left angle bracket; > substitutes a right one; and 😹 is, I kid you not, a Unicode 6.0 character named "smiling cat face with tears of joy." Hexadecimal notation can also be used if the number is prefixed with "x". In this variant, the left angle bracket becomes <, etc.

word

In comparison, the XML mode is more predictable. It generally forbids stray "{" and "&" characters inside the document, but it provides a special syntax, starting with "{![CDATA[" and ending with "]]}", as a way to encapsulate any raw text inside an arbitrary tag. For example:

word

In fact, Firefox versions prior to version 4 engaged in far-fetched reparsing whenever particular special tags, such as {title}, were not closed before the end of the document:

word

In the absence of the appropriate charset value in the Content-Type header for the downloaded stylesheet, the encoding according to which this subresource will be interpreted can be specified by the including party through the charset parameter of the {link} tag.

word

In the following years, the work on HTML 4 and 4.01[139] focused on pruning HTML of all accumulated excess and on better explaining how document elements should be interpreted and rendered. It also defined an alternative, strict XHTML syntax derived from XML, which was much easier to consistently parse but more punishing to write. Despite all this work, however, only a small fraction of all websites on the Internet could genuinely claim compliance with any of these standards, and little or no consistency in parsing modes and error recovery could be seen on the client end. Consequently, some of the work on improving the core language fizzled out, and the W3C turned its attention to stylesheets, the Document Object Model, and other more abstract or forward-looking challenges.

word

In the late 2000s, some of the low-level work has been revived under the banner of HTML5,[140] an ambitious project to normalize almost every aspect of the language syntax and parsing, define all the related APIs, and more closely police browser behavior in general. Time will tell if it will be successful; until then, the language itself, and each of the four leading parsing engines,[25] come with their own set of frustrating quirks.

word

In traditional HTML documents, this tag puts the parser in one of the special parsing modes, and all text between the opening and the closing tag will simply be ignored in frame-aware browsers. In legacy browsers that do not understand {iframe}, the markup between the opening and closing tags is processed normally, however, offering a decidedly low-budget, conditional rendering directive. This conditional behavior is commonly used to provide insightful advice such as "This page must be viewed in a browser that supports frames."

word

Let's talk about character encoding again. As noted on the first pages of this chapter, certain reserved characters are generally unsafe inside text nodes and tag parameter values, and they will often lead to outright syntax errors in XHTML. In order to allow such characters to be used safely (and to allow a convenient way to embed high-bit text), a simple ampersand-prefixed, semicolon-terminated encoding scheme, known as entity encoding, is available to developers.

word

Many other quirks of this type are related to the idiosyncrasies of SGML and XML. For example, due to the comment-handling behavior mentioned earlier in an aside, browsers disagree on how to parse !- and ?-directives (such as {!DOCTYPE} or {?xml}), whether to allow XML-style CDATA blocks in non-XHTML modes, and on what precedence to give to overlapping special parsing mode tags (such as "{style}{!-- {/style} --}").

word

Moving on, the location marked is also of note. In this spot, NUL characters are ignored by most parsers, as are many types of whitespaces. Not long ago, WebKit browsers accepted a slash in this location, but recent parser improvements have eliminated this quirk.

word

Of course, the availability of such an encoding scheme is not a guarantee of its use. The failure to properly filter out or escape reserved characters when displaying user-controlled data is the cause of a range of extremely common and deadly web application security flaws. A particularly well-known example of this is cross-site scripting (XSS), an attack in which malicious, attacker-provided JavaScript code is unintentionally echoed back somewhere in the HTML markup, effectively giving the attacker full control over the appearance and operation of the targeted site.

word

On all types of cross-domain navigation, the browser will transparently include any ambient credentials; consequently, to the server, a request legitimately originating from its own client-side code will appear roughly the same as a request originating from a rogue third-party site, and it may be granted the same privileges.

word

On the flip side, all of the following directives will fail, because at this point it is too late to switch to an incompatible UTF-32 encoding, change the document type to a video format, or execute a redirect instead of parsing the file:

word

One of the most important and security-relevant features of HTML is, predictably, the ability to link to and embed external content. HTTP-level features such as Location and Refresh aside, this can be accomplished in a couple of straightforward ways.

word

Parsing a single tag can be a daunting task, but as you might imagine, anomalous arrangements of multiple HTML tags will be even less predictable. Consider the following trivial example:

word

Quote characters appearing inside a tag can have undesirable effects, depending on their exact location, but are harmless in text nodes.

word

Quote characters are a yet another topic of interest. Website developers know that single and double quotes can be used to put a string containing whitespaces or angle brackets in an HTML parameter, but it usually comes as a surprise that Internet Explorer also honors backticks (`) instead of real quotes in the location marked . Similarly, few people realize that in any browser, an implicit whitespace is inserted after a quoted parameter, and that the explicit whitespace at can therefore be skipped without changing the meaning of the tag.

word

SECURITY ENGINEERING CHEAT SHEET

word

Scripts are text-based programs included with {script} tags and are executed in a manner that gives them full control over the host document. The primary scripting language for the Web is JavaScript, although an embedded version of Visual Basic is also supported in Internet Explorer and can be used at will. Chapter 6 takes an in-depth look at client-side scripts and their capabilities.

word

Several other once-supported content inclusion methods, such as the {bgsound} tag for background music, were commonplace in the past but have fallen out of grace. On the other hand, as a part of HTML5, new tags such as {video} and {audio} are expected to gain popularity soon.

word

Since an accurate understanding of user-supplied markup is essential to designing many types of security filters, let's have a quick look at some of these behaviors and quirks. To begin, consider the following reference snippet:

word

Sir Berners-Lee has never given up on this dream, but in this one regard, the actual usage of HTML proved to be very different from what he wished for. Web developers were quick to pragmatically distill the essence of HTML 3.2 into a handful of presentation-altering but semantically neutral tags, such as {font}, {b}, and {pre}, and saw no reason to explain further the structure of their documents to the browser. W3C attempted to combat this trend but with limited success. Although tags such as {font} have been successfully obsoleted and largely abandoned in favor of CSS, this is only because stylesheets offered more powerful and consistent visual controls. With the help of CSS, the developers simply started relying on a soup of semantically agnostic {span} and {div} tags to build everything from headings to user-clickable buttons, all in a manner completely opaque to any automated content extraction tools.

word

Some browsers will attempt to speculatively extract {meta http-equiv} information before actually parsing the document, which may lead to embarrassing mistakes. For example, a security bug recently fixed in Firefox 4 caused the browser to interpret the following statement as a character set declaration: {meta http-equiv="Refresh" content="10;http://www.example.com/charset=utf-7"}.[141]

word

Stray ampersands (&) should never appear in most sections of an HTML document.

word

The HTML parser recognizes entity encoding inside text nodes and parameter values and decodes it transparently when building an in-memory representation of the document tree. Therefore, the following two cases are functionally identical:

word

The Hypertext Markup Language (HTML) is the primary method of authoring online documents. One of the earliest written accounts of this language is a brief summary posted on the Internet by Tim Berners-Lee in 1991.[136] His proposal outlines an SGML-derived syntax that allows text documents to be annotated with inline hyperlinks and several types of layout aids. In the following years, this specification evolved gradually under the direction of Sir Berners-Lee and Dan Connolly, but it wasn't until 1995, at the onset of the First Browser Wars, that a reasonably serious and exhaustive specification of the language (HTML 2.0) made it to RFC 1866.[137]

word

The XML mode, on the other hand, is strict: All tags need to be balanced carefully, named using the proper case, and closed explicitly. (The XML-specific self-closing tag syntax, such as {img /}, is permitted.) In addition, most syntax mistakes, even trivial ones, will result in an error and prevent the document from being displayed at all.

word

The action parameter works like the href value used for normal links, with one minor difference: If the value is absent, the form will be submitted to the location of the current document, whereas any destination-free {a} links will simply not work at all. An optional target parameter may also be specified and will behave as outlined in the previous section.

word

The constraints on the src URL for framed content are roughly similar to the rules enforced on regular links. This includes the ability to point frames to javascript: or to load externally handled protocols that leave the frame empty and open the target application in a new process.

word

The existence of the second POST submission mode, triggered by specifying enctype="text/plain" on the {form} tag, is difficult to justify. In this mode, field names and values will not be percent encoded at all (but, depending on the browser, plus signs may be used to substitute for spaces), and a newline delimiter will be used in place of an ampersand. The resulting format is essentially useless, as it can't be parsed unambiguously: Form-originating newlines and equal signs are indistinguishable from browser inserted ones.

word

The following markup demonstrates the most familiar and most basic method for referencing external content from within a document:

word

The following two examples, on the other hand, will not work as expected, as the encoding interferes with the structure of the tag itself:

word

The frame is a completely separate document view that in many aspects is identical to a new browser window. (It even enjoys its own JavaScript execution context.) Like browser windows, frames can be equipped with a name parameter and then targeted from {a} and {form} tags.

word

The fundamentals of HTML syntax outlined in the previous sections are usually enough to understand the meaning of well-formed HTML and XHTML documents. When the XHTML dialect is used, there is little more to the story: The minimal fault-tolerance of the parser means that anomalous syntax almost always leads simply to a parsing error. Alas, the picture is very different with traditional, laid-back HTML parsers, which aggressively second-guess the intent of the page developer even in very ambiguous or potentially harmful situations.

word

The handling of tags that are not closed before the end of the file is equally fascinating. For example, the following snippet will prompt most browsers to interpret the {i} tag or ignore the entire string, but Internet Explorer and Opera use a different backtracking approach and will see {b} instead:

word

The largely transparent behavior of entity encoding makes it important to correctly resolve it prior to making any security decisions about the contents of a document and, if applicable, to properly restore it in the sanitized output later on. To illustrate, the following syntax must be recognized as an absolute reference to a javascript: pseudo-URL and not to a cryptic fragment ID inside a relative resource named "./javascript&":

word

The last important difference worth mentioning here is that traditional HTML parsing strategies feature a selection of special modes, entered into after certain tags are encountered and exited only when a specific terminator string is seen; everything in between is interpreted as non-HTML text. Some examples of such special tags include {style}, {script}, {textarea}, or {xmp}. In practical implementations, these modes are exited only when a literal, case-insensitive match on {/style, {/script, or a similar matching value, is made; any other markup inside such a block will not be interpreted as HTML. (Interestingly, there is one officially obsolete tag, {plaintext}, that cannot be exited at all; it stays in effect for the remainder of the document.)

word

The last mode is triggered with enctype="multipart/form-data" and must be used whenever submitting user-selected files through a form (which is possible with a special {input type="file"} tag). The resulting request body consists of a series of short MIME messages corresponding to every submitted field.[28] These messages are delimited with a client-selected random, unique boundary token that should otherwise not appear in the encapsulated data:

word

The last two parsing quirks have interesting security consequences in any scenario where the attacker may be able to interrupt page load prematurely. Even if the markup is otherwise fairly well sanitized, the meaning of the document may change in a very unexpected way.

word

The left angle bracket ({) is a hazard inside a text node.

word

The list of recognized image types can be wrapped up with odds and ends such as Windows metafiles (WMF and EMF), Windows Media Photo (WDP and HDP), Windows icons (ICO), animated PNG (APNG), TIFF images, and—more recently—WebP. Browser support for these is far from universal, however.

word

The low-level syntax of the language aside, HTML is also the subject of a fascinating conceptual struggle: a clash between the ideology and the reality of the online world. Tim Berners-Lee always championed the vision of a semantic web, an interconnected system of documents in which every functional block, such as a citation, a snippet of code, a mailing address, or a heading, has its meaning explained by an appropriate machine-readable tag (say, {cite}, {code}, {address}, or {h1} to {h6}).

word

The most familiar use of this encoding method is the inclusion of certain predefined, named entities. Only a handful of these are specified for XML, but several hundred more are scattered in HTML specifications and supported by all modern browsers. In this approach, < is used to insert a left angle bracket; > substitutes a right angle bracket; & replaces the ampersand itself; while, say, → is a nice Unicode arrow.

word

The most popular image type on the Internet is a lossy but very efficient JPEG file, followed by lossless and more featured (but slower) PNG. An increasingly obsolete lossless GIF format is also supported by every browser, and so is the rarely encountered and usually uncompressed Windows bitmap file (BMP). An increasing number of rendering engines support SVG, an XML-based vector graphics and animation format, too, but the inclusion of such images through the {img} tag is subject to additional restrictions.

word

The new specification tried to reconcile the differences in browser implementations while embracing many of the bells and whistles that appealed to the public, such as customizable text colors and variable typefaces. Ultimately, though, HTML 3.2 proved to be a step back for the clarity of the language and had only limited success in catching up with the facts.

word

The only reasonable approach to tag sanitization is to employ a realistic parser to translate the input document into a hierarchical in-memory document tree, and then scrub this representation for all unrecognized tags and parameters, as well as any undesirable tag/parameter/value configurations. At that point, the tree can be carefully reserialized into a well-formed, well-escaped HTML that will not flex any of the error correction muscles in the browser itself. Many developers think that a simpler design should be possible, but eventually they discover the reality the hard way.

word

The other notable special parsing mode available in both XHTML and normal HTML is a comment block. In XML, it quite simply begins with "{!--" and ends with "--}". In the traditional HTML parser in Firefox versions prior to 4, any occurrence of "--", later followed by "}", is also considered good enough.

word

The resulting value is inserted into the query part of the destination URL (replacing any existing contents of that section) and submitted to the server. The received response is then shown to the user in the targeted viewport.

word

The security consequences of the browser-level heuristics used to detect character sets and document types will be explored in detail in Chapter 13. Meanwhile, the problem of preserving protocol-level information within a document is somewhat awkwardly addressed by a special HTML directive, {meta http-equiv=...}. By the time the browser examines the markup, many content-handling decisions must have already been made, but some tweaks are still on the table; for example, it may be possible to adjust the charset to a generally compatible value or to specify Refresh, Set-Cookie, and caching directives.

word

The security impact of these patterns is not always easy to appreciate, but consider an HTML filter tasked with scrubbing an {img} tag with an attacker-controlled title parameter. Let's say that in the input markup, this parameter is not quoted if it contains no whitespaces and angle brackets—a design that can be seen on a popular blogging site. This practice may appear safe at first, but in the following two cases, a malicious, injected onerror parameter will materialize inside a tag:

word

The set of parsing behaviors discussed in the previous sections is by no means exhaustive. In fact, an entire book has been written on this topic: Inquisitive readers are advised to grab Web Application Obfuscation (Syngress, 2011) by Mario Heiderich, Eduardo Alberto Vela Nava, Gareth Heyes, and David Lindsay—and then weep about the fate of humanity. The bottom line is that building HTML filters that try to block known dangerous patterns, and allow the remaining markup as is, is simply not feasible.

word

The situation is a bit more complicated if the method parameter is set to POST. For that type of HTTP request, three data submission formats are available. In the default mode (referred to as application/x-www-form-urlencoded), the message is constructed the same way as for GET but is transmitted in the request payload instead, leaving the query string and all other parts of the destination URL intact.[27]

word

The standard permits certain types of browser-supported documents, such as text/html or text/plain, to be loaded through {object} tags, in which case they form a close equivalent of {iframe}. This functionality is not used in practice, and the rationale behind it is difficult to grasp.

word

There is relatively little consistency in what URL schemes are accepted for type-specific content retrieval. It should be expected that protocols routed to external applications will be rejected, as they do not have a sensible meaning in this context, but beyond this, not many assumptions should be made. As a security precaution, most browsers will also reject scripting-related schemes when loading images and stylesheets, although Internet Explorer 6 and Opera do not follow this practice. As of this writing, javascript: URLs are also permitted on {embed} and {applet} tags in Firefox but not, for example, on {img}.

word

These text-based files can be loaded with a {link rel=stylesheet href=...} tag—even though {style src=...} would be a more intuitive choice—and may redefine the visual aspects of almost any other HTML tag within their parent document (and in some cases, even include embedded JavaScript). The syntax and function of CSS are the subject of Chapter 5.

word

This approach, he and other proponents argued, would make it easier for machines to crawl, analyze, and index the content in a meaningful way, and in the near future, it would enable computers to reason using the sum of human knowledge. According to this philosophy, the markup language should provide a way to stylize the appearance of a document, but only as an afterthought.

word

This category includes various rendering cues that may or may not be honored by the browser; they are most commonly provided through {link} directives. Examples include website icons (known as "favicons"), alternative versions of a page, and chapter navigation links.

word

This category spans miscellaneous binary files included with {embed} or {object} tags or via an obsolete, Java-specific {applet} tag. Browser plug-in content follows its own security rules, which are explored to some extent in Chapter 8 and Chapter 9. In many cases, it is safe to consider plug-in-supported content as equivalent to or more powerful than JavaScript.

word

This hyperlink may point to any of the browser-recognized schemes, including pseudo-URLs (data:, javascript:, and so on) and protocols handled by external applications (such as mailto:). Clicking on the text (or any HTML elements) nested inside such a {a href=...} block will typically prompt the browser to navigate away from the linking document and go to the specified location, if meaningfully possible for the protocol used.

word

This syntax puts some constraints on what may appear inside a parameter value or inside the document body. Five characters—angle brackets, single and double quotes, and an ampersand—are reserved as the building blocks of the HTML markup, and these need to be avoided or escaped in some way when used outside of their intended function. The most important rules are:

word

To allow these characters to appear in problematic locations without causing side effects, an ampersand-based encoding scheme, discussed in Entity Encoding in HTML Parsing Survival Tips, is provided.

word

To further complicate the job of HTML parsing, some browsers exhibit behaviors that can be used to conditionally skip some of the markup in a document. For example, in an attempt to help novice users of Microsoft's Active Server Pages development platform, Internet Explorer treats {% ... %} blocks as a completely nonstandard comment, hiding any markup between these two character sequences. Another Internet Explorer-specific feature is explicit conditional expressions interpreted by the parser and smuggled inside standard HTML comment blocks:

word

Unfortunately, even the simple task of recognizing and parsing HTML entities can be tricky. In traditional parsing, for example, entities may often be accepted even if the trailing semicolon is omitted, as long as the next character is not an alphanumeric. (In Firefox, dashes and periods are also accepted in entity names.) Numeric entities are even more problematic, as they may have an overlong notation with an arbitrary number of trailing zeros. Moreover, if the numerical value is higher than 232, the standard size of an integer on many computer architectures, the corresponding character may be computed incorrectly.

word

Unlike the regular flavor of HTML, XML-based documents may also elegantly incorporate sections using other XML-compliant markup formats, such as MathML, a mathematical formula markup language. This is done by specifying a different xmlns namespace setting for a particular tag, with no need for one-off, language-level hacks.

word

Unusually, unlike {a} tags, forms cannot be nested inside each other, and only the top-level {form} tag will remain operational in such a case.

word

Web developers are usually surprised to learn that this syntax can be drastically altered without changing its significance to the browser. For example, Internet Explorer will allow an NUL character (0x00) to be inserted in the location marked at , a change that is likely to throw all naïve HTML filters off the trail. It is also not widely known that the whitespaces at and can be substituted with uncommon vertical tab (0x0B) or form feed (0x0C) characters in all browsers and with a nonbreaking UTF-8 space (0xA0) in Opera.[26] Oh, and here's a really surprising bit: In Firefox, the whitespace at can also be replaced with a single, regular slash—yet the one at can't.

word

When presented with such syntax, most browsers only interpret {i} and treat the "{b" string as an invalid tag parameter. Firefox versions before 4, however, would automatically close the {i} tag first when encountering an angle bracket and, in the end, will interpret both {i} and {b}. In the spirit of fault tolerance, until recently WebKit followed that model, too.

word

When the method value is set to GET or is simply not present at all, all the nested field names and their current values will be escaped using the familiar percent-encoding scheme outlined in Chapter 2, but with two rather arbitrary differences. First, the space character (0x20) will be substituted with the plus sign, rather than encoded as "%20". Second, following from this, any existing plus signs need to be encoded as "%2B", or else they will be misinterpreted as spaces.

word

Yet another wonderful quote-related quirk in Internet Explorer makes this job even more complicated. While most browsers recognize quoting only when it is used at the beginning of a parameter value, Internet Explorer simply checks for any occurrence of an equal sign (=) followed by a quote and will parse this syntax in a rather unexpected way:

word

and


Conjuntos de estudio relacionados

Biol 1107 exam 2 combination of other ppl

View Set

Modern Database Management Chapter 5 (12th edition)

View Set

2) Contrast two theories explaining altruism in humans

View Set

The Child with Renal/Genitourinary Dysfunction

View Set