Doctype html error

I get the following syntax error in Firebug and I don't get what's it: > syntax error [Break on this error] > <!DOCTYPE html PUBLIC "-//W3C//DTDXHT...org/TR/xhtml1/DTD/xhtml1-strict.dtd"&...

I get the following syntax error in Firebug and I don’t get what’s it:

> syntax error [Break on this error]
> <!DOCTYPE html PUBLIC "-//W3C//DTDXHT...org/TR/xhtml1/DTD/xhtml1-strict.dtd">n

Is it because of the final «n» at the end ?

thanks

ps. I’ve just realized if I remove all scripts, I don’t get that error. For example, if I remove these lines, I don’t get it. If I add another script I get it again, so it doesn’t depend on the script itself.

<script type="text/JavaScript" src="<?php echo $base_url; ?>sites/all/themes/bluemarine/js/main.js"></script>

CODE:

<?php
// $Id: page.tpl.php,v 1.28.2.1 2009/04/30 00:13:31 goba Exp $
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="<?php print $language->language ?>" xml:lang="<?php print $language->language ?>" dir="<?php print $language->dir ?>">
<head>
  <?php print $head ?>
  <title><?php print $head_title ?></title>
  <?php print $styles ?>
  <?php print $scripts ?>
  <script type="text/JavaScript" src="<?php echo $base_url; ?>sites/all/themes/bluemarine/js/main.js"></script>
  <!--<script type="text/JavaScript" src="<?php echo $base_url; ?>sites/all/themes/bluemarine/js/griddy-min.js"></script>
-->

</head>

<body>...

There are many languages in website design. You’ve got HTML, CSS, JavaScript, Java, Python, SQL, PHP — and these are just a few.

Developer creating HTML5 file with HTML doctype declaration

To ensure browsers can parse documents in different languages and display them correctly, it’s important to add a doctype declaration. This basically instructs browsers how to read and render a file.

Let’s take a closer look at doctype HTML below.

Download Now: 25 Free HTML & CSS Hacks

What is doctype HTML?

Doctype HTML is a declaration that tells the browser what version of HTML the document is written in. This declaration appears as the very first line in an HTML file.

Here’s an example of an HTML file with a doctype declaration:

 
<!DOCTYPE html>

<html lang=”en”>

<head>

<title>This is the Title of the Page</title>

</head>

<body>

<h1>This is a Heading</h1>

<p>This is a paragraph.</p>

</body>

</html>

Let’s take a closer look at the HTML doctype declaration below.

HTML Doctype Declaration

The HTML <!DOCTYPE> declaration is not an HTML element or tag. It is an instruction that tells the browser what type of document to expect.

All HTML documents need to start with a <!DOCTYPE> declaration. The declaration varies depending on what version of HTML the document is written in.

Here’s the HTML4 strict declaration, for example:

 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">

Here’s the XHTML 1.0 strict declaration:

 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

In both examples, the information appearing after “PUBLIC” is referring to a Document Type Definition (DTD). A DTD defines the structure of a document and which elements and attributes are valid. In the examples above, the DTD is declared in external files by the World Wide Web Consortium (W3C), which are referenced in quotation marks.

Notice that they both reference strict DTDs, which were used for web pages that excluded attributes and elements that W3C expected to phase out as CSS support grew. There were also transitional and frameset DTDs.

It is no longer necessary to refer to a DTD when writing with the latest version of HTML, HTML5. Let’s take a closer look at the HTML5 doctype declaration below.

HTML5 Doctype

In HTML5, the doctype declaration is <!DOCTYPE html>. This is easy to write and remember, particularly when compared to the complicated doctype declarations of previous versions of HTML.

The <!DOCTYPE> declaration is not case sensitive. Although most commonly written in all uppercase, you can write it as lowercase, sentence case, and so on, and the browser will still recognize it as the HTML doctype declaration.

Here are some examples that would be effective:

 
<!DOCTYPE html>

<!doctype html>

<!Doctype html>

<!DocType html>

This tells the browser that this document type is HTML5 so it knows which elements are valid.

For example, the HTML anchor element, <a>, is valid in the HTML5, HTML4, and XHTML doctypes. The HTML acronym element, <acronym>, is only valid for HTML4 and XHTML — not in HTML5. On the other hand, the HTML dialog element, <dialog>, is only valid for HTML5 — not HTML4 or XHTML.

Doctype HTML Unexpected Token

When creating an HTML file with the doctype declaration, it’s possible that you’ll get an “unexpected token” error message. Here’s how it typically looks:

 
Uncaught SyntaxError: Unexpected token <

It will appear on the same line as the HTML <!DOCTYPE> declaration, but the reason for this error is not the way you wrote the declaration. Instead, it typically has to do with JavaScript.

For example, if you reference a Javascript file — like core.js — in your HTML file and the browser can’t find this file, then it will prompt the «unexpected token» error. This is a common issue for websites running on an Apache server.

To fix this error, you’ll have to replace the file path in your script tag to the correct file path — or you can move the core.js file to the correct location cited in the script tag.

Let’s say your script tag is as follows:

 
<script type="text/javascript" src="/Client/public/core.js"></script>

In that case,  you can either replace “/Client/public/core.js” or move the core.js file to the location /Client/public/.

Now let’s say you click into Client/public and do see a file named “core.js.” In that case, double check the permissions for both the public folder and core.js file. Either of these might be lacking the proper permissions and therefore causing the “unexpected token” error.

Let’s recap everything we discussed in this post, with this handy video.

HubSpot Video

Understanding the Doctype Declaration

The HTML doctype declaration is a fundamental concept of coding. Every HTML document requires a valid doctype declaration so that the browser displays your web page in the way you intended. Now that you understand this concept, you’re already on the path to learning HTML and CSS.

New Call-to-action

Floody

0 / 0 / 0

Регистрация: 27.04.2010

Сообщений: 9

1

30.04.2010, 00:34. Показов 10928. Ответов 3

Метки нет (Все метки)


Вот код (вернее его начало):

HTML5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 
  <meta http-equiv="content-type" content="text/html; charset=windows-1251">
  <meta name="generator" content="Bluefish 2.0.0" >
 
  <title>Сайт</title>
  <link rel="stylesheet" href="css/stylekh.css" type="text/css" />
  <!--[if lte IE 7]><link rel="stylesheet" href="css/ie.css" type="text/css" media="screen,projection" /><![endif]-->
  
 
 
  </head>

…..и т.п.

Валидатор на w3c возмущается про ошибку: Validation Output: 1 Error

1. Error Line 1, Column 1: end of document in prolog

<!DOCTYPE html PUBLIC «-//W3C//DTD XHTML 1.0 Transitional//EN»

Что происходит? Еще вчера все это не вызывало у него же нареканий

__________________
Помощь в написании контрольных, курсовых и дипломных работ, диссертаций здесь



0



Programming

Эксперт

94731 / 64177 / 26122

Регистрация: 12.04.2006

Сообщений: 116,782

30.04.2010, 00:34

Ответы с готовыми решениями:

!DOCTYPE
Доброго времени суток!
Подскажите, пожалуйста, есть ли способ узнать какой у браузера !DOCTYPE по…

<!DOCTYPE html>
Использую фреймворк Bootstrap для сайта. Сейчас более-менее стал разбираться в php, уже на создавал…

DOCTYPE html
раньше писали и у казыва в &lt;!DOCTYPE …. строгий там синтаксис или переходной, 1.0 или 4.01,…

Doctype и таблицы
Всем привет, у меня вопрос, почему при добавлении в текст html документа &lt;!DOCTYPE html&gt; в таблица…

3

lesha_firs

143 / 109 / 13

Регистрация: 13.02.2010

Сообщений: 522

30.04.2010, 02:02

2

используй этот

HTML5
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

Добавлено через 47 секунд
Ты проста для xml использовал!



0



cooperOk

60 / 60 / 6

Регистрация: 12.11.2009

Сообщений: 169

30.04.2010, 11:17

3

Ну вообще доктайп нормальный… Единственное что у вас может вызвать ошибку это не закрытые теги meta. По стандарту xhtml все теги должны быть закрыты, даже одинарные. В вашем случае должно быть такая запись

HTML5
1
2
3
4
...
<meta http-equiv="content-type" content="text/html; charset=windows-1251" />
<meta name="generator" content="Bluefish 2.0.0" />
...



0



Floody

0 / 0 / 0

Регистрация: 27.04.2010

Сообщений: 9

30.04.2010, 12:27

 [ТС]

4

cooperOk, спасибо, что указали на ошибку — я просто вчера после часа борьбы и напившись от огорчения водкой скопировала все немного не оттуда. На самом деле код у файла, который вчера лежал на сервере и проверялся по линку, вот такой:

HTML5
1
2
3
4
5
6
7
8
9
10
11
12
13
# <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
# "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
# <html xmlns="http://www.w3.org/1999/xhtml">
#
#
# <head>
# <meta http-equiv="content-type" content="text/html; charset=windows-1251" />
# <meta name="generator" content="Bluefish 2.0.0" />
#
# <title>Сайт</title>
# <link rel="stylesheet" href="css/stylekh.css" type="text/css" />
# <!--[if lte IE 7]><link rel="stylesheet" href="css/ie.css" type="text/css" media="screen,projection" /><![endif]-->
# </head>

Он и вчера был такой и сегодня остался… так вот сегодня после проверки я получаю сообщение «This document was successfully checked as XHTML 1.0 Transitional!»

Ничего не поняла, но проблема исчезла, поэтому всем спасибо за ответы

Добавлено через 1 минуту

Цитата
Сообщение от lesha_firs
Посмотреть сообщение

используй этот

HTML5
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

Добавлено через 47 секунд
Ты проста для xml использовал!

В след.раз в случае возникновения проблемы попробую и такой поюзать, thx



0



IT_Exp

Эксперт

87844 / 49110 / 22898

Регистрация: 17.06.2006

Сообщений: 92,604

30.04.2010, 12:27

Помогаю со студенческими работами здесь

Выбор doctype
кто может помочь с выборот Doctype? и какая между ними разница,так же замечал что некоторые пишут…

!DOCTYPE html
ребята можно исползоват это
&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Strict//EN&quot;…

DOCTYPE и CSS
Люди у меня DOCTYPE конфликтует с CSS. Долго не мог понять почему body -&gt; Text-Align -&gt; center не…

Mozilla & DOCTYPE
Доброго времени суток! Mozilla Firefox 3.6 как-то не корректно работает с DOCTYPE. У меня на…

Искать еще темы с ответами

Или воспользуйтесь поиском по форуму:

4

Тег DOCTYPE означает «объявление типа документа». Он сообщает веб-браузерам, какую версию HTML использует ваша страница. Прописывать DOCTYPE необходимо для переключения браузера в режим соответствия стандартам данной версии. DOCTYPE всегда должен быть самой первой строкой HTML-кода страницы.

Хотя современные браузеры не принимают во внимание незначительные ошибки в коде HTML, неправильный DOCTYPE является серьезной проблемой, которая может существенно повлиять на доступность вашего веб-контента. Отчет Labrika «Неверный HTML DOCTYPE» позволяет установить наличие таких ошибок.

Содержание отчета

Отчет показывает список URL-адресов страниц, в HTML-коде которых DOCTYPE прописан неверно.

список URL-адресов страниц, в HTML-коде которых DOCTYPE прописан неверно

Использование отчета

Найдите страницы с  некорректным DOCTYPE и пропишите его правильно.

Синтаксис:

<!DOCTYPE [Элемент верхнего уровня] [Публичность] "[Регистрация]//[Организация]//[Тип] [Имя]//[Язык]" "[URL]">

Параметры:

  • Элемент верхнего уровня — указывает элемент верхнего уровня в документе, для HTML это тег <html>.
  • Публичность — указывает, является объект публичным (значение PUBLIC) или системным (значение SYSTEM) ресурсом, например, таким как локальный файл. Для HTML/XHTML выбирается значение PUBLIC.
  • Регистрация — сообщает, зарегистрирован ли разработчик DTD (document type definition, описания типа документа) в международной организации по стандартизации (International Organization for Standardization, ISO). Принимает одно из двух значений: плюс (+) — разработчик зарегистрирован в ISO или минус (-) — разработчик не зарегистрирован. Для W3C указывается значение «-».
  • Организация — уникальное название организации, разработавшей DTD. Официально HTML/XHTML публикует W3C (англ. World Wide Web Consortium — Консорциум Всемирной паутины). Нназвание этой организации и пишется в теге DOCTYPE.
  • Тип описываемого документа. Для HTML/XHTML указывается значение DTD.
  • Имя — уникальное имя документа для описания DTD.
  • Язык — указывается язык, на котором написан текст для описания объекта. Содержит две буквы, пишется в верхнем регистре. Для документа HTML/XHTML нужно прописывать английский язык (EN).
  • URL — адрес документа с DTD.

Существует несколько видов DOCTYPE — они различаются в зависимости от используемой на странице версии языка разметки гипертекста.

Пример для версии HTML 4.01:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

Спецификация HTML5 предлагает следующий вариант написания DOCTYPE для всех документов:

<!DOCTYPE html>.

You’ve done all the right stuff, but your site doesn’t look or work as it should in the latest browsers.

Article Continues Below

You’ve written valid XHTML and CSS. You’ve used the W3C standard Document Object Model (DOM) to manipulate dynamic page elements. Yet, in browsers designed to support these very standards, your site is failing. A faulty DOCTYPE is likely to blame.

This little article will provide you with DOCTYPEs that work, and explain the practical, real–world effect of these seemingly abstract tags.

WHY A DOCTYPE?#section2

Per HTML and XHTML standards, a DOCTYPE (short for “document type declaration”) informs the validator which version of (X)HTML you’re using, and must appear at the very top of every web page. DOCTYPEs are a key component of compliant web pages: your markup and CSS won’t validate without them.

As mentioned in previous ALA articles (and in other interesting places), DOCTYPES are also essential to the proper rendering and functioning of web documents in compliant browsers like Mozilla, IE5/Mac, and IE6/Win.

A recent DOCTYPE that includes a full URI (a complete web address) tells these browsers to render your page in standards–compliant mode, treating your (X)HTML, CSS, and DOM as you expect them to be treated.

Using an incomplete or outdated DOCTYPE—or no DOCTYPE at all—throws these same browsers into “Quirks” mode, where the browser assumes you’ve written old-fashioned, invalid markup and code per the depressing industry norms of the late 1990s.

In this setting, the browser will attempt to parse your page in backward–compatible fashion, rendering your CSS as it might have looked in IE4, and reverting to a proprietary, browser–specific DOM. (IE reverts to the IE DOM; Mozilla and Netscape 6 revert to who knows what.)

Clearly, this is not what you want. But it is often what you’ll get, due to the preponderance of incorrect or incomplete DOCTYPE information this article hopes to correct.

(Note: The Opera browser does not play by these rules; it always attempts to render pages in standards–compliant mode. Go, Opera! On the other hand, Opera does not yet offer solid support for the W3C DOM. But they’re working on it.) {Ed: Since this article was first published, Opera has delivered the DOM-compliant Opera 7 browser.}

WHERE HAVE ALL THE DOCTYPES GONE?#section3

Since DOCTYPES are vital to the proper functioning of web standards in browsers, and since W3C is a leading creator of web standards, you might expect W3C’s website to provide a listing of proper DOCTYPEs, and you might also expect to be able to find this information quickly and easily in a single location. But as of this writing, you can’t. {Ed. Prompted in part by this article, the W3C now lists standard DOCTYPEs on its site. You will find the listing a few screens into the W3C tutorial, “My Web site is standard. And yours?”}

W3.org is not A List Apart, WebReference, or Webmonkey. It’s not intended to help web designers, developers, and content folks get up to speed on the latest technological recommendations and practices. That’s not its job.

W3C does publish a series of tutorials, though most web developers are unaware of it. Mainly, though, W3C’s site houses a collection of proposals, drafts, and Recommendations, written by geeks for geeks. And when I say geeks, I don’t mean ordinary web professionals like you and me. I mean geeks who make the rest of us look like Grandma on the first day She’s Got Mail.™

You can search for DOCTYPEs all day at w3.org without finding one page that lists them all. And when you do hunt down a DOCTYPE (generally in relation to a particular Recommendation or Working Draft), it’s often one that won’t work on your site.

Scattered throughout W3C’s site are DOCTYPEs with missing URIs, and DOCTYPEs with relative URIs that point to documents on W3C’s own site. Once removed from W3C’s site and used on your web pages, these URIs point to non–existent documents, thus fouling up your best efforts and the browser’s.

For instance, many sites sport this DOCTYPE, copied and pasted directly from w3.org:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">

If you look at the last part of the DOCTYPE (“DTD/xhtml1-strict.dtd”), you’ll see that it is a relative link to a document on W3C’s site. Since that document is on W3C’s site but not yours, the URI is useless to the browser.

The DOCTYPE you’d actually want to use is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Notice that the latter DOCTYPE includes a complete URI at the end of the tag. Since the tag provides a valid location on the web, the browser knows where to find it, and will render your document in standards–compliant mode.

DOCTYPES THAT WORK#section4

So what DOCTYPEs should we use? Glad you asked. The following complete DOCTYPEs are the ones we need:

HTML 4.01 Strict, Transitional, Frameset#section5

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">

XHTML 1.0 Strict, Transitional, Frameset#section6

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

XHTML 1.1 DTD#section7

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

NEXT STEPS#section8

How can you help improve support for standards on the web? Besides bookmarking this page (and copying and pasting these DOCTYPEs for your own use), if your web editor inserts DOCTYPEs, you might want to check them out and compare them to the list above.
Many well–intentioned software makers have cut and pasted incomplete DOCTYPEs from W3C into their software. Result: when you use these programs’ built–in functionality to insert DOCTYPEs in your pages, the browsers go into Quirks mode, undoing all your hard work.

It’s worth contacting the folks who make your favorite authoring package, showing them the proper DOCTYPEs, and politely requesting them to address this issue in an incremental upgrade. (In some cases, you may also be able to modify your editor yourself.)

COMING SOON#section9

We have every reason to believe that W3C’s site will soon sport a handy listing of accurate, usable DOCTYPES and other essential information in an easy–to–find location. In fact, Karl Dubost, Conformance Manager of W3C’s Quality Assurance team,  contributed to the information in this little article.

Likewise, when The Web Standards Project relaunches (real soon), it will also provide this information. {Ed. note: The Web Standards Project relaunched in late 2002.}

But sites are being designed and built every day, and you need this information now—so there it is.

Happy authoring and rendering!

Rocket Validator checks HTML on your sites using the latest version of W3C Validator Nu HTML Checker, which is intended for HTML5 documents.

The page scanned is using an obsolete doctype, instead of the expected <!DOCTYPE html>.

HTML documents are expected to start with a first line containing the Document Type Declaration, that defines the HTML version used. Since HTML5, it’s just <!DOCTYPE html>, which must appear before the start <html> tag.

Here’s an example of a minimal HTML5 document:

<!DOCTYPE html>
<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <p></p>
  </body>
</html>

All HTML documents must start with a <!DOCTYPE> (Document Type Declaration), that informs browsers about the type and version of HTML used to build the document. In HTML5, this is simply <!DOCTYPE html> and must appear at the start of the document.

Here is an example of a minimal HTML document, including the Document Type Declaration at its start:

<!DOCTYPE html>
<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <p></p>
  </body>
</html>

6,250 HTML checks per week. Fully automated.

Save time using our automated web checker. Let our crawler check your web pages on the W3C Validator.

WHATWG

Living Standard — Last Updated 6 February 2023

    1. 13.2 Parsing HTML documents
      1. 13.2.1 Overview of the parsing model
      2. 13.2.2 Parse errors
      3. 13.2.3 The input byte stream
        1. 13.2.3.1 Parsing with a known character encoding
        2. 13.2.3.2 Determining the character encoding
        3. 13.2.3.3 Character encodings
        4. 13.2.3.4 Changing the encoding while parsing
        5. 13.2.3.5 Preprocessing the input stream
      4. 13.2.4 Parse state
        1. 13.2.4.1 The insertion mode
        2. 13.2.4.2 The stack of open elements
        3. 13.2.4.3 The list of active formatting elements
        4. 13.2.4.4 The element pointers
        5. 13.2.4.5 Other parsing state flags
      5. 13.2.5 Tokenization
        1. 13.2.5.1 Data state
        2. 13.2.5.2 RCDATA state
        3. 13.2.5.3 RAWTEXT state
        4. 13.2.5.4 Script data state
        5. 13.2.5.5 PLAINTEXT state
        6. 13.2.5.6 Tag open state
        7. 13.2.5.7 End tag open state
        8. 13.2.5.8 Tag name state
        9. 13.2.5.9 RCDATA less-than sign state
        10. 13.2.5.10 RCDATA end tag open state
        11. 13.2.5.11 RCDATA end tag name state
        12. 13.2.5.12 RAWTEXT less-than sign state
        13. 13.2.5.13 RAWTEXT end tag open state
        14. 13.2.5.14 RAWTEXT end tag name state
        15. 13.2.5.15 Script data less-than sign state
        16. 13.2.5.16 Script data end tag open state
        17. 13.2.5.17 Script data end tag name state
        18. 13.2.5.18 Script data escape start state
        19. 13.2.5.19 Script data escape start dash state
        20. 13.2.5.20 Script data escaped state
        21. 13.2.5.21 Script data escaped dash state
        22. 13.2.5.22 Script data escaped dash dash state
        23. 13.2.5.23 Script data escaped less-than sign state
        24. 13.2.5.24 Script data escaped end tag open state
        25. 13.2.5.25 Script data escaped end tag name state
        26. 13.2.5.26 Script data double escape start state
        27. 13.2.5.27 Script data double escaped state
        28. 13.2.5.28 Script data double escaped dash state
        29. 13.2.5.29 Script data double escaped dash dash state
        30. 13.2.5.30 Script data double escaped less-than sign state
        31. 13.2.5.31 Script data double escape end state
        32. 13.2.5.32 Before attribute name state
        33. 13.2.5.33 Attribute name state
        34. 13.2.5.34 After attribute name state
        35. 13.2.5.35 Before attribute value state
        36. 13.2.5.36 Attribute value (double-quoted) state
        37. 13.2.5.37 Attribute value (single-quoted) state
        38. 13.2.5.38 Attribute value (unquoted) state
        39. 13.2.5.39 After attribute value (quoted) state
        40. 13.2.5.40 Self-closing start tag state
        41. 13.2.5.41 Bogus comment state
        42. 13.2.5.42 Markup declaration open state
        43. 13.2.5.43 Comment start state
        44. 13.2.5.44 Comment start dash state
        45. 13.2.5.45 Comment state
        46. 13.2.5.46 Comment less-than sign state
        47. 13.2.5.47 Comment less-than sign bang state
        48. 13.2.5.48 Comment less-than sign bang dash state
        49. 13.2.5.49 Comment less-than sign bang dash dash state
        50. 13.2.5.50 Comment end dash state
        51. 13.2.5.51 Comment end state
        52. 13.2.5.52 Comment end bang state
        53. 13.2.5.53 DOCTYPE state
        54. 13.2.5.54 Before DOCTYPE name state
        55. 13.2.5.55 DOCTYPE name state
        56. 13.2.5.56 After DOCTYPE name state
        57. 13.2.5.57 After DOCTYPE public keyword state
        58. 13.2.5.58 Before DOCTYPE public identifier state
        59. 13.2.5.59 DOCTYPE public identifier (double-quoted) state
        60. 13.2.5.60 DOCTYPE public identifier (single-quoted) state
        61. 13.2.5.61 After DOCTYPE public identifier state
        62. 13.2.5.62 Between DOCTYPE public and system identifiers state
        63. 13.2.5.63 After DOCTYPE system keyword state
        64. 13.2.5.64 Before DOCTYPE system identifier state
        65. 13.2.5.65 DOCTYPE system identifier (double-quoted) state
        66. 13.2.5.66 DOCTYPE system identifier (single-quoted) state
        67. 13.2.5.67 After DOCTYPE system identifier state
        68. 13.2.5.68 Bogus DOCTYPE state
        69. 13.2.5.69 CDATA section state
        70. 13.2.5.70 CDATA section bracket state
        71. 13.2.5.71 CDATA section end state
        72. 13.2.5.72 Character reference state
        73. 13.2.5.73 Named character reference state
        74. 13.2.5.74 Ambiguous ampersand state
        75. 13.2.5.75 Numeric character reference state
        76. 13.2.5.76 Hexadecimal character reference start state
        77. 13.2.5.77 Decimal character reference start state
        78. 13.2.5.78 Hexadecimal character reference state
        79. 13.2.5.79 Decimal character reference state
        80. 13.2.5.80 Numeric character reference end state
      6. 13.2.6 Tree construction
        1. 13.2.6.1 Creating and inserting nodes
        2. 13.2.6.2 Parsing elements that contain only text
        3. 13.2.6.3 Closing elements that have implied end tags
        4. 13.2.6.4 The rules for parsing tokens in HTML content
          1. 13.2.6.4.1 The «initial» insertion mode
          2. 13.2.6.4.2 The «before html» insertion mode
          3. 13.2.6.4.3 The «before head» insertion mode
          4. 13.2.6.4.4 The «in head» insertion mode
          5. 13.2.6.4.5 The «in head noscript» insertion mode
          6. 13.2.6.4.6 The «after head» insertion mode
          7. 13.2.6.4.7 The «in body» insertion mode
          8. 13.2.6.4.8 The «text» insertion mode
          9. 13.2.6.4.9 The «in table» insertion mode
          10. 13.2.6.4.10 The «in table text» insertion mode
          11. 13.2.6.4.11 The «in caption» insertion mode
          12. 13.2.6.4.12 The «in column group» insertion mode
          13. 13.2.6.4.13 The «in table body» insertion mode
          14. 13.2.6.4.14 The «in row» insertion mode
          15. 13.2.6.4.15 The «in cell» insertion mode
          16. 13.2.6.4.16 The «in select» insertion mode
          17. 13.2.6.4.17 The «in select in table» insertion mode
          18. 13.2.6.4.18 The «in template» insertion mode
          19. 13.2.6.4.19 The «after body» insertion mode
          20. 13.2.6.4.20 The «in frameset» insertion mode
          21. 13.2.6.4.21 The «after frameset» insertion mode
          22. 13.2.6.4.22 The «after after body» insertion mode
          23. 13.2.6.4.23 The «after after frameset» insertion mode
        5. 13.2.6.5 The rules for parsing tokens in foreign content
      7. 13.2.7 The end
      8. 13.2.8 Speculative HTML parsing
      9. 13.2.9 Coercing an HTML DOM into an infoset
      10. 13.2.10 An introduction to error handling and strange cases in the parser
        1. 13.2.10.1 Misnested tags: <b><i></b></i>
        2. 13.2.10.2 Misnested tags: <b><p></b></p>
        3. 13.2.10.3 Unexpected markup in tables
        4. 13.2.10.4 Scripts that modify the page as it is being parsed
        5. 13.2.10.5 The execution of scripts that are moving across multiple documents
        6. 13.2.10.6 Unclosed formatting elements
    2. 13.3 Serializing HTML fragments
    3. 13.4 Parsing HTML fragments

13.2 Parsing HTML documents

This section only applies to user agents, data mining tools, and conformance
checkers.

The rules for parsing XML documents into DOM trees are covered by the next
section, entitled «The XML syntax».

User agents must use the parsing rules described in this section to generate the DOM trees from
text/html resources. Together, these rules define what is referred to as the
HTML parser.

While the HTML syntax described in this specification bears a close resemblance to SGML and
XML, it is a separate language with its own parsing rules.

Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used
SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for
HTML documents; the only user agents to strictly handle HTML as an SGML application have
historically been validators. The resulting confusion — with validators claiming documents
to have one representation while widely deployed web browsers interoperably implemented a
different representation — has wasted decades of productivity. This version of HTML thus
returns to a non-SGML basis.

Authors interested in using SGML tools in their authoring pipeline are encouraged to use XML
tools and the XML serialization of HTML.

For the purposes of conformance checkers, if a resource is determined to be in the HTML
syntax, then it is an HTML document.

As stated in the terminology section,
references to element types that do not explicitly specify a
namespace always refer to elements in the HTML namespace. For example, if the spec
talks about «a element», then that is an element with the local name «menu«, the namespace «http://www.w3.org/1999/xhtml«, and
the interface . Where possible, references to such elements are
hyperlinked to their definition.

13.2.1 Overview of the parsing model

The input to the HTML parsing process consists of a stream of code
points, which is passed through a tokenization stage followed by a tree
construction stage. The output is a Document object.

Implementations that do not support scripting do not
have to actually create a DOM Document object, but the DOM tree in such cases is
still used as the model for the rest of the specification.

In the common case, the data handled by the tokenization stage comes from the network, but
it can also come from script running in the user
agent, e.g. using the document.write() API.

There is only one set of states for the tokenizer stage and the tree
construction stage, but the tree construction stage is reentrant, meaning that while the tree
construction stage is handling one token, the tokenizer might be resumed, causing further tokens
to be emitted and processed before the first token’s processing is complete.

In the following example, the tree construction stage will be called upon to handle a «p»
start tag token while handling the «script» end tag token:

...
<script>
 document.write('<p>');
</script>
...

To handle these cases, parsers have a script nesting level, which must be initially
set to zero, and a parser pause flag, which must be initially set to false.

13.2.2 Parse errors

This specification defines the parsing rules for HTML documents, whether they are syntactically
correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that’s the
processing rules described throughout this specification), but user agents, while parsing an HTML
document, may abort the parser at the first parse
error that they encounter for which they do not wish to apply the rules described in this
specification.

Conformance checkers must report at least one parse error condition to the user if one or more
parse error conditions exist in the document and must not report parse error conditions if none
exist in the document. Conformance checkers may report more than one parse error condition if more
than one parse error condition exists in the document.

Parse errors are only errors with the syntax of HTML. In addition to
checking for parse errors, conformance checkers will also verify that the document obeys all the
other conformance requirements described in this specification.

Some parse errors have dedicated codes outlined in the table below that should be used by
conformance checkers in reports.

Error descriptions in the table below are non-normative.

Code Description
abrupt-closing-of-empty-comment

This error occurs if the parser encounters an empty comment that is abruptly closed by a U+003E (>) code
point (i.e., <!--> or <!--->). The
parser behaves as if the comment is closed correctly.

abrupt-doctype-public-identifier

This error occurs if the parser encounters a U+003E (>) code point in the
DOCTYPE public identifier (e.g., <!DOCTYPE html PUBLIC "foo>). In such a case, if the DOCTYPE is correctly
placed as a document preamble, the parser sets the Document to quirks
mode.

abrupt-doctype-system-identifier

This error occurs if the parser encounters a U+003E (>) code point in the
DOCTYPE system identifier (e.g., <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "foo>). In such a case,
if the DOCTYPE is correctly placed as a document preamble, the parser sets the
Document to quirks mode.

absence-of-digits-in-numeric-character-reference

This error occurs if the parser encounters a numeric character reference that doesn’t contain any digits (e.g., &#qux;). In this case the parser doesn’t resolve the character
reference.

cdata-in-html-content

This error occurs if the parser encounters a CDATA
section outside of foreign content (SVG or MathML). The parser treats such CDATA
sections (including leading «[CDATA[» and trailing «]]» strings) as comments.

character-reference-outside-unicode-range

This error occurs if the parser encounters a numeric character reference that references a code point
that is greater than the valid Unicode range. The parser resolves such a character reference to
a U+FFFD REPLACEMENT CHARACTER.

control-character-in-input-stream

This error occurs if the input stream contains a control code point that is not ASCII
whitespace or U+0000 NULL. Such code points are parsed as-is and usually, where parsing
rules don’t apply any additional restrictions, make their way into the DOM.

control-character-reference

This error occurs if the parser encounters a numeric character reference that references a control code point that is not ASCII
whitespace or is a U+000D CARRIAGE RETURN. The parser resolves such character references
as-is except C1 control references that are replaced according to the numeric character
reference end state.

end-tag-with-attributes

This error occurs if the parser encounters an end
tag with attributes. Attributes in end tags are
ignored and do not make their way into the DOM.

duplicate-attribute

This error occurs if the parser encounters an attribute in a tag that already has an attribute with the
same name. The parser ignores all such duplicate occurrences of the attribute.

end-tag-with-trailing-solidus

This error occurs if the parser encounters an end
tag that has a U+002F (/) code point right before the closing U+003E (>)
code point (e.g., </div/>). Such a tag is treated as a regular end
tag.

eof-before-tag-name

This error occurs if the parser encounters the end of the input stream
where a tag name is expected. In this case the parser treats the beginning of a start tag (i.e., <) or an end tag (i.e., </) as text
content.

eof-in-cdata

This error occurs if the parser encounters the end of the input stream in a
CDATA section. The parser treats such CDATA sections as if
they are closed immediately before the end of the input stream.

eof-in-comment

This error occurs if the parser encounters the end of the input stream in a
comment. The parser treats such comments as if they are
closed immediately before the end of the input stream.

eof-in-doctype

This error occurs if the parser encounters the end of the input stream in a DOCTYPE. In such a case, if the DOCTYPE is correctly placed as a
document preamble, the parser sets the Document to quirks mode.

eof-in-script-html-comment-like-text

This error occurs if the parser encounters the end of the input stream in text
that resembles an HTML comment inside
script element content (e.g., <script><!-- foo).

Syntactic structures that resemble HTML comments in script
elements are parsed as text content. They can be a part of a scripting language-specific
syntactic structure or be treated as an HTML-like comment, if the scripting language supports
them (e.g., parsing rules for HTML-like comments can be found in Annex B of the JavaScript
specification). The common reason for this error is a violation of the restrictions for contents of script elements. [JAVASCRIPT]

eof-in-tag

This error occurs if the parser encounters the end of the input stream in a
start tag or an end
tag (e.g., <div id=). Such a tag is ignored.

incorrectly-closed-comment

This error occurs if the parser encounters a comment that is closed by the «--!>»
code point sequence. The parser treats such comments as if they are correctly
closed by the «-->» code point sequence.

incorrectly-opened-comment

This error occurs if the parser encounters the «<!» code
point sequence that is not immediately followed by two U+002D (-) code points and that
is not the start of a DOCTYPE or a CDATA section. All content that follows the «<!» code point sequence up to a U+003E (>) code point (if present) or to
the end of the input stream is treated as a comment.

One possible cause of this error is using an XML markup declaration (e.g.,
<!ELEMENT br EMPTY>) in HTML.

invalid-character-sequence-after-doctype-name

This error occurs if the parser encounters any code point sequence other
than «PUBLIC» and «SYSTEM» keywords after a DOCTYPE name. In such a case, the parser ignores any following
public or system identifiers, and if the DOCTYPE is correctly placed as a document preamble,
and if the parser cannot change the mode flag is false, sets the Document
to quirks mode.

invalid-first-character-of-tag-name

This error occurs if the parser encounters a code point that is not an
ASCII alpha where first code point of a start
tag name or an end tag name is expected. If a
start tag was expected such code point and a preceding U+003C (<) is treated as text
content, and all content that follows is treated as markup. Whereas, if an end tag was
expected, such code point and all content that follows up to a U+003E (>) code point (if
present) or to the end of the input stream is treated as a comment.

For example, consider the following markup:

<42></42>

This will be parsed into:

  • html
    • head
    • body
      • #text: <42>
      • : 42

While the first code point of a tag name is limited to an ASCII
alpha, a wide range of code points (including ASCII digits) is allowed in
subsequent positions.

missing-attribute-value

This error occurs if the parser encounters a U+003E (>) code point where an
attribute value is expected (e.g., <div id=>). The parser treats the attribute as having an empty value.

missing-doctype-name

This error occurs if the parser encounters a DOCTYPE that is missing a name (e.g., <!DOCTYPE>). In such a case, if the DOCTYPE is correctly placed as a
document preamble, the parser sets the Document to quirks mode.

missing-doctype-public-identifier

This error occurs if the parser encounters a U+003E (>) code point where
start of the DOCTYPE public identifier is expected (e.g.,
<!DOCTYPE html PUBLIC >). In such a case, if the DOCTYPE is correctly
placed as a document preamble, the parser sets the Document to quirks
mode.

missing-doctype-system-identifier

This error occurs if the parser encounters a U+003E (>) code point where
start of the DOCTYPE system identifier is expected (e.g.,
<!DOCTYPE html SYSTEM >). In such a case, if the DOCTYPE is correctly
placed as a document preamble, the parser sets the Document to quirks
mode.

missing-end-tag-name

This error occurs if the parser encounters a U+003E (>) code point where an
end tag name is expected, i.e., </>. The parser ignores the whole «</>» code
point sequence.

missing-quote-before-doctype-public-identifier

This error occurs if the parser encounters the DOCTYPE public identifier that is not preceded by a quote (e.g.,
<!DOCTYPE html PUBLIC -//W3C//DTD HTML 4.01//EN">). In such a case,
the parser ignores the public identifier, and if the DOCTYPE is correctly placed as a document
preamble, sets the Document to quirks mode.

missing-quote-before-doctype-system-identifier

This error occurs if the parser encounters the DOCTYPE system identifier that is not preceded by a quote (e.g.,
<!DOCTYPE html SYSTEM
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
). In such a case, the parser
ignores the system identifier, and if the DOCTYPE is correctly placed as a document preamble,
sets the Document to quirks mode.

missing-semicolon-after-character-reference

This error occurs if the parser encounters a character
reference that is not terminated by a U+003B (;) code point. Usually the
parser behaves as if character reference is terminated by the U+003B (;) code point; however,
there are some ambiguous cases in which the parser includes subsequent code points in the
character reference.

For example, &not;in will be parsed as «¬in» whereas &notin will be parsed as ««.

missing-whitespace-after-doctype-public-keyword

This error occurs if the parser encounters a DOCTYPE whose «PUBLIC» keyword and public
identifier are not separated by ASCII whitespace. In this case the parser behaves
as if ASCII whitespace is present.

missing-whitespace-after-doctype-system-keyword

This error occurs if the parser encounters a DOCTYPE whose «SYSTEM» keyword and system
identifier are not separated by ASCII whitespace. In this case the parser behaves
as if ASCII whitespace is present.

missing-whitespace-before-doctype-name

This error occurs if the parser encounters a DOCTYPE whose «DOCTYPE» keyword and name
are not separated by ASCII whitespace. In this case the parser behaves as if ASCII
whitespace is present.

missing-whitespace-between-attributes

This error occurs if the parser encounters attributes that are not separated by ASCII
whitespace (e.g., <div id="foo"class="bar">). In this case the
parser behaves as if ASCII whitespace is present.

missing-whitespace-between-doctype-public-and-system-identifiers

This error occurs if the parser encounters a DOCTYPE whose public and system identifiers are not separated by
ASCII whitespace. In this case the parser behaves as if ASCII whitespace is
present.

nested-comment

This error occurs if the parser encounters a nested comment (e.g., <!-- <!-- nested -->
-->
). Such a comment will be closed by the first occurring «-->»
code point sequence and everything that follows will be treated as markup.

noncharacter-character-reference

This error occurs if the parser encounters a numeric character reference that references a noncharacter.
The parser resolves such character references as-is.

noncharacter-in-input-stream

This error occurs if the input stream contains a noncharacter.
Such code points are parsed as-is and usually, where parsing
rules don’t apply any additional restrictions, make their way into the DOM.

non-void-html-element-start-tag-with-trailing-solidus

This error occurs if the parser encounters a start
tag for an element that is not in the list of void elements or is not a
part of foreign content (i.e., not an SVG or MathML element) that has a U+002F (/) code
point right before the closing U+003E (>) code point. The parser behaves as if the
U+002F (/) is not present.

For example, consider the following markup:

<div/><span></span><span></span>

This will be parsed into:

  • html
    • head
    • body
      • div
        • span
        • span

The trailing U+002F (/) in a start tag name can be used only in foreign
content to specify self-closing tags. (Self-closing tags don’t exist in HTML.) It is also
allowed for void elements, but doesn’t have any effect in this case.

null-character-reference

This error occurs if the parser encounters a numeric character reference that references a U+0000 NULL code
point. The parser resolves such character references to a U+FFFD REPLACEMENT
CHARACTER.

surrogate-character-reference

This error occurs if the parser encounters a numeric character reference that references a surrogate.
The parser resolves such character references to a U+FFFD REPLACEMENT CHARACTER.

surrogate-in-input-stream

This error occurs if the input stream contains a surrogate. Such code points are
parsed as-is and usually, where parsing rules don’t apply any additional restrictions, make
their way into the DOM.

Surrogates can only find their way into the input stream via script APIs such
as document.write().

unexpected-character-after-doctype-system-identifier

This error occurs if the parser encounters any code
points other than ASCII whitespace or closing U+003E (>) after the DOCTYPE system identifier. The parser ignores these code
points.

unexpected-character-in-attribute-name

This error occurs if the parser encounters a U+0022 («), U+0027 (‘), or U+003C (<)
code point in an attribute name. The
parser includes such code points in the attribute name.

Code points that trigger this error are usually a part of another syntactic
construct and can be a sign of a typo around the attribute name.

For example, consider the following markup:

<div foo<div>

Due to a forgotten U+003E (>) code point after foo the parser
treats this markup as a single div element with a «foo<div» attribute.

As another example of this error, consider the following markup:

<div id'bar'>

Due to a forgotten U+003D (=) code point between an attribute name and value the parser
treats this markup as a div element with the attribute «id'bar'» that has an empty value.

unexpected-character-in-unquoted-attribute-value

This error occurs if the parser encounters a U+0022 («), U+0027 (‘), U+003C (<), U+003D
(=), or U+0060 (`) code point in an unquoted attribute value. The parser includes such code points
in the attribute value.

Code points that trigger this error are usually a part of another syntactic
construct and can be a sign of a typo around the attribute value.

U+0060 (`) is in the list of code points that trigger this error because
certain legacy user agents treat it as a quote.

For example, consider the following markup:

<div foo=b'ar'>

Due to a misplaced U+0027 (‘) code point the parser sets the value of the «foo» attribute to «b'ar'«.

unexpected-equals-sign-before-attribute-name

This error occurs if the parser encounters a U+003D (=) code point before an
attribute name. In this case the parser treats U+003D (=) as the first code point of the
attribute name.

The common reason for this error is a forgotten attribute name.

For example, consider the following markup:

<div foo="bar" ="baz">

Due to a forgotten attribute name the parser treats this markup as a div
element with two attributes: a «foo» attribute with a «bar» value and a «="baz"» attribute with an empty
value.

unexpected-null-character

This error occurs if the parser encounters a U+0000 NULL code point in the
input stream in certain positions. In general, such code points are either
ignored or, for security reasons, replaced with a U+FFFD REPLACEMENT CHARACTER.

unexpected-question-mark-instead-of-tag-name

This error occurs if the parser encounters a U+003F (?) code point where first
code point of a start tag name is expected. The U+003F
(?) and all content that follows up to a U+003E (>) code point (if present) or to the end of
the input stream is treated as a comment.

For example, consider the following markup:

<?xml-stylesheet type="text/css" href="style.css"?>

This will be parsed into:

  • : ?xml-stylesheet type=»text/css» href=»style.css»?
  • html
    • head
    • body

The common reason for this error is an XML processing instruction (e.g., <?xml-stylesheet type="text/css" href="style.css"?>) or an XML declaration
(e.g., <?xml version="1.0" encoding="UTF-8"?>) being used in
HTML.

unexpected-solidus-in-tag

This error occurs if the parser encounters a U+002F (/) code point that is
not a part of a quoted attribute value and not
immediately followed by a U+003E (>) code point in a tag (e.g., <div /
id="foo">
). In this case the parser behaves as if it encountered ASCII
whitespace.

unknown-named-character-reference

This error occurs if the parser encounters an ambiguous ampersand. In this case the parser doesn’t
resolve the character reference.

13.2.3 The input byte stream

The stream of code points that comprises the input to the tokenization stage will be initially
seen by the user agent as a stream of bytes (typically coming over the network or from the local
file system). The bytes encode the actual characters according to a particular character
encoding
, which the user agent uses to decode the bytes into characters.

For XML documents, the algorithm user agents are required to use to determine the
character encoding is given by XML. This section does not apply to XML
documents. [XML]

Usually, the encoding sniffing algorithm defined below is used to determine the
character encoding.

Given a character encoding, the bytes in the input byte stream must be converted
to characters for the tokenizer’s input stream, by passing the input byte
stream and character encoding to decode.

A leading Byte Order Mark (BOM) causes the character encoding argument to be
ignored and will itself be skipped.

Bytes or sequences of bytes in the original byte stream that did not conform to
the Encoding standard (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors
that conformance checkers are expected to report. [ENCODING]

The decoder algorithms describe how to handle invalid input; for security
reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
sequences are handled can result in, amongst other problems, script injection vulnerabilities
(«XSS»).

When the HTML parser is decoding an input byte stream, it uses a character encoding and a confidence. The confidence is either tentative,
certain, or irrelevant. The encoding used, and whether the confidence in that
encoding is tentative or certain, is used
during the parsing to determine whether to change the encoding. If no encoding is
necessary, e.g. because the parser is operating on a Unicode stream and doesn’t have to use a
character encoding at all, then the confidence is
irrelevant.

Some algorithms feed the parser by directly adding characters to the input
stream rather than adding bytes to the input byte stream.

13.2.3.1 Parsing with a known character encoding

When the HTML parser is to operate on an input byte stream that has a known
definite encoding
, then the character encoding is that encoding and the confidence is certain.

13.2.3.2 Determining the character encoding

In some cases, it might be impractical to unambiguously determine the encoding before parsing
the document. Because of this, this specification provides for a two-pass mechanism with an
optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing
algorithm to whatever bytes they have available before beginning to parse the document. Then, the
real parser is started, using a tentative encoding derived from this pre-parse and other
out-of-band metadata. If, while the document is being loaded, the user agent discovers a character
encoding declaration that conflicts with this information, then the parser can get reinvoked to
perform a parse of the document with the real encoding.

User agents must use the following algorithm, called the encoding
sniffing algorithm
, to determine the character encoding to use when decoding a document in
the first pass. This algorithm takes as input any out-of-band metadata available to the user agent
(e.g. the Content-Type metadata of the document) and all the
bytes available so far, and returns a character encoding and a confidence that is either tentative or
certain.

  1. If the result of BOM sniffing is an encoding, return that
    encoding with confidence certain.

    Although the decode algorithm will itself change the encoding to
    use based on the presence of a byte order mark, this algorithm sniffs the BOM as well in order
    to set the correct document’s character encoding and confidence.

  2. If the user has explicitly instructed the user agent to override the document’s character
    encoding with a specific encoding, optionally return that encoding with the confidence certain.

    Typically, user agents remember such user requests across sessions, and in some
    cases apply them to documents in iframes as well.

  3. The user agent may wait for more bytes of the resource to be available, either in this step
    or at any later step in this algorithm. For instance, a user agent might wait 500ms or 1024
    bytes, whichever came first. In general preparsing the source to find the encoding improves
    performance, as it reduces the need to throw away the data structures used when parsing upon
    finding the encoding information. However, if the user agent delays too long to obtain data to
    determine the encoding, then the cost of the delay could outweigh any performance improvements
    from the preparse.

    The authoring conformance requirements for character encoding declarations limit
    them to only appearing in the first 1024 bytes. User agents are
    therefore encouraged to use the prescan algorithm below (as invoked by these steps) on the first
    1024 bytes, but not to stall beyond that.

  4. If the transport layer specifies a character encoding, and it is supported, return that
    encoding with the confidence certain.

  5. Optionally prescan the byte
    stream to determine its encoding, with the end
    condition
    being when the user agent decides that scanning further bytes would not be
    efficient. User agents are encouraged to only prescan the first 1024 bytes. User agents may
    decide that scanning any bytes is not efficient, in which case these substeps are
    entirely skipped.

    The aforementioned algorithm returns either a character encoding or failure. If it returns a
    character encoding, then return the same encoding, with confidence tentative.

  6. If the HTML parser for which this algorithm is being run is associated with a
    Document d whose container
    document is non-null, then:

    1. Let parentDocument be d‘s container document.

    2. If parentDocument‘s origin is
      same origin with d‘s origin and parentDocument‘s character encoding is not
      UTF-16BE/LE, then return parentDocument‘s character encoding, with the confidence tentative.

  7. Otherwise, if the user agent has information on the likely encoding for this page, e.g.
    based on the encoding of the page when it was last visited, then return that encoding, with the
    confidence tentative.

  8. The user agent may attempt to autodetect the character encoding from applying frequency
    analysis or other algorithms to the data stream. Such algorithms may use information about the
    resource other than the resource’s contents, including the address of the resource. If
    autodetection succeeds in determining a character encoding, and that encoding is a supported
    encoding, then return that encoding, with the confidence tentative.
    [UNIVCHARDET]

    User agents are generally discouraged from attempting to autodetect encodings
    for resources obtained over the network, since doing so involves inherently non-interoperable
    heuristics. Attempting to detect encodings based on an HTML document’s preamble is especially
    tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin
    with a lot of markup rather than with text content.

    The UTF-8 encoding has a highly detectable bit pattern. Files from the local
    file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are
    very likely to be UTF-8, while documents with byte sequences that do not match it are very
    likely not. When a user agent can examine the whole file, rather than just the preamble,
    detecting for UTF-8 specifically can be especially effective. [PPUTF8] [UTF8DET]

  9. Otherwise, return an implementation-defined or user-specified default character
    encoding, with the confidence
    tentative.

    In controlled environments or in environments where the encoding of documents can be
    prescribed (for example, for user agents intended for dedicated use in new networks), the
    comprehensive UTF-8 encoding is suggested.

    In other environments, the default encoding is typically dependent on the user’s locale (an
    approximation of the languages, and thus often encodings, of the pages that the user is likely
    to frequent). The following table gives suggested defaults based on the user’s locale, for
    compatibility with legacy content. Locales are identified by BCP 47 language tags. [BCP47] [ENCODING]

    Locale language Suggested default encoding
    ar Arabic windows-1256

    az Azeri windows-1254

    ba Bashkir windows-1251

    be Belarusian windows-1251

    bg Bulgarian windows-1251

    cs Czech windows-1250

    el Greek ISO-8859-7

    et Estonian windows-1257

    fa Persian windows-1256

    he Hebrew windows-1255

    hr Croatian windows-1250

    hu Hungarian ISO-8859-2

    ja Japanese Shift_JIS

    kk Kazakh windows-1251

    ko Korean EUC-KR

    ku Kurdish windows-1254

    ky Kyrgyz windows-1251

    lt Lithuanian windows-1257

    lv Latvian windows-1257

    mk Macedonian windows-1251

    pl Polish ISO-8859-2

    ru Russian windows-1251

    sah Yakut windows-1251

    sk Slovak windows-1250

    sl Slovenian ISO-8859-2

    sr Serbian windows-1251

    tg Tajik windows-1251

    th Thai windows-874

    tr Turkish windows-1254

    tt Tatar windows-1251

    uk Ukrainian windows-1251

    vi Vietnamese windows-1258

    zh-Hans, zh-CN, zh-SG Chinese, Simplified GBK

    zh-Hant, zh-HK, zh-MO, zh-TW Chinese, Traditional Big5

    All other locales windows-1252

    The contents of this table are derived from the intersection of
    Windows, Chrome, and Firefox defaults.

The document’s character encoding must immediately be set to the value returned
from this algorithm, at the same time as the user agent uses the returned value to select the
decoder to use for the input byte stream.


When an algorithm requires a user agent to prescan a byte stream to determine its
encoding
, given some defined end condition, then it must run the following
steps. If at any point during these steps (including during instances of the get an attribute algorithm invoked by this
one) the user agent either runs out of bytes (meaning the position pointer created in
the first step below goes beyond the end of the byte stream obtained so far) or reaches its
end condition, then abort the prescan a byte stream to determine its
encoding algorithm and return the result get an XML encoding applied to the same
bytes that the prescan a byte stream to determine its encoding algorithm was applied
to. Otherwise, these steps will return a character encoding.

  1. Let fallback encoding be null.

  2. Let position be a pointer to a byte in the input byte stream, initially
    pointing at the first byte.

  3. Prescan for UTF-16 XML declarations: If position points to:

    A sequence of bytes starting with: 0x3C, 0x0, 0x3F, 0x0, 0x78, 0x0 (case-sensitive UTF-16
    little-endian ‘<?x’)

    Return UTF-16LE.

    A sequence of bytes starting with: 0x0, 0x3C, 0x0, 0x3F, 0x0, 0x78 (case-sensitive UTF-16
    big-endian ‘<?x’)

    Return UTF-16BE.

    For historical reasons, the prefix is two bytes longer than in Appendix F of XML and the
    encoding name is not checked.

  4. Loop: If position points to:

    A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (`<!--`)

    Advance the position pointer so that it points at the first 0x3E byte
    which is preceded by two 0x2D bytes (i.e. at the end of an ASCII ‘—>’ sequence) and comes
    after the 0x3C byte that was found. (The two 0x2D bytes can be the same as those in the
    ‘<!—‘ sequence.)

    A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII ‘<meta’ followed by a space or slash)
    1. Advance the position pointer so that it points at the next 0x09,
      0x0A, 0x0C, 0x0D, 0x20, or 0x2F byte (the one in sequence of characters matched
      above).

    2. Let attribute list be an empty list of strings.

    3. Let got pragma be false.

    4. Let need pragma be null.

    5. Let charset be the null value (which, for the purposes of this
      algorithm, is distinct from an unrecognized encoding or the empty string).

    6. Attributes: Get an
      attribute and its value. If no attribute was sniffed, then jump to the
      processing step below.

    7. If the attribute’s name is already in attribute list, then return
      to the step labeled attributes.

    8. Add the attribute’s name to attribute list.

    9. Run the appropriate step from the following list, if one applies:

      If the attribute’s name is «http-equiv«

      If the attribute’s value is «content-type«, then set got pragma to true.

      If the attribute’s name is «content«

      Apply the algorithm for extracting a character encoding from a
      meta element, giving the attribute’s value as the string to parse. If a
      character encoding is returned, and if charset is still set to null,
      let charset be the encoding returned, and set need
      pragma
      to true.

      If the attribute’s name is «charset«

      Let charset be the result of getting an encoding
      from the attribute’s value, and set need pragma to false.

    10. Return to the step labeled attributes.

    11. Processing: If need pragma is null, then jump to the step
      below labeled next byte.

    12. If need pragma is true but got pragma is
      false, then jump to the step below labeled next byte.

    13. If charset is failure, then jump to the step below labeled next
      byte
      .

    14. If charset is UTF-16BE/LE, then set charset to
      UTF-8.

    15. If charset is x-user-defined, then set charset to
      windows-1252.

    16. Return charset.

    A sequence of bytes starting with a 0x3C byte (<), optionally a 0x2F byte (/), and
    finally a byte in the range 0x41-0x5A or 0x61-0x7A (A-Z or a-z)
    1. Advance the position pointer so that it points at the next 0x09 (HT),
      0x0A (LF), 0x0C (FF), 0x0D (CR), 0x20 (SP), or 0x3E (>) byte.

    2. Repeatedly get an attribute
      until no further attributes can be found, then jump to the step below labeled next
      byte
      .

    A sequence of bytes starting with: 0x3C 0x21 (`<!`)
    A sequence of bytes starting with: 0x3C 0x2F (`</`)
    A sequence of bytes starting with: 0x3C 0x3F (`<?`)

    Advance the position pointer so that it points at the first 0x3E byte (>) that
    comes after the 0x3C byte that was found.

    Any other byte

    Do nothing with that byte.

  5. Next byte: Move position so it points at the next byte in the
    input byte stream, and return to the step above labeled loop.

When the prescan a byte stream to determine its encoding algorithm says to get an attribute, it means doing this:

  1. If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR),
    0x20 (SP), or 0x2F (/) then advance position to the next byte and redo this
    step.

  2. If the byte at position is 0x3E (>), then abort the get an attribute algorithm. There isn’t
    one.

  3. Otherwise, the byte at position is the start of the attribute name.
    Let attribute name and attribute value be the empty
    string.

  4. Process the byte at position as follows:

    If it is 0x3D (=), and the attribute name is longer than the empty string
    Advance position to the next byte and jump to the step below labeled
    value.
    If it is 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), or 0x20 (SP)
    Jump to the step below labeled spaces.
    If it is 0x2F (/) or 0x3E (>)
    Abort the get an attribute
    algorithm. The attribute’s name is the value of attribute name, its value
    is the empty string.
    If it is in the range 0x41 (A) to 0x5A (Z)
    Append the code point b+0x20 to attribute name
    (where b is the value of the byte at position). (This converts the input
    to lowercase.)
    Anything else
    Append the code point with the same value as the byte at position to
    attribute name. (It doesn’t actually matter how bytes outside the ASCII range are
    handled here, since only ASCII bytes can contribute to the detection of a character
    encoding.)
  5. Advance position to the next byte and return to the previous
    step.

  6. Spaces: If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C
    (FF), 0x0D (CR), or 0x20 (SP) then advance position to the next byte, then, repeat
    this step.

  7. If the byte at position is not 0x3D (=), abort the get an attribute algorithm. The attribute’s
    name is the value of attribute name, its value is the empty string.

  8. Advance position past the 0x3D (=) byte.

  9. Value: If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C
    (FF), 0x0D (CR), or 0x20 (SP) then advance position to the next byte, then, repeat
    this step.

  10. Process the byte at position as follows:

    If it is 0x22 («) or 0x27 (‘)
    1. Let b be the value of the byte at position.
    2. Quote loop: Advance position to the next byte.
    3. If the value of the byte at position is the value of b, then advance position to the next byte and abort the
      «get an attribute» algorithm. The attribute’s name is the value of attribute
      name
      , and its value is the value of attribute value.
    4. Otherwise, if the value of the byte at position is in the range 0x41 (A) to
      0x5A (Z), then append a code point to attribute value whose value is 0x20 more
      than the value of the byte at position.
    5. Otherwise, append a code point to attribute value whose value is the same as
      the value of the byte at position.
    6. Return to the step above labeled quote loop.
    If it is 0x3E (>)
    Abort the get an attribute
    algorithm. The attribute’s name is the value of attribute name, its value
    is the empty string.
    If it is in the range 0x41 (A) to 0x5A (Z)
    Append a code point b+0x20 to attribute value
    (where b is the value of the byte at position). Advance
    position to the next byte.
    Anything else
    Append a code point with the same value as the byte at position to
    attribute value. Advance position to the next byte.
  11. Process the byte at position as
    follows:

    If it is 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), 0x20 (SP), or 0x3E (>)
    Abort the get an attribute
    algorithm. The attribute’s name is the value of attribute name and its
    value is the value of attribute value.
    If it is in the range 0x41 (A) to 0x5A (Z)
    Append a code point b+0x20 to attribute value
    (where b is the value of the byte at position).
    Anything else
    Append a code point with the same value as the byte at position to
    attribute value.
  12. Advance position to the next byte and return to the previous
    step.

When the prescan a byte stream to determine its encoding algorithm is aborted
without returning an encoding, get an XML
encoding
means doing this.

Looking for syntax resembling an XML declaration, even in text/html,
is necessary for compatibility with existing content.

  1. Let encodingPosition be a pointer to the start of the stream.

  2. If encodingPosition does not point to the start of a byte sequence 0x3C, 0x3F,
    0x78, 0x6D, 0x6C (`<?xml`), then return failure.

  3. Let xmlDeclarationEnd be a pointer to the next byte in the input byte stream
    which is 0x3E (>). If there is no such byte, then return failure.

  4. Set encodingPosition to the position of the first occurrence of the subsequence
    of bytes 0x65, 0x6E, 0x63, 0x6F, 0x64, 0x69, 0x6E, 0x67 (`encoding`) at or
    after the current encodingPosition. If there is no such sequence, then return
    failure.

  5. Advance encodingPosition past the 0x67 (g) byte.

  6. While the byte at encodingPosition is less than or equal to 0x20 (i.e., it is
    either an ASCII space or control character), advance encodingPosition to the next
    byte.

  7. If the byte at encodingPosition is not 0x3D (=), then return failure.

  8. Advance encodingPosition to the next byte.

  9. While the byte at encodingPosition is less than or equal to 0x20 (i.e., it is
    either an ASCII space or control character), advance encodingPosition to the next
    byte.

  10. Let quoteMark be the byte at encodingPosition.

  11. If quoteMark is not either 0x22 («) or 0x27 (‘), then return failure.

  12. Advance encodingPosition to the next byte.

  13. Let encodingEndPosition be the position of the next occurrence of
    quoteMark at or after encodingPosition. If quoteMark does not
    occur again, then return failure.

  14. Let potentialEncoding be the sequence of the bytes between
    encodingPosition (inclusive) and encodingEndPosition (exclusive).

  15. If potentialEncoding contains one or more bytes whose byte value is 0x20 or
    below, then return failure.

  16. Let encoding be the result of getting an encoding given
    potentialEncoding isomorphic decoded.

  17. If the encoding is UTF-16BE/LE, then change it to
    UTF-8.

  18. Return encoding.

For the sake of interoperability, user agents should not use a pre-scan algorithm that returns
different results than the one described above. (But, if you do, please at least let us know, so
that we can improve this algorithm and benefit everyone…)

13.2.3.3 Character encodings

User agents must support the encodings defined in Encoding, including, but
not limited to,
UTF-8,
ISO-8859-2,
ISO-8859-7,
ISO-8859-8,
windows-874,
windows-1250,
windows-1251,
windows-1252,
windows-1254,
windows-1255,
windows-1256,
windows-1257,
windows-1258,
GBK,
Big5,
ISO-2022-JP,
Shift_JIS,
EUC-KR,
UTF-16BE,
UTF-16LE,
UTF-16BE/LE, and
x-user-defined.
User agents must not support other encodings.

The above prohibits supporting, for example, CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC,
and UTF-32. This specification does not make any attempt to support prohibited encodings in its
algorithms; support and use of prohibited encodings would thus lead to unexpected behavior. [CESU8] [UTF7] [BOCU1] [SCSU]

13.2.3.4 Changing the encoding while parsing

When the parser requires the user agent to change the encoding, it must run the
following steps. This might happen if the encoding sniffing algorithm described above
failed to find a character encoding, or if it found a character encoding that was not the actual
encoding of the file.

  1. If the encoding that is already being used to interpret the input stream is
    UTF-16BE/LE, then set the confidence to certain and return. The new
    encoding is ignored; if it was anything but the same encoding, then it would be clearly
    incorrect.

  2. If the new encoding is UTF-16BE/LE, then change it to
    UTF-8.

  3. If the new encoding is x-user-defined, then change it to
    windows-1252.

  4. If the new encoding is identical or equivalent to the encoding that is already being used
    to interpret the input stream, then set the confidence to certain and return.
    This happens when the encoding information found in the file matches what the encoding
    sniffing algorithm determined to be the encoding, and in the second pass through the
    parser if the first pass found that the encoding sniffing algorithm described in the earlier
    section failed to find the right encoding.

  5. If all the bytes up to the last byte converted by the current decoder have the same
    Unicode interpretations in both the current encoding and the new encoding, and if the user agent
    supports changing the converter on the fly, then the user agent may change to the new converter
    for the encoding on the fly. Set the document’s character encoding and the encoding
    used to convert the input stream to the new encoding, set the confidence to certain, and return.

  6. Otherwise, restart the navigate algorithm, with historyHandling set to «replace» and
    other inputs kept the same, but this time skip the encoding sniffing algorithm and
    instead just set the encoding to the new encoding and the confidence to certain. Whenever possible,
    this should be done without actually contacting the network layer (the bytes should be re-parsed
    from memory), even if, e.g., the document is marked as not being cacheable. If this is not
    possible and contacting the network layer would involve repeating a request that uses a method
    other than `GET`, then instead set the confidence to certain and ignore the new
    encoding. The resource will be misinterpreted. User agents may notify the user of the situation,
    to aid in application development.

This algorithm is only invoked when a new encoding is found declared on a
meta element.

13.2.3.5 Preprocessing the input stream

The input stream consists of the characters pushed into it as the input byte
stream is decoded or from the various APIs that directly manipulate the input stream.

Any occurrences of surrogates are surrogate-in-input-stream
parse errors. Any occurrences of noncharacters are noncharacter-in-input-stream
parse errors and any occurrences of controls other than ASCII whitespace and U+0000 NULL
characters are control-character-in-input-stream
parse errors.

The handling of U+0000 NULL characters varies based on where the characters are
found and happens at the later stages of the parsing. They are either ignored or, for security
reasons, replaced with a U+FFFD REPLACEMENT CHARACTER. This handling is, by necessity, spread
across both the tokenization stage and the tree construction stage.

Before the tokenization stage, the input stream must be preprocessed by normalizing newlines. Thus, newlines in HTML DOMs are
represented by U+000A LF characters, and there are never any U+000D CR characters in the input to
the tokenization stage.

The next input character is the first character in the input stream
that has not yet been consumed or explicitly ignored by the requirements in
this section. Initially, the next input character is the
first character in the input. The current input character is the last character to have
been consumed.

The insertion point is the position (just before a character or just before the end
of the input stream) where content inserted using document.write() is actually inserted. The insertion point is
relative to the position of the character immediately after it, it is not an absolute offset into
the input stream. Initially, the insertion point is undefined.

The «EOF» character in the tables below is a conceptual character representing the end of the
input stream. If the parser is a script-created parser, then the end of
the input stream is reached when an explicit «EOF» character (inserted by
the document.close() method) is consumed. Otherwise, the
«EOF» character is not a real character in the stream, but rather the lack of any further
characters.

13.2.4 Parse state

13.2.4.1 The insertion mode

The insertion mode is a state variable that controls the primary operation of the
tree construction stage.

Initially, the insertion mode is «initial». It can change to «before
html», «before head», «in head», «in head noscript», «after head»,
«in body», «text», «in table», «in table text», «in caption», «in column
group», «in table body», «in row», «in
cell», «in select», «in select in table», «in template», «after body»,
«in frameset», «after frameset», «after
after body», and «after after
frameset» during the course of the parsing, as described in the tree
construction stage. The insertion mode affects how tokens are processed and whether CDATA
sections are supported.

Several of these modes, namely «in head», «in body», «in
table», and «in select», are special, in
that the other modes defer to them at various times. When the algorithm below says that the user
agent is to do something «using the rules for the m insertion
mode», where m is one of these modes, the user agent must use the rules
described under the m insertion mode’s section, but must leave
the insertion mode unchanged unless the rules in m themselves
switch the insertion mode to a new value.

When the insertion mode is switched to «text» or
«in table text», the original insertion
mode
is also set. This is the insertion mode to which the tree construction stage will
return.

Similarly, to parse nested template elements, a stack of template insertion
modes
is used. It is initially empty. The current template insertion mode is the
insertion mode that was most recently added to the stack of template insertion modes.
The algorithms in the sections below will push insertion modes onto this stack, meaning
that the specified insertion mode is to be added to the stack, and pop insertion modes from
the stack, which means that the most recently added insertion mode must be removed from the
stack.


When the steps below require the UA to reset the insertion mode appropriately, it
means the UA must follow these steps:

  1. Let last be false.

  2. Let node be the last node in the stack of open
    elements.

  3. Loop: If node is the first node in the stack of open elements, then set
    last to true, and, if the parser was created as part of the HTML fragment
    parsing algorithm (fragment case), set node to the context element passed to that algorithm.

  4. If node is a select element, run these substeps:

    1. If last is true, jump to the step below labeled
      done.

    2. Let ancestor be node.

    3. Loop: If ancestor is the first node in the stack of
      open elements, jump to the step below labeled done.

    4. Let ancestor be the node before ancestor in the
      stack of open elements.

    5. If ancestor is a template node, jump to the step below
      labeled done.

    6. If ancestor is a table node, switch the insertion
      mode to «in select in table»
      and return.

    7. Jump back to the step labeled loop.

    8. Done: Switch the insertion mode to «in select» and return.

  5. If node is a td or th element and last is
    false, then switch the insertion mode to «in
    cell» and return.

  6. If node is a tr element, then switch the insertion
    mode to «in row» and return.

  7. If node is a tbody, thead, or
    tfoot element, then switch the insertion mode to «in table body» and return.

  8. If node is a caption element, then switch the
    insertion mode to «in caption» and
    return.

  9. If node is a colgroup element, then switch the
    insertion mode to «in column
    group» and return.

  10. If node is a table element, then switch the
    insertion mode to «in table» and
    return.

  11. If node is a template element, then switch the
    insertion mode to the current template insertion mode and
    return.

  12. If node is a head element and last is
    false, then switch the insertion mode to «in
    head» and return.

  13. If node is a body element, then switch the
    insertion mode to «in body» and
    return.

  14. If node is a frameset element, then switch the
    insertion mode to «in frameset» and
    return. (fragment case)

  15. If node is an html element, run these substeps:

    1. If the head element pointer is null, switch the
      insertion mode to «before head»
      and return. (fragment case)

    2. Otherwise, the head element pointer is not null, switch the
      insertion mode to «after head» and
      return.

  16. If last is true, then switch the insertion mode to «in body» and return. (fragment
    case)

  17. Let node now be the node before node in the
    stack of open elements.

  18. Return to the step labeled loop.

13.2.4.2 The stack of open elements

Initially, the stack of open elements is empty. The stack grows downwards; the
topmost node on the stack is the first one added to the stack, and the bottommost node of the
stack is the most recently added node in the stack (notwithstanding when the stack is manipulated
in a random access fashion as part of the handling for misnested
tags).

The «before html»
insertion mode creates the html document element, which is
then added to the stack.

In the fragment case, the stack of open elements is
initialized to contain an html element that is created as part of that algorithm. (The fragment case skips the
«before html» insertion mode.)

The html node, however it is created, is the topmost node of the stack. It only
gets popped off the stack when the parser finishes.

The current node is the bottommost node in this stack of open
elements.

The adjusted current node is the context
element if the parser was created as part of the HTML fragment parsing algorithm and
the stack of open elements has only one element in it (fragment case);
otherwise, the adjusted current node is the current node.

Elements in the stack of open elements fall into the following categories:

Special

The following elements have varying levels of special parsing rules: HTML’s
address, applet, area, article,
aside, base, basefont, bgsound,
blockquote, body, br, button,
caption, center, col, colgroup,
dd, details, dir, div, dl,
dt, embed, fieldset, figcaption,
figure, , form, frame,
frameset, h1, h2, h3, h4,
h5, h6, head, , hgroup,
hr, html, iframe,
img, input, keygen, li, link,
listing, main, marquee, ,
meta, nav, noembed, noframes,
noscript, object, ol, p, param,
plaintext, pre, script, section,
select, source, style, summary,
table, tbody, td, template,
textarea, tfoot, th, thead,
title, tr, track, ul, wbr,
xmp; MathML mi, MathML mo,
MathML mn, MathML ms, MathML
mtext, and MathML annotation-xml; and SVG
foreignObject, SVG desc, and SVG
title.

An image start tag token is handled by the tree builder,
but it is not in this list because it is not an element; it gets turned into an img
element.

Formatting

The following HTML elements are those that end up in the list of active formatting
elements: a, b, big, code,
em, font, i, nobr, s,
small, strike, strong, tt, and
u.

Ordinary

All other elements found while parsing an HTML document.

Typically, the special elements have the start and end tag tokens
handled specifically, while ordinary elements’ tokens fall into «any other start tag»
and «any other end tag» clauses, and some parts of the tree builder check if a particular element
in the stack of open elements is in the special category. However, some
elements (e.g., the option element) have their start or end tag tokens handled
specifically, but are still not in the special category, so that they get the
ordinary handling elsewhere.

The stack of open elements is said to have an element target node in a specific scope consisting of a
list of element types list when the following algorithm terminates in a match
state:

  1. Initialize node to be the current node (the bottommost
    node of the stack).

  2. If node is the target node, terminate in a match state.

  3. Otherwise, if node is one of the element types in list, terminate in a failure state.

  4. Otherwise, set node to the previous entry in the stack of open
    elements and return to step 2. (This will never fail, since the loop will always terminate
    in the previous step if the top of the stack — an html element — is
    reached.)

The stack of open elements is said to have a
particular element in scope
when it has
that element in the specific scope consisting of the following element types:

  • applet
  • caption
  • html
  • table
  • td
  • th
  • marquee
  • object
  • template
  • MathML mi
  • MathML mo
  • MathML mn
  • MathML ms
  • MathML mtext
  • MathML annotation-xml
  • SVG foreignObject
  • SVG desc
  • SVG title

The stack of open elements is said to have a particular element in list item scope when it has that element in the specific scope consisting of the following
element types:

  • All the element types listed above for the has an element in scope algorithm.
  • ol in the HTML namespace
  • ul in the HTML namespace

The stack of open elements is said to have a particular element in button scope when it has that element in the specific scope consisting of the following element
types:

  • All the element types listed above for the has an element in scope algorithm.
  • button in the HTML namespace

The stack of open elements is said to have a particular element in table scope when it has that element in the specific scope consisting of the following element
types:

  • html in the HTML namespace
  • table in the HTML namespace
  • template in the HTML namespace

The stack of open elements is said to have a particular element in select scope when it has that element in the specific scope consisting of all element types
except the following:

  • optgroup in the HTML namespace
  • option in the HTML namespace

Nothing happens if at any time any of the elements in the stack of open elements
are moved to a new location in, or removed from, the Document tree. In particular,
the stack is not changed in this situation. This can cause, amongst other strange effects, content
to be appended to nodes that are no longer in the DOM.

In some cases (namely, when closing misnested formatting
elements), the stack is manipulated in a random-access fashion.

13.2.4.3 The list of active formatting elements

Initially, the list of active formatting elements is empty. It is used to handle
mis-nested formatting element tags.

The list contains elements in the formatting category, and markers. The markers are inserted when entering applet,
object, marquee, template, td,
th, and caption elements, and are used to prevent formatting from
«leaking» into applet, object, marquee,
template, td, th, and caption elements.

In addition, each element in the list of active formatting elements is associated
with the token for which it was created, so that further elements can be created for that token if
necessary.

When the steps below require the UA to push onto the list of active formatting
elements
an element element, the UA must perform the following
steps:

  1. If there are already three elements in the list of active formatting elements
    after the last marker, if any, or anywhere in the
    list if there are no markers, that have the same tag
    name, namespace, and attributes as element, then remove the earliest such
    element from the list of active formatting elements. For these purposes, the
    attributes must be compared as they were when the elements were created by the parser; two
    elements have the same attributes if all their parsed attributes can be paired such that the two
    attributes in each pair have identical names, namespaces, and values (the order of the attributes
    does not matter).

    This is the Noah’s Ark clause. But with three per family instead of two.

  2. Add element to the list of active formatting
    elements.

When the steps below require the UA to reconstruct the active formatting elements,
the UA must perform the following steps:

  1. If there are no entries in the list of active formatting elements, then there
    is nothing to reconstruct; stop this algorithm.

  2. If the last (most recently added) entry in the list of active formatting
    elements is a marker, or if it is an element
    that is in the stack of open elements, then there is nothing to reconstruct; stop
    this algorithm.

  3. Let entry be the last (most recently added) element in the list
    of active formatting elements.

  4. Rewind: If there are no entries before entry in the list
    of active formatting elements, then jump to the step labeled create.

  5. Let entry be the entry one earlier than entry in
    the list of active formatting elements.

  6. If entry is neither a marker nor an element that is also in the stack of
    open elements, go to the step labeled rewind.

  7. Advance: Let entry be the element one later than entry in the list of active formatting elements.

  8. Create: Insert an HTML element for the token for which the element
    entry was created, to obtain new element.

  9. Replace the entry for entry in the list with an entry for new element.

  10. If the entry for new element in the list of active formatting
    elements is not the last entry in the list, return to the step labeled
    advance.

This has the effect of reopening all the formatting elements that were opened in the current
body, cell, or caption (whichever is youngest) that haven’t been explicitly closed.

The way this specification is written, the list of active formatting
elements always consists of elements in chronological order with the least recently added
element first and the most recently added element last (except for while steps 7 to 10 of the
above algorithm are being executed, of course).

When the steps below require the UA to clear the list of active formatting elements up to
the last marker
, the UA must perform the following steps:

  1. Let entry be the last (most recently added) entry in the list of
    active formatting elements.

  2. Remove entry from the list of active formatting
    elements.

  3. If entry was a marker,
    then stop the algorithm at this point. The list has been cleared up to the last marker.

  4. Go to step 1.

13.2.4.4 The element pointers

Initially, the head element pointer and the form element pointer are both null.

Once a head element has been parsed (whether implicitly or explicitly) the
head element pointer gets set to point to this node.

The form element pointer points to the last
form element that was opened and whose end tag has not yet been seen. It is used to
make form controls associate with forms in the face of dramatically bad markup, for historical
reasons. It is ignored inside template elements.

13.2.4.5 Other parsing state flags

The scripting flag is set to «enabled» if scripting
was enabled for the Document with which the parser is associated when the
parser was created, and «disabled» otherwise.

The scripting flag can be enabled even when the parser was created as
part of the HTML fragment parsing algorithm, even though script elements
don’t execute in that case.

The frameset-ok flag is set to «ok» when the parser is created. It is set to «not
ok» after certain tokens are seen.

13.2.5 Tokenization

Implementations must act as if they used the following state machine to tokenize HTML. The
state machine must start in the data state. Most states consume a single character,
which may have various side-effects, and either switches the state machine to a new state to
reconsume the current input character, or switches it to a new state to
consume the next character, or stays in the same state
to consume the next character. Some states have more complicated behavior and can consume several
characters before switching to another state. In some cases, the tokenizer state is also changed
by the tree construction stage.

When a state says to reconsume a matched character in a specified state, that means
to switch to that state, but when it attempts to consume the next input character,
provide it with the current input character instead.

The exact behavior of certain states depends on the insertion mode and the
stack of open elements. Certain states also use a temporary buffer to track progress, and the character reference
state uses a return state to return to the
state it was invoked from.

The output of the tokenization step is a series of zero or more of the following tokens:
DOCTYPE, start tag, end tag, comment, character, end-of-file. DOCTYPE tokens have a name, a public
identifier, a system identifier, and a force-quirks flag. When a DOCTYPE token
is created, its name, public identifier, and system identifier must be marked as missing (which is
a distinct state from the empty string), and the force-quirks flag must be set to
off (its other state is on). Start and end tag tokens have a tag name, a self-closing flag, and a list of attributes, each of which has a
name and a value. When a start or end tag token is created, its self-closing flag must be unset (its other state is that it be set), and its attributes
list must be empty. Comment and character tokens have data.

When a token is emitted, it must immediately be handled by the tree construction
stage. The tree construction stage can affect the state of the tokenization stage, and can insert
additional characters into the stream. (For example, the script element can result in
scripts executing and using the dynamic markup insertion APIs to insert characters
into the stream being tokenized.)

Creating a token and emitting it are distinct actions. It is possible for a token
to be created but implicitly abandoned (never emitted), e.g. if the file ends unexpectedly while
processing the characters that are being parsed into a start tag token.

When a start tag token is emitted with its self-closing flag set, if the flag is not
acknowledged when it is processed by the tree
construction stage, that is a non-void-html-element-start-tag-with-trailing-solidus
parse error.

When an end tag token is emitted with attributes, that is an end-tag-with-attributes parse
error.

When an end tag token is emitted with its self-closing flag
set, that is an end-tag-with-trailing-solidus
parse error.

An appropriate end tag token is an end tag token whose tag name matches the tag name
of the last start tag to have been emitted from this tokenizer, if any. If no start tag has been
emitted from this tokenizer, then no end tag token is appropriate.

A character reference is said to be consumed as part of an attribute if the return state is either attribute value (double-quoted) state,
attribute value (single-quoted) state or attribute value (unquoted)
state.

When a state says to flush code points consumed as a character reference, it means
that for each code point in the temporary
buffer
(in the order they were added to the buffer) user agent must append the code point
from the buffer to the current attribute’s value if the character reference was consumed as part of an attribute, or emit the code point as a
character token otherwise.

Before each step of the tokenizer, the user agent must first check
the parser pause flag. If it is true, then the tokenizer must abort the processing of
any nested invocations of the tokenizer, yielding control back to the caller.

The tokenizer state machine consists of the states defined in the following subsections.

13.2.5.1 Data state

Consume the next input character:

U+0026 AMPERSAND (&)
Set the return state to the data state.
Switch to the character reference state.
U+003C LESS-THAN SIGN (<)
Switch to the tag open state.
U+0000 NULL
This is an unexpected-null-character parse
error. Emit the current input character as a character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.2 RCDATA state

Consume the next input character:

U+0026 AMPERSAND (&)
Set the return state to the RCDATA state.
Switch to the character reference state.
U+003C LESS-THAN SIGN (<)
Switch to the RCDATA less-than sign state.
U+0000 NULL
This is an unexpected-null-character parse
error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.3 RAWTEXT state

Consume the next input character:

U+003C LESS-THAN SIGN (<)
Switch to the RAWTEXT less-than sign state.
U+0000 NULL
This is an unexpected-null-character parse
error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.4 Script data state

Consume the next input character:

U+003C LESS-THAN SIGN (<)
Switch to the script data less-than sign state.
U+0000 NULL
This is an unexpected-null-character parse
error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.5 PLAINTEXT state

Consume the next input character:

U+0000 NULL
This is an unexpected-null-character parse
error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.6 Tag open state

Consume the next input character:

U+0021 EXCLAMATION MARK (!)
Switch to the markup declaration open state.
U+002F SOLIDUS (/)
Switch to the end tag open state.
ASCII alpha
Create a new start tag token, set its tag name to the empty string. Reconsume in
the tag name state.

U+003F QUESTION MARK (?)
This is an unexpected-question-mark-instead-of-tag-name
parse error. Create a comment token whose data is the empty string.
Reconsume in the bogus comment state.
EOF
This is an eof-before-tag-name
parse error. Emit a U+003C LESS-THAN SIGN character token and an end-of-file
token.
Anything else
This is an invalid-first-character-of-tag-name
parse error. Emit a U+003C LESS-THAN SIGN character token. Reconsume in
the data state.
13.2.5.7 End tag open state

Consume the next input character:

ASCII alpha
Create a new end tag token, set its tag name to the empty string. Reconsume in
the tag name state.

U+003E GREATER-THAN SIGN (>)
This is a missing-end-tag-name
parse error. Switch to the data state.
EOF
This is an eof-before-tag-name
parse error. Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS
character token and an end-of-file token.

Anything else
This is an invalid-first-character-of-tag-name
parse error. Create a comment token whose data is the empty string.
Reconsume in the bogus comment state.
13.2.5.8 Tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before attribute name state.
U+002F SOLIDUS (/)
Switch to the self-closing start tag state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the
character’s code point) to the current tag token’s tag name.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the current tag token’s tag
name.
EOF
This is an eof-in-tag parse error.
Emit an end-of-file token.
Anything else
Append the current input character to the current tag token’s tag name.
13.2.5.9 RCDATA less-than sign state

Consume the next input character:

U+002F SOLIDUS (/)
Set the temporary buffer to the empty string. Switch to
the RCDATA end tag open state.
Anything else
Emit a U+003C LESS-THAN SIGN character token. Reconsume in the RCDATA
state.
13.2.5.10 RCDATA end tag open state

Consume the next input character:

ASCII alpha
Create a new end tag token, set its tag name to the empty string. Reconsume in
the RCDATA end tag name state.

Anything else
Emit a U+003C LESS-THAN SIGN character token and a U+002F SOLIDUS character token.
Reconsume in the RCDATA state.
13.2.5.11 RCDATA end tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
If the current end tag token is an appropriate end tag token, then switch to the
before attribute name state. Otherwise, treat it as per the «anything else» entry
below.
U+002F SOLIDUS (/)
If the current end tag token is an appropriate end tag token, then switch to the
self-closing start tag state. Otherwise, treat it as per the «anything else» entry
below.
U+003E GREATER-THAN SIGN (>)
If the current end tag token is an appropriate end tag token, then switch to the
data state and emit the current tag token. Otherwise, treat it as per the «anything
else» entry below.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the
character’s code point) to the current tag token’s tag name. Append the current input
character to the temporary buffer.
ASCII lower alpha
Append the current input character to the current tag token’s tag name. Append
the current input character to the temporary
buffer
.
Anything else
Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token, and a
character token for each of the characters in the temporary
buffer
(in the order they were added to the buffer). Reconsume in the
RCDATA state.
13.2.5.12 RAWTEXT less-than sign state

Consume the next input character:

U+002F SOLIDUS (/)
Set the temporary buffer to the empty string. Switch to
the RAWTEXT end tag open state.
Anything else
Emit a U+003C LESS-THAN SIGN character token. Reconsume in the RAWTEXT
state.
13.2.5.13 RAWTEXT end tag open state

Consume the next input character:

ASCII alpha
Create a new end tag token, set its tag name to the empty string. Reconsume in
the RAWTEXT end tag name state.

Anything else
Emit a U+003C LESS-THAN SIGN character token and a U+002F SOLIDUS character token.
Reconsume in the RAWTEXT state.
13.2.5.14 RAWTEXT end tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
If the current end tag token is an appropriate end tag token, then switch to the
before attribute name state. Otherwise, treat it as per the «anything else» entry
below.
U+002F SOLIDUS (/)
If the current end tag token is an appropriate end tag token, then switch to the
self-closing start tag state. Otherwise, treat it as per the «anything else» entry
below.
U+003E GREATER-THAN SIGN (>)
If the current end tag token is an appropriate end tag token, then switch to the
data state and emit the current tag token. Otherwise, treat it as per the «anything
else» entry below.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the
character’s code point) to the current tag token’s tag name. Append the current input
character to the temporary buffer.
ASCII lower alpha
Append the current input character to the current tag token’s tag name. Append
the current input character to the temporary
buffer
.
Anything else
Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token, and a
character token for each of the characters in the temporary
buffer
(in the order they were added to the buffer). Reconsume in the
RAWTEXT state.
13.2.5.15 Script data less-than sign state

Consume the next input character:

U+002F SOLIDUS (/)
Set the temporary buffer to the empty string. Switch to
the script data end tag open state.
U+0021 EXCLAMATION MARK (!)
Switch to the script data escape start state. Emit a U+003C LESS-THAN SIGN
character token and a U+0021 EXCLAMATION MARK character token.
Anything else
Emit a U+003C LESS-THAN SIGN character token. Reconsume in the script data
state.
13.2.5.16 Script data end tag open state

Consume the next input character:

ASCII alpha
Create a new end tag token, set its tag name to the empty string. Reconsume in
the script data end tag name state.

Anything else
Emit a U+003C LESS-THAN SIGN character token and a U+002F SOLIDUS character token.
Reconsume in the script data state.
13.2.5.17 Script data end tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
If the current end tag token is an appropriate end tag token, then switch to the
before attribute name state. Otherwise, treat it as per the «anything else» entry
below.
U+002F SOLIDUS (/)
If the current end tag token is an appropriate end tag token, then switch to the
self-closing start tag state. Otherwise, treat it as per the «anything else» entry
below.
U+003E GREATER-THAN SIGN (>)
If the current end tag token is an appropriate end tag token, then switch to the
data state and emit the current tag token. Otherwise, treat it as per the «anything
else» entry below.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the
character’s code point) to the current tag token’s tag name. Append the current input
character to the temporary buffer.
ASCII lower alpha
Append the current input character to the current tag token’s tag name. Append
the current input character to the temporary
buffer
.
Anything else
Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token, and a
character token for each of the characters in the temporary
buffer
(in the order they were added to the buffer). Reconsume in the
script data state.
13.2.5.18 Script data escape start state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the script data escape start dash state. Emit a U+002D HYPHEN-MINUS
character token.
Anything else
Reconsume in the script data state.
13.2.5.19 Script data escape start dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the script data escaped dash dash state. Emit a U+002D HYPHEN-MINUS
character token.
Anything else
Reconsume in the script data state.
13.2.5.20 Script data escaped state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the script data escaped dash state. Emit a U+002D HYPHEN-MINUS
character token.
U+003C LESS-THAN SIGN (<)
Switch to the script data escaped less-than sign state.
U+0000 NULL
This is an unexpected-null-character parse
error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
This is an eof-in-script-html-comment-like-text
parse error. Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.21 Script data escaped dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the script data escaped dash dash state. Emit a U+002D HYPHEN-MINUS
character token.
U+003C LESS-THAN SIGN (<)
Switch to the script data escaped less-than sign state.
U+0000 NULL
This is an unexpected-null-character parse
error. Switch to the script data escaped state. Emit a U+FFFD REPLACEMENT
CHARACTER character token.
EOF
This is an eof-in-script-html-comment-like-text
parse error. Emit an end-of-file token.
Anything else
Switch to the script data escaped state. Emit the current input
character as a character token.
13.2.5.22 Script data escaped dash dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Emit a U+002D HYPHEN-MINUS character token.
U+003C LESS-THAN SIGN (<)
Switch to the script data escaped less-than sign state.
U+003E GREATER-THAN SIGN (>)
Switch to the script data state. Emit a U+003E GREATER-THAN SIGN character
token.
U+0000 NULL
This is an unexpected-null-character parse
error. Switch to the script data escaped state. Emit a U+FFFD REPLACEMENT
CHARACTER character token.
EOF
This is an eof-in-script-html-comment-like-text
parse error. Emit an end-of-file token.
Anything else
Switch to the script data escaped state. Emit the current input
character as a character token.
13.2.5.23 Script data escaped less-than sign state

Consume the next input character:

U+002F SOLIDUS (/)
Set the temporary buffer to the empty string. Switch to
the script data escaped end tag open state.
ASCII alpha
Set the temporary buffer to the empty string. Emit a
U+003C LESS-THAN SIGN character token. Reconsume in the script data double
escape start state.
Anything else
Emit a U+003C LESS-THAN SIGN character token. Reconsume in the script data
escaped state.
13.2.5.24 Script data escaped end tag open state

Consume the next input character:

ASCII alpha
Create a new end tag token, set its tag name to the empty string. Reconsume in
the script data escaped end tag name state.
Anything else
Emit a U+003C LESS-THAN SIGN character token and a U+002F SOLIDUS character token.
Reconsume in the script data escaped state.
13.2.5.25 Script data escaped end tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
If the current end tag token is an appropriate end tag token, then switch to the
before attribute name state. Otherwise, treat it as per the «anything else» entry
below.
U+002F SOLIDUS (/)
If the current end tag token is an appropriate end tag token, then switch to the
self-closing start tag state. Otherwise, treat it as per the «anything else» entry
below.
U+003E GREATER-THAN SIGN (>)
If the current end tag token is an appropriate end tag token, then switch to the
data state and emit the current tag token. Otherwise, treat it as per the «anything
else» entry below.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the
character’s code point) to the current tag token’s tag name. Append the current input
character to the temporary buffer.
ASCII lower alpha
Append the current input character to the current tag token’s tag name. Append
the current input character to the temporary
buffer
.
Anything else
Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token, and a
character token for each of the characters in the temporary buffer
(in the order they were added to the buffer). Reconsume in the script
data escaped state.
13.2.5.26 Script data double escape start state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
U+002F SOLIDUS (/)
U+003E GREATER-THAN SIGN (>)
If the temporary buffer is the string «script«, then switch to the script data double escaped state.
Otherwise, switch to the script data escaped state. Emit the current input
character as a character token.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the
character’s code point) to the temporary buffer. Emit the
current input character as a character token.
ASCII lower alpha
Append the current input character to the temporary buffer. Emit the current input character as a character
token.
Anything else
Reconsume in the script data escaped state.
13.2.5.27 Script data double escaped state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the script data double escaped dash state. Emit a U+002D HYPHEN-MINUS
character token.
U+003C LESS-THAN SIGN (<)
Switch to the script data double escaped less-than sign state. Emit a U+003C
LESS-THAN SIGN character token.
U+0000 NULL
This is an unexpected-null-character parse
error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
This is an eof-in-script-html-comment-like-text
parse error. Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.28 Script data double escaped dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the script data double escaped dash dash state. Emit a U+002D
HYPHEN-MINUS character token.
U+003C LESS-THAN SIGN (<)
Switch to the script data double escaped less-than sign state. Emit a U+003C
LESS-THAN SIGN character token.
U+0000 NULL
This is an unexpected-null-character parse
error. Switch to the script data double escaped state. Emit a U+FFFD
REPLACEMENT CHARACTER character token.
EOF
This is an eof-in-script-html-comment-like-text
parse error. Emit an end-of-file token.
Anything else
Switch to the script data double escaped state. Emit the current input
character as a character token.
13.2.5.29 Script data double escaped dash dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Emit a U+002D HYPHEN-MINUS character token.
U+003C LESS-THAN SIGN (<)
Switch to the script data double escaped less-than sign state. Emit a U+003C
LESS-THAN SIGN character token.
U+003E GREATER-THAN SIGN (>)
Switch to the script data state. Emit a U+003E GREATER-THAN SIGN character
token.
U+0000 NULL
This is an unexpected-null-character parse
error. Switch to the script data double escaped state. Emit a U+FFFD
REPLACEMENT CHARACTER character token.
EOF
This is an eof-in-script-html-comment-like-text
parse error. Emit an end-of-file token.
Anything else
Switch to the script data double escaped state. Emit the current input
character as a character token.
13.2.5.30 Script data double escaped less-than sign state

Consume the next input character:

U+002F SOLIDUS (/)
Set the temporary buffer to the empty string. Switch to
the script data double escape end state. Emit a U+002F SOLIDUS character token.
Anything else
Reconsume in the script data double escaped state.
13.2.5.31 Script data double escape end state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
U+002F SOLIDUS (/)
U+003E GREATER-THAN SIGN (>)
If the temporary buffer is the string «script«, then switch to the script data escaped state. Otherwise,
switch to the script data double escaped state. Emit the current input
character as a character token.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the
character’s code point) to the temporary buffer. Emit the
current input character as a character token.
ASCII lower alpha
Append the current input character to the temporary buffer. Emit the current input character as a character
token.
Anything else
Reconsume in the script data double escaped state.
13.2.5.32 Before attribute name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+002F SOLIDUS (/)
U+003E GREATER-THAN SIGN (>)
EOF
Reconsume in the after attribute name state.
U+003D EQUALS SIGN (=)
This is an unexpected-equals-sign-before-attribute-name
parse error. Start a new attribute in the current tag token. Set that attribute’s
name to the current input character, and its value to the empty string. Switch to
the attribute name state.
Anything else
Start a new attribute in the current tag token. Set that attribute name and value to the
empty string. Reconsume in the attribute name state.
13.2.5.33 Attribute name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
U+002F SOLIDUS (/)
U+003E GREATER-THAN SIGN (>)
EOF
Reconsume in the after attribute name state.
U+003D EQUALS SIGN (=)
Switch to the before attribute value state.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the
character’s code point) to the current attribute’s name.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute’s
name.
U+0022 QUOTATION MARK («)
U+0027 APOSTROPHE (‘)
U+003C LESS-THAN SIGN (<)
This is an unexpected-character-in-attribute-name
parse error. Treat it as per the «anything else» entry below.
Anything else
Append the current input character to the current attribute’s name.

When the user agent leaves the attribute name state (and before emitting the tag token, if
appropriate), the complete attribute’s name must be compared to the other attributes on the same
token; if there is already an attribute on the token with the exact same name, then this is a
duplicate-attribute parse error
and the new attribute must be removed from the token.

If an attribute is so removed from a token, it, and the value that gets associated
with it, if any, are never subsequently used by the parser, and are therefore effectively
discarded. Removing the attribute in this way does not change its status as the «current
attribute» for the purposes of the tokenizer, however.

13.2.5.34 After attribute name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+002F SOLIDUS (/)
Switch to the self-closing start tag state.
U+003D EQUALS SIGN (=)
Switch to the before attribute value state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
EOF
This is an eof-in-tag
parse error. Emit an end-of-file token.
Anything else
Start a new attribute in the current tag token. Set that attribute name and value to the
empty string. Reconsume in the attribute name state.
13.2.5.35 Before attribute value state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+0022 QUOTATION MARK («)
Switch to the attribute value (double-quoted) state.
U+0027 APOSTROPHE (‘)
Switch to the attribute value (single-quoted) state.
U+003E GREATER-THAN SIGN (>)
This is a missing-attribute-value
parse error. Switch to the data state. Emit the current tag token.
Anything else
Reconsume in the attribute value (unquoted) state.
13.2.5.36 Attribute value (double-quoted) state

Consume the next input character:

U+0022 QUOTATION MARK («)
Switch to the after attribute value (quoted) state.
U+0026 AMPERSAND (&)
Set the return state to the attribute value
(double-quoted) state. Switch to the character reference state.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute’s
value.
EOF
This is an eof-in-tag parse error.
Emit an end-of-file token.
Anything else
Append the current input character to the current attribute’s value.
13.2.5.37 Attribute value (single-quoted) state

Consume the next input character:

U+0027 APOSTROPHE (‘)
Switch to the after attribute value (quoted) state.
U+0026 AMPERSAND (&)
Set the return state to the attribute value
(single-quoted) state. Switch to the character reference state.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute’s
value.
EOF
This is an eof-in-tag parse error.
Emit an end-of-file token.
Anything else
Append the current input character to the current attribute’s value.
13.2.5.38 Attribute value (unquoted) state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before attribute name state.
U+0026 AMPERSAND (&)
Set the return state to the attribute value
(unquoted) state. Switch to the character reference state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute’s
value.
U+0022 QUOTATION MARK («)
U+0027 APOSTROPHE (‘)
U+003C LESS-THAN SIGN (<)
U+003D EQUALS SIGN (=)
U+0060 GRAVE ACCENT (`)
This is an unexpected-character-in-unquoted-attribute-value
parse error. Treat it as per the «anything else» entry below.
EOF
This is an eof-in-tag
parse error. Emit an end-of-file token.
Anything else
Append the current input character to the current attribute’s value.
13.2.5.39 After attribute value (quoted) state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before attribute name state.
U+002F SOLIDUS (/)
Switch to the self-closing start tag state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
EOF
This is an eof-in-tag parse error.
Emit an end-of-file token.
Anything else
This is a missing-whitespace-between-attributes
parse error. Reconsume in the before attribute name state.
13.2.5.40 Self-closing start tag state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Set the self-closing flag of the current tag token. Switch
to the data state. Emit the current tag token.
EOF
This is an eof-in-tag
parse error. Emit an end-of-file token.
Anything else
This is an unexpected-solidus-in-tag parse
error. Reconsume in the before attribute name state.

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current comment token.
EOF
Emit the comment. Emit an end-of-file token.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the comment token’s data.
Anything else
Append the current input character to the comment token’s data.
13.2.5.42 Markup declaration open state

If the next few characters are:

Two U+002D HYPHEN-MINUS characters (-)
Consume those two characters, create a comment token whose data is the empty string, and
switch to the comment start state.
ASCII case-insensitive match for the word «DOCTYPE»
Consume those characters and switch to the DOCTYPE state.
The string «[CDATA[» (the five uppercase letters «CDATA» with a U+005B LEFT SQUARE BRACKET
character before and after)
Consume those characters. If there is an adjusted current node and it is not
an element in the HTML namespace, then switch to the CDATA section
state. Otherwise, this is a cdata-in-html-content parse
error. Create a comment token whose data is the «[CDATA[» string. Switch to the
bogus comment state.
Anything else
This is an incorrectly-opened-comment
parse error. Create a comment token whose data is the empty string. Switch to the
bogus comment state (don’t consume anything in the current state).

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment start dash state.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-closing-of-empty-comment parse
error. Switch to the data state. Emit the current comment token.
Anything else
Reconsume in the comment state.

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment end state.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-closing-of-empty-comment parse
error. Switch to the data state. Emit the current comment token.
EOF
This is an eof-in-comment parse
error. Emit the current comment token. Emit an end-of-file token.
Anything else
Append a U+002D HYPHEN-MINUS character (-) to the comment token’s data.
Reconsume in the comment state.

Consume the next input character:

U+003C LESS-THAN SIGN (<)
Append the current input character to the comment token’s data. Switch to the
comment less-than sign state.
U+002D HYPHEN-MINUS (-)
Switch to the comment end dash state.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the comment token’s data.
EOF
This is an eof-in-comment parse
error. Emit the current comment token. Emit an end-of-file token.
Anything else
Append the current input character to the comment token’s data.

Consume the next input character:

U+0021 EXCLAMATION MARK (!)
Append the current input character to the comment token’s data. Switch to the
comment less-than sign bang state.
U+003C LESS-THAN SIGN (<)
Append the current input character to the comment token’s data.
Anything else
Reconsume in the comment state.

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment less-than sign bang dash state.
Anything else
Reconsume in the comment state.

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment less-than sign bang dash dash state.
Anything else
Reconsume in the comment end dash state.

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
EOF
Reconsume in the comment end state.
Anything else
This is a nested-comment parse
error. Reconsume in the comment end state.

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment end state.
EOF
This is an eof-in-comment parse
error. Emit the current comment token. Emit an end-of-file token.
Anything else
Append a U+002D HYPHEN-MINUS character (-) to the comment token’s data.
Reconsume in the comment state.

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current comment token.
U+0021 EXCLAMATION MARK (!)
Switch to the comment end bang state.
U+002D HYPHEN-MINUS (-)
Append a U+002D HYPHEN-MINUS character (-) to the comment token’s data.
EOF
This is an eof-in-comment parse
error. Emit the current comment token. Emit an end-of-file token.
Anything else
Append two U+002D HYPHEN-MINUS characters (-) to the comment token’s data.
Reconsume in the comment state.

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Append two U+002D HYPHEN-MINUS characters (-) and a U+0021 EXCLAMATION MARK character (!) to
the comment token’s data. Switch to the comment end dash state.
U+003E GREATER-THAN SIGN (>)
This is an incorrectly-closed-comment parse
error. Switch to the data state. Emit the current comment token.
EOF
This is an eof-in-comment parse
error. Emit the current comment token. Emit an end-of-file token.
Anything else
Append two U+002D HYPHEN-MINUS characters (-) and a U+0021 EXCLAMATION MARK character (!) to
the comment token’s data. Reconsume in the comment state.
13.2.5.53 DOCTYPE state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before DOCTYPE name state.
U+003E GREATER-THAN SIGN (>)
Reconsume in the before DOCTYPE name state.
EOF
This is an eof-in-doctype parse
error. Create a new DOCTYPE token. Set its force-quirks
flag
to on. Emit the current token. Emit an end-of-file token.
Anything else
This is a missing-whitespace-before-doctype-name
parse error. Reconsume in the before DOCTYPE name state.
13.2.5.54 Before DOCTYPE name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
ASCII upper alpha
Create a new DOCTYPE token. Set the token’s name to the lowercase version of the
current input character (add 0x0020 to the character’s code point). Switch to the
DOCTYPE name state.
U+0000 NULL
This is an unexpected-null-character parse
error. Create a new DOCTYPE token. Set the token’s name to a U+FFFD REPLACEMENT CHARACTER
character. Switch to the DOCTYPE name state.
U+003E GREATER-THAN SIGN (>)
This is a missing-doctype-name
parse error. Create a new DOCTYPE token. Set its force-quirks flag to on. Switch to the data
state. Emit the current token.
EOF
This is an eof-in-doctype parse
error. Create a new DOCTYPE token. Set its force-quirks
flag
to on. Emit the current token. Emit an end-of-file token.
Anything else
Create a new DOCTYPE token. Set the token’s name to the current input character.
Switch to the DOCTYPE name state.
13.2.5.55 DOCTYPE name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the after DOCTYPE name state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current DOCTYPE token.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the
character’s code point) to the current DOCTYPE token’s name.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token’s
name.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
Append the current input character to the current DOCTYPE token’s name.
13.2.5.56 After DOCTYPE name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else

If the six characters starting from the current input character are an
ASCII case-insensitive match for the word «PUBLIC», then consume those characters
and switch to the after DOCTYPE public keyword state.

Otherwise, if the six characters starting from the current input character are
an ASCII case-insensitive match for the word «SYSTEM», then consume those
characters and switch to the after DOCTYPE system keyword state.

Otherwise, this is an invalid-character-sequence-after-doctype-name
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Reconsume in the bogus DOCTYPE state.

13.2.5.57 After DOCTYPE public keyword state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before DOCTYPE public identifier state.
U+0022 QUOTATION MARK («)
This is a missing-whitespace-after-doctype-public-keyword
parse error. Set the current DOCTYPE token’s public identifier to the empty string (not
missing), then switch to the DOCTYPE public identifier (double-quoted) state.
U+0027 APOSTROPHE (‘)
This is a missing-whitespace-after-doctype-public-keyword
parse error. Set the current DOCTYPE token’s public identifier to the empty string (not
missing), then switch to the DOCTYPE public identifier (single-quoted) state.
U+003E GREATER-THAN SIGN (>)
This is a missing-doctype-public-identifier parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-public-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Reconsume in the bogus DOCTYPE state.
13.2.5.58 Before DOCTYPE public identifier state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+0022 QUOTATION MARK («)
Set the current DOCTYPE token’s public identifier to the empty string (not missing), then switch to
the DOCTYPE public identifier (double-quoted) state.
U+0027 APOSTROPHE (‘)
Set the current DOCTYPE token’s public identifier to the empty string (not missing), then switch to
the DOCTYPE public identifier (single-quoted) state.
U+003E GREATER-THAN SIGN (>)
This is a missing-doctype-public-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-public-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Reconsume in the bogus DOCTYPE state.
13.2.5.59 DOCTYPE public identifier (double-quoted) state

Consume the next input character:

U+0022 QUOTATION MARK («)
Switch to the after DOCTYPE public identifier state.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token’s
public identifier.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-doctype-public-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
Append the current input character to the current DOCTYPE token’s public
identifier.
13.2.5.60 DOCTYPE public identifier (single-quoted) state

Consume the next input character:

U+0027 APOSTROPHE (‘)
Switch to the after DOCTYPE public identifier state.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token’s
public identifier.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-doctype-public-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
Append the current input character to the current DOCTYPE token’s public
identifier.
13.2.5.61 After DOCTYPE public identifier state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the between DOCTYPE public and system identifiers state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current DOCTYPE token.
U+0022 QUOTATION MARK («)
This is a missing-whitespace-between-doctype-public-and-system-identifiers
parse error. Set the current DOCTYPE token’s system identifier to the empty string (not
missing), then switch to the DOCTYPE system identifier (double-quoted) state.
U+0027 APOSTROPHE (‘)
This is a missing-whitespace-between-doctype-public-and-system-identifiers
parse error. Set the current DOCTYPE token’s system identifier to the empty string (not
missing), then switch to the DOCTYPE system identifier (single-quoted) state.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-system-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Reconsume in the bogus DOCTYPE state.
13.2.5.62 Between DOCTYPE public and system identifiers state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current DOCTYPE token.
U+0022 QUOTATION MARK («)
Set the current DOCTYPE token’s system identifier to the empty string (not missing), then switch to
the DOCTYPE system identifier (double-quoted) state.
U+0027 APOSTROPHE (‘)
Set the current DOCTYPE token’s system identifier to the empty string (not missing), then switch to
the DOCTYPE system identifier (single-quoted) state.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-system-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Reconsume in the bogus DOCTYPE state.
13.2.5.63 After DOCTYPE system keyword state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before DOCTYPE system identifier state.
U+0022 QUOTATION MARK («)
This is a missing-whitespace-after-doctype-system-keyword
parse error. Set the current DOCTYPE token’s system identifier to the empty string (not
missing), then switch to the DOCTYPE system identifier (double-quoted) state.
U+0027 APOSTROPHE (‘)
This is a missing-whitespace-after-doctype-system-keyword
parse error. Set the current DOCTYPE token’s system identifier to the empty string (not
missing), then switch to the DOCTYPE system identifier (single-quoted) state.
U+003E GREATER-THAN SIGN (>)
This is a missing-doctype-system-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-system-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Reconsume in the bogus DOCTYPE state.
13.2.5.64 Before DOCTYPE system identifier state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+0022 QUOTATION MARK («)
Set the current DOCTYPE token’s system identifier to the empty string (not missing), then switch to
the DOCTYPE system identifier (double-quoted) state.
U+0027 APOSTROPHE (‘)
Set the current DOCTYPE token’s system identifier to the empty string (not missing), then switch to
the DOCTYPE system identifier (single-quoted) state.
U+003E GREATER-THAN SIGN (>)
This is a missing-doctype-system-identifier parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-system-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Reconsume in the bogus DOCTYPE state.
13.2.5.65 DOCTYPE system identifier (double-quoted) state

Consume the next input character:

U+0022 QUOTATION MARK («)
Switch to the after DOCTYPE system identifier state.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token’s
system identifier.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-doctype-system-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
Append the current input character to the current DOCTYPE token’s system
identifier.
13.2.5.66 DOCTYPE system identifier (single-quoted) state

Consume the next input character:

U+0027 APOSTROPHE (‘)
Switch to the after DOCTYPE system identifier state.
U+0000 NULL
This is an unexpected-null-character parse
error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token’s
system identifier.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-doctype-system-identifier
parse error. Set the current DOCTYPE token’s force-quirks
flag
to on. Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
Append the current input character to the current DOCTYPE token’s system
identifier.
13.2.5.67 After DOCTYPE system identifier state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse
error. Set the current DOCTYPE token’s force-quirks flag to
on. Emit the current DOCTYPE token. Emit an end-of-file token.
Anything else
This is an unexpected-character-after-doctype-system-identifier
parse error. Reconsume in the bogus DOCTYPE state. (This
does not set the current DOCTYPE token’s force-quirks flag to
on.)
13.2.5.68 Bogus DOCTYPE state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the DOCTYPE token.
U+0000 NULL
This is an unexpected-null-character parse
error. Ignore the character.
EOF
Emit the DOCTYPE token. Emit an end-of-file token.
Anything else
Ignore the character.
13.2.5.69 CDATA section state

Consume the next input character:

U+005D RIGHT SQUARE BRACKET (])
Switch to the CDATA section bracket state.
EOF
This is an eof-in-cdata parse
error. Emit an end-of-file token.
Anything else
Emit the current input character as a character token.

U+0000 NULL characters are handled in the tree construction stage, as part of the
in foreign content insertion mode, which
is the only place where CDATA sections can appear.

13.2.5.70 CDATA section bracket state

Consume the next input character:

U+005D RIGHT SQUARE BRACKET (])
Switch to the CDATA section end state.
Anything else
Emit a U+005D RIGHT SQUARE BRACKET character token. Reconsume in the
CDATA section state.
13.2.5.71 CDATA section end state

Consume the next input character:

U+005D RIGHT SQUARE BRACKET (])
Emit a U+005D RIGHT SQUARE BRACKET character token.
U+003E GREATER-THAN SIGN character
Switch to the data state.
Anything else
Emit two U+005D RIGHT SQUARE BRACKET character tokens. Reconsume in the
CDATA section state.
13.2.5.72 Character reference state

Set the temporary buffer to the empty string. Append
a U+0026 AMPERSAND (&) character to the temporary
buffer
. Consume the next input character:

ASCII alphanumeric
Reconsume in the named character reference state.
U+0023 NUMBER SIGN (#)
Append the current input character to the
temporary buffer. Switch to the numeric character
reference state.
Anything else
Flush code points consumed as a character reference. Reconsume in
the return state.
13.2.5.73 Named character reference state

Consume the maximum number of characters possible, where the consumed characters are one of the
identifiers in the first column of the named character references table. Append each
character to the temporary buffer when it’s consumed.

If there is a match

If the character reference was consumed as part of an
attribute, and the last character matched is not a U+003B SEMICOLON character (;), and
the next input character is either a U+003D EQUALS SIGN character (=) or an
ASCII alphanumeric, then, for historical reasons, flush code points consumed
as a character reference and switch to the return state.

Otherwise:

  1. If the last character matched is not a U+003B SEMICOLON character (;), then this is a
    missing-semicolon-after-character-reference
    parse error.

  2. Set the temporary buffer to the empty string.
    Append one or two characters corresponding to the character reference name (as given by the
    second column of the named character references table) to the temporary buffer.

  3. Flush code points consumed as a character reference. Switch to the return state.
Otherwise
Flush code points consumed as a character reference. Switch to the
ambiguous ampersand state.

If the markup contains (not in an attribute) the string I'm &notit; I
tell you
, the character reference is parsed as «not», as in, I'm ¬it;
I tell you
(and this is a parse error). But if the markup was I'm
&notin; I tell you
, the character reference would be parsed as «notin;», resulting
in I'm ∉ I tell you (and no parse error).

However, if the markup contains the string I'm &notit; I tell you
in an attribute, no character reference is parsed and string remains intact (and there is no
parse error).

13.2.5.74 Ambiguous ampersand state

Consume the next input character:

ASCII alphanumeric
If the character reference was consumed as part of an
attribute, then append the current input character to the current
attribute’s value. Otherwise, emit the current input character as a character
token.
U+003B SEMICOLON (;)
This is an unknown-named-character-reference
parse error. Reconsume in the return
state
.

Anything else
Reconsume in the return state.
13.2.5.75 Numeric character reference state

Set the character reference code to
zero (0).

Consume the next input character:

U+0078 LATIN SMALL LETTER X
U+0058 LATIN CAPITAL LETTER X
Append the current input character to the temporary buffer. Switch to the hexadecimal character reference start
state.
Anything else
Reconsume in the decimal character reference start state.
13.2.5.76 Hexadecimal character reference start state

Consume the next input character:

ASCII hex digit
Reconsume in the hexadecimal character reference state.
Anything else
This is an absence-of-digits-in-numeric-character-reference
parse error. Flush code points consumed as a character reference.
Reconsume in the return state.
13.2.5.77 Decimal character reference start state

Consume the next input character:

ASCII digit
Reconsume in the decimal character reference state.
Anything else
This is an absence-of-digits-in-numeric-character-reference
parse error. Flush code points consumed as a character reference.
Reconsume in the return state.
13.2.5.78 Hexadecimal character reference state

Consume the next input character:

ASCII digit
Multiply the character reference code by 16.
Add a numeric version of the current input character (subtract 0x0030 from the
character’s code point) to the character reference
code
.
ASCII upper hex digit
Multiply the character reference code by 16.
Add a numeric version of the current input character as a hexadecimal digit
(subtract 0x0037 from the character’s code point) to the character reference code.
ASCII lower hex digit
Multiply the character reference code by 16.
Add a numeric version of the current input character as a hexadecimal digit
(subtract 0x0057 from the character’s code point) to the character reference code.
U+003B SEMICOLON
Switch to the numeric character reference end state.
Anything else
This is a missing-semicolon-after-character-reference
parse error. Reconsume in the numeric character reference end
state.
13.2.5.79 Decimal character reference state

Consume the next input character:

ASCII digit
Multiply the character reference code by 10.
Add a numeric version of the current input character (subtract 0x0030 from the
character’s code point) to the character reference
code
.
U+003B SEMICOLON
Switch to the numeric character reference end state.
Anything else
This is a missing-semicolon-after-character-reference
parse error. Reconsume in the numeric character reference end
state.
13.2.5.80 Numeric character reference end state

Check the character reference code:

  • If the number is 0x00, then this is a null-character-reference parse
    error. Set the character reference code to
    0xFFFD.

  • If the number is greater than 0x10FFFF, then this is a character-reference-outside-unicode-range
    parse error. Set the character reference
    code
    to 0xFFFD.

  • If the number is a surrogate, then this is a surrogate-character-reference
    parse error. Set the character reference
    code
    to 0xFFFD.

  • If the number is a noncharacter, then this is a
    noncharacter-character-reference
    parse error.

  • If the number is 0x0D, or a
    control that’s not ASCII whitespace, then this is a
    control-character-reference
    parse error. If the number is one of the numbers in the first column of the
    following table, then find the row with that number in the first column, and set the character reference code to the number in the second
    column of that row.

    Number Code point
    0x80 0x20AC EURO SIGN (€)

    0x82 0x201A SINGLE LOW-9 QUOTATION MARK (‚)
    0x83 0x0192 LATIN SMALL LETTER F WITH HOOK (ƒ)
    0x84 0x201E DOUBLE LOW-9 QUOTATION MARK („)
    0x85 0x2026 HORIZONTAL ELLIPSIS (…)
    0x86 0x2020 DAGGER (†)
    0x87 0x2021 DOUBLE DAGGER (‡)
    0x88 0x02C6 MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ)
    0x89 0x2030 PER MILLE SIGN (‰)
    0x8A 0x0160 LATIN CAPITAL LETTER S WITH CARON (Š)
    0x8B 0x2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK (‹)
    0x8C 0x0152 LATIN CAPITAL LIGATURE OE (Œ)

    0x8E 0x017D LATIN CAPITAL LETTER Z WITH CARON (Ž)

    0x91 0x2018 LEFT SINGLE QUOTATION MARK (‘)
    0x92 0x2019 RIGHT SINGLE QUOTATION MARK (’)
    0x93 0x201C LEFT DOUBLE QUOTATION MARK (“)
    0x94 0x201D RIGHT DOUBLE QUOTATION MARK (”)
    0x95 0x2022 BULLET (•)
    0x96 0x2013 EN DASH (–)
    0x97 0x2014 EM DASH (—)
    0x98 0x02DC SMALL TILDE (˜)
    0x99 0x2122 TRADE MARK SIGN (™)
    0x9A 0x0161 LATIN SMALL LETTER S WITH CARON (š)
    0x9B 0x203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (›)
    0x9C 0x0153 LATIN SMALL LIGATURE OE (œ)

    0x9E 0x017E LATIN SMALL LETTER Z WITH CARON (ž)
    0x9F 0x0178 LATIN CAPITAL LETTER Y WITH DIAERESIS (Ÿ)

Set the temporary buffer to the empty string. Append a
code point equal to the character reference code to
the temporary buffer. Flush code points consumed as a
character reference. Switch to the return state.

13.2.6 Tree construction

The input to the tree construction stage is a sequence of tokens from the
tokenization stage. The tree construction stage is associated with a DOM
Document object when a parser is created. The «output» of this stage consists of
dynamically modifying or extending that document’s DOM tree.

This specification does not define when an interactive user agent has to render the
Document so that it is available to the user, or when it has to begin accepting user
input.


As each token is emitted from the tokenizer, the user agent must follow the appropriate steps
from the following list, known as the tree construction dispatcher:

If the stack of open elements is empty
If the adjusted current node is an element in the HTML namespace
If the adjusted current node is a MathML text integration point and the token is a start tag whose tag name is neither «mglyph» nor «malignmark»
If the adjusted current node is a MathML text integration point and the token is a character token
If the adjusted current node is a MathML annotation-xml element and the token is a start tag whose tag name is «svg»
If the adjusted current node is an HTML integration point and the token is a start tag
If the adjusted current node is an HTML integration point and the token is a character token
If the token is an end-of-file token
Process the token according to the rules given in the section corresponding to the current
insertion mode in HTML content.
Otherwise
Process the token according to the rules given in the section for parsing tokens in foreign content.

The next token is the token that is about to be processed by the tree
construction dispatcher (even if the token is subsequently just ignored).

A node is a MathML text integration point if it is one of the following
elements:

  • A MathML mi element
  • A MathML mo element
  • A MathML mn element
  • A MathML ms element
  • A MathML mtext element

A node is an HTML integration point if it is one of the following elements:

  • A MathML annotation-xml element whose start tag token had an
    attribute with the name «encoding» whose value was an ASCII case-insensitive match
    for the string «text/html«
  • A MathML annotation-xml element whose start tag token had an
    attribute with the name «encoding» whose value was an ASCII case-insensitive match
    for the string «application/xhtml+xml«
  • An SVG foreignObject element
  • An SVG desc element
  • An SVG title element

If the node in question is the context element passed to the HTML fragment
parsing algorithm, then the start tag token for that element is the «fake» token created
during by that HTML fragment parsing algorithm.


Not all of the tag names mentioned below are conformant tag names in this
specification; many are included to handle legacy content. They still form part of the algorithm
that implementations are required to implement to claim conformance.

The algorithm described below places no limit on the depth of the DOM tree
generated, or on the length of tag names, attribute names, attribute values, Text
nodes, etc. While implementers are encouraged to avoid arbitrary limits, it is
recognized that practical concerns will likely force user agents to impose nesting depth
constraints.

13.2.6.1 Creating and inserting nodes

While the parser is processing a token, it can enable or disable foster parenting. This affects the following algorithm.

The appropriate place for inserting a node, optionally using a particular
override target, is the position in an element returned by running the following steps:

  1. If there was an override target specified, then let target be the
    override target.

    Otherwise, let target be the current node.

  2. Determine the adjusted insertion location using the first matching steps
    from the following list:

    If foster parenting is enabled and target is a table, tbody, tfoot,
    thead, or tr element

    Foster parenting happens when content is misnested in tables.

    Run these substeps:

    1. Let last template be the last template element in the
      stack of open elements, if any.

    2. Let last table be the last table element in the
      stack of open elements, if any.

    3. If there is a last template and either there is no last table, or there is one, but last template is lower
      (more recently added) than last table in the stack of open
      elements, then: let adjusted insertion location be inside last template‘s template contents, after its last child (if any),
      and abort these steps.

    4. If there is no last table, then let adjusted insertion
      location
      be inside the first element in the stack of open elements (the
      html element), after its last child (if any), and abort these steps.
      (fragment case)

    5. If last table has a parent node, then let adjusted insertion location be inside last table‘s parent
      node, immediately before last table, and abort these steps.

    6. Let previous element be the element immediately above last table in the stack of open elements.

    7. Let adjusted insertion location be inside previous
      element
      , after its last child (if any).

    These steps are involved in part because it’s possible for elements, the
    table element in this case in particular, to have been moved by a script around
    in the DOM, or indeed removed from the DOM entirely, after the element was inserted by the
    parser.

    Otherwise

    Let adjusted insertion location be inside target,
    after its last child (if any).

  3. If the adjusted insertion location is inside a template
    element, let it instead be inside the template element’s template
    contents, after its last child (if any).

  4. Return the adjusted insertion location.


When the steps below require the UA to create an
element for a token
in a particular given namespace and with a
particular intended parent, the UA must run the following steps:

  1. If the active speculative HTML parser is not null, then return the result of
    creating a speculative mock element
    given given namespace, the tag name of the given token, and the attributes of the
    given token.

  2. Otherwise, optionally create a speculative mock element given given
    namespace
    , the tag name of the given token, and the attributes of the given token.

    The result is not used. This step allows for a speculative fetch to
    be initiated from non-speculative parsing. The fetch is still speculative at this point,
    because, for example, by the time the element is inserted, intended parent might
    have been removed from the document.

  3. Let document be intended parent‘s node document.

  4. Let local name be the tag name of the token.

  5. Let is be the value of the «is» attribute in the
    given token, if such an attribute exists, or null otherwise.

  6. Let definition be the result of looking up a custom element definition given document, given
    namespace
    , local name, and is.

  7. If definition is non-null and the parser was not created as part of the
    HTML fragment parsing algorithm, then let will execute script be true.
    Otherwise, let it be false.

  8. If will execute script is true, then:

    1. Increment document‘s throw-on-dynamic-markup-insertion
      counter.

    2. If the JavaScript execution context stack is empty, then perform a
      microtask checkpoint.

    3. Push a new element queue onto document‘s relevant
      agent’s custom element reactions stack.

  9. Let element be the result of creating an
    element given document, localName, given
    namespace
    , null, and is. If will execute script is true, set the
    synchronous custom elements flag; otherwise, leave it unset.

    This will cause custom element
    constructors to run, if will execute script is true. However, since we
    incremented the throw-on-dynamic-markup-insertion counter, this cannot cause new characters to be inserted into the tokenizer, or the document to be blown away.

  10. Append each attribute in the given
    token to element.

    This can enqueue a custom element callback reaction for the
    attributeChangedCallback, which might run immediately (in the next
    step).

    Even though the is attribute governs the creation of a customized built-in element, it is
    not present during the execution of the relevant custom element constructor; it is
    appended in this step, along with all other attributes.

  11. If will execute script is true, then:

    1. Let queue be the result of popping from document‘s relevant
      agent’s custom element reactions stack. (This will be the same
      element queue as was pushed above.)

    2. Invoke custom element reactions in queue.

    3. Decrement document‘s throw-on-dynamic-markup-insertion
      counter.

  12. If element has an xmlns attribute in the XMLNS
    namespace
    whose value is not exactly the same as the element’s namespace, that is a
    parse error. Similarly, if element has an xmlns:xlink attribute in the XMLNS namespace whose value is not the
    XLink Namespace, that is a parse error.

  13. If element is a resettable element, invoke
    its reset algorithm. (This initializes the
    element’s value and checkedness based on the element’s attributes.)

  14. If element is a form-associated element and not a
    form-associated custom element, the
    form element pointer is not null, there is no
    template element on the stack of open elements, element is
    either not listed or doesn’t have a form attribute, and the intended parent is in the same
    tree as the element pointed to by the form element
    pointer, then associate element
    with the form element pointed to by the form element
    pointer and set element‘s parser inserted flag.

  15. Return element.


When the steps below require the user agent to insert a foreign element for a token
in a given namespace, the user agent must run these steps:

  1. Let the adjusted insertion location be the appropriate place for
    inserting a node.

  2. Let element be the result of creating an element for the token in the given namespace, with the intended
    parent being the element in which the adjusted insertion location finds
    itself.

  3. If it is possible to insert element at the adjusted insertion location,
    then:

    1. If the parser was not created as part of the HTML fragment parsing
      algorithm, then push a new element queue onto element‘s
      relevant agent’s custom element reactions stack.

    2. Insert element at the adjusted insertion location.

    3. If the parser was not created as part of the HTML fragment parsing
      algorithm, then pop the element queue from element‘s
      relevant agent’s custom element reactions stack, and invoke
      custom element reactions in that queue.

    If the adjusted insertion location cannot accept more
    elements, e.g. because it’s a Document that already has an element child, then
    element is dropped on the floor.

  4. Push element onto the stack of open elements so that it is the new
    current node.

  5. Return element.

When the steps below require the user agent to insert an HTML element for a token,
the user agent must insert a foreign element for the token, in the HTML
namespace.


When the steps below require the user agent to adjust MathML attributes for a token,
then, if the token has an attribute named definitionurl, change its name to
definitionURL (note the case difference).

When the steps below require the user agent to adjust SVG attributes for a token,
then, for each attribute on the token whose attribute name is one of the ones in the first column
of the following table, change the attribute’s name to the name given in the corresponding cell in
the second column. (This fixes the case of SVG attributes that are not all lowercase.)

Attribute name on token Attribute name on element
attributename attributeName
attributetype attributeType
basefrequency baseFrequency
baseprofile baseProfile
calcmode calcMode
clippathunits clipPathUnits
diffuseconstant diffuseConstant
edgemode edgeMode
filterunits filterUnits
glyphref glyphRef
gradienttransform gradientTransform
gradientunits gradientUnits
kernelmatrix kernelMatrix
kernelunitlength kernelUnitLength
keypoints keyPoints
keysplines keySplines
keytimes keyTimes
lengthadjust lengthAdjust
limitingconeangle limitingConeAngle
markerheight markerHeight
markerunits markerUnits
markerwidth markerWidth
maskcontentunits maskContentUnits
maskunits maskUnits
numoctaves numOctaves
pathlength pathLength
patterncontentunits patternContentUnits
patterntransform patternTransform
patternunits patternUnits
pointsatx pointsAtX
pointsaty pointsAtY
pointsatz pointsAtZ
preservealpha preserveAlpha
preserveaspectratio preserveAspectRatio
primitiveunits primitiveUnits
refx refX
refy refY
repeatcount repeatCount
repeatdur repeatDur
requiredextensions requiredExtensions
requiredfeatures requiredFeatures
specularconstant specularConstant
specularexponent specularExponent
spreadmethod spreadMethod
startoffset startOffset
stddeviation stdDeviation
stitchtiles stitchTiles
surfacescale surfaceScale
systemlanguage systemLanguage
tablevalues tableValues
targetx targetX
targety targetY
textlength textLength
viewbox viewBox
viewtarget viewTarget
xchannelselector xChannelSelector
ychannelselector yChannelSelector
zoomandpan zoomAndPan

When the steps below require the user agent to adjust foreign attributes for a
token, then, if any of the attributes on the token match the strings given in the first column of
the following table, let the attribute be a namespaced attribute, with the prefix being the string
given in the corresponding cell in the second column, the local name being the string given in the
corresponding cell in the third column, and the namespace being the namespace given in the
corresponding cell in the fourth column. (This fixes the use of namespaced attributes, in
particular lang attributes in the XML
namespace
.)

Attribute name Prefix Local name Namespace
xlink:actuate xlink actuate XLink namespace
xlink:arcrole xlink arcrole XLink namespace
xlink:href xlink href XLink namespace
xlink:role xlink role XLink namespace
xlink:show xlink show XLink namespace
xlink:title xlink title XLink namespace
xlink:type xlink type XLink namespace
xml:lang xml lang XML namespace
xml:space xml space XML namespace
xmlns (none) xmlns XMLNS namespace
xmlns:xlink xmlns xlink XMLNS namespace

When the steps below require the user agent to insert a character while processing a
token, the user agent must run the following steps:

  1. Let data be the characters passed to the algorithm, or, if no
    characters were explicitly specified, the character of the character token being
    processed.

  2. Let the adjusted insertion location be the appropriate
    place for inserting a node.

  3. If the adjusted insertion location is in a Document node,
    then return.

    The DOM will not let Document nodes have Text node
    children, so they are dropped on the floor.

  4. If there is a Text node immediately before the adjusted insertion
    location
    , then append data to that Text node’s data.

    Otherwise, create a new Text node whose data is data and
    whose node document is the same as that of the
    element in which the adjusted insertion location finds itself, and insert
    the newly created node at the adjusted insertion location.

Here are some sample inputs to the parser and the corresponding number of Text
nodes that they result in, assuming a user agent that executes scripts.

Input Number of Text nodes
A<script>
var script = document.getElementsByTagName('script')[0];
document.body.removeChild(script);
</script>B
One Text node in the document, containing «AB».
A<script>
var text = document.createTextNode('B');
document.body.appendChild(text);
</script>C
Three Text nodes; «A» before the script, the script’s contents, and «BC» after the script (the parser appends to the Text node created by the script).
A<script>
var text = document.getElementsByTagName('script')[0].firstChild;
text.data = 'B';
document.body.appendChild(text);
</script>C
Two adjacent Text nodes in the document, containing «A» and «BC».
A<table>B<tr>C</tr>D</table>
One Text node before the table, containing «ABCD». (This is caused by foster parenting.)
A<table><tr> B</tr> C</table>
One Text node before the table, containing «A B C» (A-space-B-space-C). (This is caused by foster parenting.)
A<table><tr> B</tr> </em>C</table>
One Text node before the table, containing «A BC» (A-space-B-C), and one Text node inside the table (as a child of a tbody) with a single space character. (Space characters separated from non-space characters by non-character tokens are not affected by foster parenting, even if those other tokens then get ignored.)

When the steps below require the user agent to while processing a
comment token, optionally with an explicitly insertion position position, the
user agent must run the following steps:

  1. Let data be the data given in the comment token being
    processed.

  2. If position was specified, then let the adjusted
    insertion location
    be position. Otherwise, let adjusted
    insertion location
    be the appropriate place for inserting a node.

  3. Create a Comment node whose data attribute is set to
    data and whose node document is
    the same as that of the node in which the adjusted insertion location finds
    itself.

  4. Insert the newly created node at the adjusted insertion
    location
    .


DOM mutation events must not fire for changes caused by the UA
parsing the document. This includes the parsing of any content inserted using document.write() and document.writeln() calls. [UIEVENTS]

However, mutation observers do fire, as required by DOM
.

13.2.6.2 Parsing elements that contain only text

The generic raw text element parsing algorithm and the generic RCDATA element
parsing algorithm
consist of the following steps. These algorithms are always invoked in
response to a start tag token.

  1. Insert an HTML element for the token.

  2. If the algorithm that was invoked is the generic raw text element parsing
    algorithm, switch the tokenizer to the RAWTEXT state; otherwise the algorithm
    invoked was the generic RCDATA element parsing algorithm, switch the tokenizer to
    the RCDATA state.

  3. Let the original insertion mode be the current insertion
    mode.

  4. Then, switch the insertion mode to «text».

13.2.6.3 Closing elements that have implied end tags

When the steps below require the UA to generate implied end tags, then, while the
current node is a dd element, a dt element, an
li element, an optgroup element, an option element, a
p element, an rb element, an rp element, an rt
element, or an rtc element, the UA must pop the current node off the
stack of open elements.

If a step requires the UA to generate implied end tags but lists an element to exclude from the
process, then the UA must perform the above steps as if that element was not in the above
list.

When the steps below require the UA to generate all implied end tags thoroughly,
then, while the current node is a caption element, a
colgroup element, a dd element, a dt element, an
li element, an optgroup element, an option element, a
p element, an rb element, an rp element, an rt
element, an rtc element, a tbody element, a td element, a
tfoot element, a th element, a thead element, or a
tr element, the UA must pop the current node off the
stack of open elements.

13.2.6.4 The rules for parsing tokens in HTML content
13.2.6.4.1 The «initial» insertion mode

A Document object has an associated parser cannot change the mode flag
(a boolean). It is initially false.

When the user agent is to apply the rules for the «initial» insertion mode, the user agent must handle the token as
follows:

A character token that is one of U+0009 CHARACTER
TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF),
U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Ignore the token.

A comment token

Insert a comment as the last child of the Document object.

A DOCTYPE token

If the DOCTYPE token’s name is not «html«, or the token’s public
identifier is not missing, or the token’s system identifier is neither missing nor
«about:legacy-compat«, then there is a parse error.

Append a DocumentType node to the Document node, with its name set to the name given in the DOCTYPE token, or the
empty string if the name was missing; its public
ID set to the public identifier given in the DOCTYPE token, or the empty string if the
public identifier was missing; and its system ID
set to the system identifier given in the DOCTYPE token, or the empty string if the system
identifier was missing.

This also ensures that the DocumentType node is returned as the
value of the doctype attribute of the
Document object.

Then, if the document is not an iframe
srcdoc document, and the parser cannot
change the mode flag is false, and the DOCTYPE token matches one of the conditions in the
following list, then set the Document to quirks mode:

  • The force-quirks flag is set to on.
  • The name is not «html«.
  • The public identifier is set to: «-//W3O//DTD W3 HTML Strict 3.0//EN//»
  • The public identifier is set to: «-/W3C/DTD HTML 4.0 Transitional/EN»
  • The public identifier is set to: «HTML»
  • The system identifier is set to: «http://www.ibm.com/data/dtd/v11/ibmxhtml1-transitional.dtd»
  • The public identifier starts with: «+//Silmaril//dtd html Pro v0r11 19970101//»
  • The public identifier starts with: «-//AS//DTD HTML 3.0 asWedit + extensions//»
  • The public identifier starts with: «-//AdvaSoft Ltd//DTD HTML 3.0 asWedit + extensions//»
  • The public identifier starts with: «-//IETF//DTD HTML 2.0 Level 1//»
  • The public identifier starts with: «-//IETF//DTD HTML 2.0 Level 2//»
  • The public identifier starts with: «-//IETF//DTD HTML 2.0 Strict Level 1//»
  • The public identifier starts with: «-//IETF//DTD HTML 2.0 Strict Level 2//»
  • The public identifier starts with: «-//IETF//DTD HTML 2.0 Strict//»
  • The public identifier starts with: «-//IETF//DTD HTML 2.0//»
  • The public identifier starts with: «-//IETF//DTD HTML 2.1E//»
  • The public identifier starts with: «-//IETF//DTD HTML 3.0//»
  • The public identifier starts with: «-//IETF//DTD HTML 3.2 Final//»
  • The public identifier starts with: «-//IETF//DTD HTML 3.2//»
  • The public identifier starts with: «-//IETF//DTD HTML 3//»
  • The public identifier starts with: «-//IETF//DTD HTML Level 0//»
  • The public identifier starts with: «-//IETF//DTD HTML Level 1//»
  • The public identifier starts with: «-//IETF//DTD HTML Level 2//»
  • The public identifier starts with: «-//IETF//DTD HTML Level 3//»
  • The public identifier starts with: «-//IETF//DTD HTML Strict Level 0//»
  • The public identifier starts with: «-//IETF//DTD HTML Strict Level 1//»
  • The public identifier starts with: «-//IETF//DTD HTML Strict Level 2//»
  • The public identifier starts with: «-//IETF//DTD HTML Strict Level 3//»
  • The public identifier starts with: «-//IETF//DTD HTML Strict//»
  • The public identifier starts with: «-//IETF//DTD HTML//»
  • The public identifier starts with: «-//Metrius//DTD Metrius Presentational//»
  • The public identifier starts with: «-//Microsoft//DTD Internet Explorer 2.0 HTML Strict//»
  • The public identifier starts with: «-//Microsoft//DTD Internet Explorer 2.0 HTML//»
  • The public identifier starts with: «-//Microsoft//DTD Internet Explorer 2.0 Tables//»
  • The public identifier starts with: «-//Microsoft//DTD Internet Explorer 3.0 HTML Strict//»
  • The public identifier starts with: «-//Microsoft//DTD Internet Explorer 3.0 HTML//»
  • The public identifier starts with: «-//Microsoft//DTD Internet Explorer 3.0 Tables//»
  • The public identifier starts with: «-//Netscape Comm. Corp.//DTD HTML//»
  • The public identifier starts with: «-//Netscape Comm. Corp.//DTD Strict HTML//»
  • The public identifier starts with: «-//O'Reilly and Associates//DTD HTML 2.0//»
  • The public identifier starts with: «-//O'Reilly and Associates//DTD HTML Extended 1.0//»
  • The public identifier starts with: «-//O'Reilly and Associates//DTD HTML Extended Relaxed 1.0//»
  • The public identifier starts with: «-//SQ//DTD HTML 2.0 HoTMetaL + extensions//»
  • The public identifier starts with: «-//SoftQuad Software//DTD HoTMetaL PRO 6.0::19990601::extensions to HTML 4.0//»
  • The public identifier starts with: «-//SoftQuad//DTD HoTMetaL PRO 4.0::19971010::extensions to HTML 4.0//»
  • The public identifier starts with: «-//Spyglass//DTD HTML 2.0 Extended//»
  • The public identifier starts with: «-//Sun Microsystems Corp.//DTD HotJava HTML//»
  • The public identifier starts with: «-//Sun Microsystems Corp.//DTD HotJava Strict HTML//»
  • The public identifier starts with: «-//W3C//DTD HTML 3 1995-03-24//»
  • The public identifier starts with: «-//W3C//DTD HTML 3.2 Draft//»
  • The public identifier starts with: «-//W3C//DTD HTML 3.2 Final//»
  • The public identifier starts with: «-//W3C//DTD HTML 3.2//»
  • The public identifier starts with: «-//W3C//DTD HTML 3.2S Draft//»
  • The public identifier starts with: «-//W3C//DTD HTML 4.0 Frameset//»
  • The public identifier starts with: «-//W3C//DTD HTML 4.0 Transitional//»
  • The public identifier starts with: «-//W3C//DTD HTML Experimental 19960712//»
  • The public identifier starts with: «-//W3C//DTD HTML Experimental 970421//»
  • The public identifier starts with: «-//W3C//DTD W3 HTML//»
  • The public identifier starts with: «-//W3O//DTD W3 HTML 3.0//»
  • The public identifier starts with: «-//WebTechs//DTD Mozilla HTML 2.0//»
  • The public identifier starts with: «-//WebTechs//DTD Mozilla HTML//»
  • The system identifier is missing and the public identifier starts with: «-//W3C//DTD HTML 4.01 Frameset//»
  • The system identifier is missing and the public identifier starts with: «-//W3C//DTD HTML 4.01 Transitional//»

Otherwise, if the document is not an iframe srcdoc document, and the parser cannot change
the mode flag is false, and the DOCTYPE token matches one of the conditions in the
following list, then then set the Document to limited-quirks mode:

  • The public identifier starts with: «-//W3C//DTD XHTML 1.0 Frameset//»
  • The public identifier starts with: «-//W3C//DTD XHTML 1.0 Transitional//»
  • The system identifier is not missing and the public identifier starts with: «-//W3C//DTD HTML 4.01 Frameset//»
  • The system identifier is not missing and the public identifier starts with: «-//W3C//DTD HTML 4.01 Transitional//»

The system identifier and public identifier strings must be compared to the values given in
the lists above in an ASCII case-insensitive manner. A system identifier whose
value is the empty string is not considered missing for the purposes of the conditions
above.

Then, switch the insertion mode to «before html».

Anything else

If the document is not an iframe srcdoc document, then this is a parse
error; if the parser cannot change the mode flag is false, set the
Document to quirks mode.

In any case, switch the insertion mode to «before html», then reprocess the token.

13.2.6.4.2 The «before html» insertion mode

When the user agent is to apply the rules for the «before html» insertion mode, the user agent must handle the token as
follows:

A DOCTYPE token

Parse error. Ignore the token.

A comment token

Insert a comment as the last child of the Document object.

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C
FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Ignore the token.

A start tag whose tag name is «html»

Create an element for the token in the HTML namespace, with the
Document as the intended parent. Append it to the Document object. Put
this element in the stack of open elements.

Switch the insertion mode to «before
head».

An end tag whose tag name is one of: «head», «body», «html», «br»

Act as described in the «anything else» entry below.

Any other end tag

Parse error. Ignore the token.

Anything else

Create an html element whose node document is the Document object. Append
it to the Document object. Put this element in the stack of open
elements.

Switch the insertion mode to «before
head», then reprocess the token.

The document element can end up being removed from the Document
object, e.g. by scripts; nothing in particular happens in such cases, content continues being
appended to the nodes as described in the next section.

13.2.6.4.3 The «before head» insertion mode

When the user agent is to apply the rules for the «before head» insertion mode, the user agent must handle the token as
follows:

A character token that is one of U+0009 CHARACTER
TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF),
U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Ignore the token.

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

A start tag whose tag name is «head»

Insert an HTML element for the token.

Set the head element pointer to the newly created
head element.

Switch the insertion mode to «in
head».

An end tag whose tag name is one of: «head», «body», «html», «br»

Act as described in the «anything else» entry below.

Any other end tag

Parse error. Ignore the token.

Anything else

Insert an HTML element for a «head» start tag token with no attributes.

Set the head element pointer to the newly created
head element.

Switch the insertion mode to «in
head».

Reprocess the current token.

13.2.6.4.4 The «in head» insertion mode

When the user agent is to apply the rules for the «in
head» insertion mode, the user agent must handle the token as follows:

A character token that is one of U+0009 CHARACTER
TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF),
U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Insert the character.

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

A start tag whose tag name is one of: «base», «basefont»,
«bgsound», «link»

Insert an HTML element for the token. Immediately pop the current
node off the stack of open elements.

Acknowledge the token’s self-closing
flag
, if it is set.

A start tag whose tag name is «meta»

Insert an HTML element for the token. Immediately pop the current
node off the stack of open elements.

Acknowledge the token’s self-closing
flag
, if it is set.

If the active speculative HTML parser is null, then:

  1. If the element has a charset attribute, and getting an encoding from
    its value results in an encoding, and the
    confidence is currently tentative,
    then change the encoding to the resulting encoding.

  2. Otherwise, if the element has an http-equiv
    attribute whose value is an ASCII case-insensitive match for the string «Content-Type«, and the element has a content attribute, and applying the algorithm for
    extracting a character encoding from a meta element to that attribute’s
    value returns an encoding, and the
    confidence is currently tentative,
    then change the encoding to the extracted encoding.

The speculative HTML parser doesn’t speculatively apply character
encoding declarations in order to reduce implementation complexity.

A start tag whose tag name is «title»

Follow the generic RCDATA element parsing algorithm.

A start tag whose tag name is «noscript», if the scripting flag is enabled
A start tag whose tag name is one of: «noframes», «style»

Follow the generic raw text element parsing algorithm.

A start tag whose tag name is «noscript», if the scripting flag is disabled

Insert an HTML element for the token.

Switch the insertion mode to «in
head noscript».

A start tag whose tag name is «script»

Run these steps:

  1. Let the adjusted insertion location be the appropriate place
    for inserting a node.

  2. Create an element for the token in the HTML namespace, with
    the intended parent being the element in which the adjusted insertion
    location
    finds itself.

  3. Set the element’s parser document to the Document, and set the
    element’s force async to false.

    This ensures that, if the script is external, any document.write() calls in the script will execute in-line,
    instead of blowing the document away, as would happen in most other cases. It also prevents
    the script from executing until the end tag is seen.

  4. If the parser was created as part of the HTML fragment parsing algorithm,
    then set the script element’s already started to true.
    (fragment case)

  5. If the parser was invoked via the document.write() or document.writeln() methods, then optionally set the
    script element’s already started to true. (For example, the user
    agent might use this clause to prevent execution of cross-origin
    scripts inserted via document.write() under slow
    network conditions, or when the page has already taken a long time to load.)

  6. Insert the newly created element at the adjusted insertion
    location
    .

  7. Push the element onto the stack of open elements so that it is the new
    current node.

  8. Switch the tokenizer to the script data state.

  9. Let the original insertion mode be the current insertion
    mode.

  10. Switch the insertion mode to «text».

An end tag whose tag name is «head»

Pop the current node (which will be the head element) off the
stack of open elements.

Switch the insertion mode to «after
head».

An end tag whose tag name is one of: «body», «html», «br»

Act as described in the «anything else» entry below.

A start tag whose tag name is «template»

Insert an HTML element for the token.

Insert a marker at the end of the list of
active formatting elements.

Set the frameset-ok flag to «not ok».

Switch the insertion mode to «in
template».

Push «in template» onto the stack of
template insertion modes so that it is the new current template insertion
mode.

An end tag whose tag name is «template»

If there is no template element on the stack of open elements, then
this is a parse error; ignore the token.

Otherwise, run these steps:

  1. Generate all implied end tags thoroughly.

  2. If the current node is not a template element, then this is a
    parse error.

  3. Pop elements from the stack of open elements until a template
    element has been popped from the stack.

  4. Clear the list of active formatting elements up to the last marker.
  5. Pop the current template insertion mode off the stack of template
    insertion modes.

  6. Reset the insertion mode appropriately.

A start tag whose tag name is «head»
Any other end tag

Parse error. Ignore the token.

Anything else

Pop the current node (which will be the head element) off the
stack of open elements.

Switch the insertion mode to «after
head».

Reprocess the token.

13.2.6.4.5 The «in head noscript» insertion mode

When the user agent is to apply the rules for the «in head noscript» insertion mode, the user agent must handle the
token as follows:

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

An end tag whose tag name is «noscript»

Pop the current node (which will be a noscript element) from the
stack of open elements; the new current node will be a
head element.

Switch the insertion mode to «in
head».

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C
FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE
A comment token
A start tag whose tag name is one of: «basefont», «bgsound», «link», «meta», «noframes»,
«style»

Process the token using the rules for the «in head» insertion mode.

An end tag whose tag name is «br»

Act as described in the «anything else» entry below.

A start tag whose tag name is one of: «head», «noscript»
Any other end tag

Parse error. Ignore the token.

Anything else

Parse error.

Pop the current node (which will be a noscript element) from the
stack of open elements; the new current node will be a
head element.

Switch the insertion mode to «in
head».

Reprocess the token.

13.2.6.4.6 The «after head» insertion mode

When the user agent is to apply the rules for the «after head» insertion mode, the user agent must handle the token as
follows:

A character token that is one of U+0009 CHARACTER
TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF),
U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Insert the character.

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

A start tag whose tag name is «body»

Insert an HTML element for the token.

Set the frameset-ok flag to «not ok».

Switch the insertion mode to «in
body».

A start tag whose tag name is «frameset»

Insert an HTML element for the token.

Switch the insertion mode to «in
frameset».

A start tag whose tag name is one of: «base», «basefont», «bgsound», «link», «meta»,
«noframes», «script», «style», «template», «title»

Parse error.

Push the node pointed to by the head element pointer onto
the stack of open elements.

Process the token using the rules for the «in head» insertion mode.

Remove the node pointed to by the head element pointer
from the stack of open elements. (It might not be the current node at
this point.)

The head element pointer cannot be null at
this point.

An end tag whose tag name is «template»

Process the token using the rules for the «in head» insertion mode.

An end tag whose tag name is one of: «body», «html», «br»

Act as described in the «anything else» entry below.

A start tag whose tag name is «head»
Any other end tag

Parse error. Ignore the token.

Anything else

Insert an HTML element for a «body» start tag token with no attributes.

Switch the insertion mode to «in
body».

Reprocess the current token.

13.2.6.4.7 The «in body» insertion mode

When the user agent is to apply the rules for the «in
body» insertion mode, the user agent must handle the token as follows:

A character token that is U+0000 NULL

Parse error. Ignore the token.

A character token that is one of U+0009 CHARACTER TABULATION,
U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE
RETURN (CR), or U+0020 SPACE

Reconstruct the active formatting elements, if any.

Insert the token’s character.

Any other character token

Reconstruct the active formatting elements, if any.

Insert the token’s character.

Set the frameset-ok flag to «not ok».

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «html»

Parse error.

If there is a template element on the stack of open elements, then
ignore the token.

Otherwise, for each attribute on the token, check to see if the attribute is already present
on the top element of the stack of open elements. If it is not, add the attribute
and its corresponding value to that element.

A start tag whose tag name is one of: «base», «basefont», «bgsound», «link», «meta»,
«noframes», «script», «style», «template», «title»
An end tag whose tag name is «template»

Process the token using the rules for the «in head» insertion mode.

A start tag whose tag name is «body»

Parse error.

If the second element on the stack of open elements is not a body
element, if the stack of open elements has only one node on it, or if there is a
template element on the stack of open elements, then ignore the token.
(fragment case)

Otherwise, set the frameset-ok flag to «not ok»; then, for each attribute on the
token, check to see if the attribute is already present on the body element (the
second element) on the stack of open elements, and if it is not, add the attribute
and its corresponding value to that element.

A start tag whose tag name is «frameset»

Parse error.

If the stack of open elements has only one node on it, or if the second element
on the stack of open elements is not a body element, then ignore the
token. (fragment case)

If the frameset-ok flag is set to «not ok», ignore the token.

Otherwise, run the following steps:

  1. Remove the second element on the stack of open elements from its parent
    node, if it has one.

  2. Pop all the nodes from the bottom of the stack of open elements, from the
    current node up to, but not including, the root html element.

  3. Insert an HTML element for the token.

  4. Switch the insertion mode to «in
    frameset».

An end-of-file token

If the stack of template insertion modes is not empty, then process the token
using the rules for the «in
template» insertion mode.

Otherwise, follow these steps:

  1. If there is a node in the stack of open elements that is not either a
    dd element, a dt element, an li element, an
    optgroup element, an option element, a p element, an
    rb element, an rp element, an rt element, an
    rtc element, a tbody element, a td element, a
    tfoot element, a th element, a thead element, a
    tr element, the body element, or the html element, then
    this is a parse error.

  2. Stop parsing.

An end tag whose tag name is «body»

If the stack of open elements does not have a body element in scope, this is a parse error;
ignore the token.

Otherwise, if there is a node in the stack of open elements that is not either a
dd element, a dt element, an li element, an
optgroup element, an option element, a p element, an
rb element, an rp element, an rt element, an
rtc element, a tbody element, a td element, a
tfoot element, a th element, a thead element, a
tr element, the body element, or the html element, then
this is a parse error.

Switch the insertion mode to «after
body».

An end tag whose tag name is «html»

If the stack of open elements does not have a body element in scope, this is a parse error;
ignore the token.

Otherwise, if there is a node in the stack of open elements that is not either a
dd element, a dt element, an li element, an
optgroup element, an option element, a p element, an
rb element, an rp element, an rt element, an
rtc element, a tbody element, a td element, a
tfoot element, a th element, a thead element, a
tr element, the body element, or the html element, then
this is a parse error.

Switch the insertion mode to «after
body».

Reprocess the token.

A start tag whose tag name is one of: «address», «article», «aside», «blockquote», «center»,
«details», «dialog», «dir», «div», «dl», «fieldset», «figcaption», «figure», «footer», «header»,
«hgroup», «main», «menu», «nav», «ol», «p», «section», «summary», «ul»

If the stack of open elements has a
p element in button scope, then close a p
element.

Insert an HTML element for the token.

A start tag whose tag name is one of: «h1», «h2», «h3», «h4»,
«h5», «h6»

If the stack of open elements has a
p element in button scope, then close a p
element.

If the current node is an HTML element whose
tag name is one of «h1», «h2», «h3», «h4», «h5», or «h6», then this is a parse
error; pop the current node off the stack of open elements.

Insert an HTML element for the token.

A start tag whose tag name is one of: «pre», «listing»

If the stack of open elements has
a p element in button scope, then close a p
element.

Insert an HTML element for the token.

If the next token is a U+000A LINE FEED (LF) character token, then ignore that
token and move on to the next one. (Newlines at the start of pre blocks are ignored
as an authoring convenience.)

Set the frameset-ok flag to «not ok».

A start tag whose tag name is «form»

If the form element pointer is not null, and there is
no template element on the stack of open elements, then this is a
parse error; ignore the token.

Otherwise:

If the stack of open elements has
a p element in button scope, then close a p
element.

Insert an HTML element for the token, and, if there is no template
element on the stack of open elements, set the form element pointer to point to the element created.

A start tag whose tag name is «li»

Run these steps:

  1. Set the frameset-ok flag to «not ok».

  2. Initialize node to be the current
    node (the bottommost node of the stack).

  3. Loop: If node is an li element, then run these
    substeps:

    1. Generate implied end tags, except for li elements.

    2. If the current node is not an li element, then this is a
      parse error.

    3. Pop elements from the stack of open elements until an li
      element has been popped from the stack.

    4. Jump to the step labeled done below.

  4. If node is in the special category, but is not an
    address, div, or p element, then jump to the step
    labeled done below.

  5. Otherwise, set node to the previous entry in the stack of open
    elements and return to the step labeled loop.

  6. Done: If the stack of open elements has a p element in button scope, then close a
    p element.

  7. Finally, insert an HTML element for the token.

A start tag whose tag name is one of: «dd», «dt»

Run these steps:

  1. Set the frameset-ok flag to «not ok».

  2. Initialize node to be the current
    node (the bottommost node of the stack).

  3. Loop: If node is a dd element, then run these
    substeps:

    1. Generate implied end tags, except for dd elements.

    2. If the current node is not a dd element, then this is a
      parse error.

    3. Pop elements from the stack of open elements until a dd
      element has been popped from the stack.

    4. Jump to the step labeled done below.

  4. If node is a dt element, then run these substeps:

    1. Generate implied end tags, except for dt elements.

    2. If the current node is not a dt element, then this is a
      parse error.

    3. Pop elements from the stack of open elements until a dt
      element has been popped from the stack.

    4. Jump to the step labeled done below.

  5. If node is in the special category, but is not an
    address, div, or p element, then jump to the step
    labeled done below.

  6. Otherwise, set node to the previous entry in the stack of open
    elements and return to the step labeled loop.

  7. Done: If the stack of open elements has a p element in button scope, then close a
    p element.

  8. Finally, insert an HTML element for the token.

A start tag whose tag name is «plaintext»

If the stack of open elements has a
p element in button scope, then close a p
element.

Insert an HTML element for the token.

Switch the tokenizer to the PLAINTEXT state.

Once a start tag with the tag name «plaintext» has been seen, that will be the
last token ever seen other than character tokens (and the end-of-file token), because there is
no way to switch out of the PLAINTEXT state.

A start tag whose tag name is «button»
  1. If the stack of open elements has a
    button element in scope, then run these substeps:

    1. Parse error.

    2. Generate implied end tags.

    3. Pop elements from the stack of open elements until a button
      element has been popped from the stack.

  2. Reconstruct the active formatting elements, if any.

  3. Insert an HTML element for the token.

  4. Set the frameset-ok flag to «not ok».

An end tag whose tag name is one of: «address», «article», «aside», «blockquote», «button»,
«center», «details», «dialog», «dir», «div», «dl», «fieldset», «figcaption», «figure», «footer»,
«header», «hgroup», «listing», «main», «menu», «nav», «ol», «pre», «section», «summary»,
«ul»

If the stack of open elements does not have an element in scope that is an HTML
element with the same tag name as that of the token, then this is a parse
error; ignore the token.

Otherwise, run these steps:

  1. Generate implied end tags.

  2. If the current node is not an HTML element with
    the same tag name as that of the token, then this is a
    parse error.

  3. Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the
    stack.

An end tag whose tag name is «form»

If there is no template element on the stack of open elements, then
run these substeps:

  1. Let node be the element that the form
    element pointer is set to, or null if it is not set to an element.

  2. Set the form element pointer to null.

  3. If node is null or if the stack of open elements does
    not have node in scope, then
    this is a parse error; return and ignore the token.

  4. Generate implied end tags.

  5. If the current node is not node, then this is a
    parse error.

  6. Remove node from the stack of open elements.

If there is a template element on the stack of open
elements, then run these substeps instead:

  1. If the stack of open elements does not have a form element in scope, then this is a parse
    error; return and ignore the token.

  2. Generate implied end tags.

  3. If the current node is not a form element, then this is a
    parse error.

  4. Pop elements from the stack of open elements until a form
    element has been popped from the stack.

An end tag whose tag name is «p»

If the stack of open elements does not have a p element in button scope, then this is a parse
error; insert an HTML element for a «p» start tag token with no
attributes.

Close a p element.

An end tag whose tag name is «li»

If the stack of open elements does not have an li element in list item scope, then this is a parse
error; ignore the token.

Otherwise, run these steps:

  1. Generate implied end tags, except for li elements.

  2. If the current node is not an li element, then this is a
    parse error.

  3. Pop elements from the stack of open elements until an li
    element has been popped from the stack.

An end tag whose tag name is one of: «dd», «dt»

If the stack of open elements does not have an element in scope that is an HTML
element with the same tag name as that of the token, then this is a parse
error; ignore the token.

Otherwise, run these steps:

  1. Generate implied end tags, except for HTML elements with the
    same tag name as the token.

  2. If the current node is not an HTML
    element with the same tag name as that of the token, then this is a parse
    error.

  3. Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the
    stack.

An end tag whose tag name is one of: «h1», «h2», «h3», «h4», «h5», «h6»

If the stack of open elements does not have an element in scope that is an HTML
element and whose tag name is one of «h1», «h2», «h3», «h4», «h5», or «h6», then this is
a parse error; ignore the token.

Otherwise, run these steps:

  1. Generate implied end tags.

  2. If the current node is not an HTML
    element with the same tag name as that of the token, then this is a parse
    error.

  3. Pop elements from the stack of open elements until an HTML element whose tag name is one of «h1», «h2», «h3», «h4», «h5», or «h6»
    has been popped from the stack.

An end tag whose tag name is «sarcasm»

Take a deep breath, then act as described in the «any other end
tag» entry below.

A start tag whose tag name is «a»

If the list of active formatting elements contains an a element
between the end of the list and the last marker on
the list (or the start of the list if there is no marker on the list), then this is a parse
error; run the adoption agency algorithm for the token, then remove that
element from the list of active formatting elements and the stack of open
elements if the adoption agency algorithm didn’t already remove it (it might
not have if the element is not in table
scope).

In the non-conforming stream
<a href="a">a<table><a href="b">b</table>x, the first
a element would be closed upon seeing the second one, and the «x» character would
be inside a link to «b», not to «a». This is despite the fact that the outer a
element is not in table scope (meaning that a regular </a> end tag at the start
of the table wouldn’t close the outer a element). The result is that the two
a elements are indirectly nested inside each other — non-conforming markup
will often result in non-conforming DOMs when parsed.

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Push onto the list of active
formatting elements that element.

A start tag whose tag name is one of: «b», «big», «code», «em»,
«font», «i», «s», «small», «strike», «strong», «tt», «u»

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Push onto the list of active
formatting elements that element.

A start tag whose tag name is «nobr»

Reconstruct the active formatting elements, if any.

If the stack of open elements has a
nobr element in scope, then this is a parse error; run the
adoption agency algorithm for the token, then once again reconstruct the
active formatting elements, if any.

Insert an HTML element for the token. Push onto the list of active
formatting elements that element.

An end tag whose tag name is one of: «a»,
«b», «big», «code», «em», «font», «i», «nobr», «s», «small»,
«strike», «strong», «tt», «u»

Run the adoption agency algorithm for the token.

A start tag whose tag name is one of: «applet», «marquee», «object»

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token.

Insert a marker at the end of the list of
active formatting elements.

Set the frameset-ok flag to «not ok».

An end tag token whose tag name is one of: «applet», «marquee», «object»

If the stack of open elements does not have an element in scope that is an HTML
element with the same tag name as that of the token, then this is a parse
error; ignore the token.

Otherwise, run these steps:

  1. Generate implied end tags.

  2. If the current node is not an HTML
    element with the same tag name as that of the token, then this is a parse
    error.

  3. Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the
    stack.

  4. Clear the list of active formatting elements up to the last marker.
A start tag whose tag name is «table»

If the Document is not set to quirks mode, and the
stack of open elements has a
p element in button scope, then close a p
element.

Insert an HTML element for the token.

Set the frameset-ok flag to «not ok».

Switch the insertion mode to «in
table».

An end tag whose tag name is «br»

Parse error. Drop the attributes from the token, and act as described in the
next entry; i.e. act as if this was a «br» start tag token with no attributes, rather than the
end tag token that it actually is.

A start tag whose tag name is one of: «area», «br», «embed»,
«img», «keygen», «wbr»

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Immediately pop the current
node off the stack of open elements.

Acknowledge the token’s self-closing
flag
, if it is set.

Set the frameset-ok flag to «not ok».

A start tag whose tag name is «input»

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Immediately pop the current
node off the stack of open elements.

Acknowledge the token’s self-closing
flag
, if it is set.

If the token does not have an attribute with the name «type», or if it does, but that
attribute’s value is not an ASCII case-insensitive match for the string «hidden«, then: set the frameset-ok flag to «not ok».

A start tag whose tag name is one of: «param», «source», «track»

Insert an HTML element for the token. Immediately pop the current
node off the stack of open elements.

Acknowledge the token’s self-closing
flag
, if it is set.

A start tag whose tag name is «hr»

If the stack of open elements has a
p element in button scope, then close a p
element.

Insert an HTML element for the token. Immediately pop the current
node off the stack of open elements.

Acknowledge the token’s self-closing
flag
, if it is set.

Set the frameset-ok flag to «not ok».

A start tag whose tag name is «image»

Parse error. Change the token’s tag name to «img» and reprocess it. (Don’t
ask.)

A start tag whose tag name is «textarea»

Run these steps:

  1. Insert an HTML element for the token.

  2. If the next token is a U+000A LINE FEED (LF) character token, then ignore
    that token and move on to the next one. (Newlines at the start of textarea
    elements are ignored as an authoring convenience.)

  3. Switch the tokenizer to the RCDATA state.

  4. Let the original insertion mode be the current insertion
    mode.

  5. Set the frameset-ok flag to «not ok».

  6. Switch the insertion mode to «text».

A start tag whose tag name is «xmp»

If the stack of open elements has a
p element in button scope, then close a p
element.

Reconstruct the active formatting elements, if any.

Set the frameset-ok flag to «not ok».

Follow the generic raw text element parsing algorithm.

A start tag whose tag name is «iframe»

Set the frameset-ok flag to «not ok».

Follow the generic raw text element parsing algorithm.

A start tag whose tag name is «noembed»
A start tag whose tag name is «noscript», if the scripting flag is enabled

Follow the generic raw text element parsing algorithm.

A start tag whose tag name is «select»

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token.

Set the frameset-ok flag to «not ok».

If the insertion mode is one of «in
table», «in caption», «in table body», «in row», or «in cell», then switch the
insertion mode to «in select in
table». Otherwise, switch the insertion mode to «in select».

A start tag whose tag name is one of: «optgroup», «option»

If the current node is an option element, then pop the
current node off the stack of open elements.

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token.

A start tag whose tag name is one of: «rb», «rtc»

If the stack of open elements has a
ruby element in scope, then generate implied end tags. If the
current node is not now a ruby element, this is a
parse error.

Insert an HTML element for the token.

A start tag whose tag name is one of: «rp», «rt»

If the stack of open elements has a
ruby element in scope, then generate implied end tags, except
for rtc elements. If the current node is not now a rtc
element or a ruby element, this is a parse error.

Insert an HTML element for the token.

A start tag whose tag name is «math»

Reconstruct the active formatting elements, if any.

Adjust MathML attributes for the token. (This fixes the case of MathML
attributes that are not all lowercase.)

Adjust foreign attributes for the token. (This fixes the use of namespaced
attributes, in particular XLink.)

Insert a foreign element for the token, in the MathML
namespace.

If the token has its self-closing flag set, pop the current node off the
stack of open elements and acknowledge
the token’s self-closing flag.

A start tag whose tag name is «svg»

Reconstruct the active formatting elements, if any.

Adjust SVG attributes for the token. (This fixes the case of SVG attributes that
are not all lowercase.)

Adjust foreign attributes for the token. (This fixes the use of namespaced
attributes, in particular XLink in SVG.)

Insert a foreign element for the token, in the SVG namespace.

If the token has its self-closing flag set, pop the current node off the
stack of open elements and acknowledge
the token’s self-closing flag.

A start tag whose tag name is one of: «caption», «col», «colgroup», «frame»,
«head», «tbody», «td», «tfoot», «th», «thead», «tr»

Parse error. Ignore the token.

Any other start tag

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token.

This element will be an ordinary
element.

Any other end tag

Run these steps:

  1. Initialize node to be the current node (the bottommost
    node of the stack).

  2. Loop: If node is an HTML
    element with the same tag name as the token, then:

    1. Generate implied end tags, except for HTML elements with the
      same tag name as the token.

    2. If node is not the current node, then this is a
      parse error.

    3. Pop all the nodes from the current node up to node,
      including node, then stop these steps.

  3. Otherwise, if node is in the special category, then
    this is a parse error; ignore the token, and return.

  4. Set node to the previous entry in the stack of open
    elements.

  5. Return to the step labeled loop.

When the steps above say the user agent is to close a p element, it
means that the user agent must run the following steps:

  1. Generate implied end tags, except for p elements.

  2. If the current node is not a p element, then this is a
    parse error.

  3. Pop elements from the stack of open elements until a p element
    has been popped from the stack.

The adoption agency algorithm, which takes as its only argument
a token token for which the algorithm is being run, consists of the following
steps:

  1. Let subject be token‘s tag name.

  2. If the current node is an HTML element
    whose tag name is subject, and the current node is not in the
    list of active formatting elements, then pop the current node off the
    stack of open elements and return.

  3. Let outer loop counter be 0.

  4. While true:

    1. If outer loop counter is greater than or equal to 8, then return.

    2. Increment outer loop counter by 1.

    3. Let formatting element be the last element in the list of active
      formatting elements that:

      • is between the end of the list and the last marker in the list, if any, or the start of the list
        otherwise, and
      • has the tag name subject.

      If there is no such element, then return and instead act as described in the «any other end
      tag» entry above.

    4. If formatting element is not in the stack of open elements, then
      this is a parse error; remove the element from the list, and return.

    5. If formatting element is in the stack of open elements, but the
      element is not in scope, then this is a
      parse error; return.

    6. If formatting element is not the current node, this is a
      parse error. (But do not return.)

    7. Let furthest block be the topmost node in the stack of open
      elements that is lower in the stack than formatting element, and is an
      element in the special category. There might not be one.

    8. If there is no furthest block, then the UA must first pop all the nodes from
      the bottom of the stack of open elements, from the current node up to
      and including formatting element, then remove formatting element from the
      list of active formatting elements, and finally return.

    9. Let common ancestor be the element immediately above formatting
      element
      in the stack of open elements.

    10. Let a bookmark note the position of formatting element in the list of
      active formatting elements relative to the elements on either side of it in the
      list.

    11. Let node and last node be furthest block.

    12. Let inner loop counter be 0.

    13. While true:

      1. Increment inner loop counter by 1.

      2. Let node be the element immediately above node in the
        stack of open elements, or if node is no longer in the stack of
        open elements (e.g. because it got removed by this algorithm), the element that was immediately above node in
        the stack of open elements before node was removed.

      3. If node is formatting element, then break.

      4. If inner loop counter is greater than 3 and node is in the
        list of active formatting elements, then remove node from the
        list of active formatting elements.

      5. If node is not in the list of active
        formatting elements, then remove node from the stack of open
        elements and continue.

      6. Create an element for the token for which the element node was
        created, in the HTML namespace, with common ancestor as the intended
        parent; replace the entry for node in the list of active formatting
        elements with an entry for the new element, replace the entry for node in
        the stack of open elements with an entry for the new element, and let
        node be the new element.

      7. If last node is furthest block, then move the aforementioned
        bookmark to be immediately after the new node in the list of active
        formatting elements.

      8. Append last node to
        node.

      9. Set last node to node.

    14. Insert whatever last node ended up being in the previous step at the
      appropriate place for inserting a node, but using common
      ancestor
      as the override target.

    15. Create an element for the token for which formatting
      element
      was created, in the HTML namespace, with furthest
      block
      as the intended parent.

    16. Take all of the child nodes of furthest block and append them to the
      element created in the last step.

    17. Append that new element to furthest block.

    18. Remove formatting element from the list of active formatting
      elements, and insert the new element into the list of active formatting
      elements at the position of the aforementioned bookmark.

    19. Remove formatting element from the stack of open
      elements, and insert the new element into the stack of open elements
      immediately below the position of furthest block in that stack.

This algorithm’s name, the «adoption agency algorithm», comes from the way it
causes elements to change parents, and is in contrast with other possible algorithms for dealing
with misnested content.

13.2.6.4.8 The «text» insertion mode

When the user agent is to apply the rules for the «text» insertion mode, the user agent must handle the token as
follows:

A character token

Insert the token’s character.

This can never be a U+0000 NULL character; the tokenizer converts those to
U+FFFD REPLACEMENT CHARACTER characters.

An end-of-file token

Parse error.

If the current node is a script element, then set its already
started to true.

Pop the current node off the stack of open elements.

Switch the insertion mode to the original insertion mode and
reprocess the token.

An end tag whose tag name is «script»

If the active speculative HTML parser is null and the JavaScript execution
context stack is empty, then perform a microtask checkpoint.

Let script be the current node (which will be a
script element).

Pop the current node off the stack of open elements.

Switch the insertion mode to the original insertion mode.

Let the old insertion point have the same value as the current
insertion point. Let the insertion point be just before the next
input character.

Increment the parser’s script nesting level by one.

If the active speculative HTML parser is null, then prepare the script
element script. This might cause some script to execute, which might cause
new characters to be inserted into the tokenizer, and
might cause the tokenizer to output more tokens, resulting in a reentrant invocation of the parser.

Decrement the parser’s script nesting level by one. If the parser’s script
nesting level is zero, then set the parser pause flag to false.

Let the insertion point have the value of the old insertion
point
. (In other words, restore the insertion point to its previous value.
This value might be the «undefined» value.)

At this stage, if the pending parsing-blocking
script is not null, then:

If the script nesting level is not zero:

Set the parser pause flag to true, and abort the processing of any nested
invocations of the tokenizer, yielding control back to the caller. (Tokenization will resume
when the caller returns to the «outer» tree construction stage.)

The tree construction stage of this particular parser is being called reentrantly, say from a call to document.write().

Otherwise:

While the pending parsing-blocking script is not null:

  1. Let the script be the pending parsing-blocking
    script.

  2. Set the pending parsing-blocking script to null.

  3. Start the speculative HTML parser for this instance of the HTML
    parser.

  4. Block the tokenizer for this instance of the
    HTML parser, such that the event loop will not run tasks that invoke the tokenizer.

  5. If the parser’s Document has a style sheet that is blocking
    scripts or the script‘s ready to be parser-executed is false:
    spin the event loop until the parser’s Document has no style
    sheet that is blocking scripts and the script‘s ready to be
    parser-executed becomes true.

  6. If this parser has been aborted in the meantime,
    return.

    This could happen if, e.g., while the spin the event loop
    algorithm is running, the Document gets destroyed, or the document.open()
    method gets invoked on the Document.

  7. Stop the speculative HTML parser for this instance of the HTML
    parser.

  8. Unblock the tokenizer for this instance of the
    HTML parser, such that tasks that invoke the
    tokenizer can again be run.

  9. Let the insertion point be just before the next input
    character.

  10. Increment the parser’s script nesting level by one (it should be zero
    before this step, so this sets it to one).

  11. Execute the script element the script.

  12. Decrement the parser’s script nesting level by one. If the parser’s
    script nesting level is zero (which it always should be at this point), then set
    the parser pause flag to false.

  13. Let the insertion point be undefined again.

Any other end tag

Pop the current node off the stack of open elements.

Switch the insertion mode to the original insertion mode.

13.2.6.4.9 The «in table» insertion mode

When the user agent is to apply the rules for the «in
table» insertion mode, the user agent must handle the token as follows:

A character token, if the current node is table, tbody, template, tfoot, thead, or tr element

Let the pending table character
tokens
be an empty list of tokens.

Let the original insertion mode be the current insertion mode.

Switch the insertion mode to «in
table text» and reprocess the token.

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «caption»

Clear the stack back to a table context. (See below.)

Insert a marker at the end of the list of
active formatting elements.

Insert an HTML element for the token, then switch the insertion
mode to «in caption».

A start tag whose tag name is «colgroup»

Clear the stack back to a table context. (See below.)

Insert an HTML element for the token, then switch the insertion
mode to «in column group».

A start tag whose tag name is «col»

Clear the stack back to a table context. (See below.)

Insert an HTML element for a «colgroup» start tag token with no attributes, then
switch the insertion mode to «in
column group».

Reprocess the current token.

A start tag whose tag name is one of: «tbody», «tfoot», «thead»

Clear the stack back to a table context. (See below.)

Insert an HTML element for the token, then switch the insertion
mode to «in table body».

A start tag whose tag name is one of: «td», «th», «tr»

Clear the stack back to a table context. (See below.)

Insert an HTML element for a «tbody» start tag token with no attributes, then
switch the insertion mode to «in table
body».

Reprocess the current token.

A start tag whose tag name is «table»

Parse error.

If the stack of open elements does not have a table element in table scope, ignore the token.

Otherwise:

Pop elements from this stack until a table element has been popped from the
stack.

Reset the insertion mode appropriately.

Reprocess the token.

An end tag whose tag name is «table»

If the stack of open elements does not have a table element in table scope, this is a parse
error; ignore the token.

Otherwise:

Pop elements from this stack until a table element has been popped from the
stack.

Reset the insertion mode appropriately.

An end tag whose tag name is one of: «body», «caption», «col», «colgroup», «html», «tbody»,
«td», «tfoot», «th», «thead», «tr»

Parse error. Ignore the token.

A start tag whose tag name is one of: «style», «script», «template»
An end tag whose tag name is «template»

Process the token using the rules for the «in head» insertion
mode.

A start tag whose tag name is «input»

If the token does not have an attribute with the name «type», or if it does, but that
attribute’s value is not an ASCII case-insensitive match for the string «hidden«, then: act as described in the «anything else» entry below.

Otherwise:

Parse error.

Insert an HTML element for the token.

Pop that input element off the stack of open elements.

Acknowledge the token’s self-closing
flag
, if it is set.

A start tag whose tag name is «form»

Parse error.

If there is a template element on the stack of open elements, or if
the form element pointer is not null, ignore the
token.

Otherwise:

Insert an HTML element for the token, and set the form element pointer to point to the element created.

Pop that form element off the stack of open elements.

An end-of-file token

Process the token using the rules for the «in body» insertion mode.

Anything else

Parse error. Enable foster parenting, process
the token using the rules for the «in
body» insertion mode, and then disable foster
parenting.

When the steps above require the UA to clear the stack back to a table context, it
means that the UA must, while the current node is not a table,
template, or html element, pop elements from the stack of open
elements.

This is the same list of elements as used in the has an element in table scope steps.

The current node being an html element after this
process is a fragment case.

13.2.6.4.10 The «in table text» insertion mode

When the user agent is to apply the rules for the «in table text» insertion mode, the user agent must handle the token as
follows:

A character token that is U+0000 NULL

Parse error. Ignore the token.

Any other character token

Append the character token to the pending
table character tokens
list.

Anything else

If any of the tokens in the pending table
character tokens
list are character tokens that are not ASCII whitespace,
then this is a parse error: reprocess the character tokens in the pending table character tokens list using the
rules given in the «anything else» entry in the «in
table» insertion mode.

Otherwise, insert the characters given by the pending table character tokens list.

Switch the insertion mode to the original insertion mode and
reprocess the token.

13.2.6.4.11 The «in caption» insertion mode

When the user agent is to apply the rules for the «in
caption» insertion mode, the user agent must handle the token as follows:

An end tag whose tag name is «caption»

If the stack of open elements does not have a caption element in table scope, this is a parse
error; ignore the token. (fragment case)

Otherwise:

Generate implied end tags.

Now, if the current node is not a caption element, then this is a
parse error.

Pop elements from this stack until a caption element has been popped from the
stack.

Clear the list of active formatting elements up to the last marker.

Switch the insertion mode to «in
table».

A start tag whose tag name is one of: «caption», «col», «colgroup», «tbody», «td», «tfoot»,
«th», «thead», «tr»
An end tag whose tag name is «table»

If the stack of open elements does not have a caption element in table scope, this is a parse
error; ignore the token. (fragment case)

Otherwise:

Generate implied end tags.

Now, if the current node is not a caption element, then this is a
parse error.

Pop elements from this stack until a caption element has been popped from the
stack.

Clear the list of active formatting elements up to the last marker.

Switch the insertion mode to «in
table».

Reprocess the token.

An end tag whose tag name is one of: «body», «col», «colgroup», «html», «tbody», «td»,
«tfoot», «th», «thead», «tr»

Parse error. Ignore the token.

Anything else

Process the token using the rules for the «in body» insertion mode.

13.2.6.4.12 The «in column group» insertion mode

When the user agent is to apply the rules for the «in column group» insertion mode, the user agent must handle the token
as follows:

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C
FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Insert the character.

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

A start tag whose tag name is «col»

Insert an HTML element for the token. Immediately pop the current
node off the stack of open elements.

Acknowledge the token’s self-closing
flag
, if it is set.

An end tag whose tag name is «colgroup»

If the current node is not a colgroup element, then this is a
parse error; ignore the token.

Otherwise, pop the current node from the stack of open elements.
Switch the insertion mode to «in
table».

An end tag whose tag name is «col»

Parse error. Ignore the token.

A start tag whose tag name is «template»
An end tag whose tag name is «template»

Process the token using the rules for the «in head» insertion mode.

An end-of-file token

Process the token using the rules for the «in body» insertion mode.

Anything else

If the current node is not a colgroup element, then this is a
parse error; ignore the token.

Otherwise, pop the current node from the stack of open
elements.

Switch the insertion mode to «in
table».

Reprocess the token.

13.2.6.4.13 The «in table body» insertion mode

When the user agent is to apply the rules for the «in table body» insertion mode, the user agent must handle the token as
follows:

A start tag whose tag name is «tr»

Clear the stack back to a table body context. (See below.)

Insert an HTML element for the token, then switch the insertion
mode to «in row».

A start tag whose tag name is one of: «th», «td»

Parse error.

Clear the stack back to a table body context. (See below.)

Insert an HTML element for a «tr» start tag token with no attributes, then
switch the insertion mode to «in
row».

Reprocess the current token.

An end tag whose tag name is one of: «tbody», «tfoot»,
«thead»

If the stack of open elements does not have an element in table scope that is an HTML
element with the same tag name as the token, this is a parse error;
ignore the token.

Otherwise:

Clear the stack back to a table body context. (See below.)

Pop the current node from the stack of open elements. Switch the
insertion mode to «in table».

A start tag whose tag name is one of: «caption», «col»,
«colgroup», «tbody», «tfoot», «thead»
An end tag whose tag name is «table»

If the stack of open elements does not have a tbody, thead, or tfoot element in table
scope, this is a parse error; ignore the token.

Otherwise:

Clear the stack back to a table body context. (See below.)

Pop the current node from the stack of open elements. Switch the
insertion mode to «in table».

Reprocess the token.

An end tag whose tag name is one of: «body», «caption», «col», «colgroup», «html», «td»,
«th», «tr»

Parse error. Ignore the token.

Anything else

Process the token using the rules for the «in table» insertion mode.

When the steps above require the UA to clear the stack back to a table body context,
it means that the UA must, while the current node is not a tbody,
tfoot, thead, template, or html element, pop
elements from the stack of open elements.

The current node being an html element after this
process is a fragment case.

13.2.6.4.14 The «in row» insertion mode

When the user agent is to apply the rules for the «in
row» insertion mode, the user agent must handle the token as follows:

A start tag whose tag name is one of: «th», «td»

Clear the stack back to a table row context. (See below.)

Insert an HTML element for the token, then switch the insertion
mode to «in cell».

Insert a marker at the end of the list of
active formatting elements.

An end tag whose tag name is «tr»

If the stack of open elements does not have a tr element in table scope, this is a parse error;
ignore the token.

Otherwise:

Clear the stack back to a table row context. (See below.)

Pop the current node (which will be a tr element) from the
stack of open elements. Switch the insertion mode to «in table body».

A start tag whose tag name is one of: «caption», «col», «colgroup», «tbody», «tfoot»,
«thead», «tr»
An end tag whose tag name is «table»

If the stack of open elements does not have a tr element in table scope, this is a parse error;
ignore the token.

Otherwise:

Clear the stack back to a table row context. (See below.)

Pop the current node (which will be a tr element) from the
stack of open elements. Switch the insertion mode to «in table body».

Reprocess the token.

An end tag whose tag name is one of: «tbody», «tfoot», «thead»

If the stack of open elements does not have an element in table scope that is an HTML
element with the same tag name as the token, this is a parse error;
ignore the token.

If the stack of open elements does not have a tr element in table scope, ignore the token.

Otherwise:

Clear the stack back to a table row context. (See below.)

Pop the current node (which will be a tr element) from the
stack of open elements. Switch the insertion mode to «in table body».

Reprocess the token.

An end tag whose tag name is one of: «body», «caption», «col», «colgroup», «html», «td»,
«th»

Parse error. Ignore the token.

Anything else

Process the token using the rules for the «in table» insertion mode.

When the steps above require the UA to clear the stack back to a table row context,
it means that the UA must, while the current node is not a tr,
template, or html element, pop elements from the stack of open
elements.

The current node being an html element after this
process is a fragment case.

13.2.6.4.15 The «in cell» insertion mode

When the user agent is to apply the rules for the «in cell» insertion mode, the user agent must handle the token as follows:

An end tag whose tag name is one of: «td», «th»

If the stack of open elements does not have an element in table scope that is an HTML
element with the same tag name as that of the token, then this is a parse
error; ignore the token.

Otherwise:

Generate implied end tags.

Now, if the current node is not an HTML
element with the same tag name as the token, then this is a parse error.

Pop elements from the stack of open elements stack until an HTML element with the same tag name as the token has been popped from the
stack.

Clear the list of active formatting elements up to the last marker.

Switch the insertion mode to «in
row».

A start tag whose tag name is one of: «caption», «col»,
«colgroup», «tbody», «td», «tfoot», «th», «thead», «tr»

If the stack of open elements does not have a td or th element in table scope, then this
is a parse error; ignore the token. (fragment case)

Otherwise, close the cell (see below) and reprocess the token.

An end tag whose tag name is one of: «body», «caption»,
«col», «colgroup», «html»

Parse error. Ignore the token.

An end tag whose tag name is one of: «table», «tbody»,
«tfoot», «thead», «tr»

If the stack of open elements does not have an element in table scope that is an HTML
element with the same tag name as that of the token, then this is a parse
error; ignore the token.

Otherwise, close the cell (see below) and reprocess the token.

Anything else

Process the token using the rules for the «in body» insertion mode.

Where the steps above say to close the cell, they mean to run the following
algorithm:

  1. Generate implied end tags.

  2. If the current node is not now a td element or a th
    element, then this is a parse error.

  3. Pop elements from the stack of open elements stack until a td
    element or a th element has been popped from the stack.

  4. Clear the list of active formatting elements up to the last marker.

  5. Switch the insertion mode to «in
    row».

The stack of open elements cannot have both a td and a
th element in table scope at the
same time, nor can it have neither when the close the cell algorithm is invoked.

13.2.6.4.16 The «in select» insertion mode

When the user agent is to apply the rules for the «in
select» insertion mode, the user agent must handle the token as follows:

A character token that is U+0000 NULL

Parse error. Ignore the token.

Any other character token

Insert the token’s character.

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

A start tag whose tag name is «option»

If the current node is an option element, pop that node from the
stack of open elements.

Insert an HTML element for the token.

A start tag whose tag name is «optgroup»

If the current node is an option element, pop that node from the
stack of open elements.

If the current node is an optgroup element, pop that node from the
stack of open elements.

Insert an HTML element for the token.

An end tag whose tag name is «optgroup»

First, if the current node is an option element, and the node
immediately before it in the stack of open elements is an optgroup
element, then pop the current node from the stack of open
elements.

If the current node is an optgroup element, then pop that node from
the stack of open elements. Otherwise, this is a parse error; ignore
the token.

An end tag whose tag name is «option»

If the current node is an option element, then pop that node from
the stack of open elements. Otherwise, this is a parse error; ignore
the token.

An end tag whose tag name is «select»

If the stack of open elements does not have a select element in select scope, this is a parse
error; ignore the token. (fragment case)

Otherwise:

Pop elements from the stack of open elements until a select element
has been popped from the stack.

Reset the insertion mode appropriately.

A start tag whose tag name is «select»

Parse error.

If the stack of open elements does not have a select element in select scope, ignore the token.
(fragment case)

Otherwise:

Pop elements from the stack of open elements until a select element
has been popped from the stack.

Reset the insertion mode appropriately.

It just gets treated like an end tag.

A start tag whose tag name is one of: «input», «keygen», «textarea»

Parse error.

If the stack of open elements does not have a select element in select scope, ignore the token.
(fragment case)

Otherwise:

Pop elements from the stack of open elements until a select element
has been popped from the stack.

Reset the insertion mode appropriately.

Reprocess the token.

A start tag whose tag name is one of: «script», «template»
An end tag whose tag name is «template»

Process the token using the rules for the «in head» insertion mode.

An end-of-file token

Process the token using the rules for the «in body» insertion mode.

Anything else

Parse error. Ignore the token.

13.2.6.4.17 The «in select in table» insertion mode

When the user agent is to apply the rules for the «in select in table» insertion mode, the user agent must handle the
token as follows:

A start tag whose tag name is one of: «caption», «table», «tbody», «tfoot», «thead», «tr»,
«td», «th»

Parse error.

Pop elements from the stack of open elements until a select element
has been popped from the stack.

Reset the insertion mode appropriately.

Reprocess the token.

An end tag whose tag name is one of: «caption», «table», «tbody», «tfoot», «thead», «tr»,
«td», «th»

Parse error.

If the stack of open elements does not have an element in table scope that is an HTML
element with the same tag name as that of the token, then ignore the token.

Otherwise:

Pop elements from the stack of open elements until a select element
has been popped from the stack.

Reset the insertion mode appropriately.

Reprocess the token.

Anything else

Process the token using the rules for the «in select» insertion mode.

13.2.6.4.18 The «in template» insertion mode

When the user agent is to apply the rules for the «in
template» insertion mode, the user agent must handle the token as follows:

A character token
A comment token
A DOCTYPE token

Process the token using the rules for the «in body» insertion mode.

A start tag whose tag name is one of: «base», «basefont», «bgsound», «link», «meta», «noframes», «script», «style», «template», «title»
An end tag whose tag name is «template»

Process the token using the rules for the «in head» insertion mode.

A start tag whose tag name is one of: «caption», «colgroup», «tbody», «tfoot», «thead»

Pop the current template insertion mode off the stack of template
insertion modes.

Push «in table» onto the stack of
template insertion modes so that it is the new current template insertion
mode.

Switch the insertion mode to «in
table», and reprocess the token.

A start tag whose tag name is «col»

Pop the current template insertion mode off the stack of template
insertion modes.

Push «in column group» onto the
stack of template insertion modes so that it is the new current template
insertion mode.

Switch the insertion mode to «in
column group», and reprocess the token.

A start tag whose tag name is «tr»

Pop the current template insertion mode off the stack of template
insertion modes.

Push «in table body» onto the stack
of template insertion modes so that it is the new current template insertion
mode.

Switch the insertion mode to «in
table body», and reprocess the token.

A start tag whose tag name is one of: «td», «th»

Pop the current template insertion mode off the stack of template
insertion modes.

Push «in row» onto the stack of template
insertion modes so that it is the new current template insertion mode.

Switch the insertion mode to «in
row», and reprocess the token.

Any other start tag

Pop the current template insertion mode off the stack of template
insertion modes.

Push «in body» onto the stack of template
insertion modes so that it is the new current template insertion mode.

Switch the insertion mode to «in
body», and reprocess the token.

Any other end tag

Parse error. Ignore the token.

An end-of-file token

If there is no template element on the stack of open elements, then
stop parsing. (fragment case)

Otherwise, this is a parse error.

Pop elements from the stack of open elements until a template
element has been popped from the stack.

Clear the list of active formatting elements up to the last marker.

Pop the current template insertion mode off the stack of template
insertion modes.

Reset the insertion mode appropriately.

Reprocess the token.

13.2.6.4.19 The «after body» insertion mode

When the user agent is to apply the rules for the «after body» insertion mode, the user agent must handle the token as follows:

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C
FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Process the token using the rules for the «in body» insertion mode.

A comment token

Insert a comment as the last child of the first element in the stack of
open elements (the html element).

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

An end tag whose tag name is «html»

If the parser was created as part of the HTML fragment parsing algorithm, this
is a parse error; ignore the token. (fragment case)

Otherwise, switch the insertion mode to «after after body».

An end-of-file token

Stop parsing.

Anything else

Parse error. Switch the insertion mode to «in body» and reprocess the token.

13.2.6.4.20 The «in frameset» insertion mode

When the user agent is to apply the rules for the «in
frameset» insertion mode, the user agent must handle the token as follows:

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C
FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Insert the character.

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

A start tag whose tag name is «frameset»

Insert an HTML element for the token.

An end tag whose tag name is «frameset»

If the current node is the root html element, then this is a
parse error; ignore the token. (fragment case)

Otherwise, pop the current node from the stack of open
elements.

If the parser was not created as part of the HTML fragment parsing algorithm
(fragment case), and the current node is no longer a
frameset element, then switch the insertion mode to «after frameset».

A start tag whose tag name is «frame»

Insert an HTML element for the token. Immediately pop the current
node off the stack of open elements.

Acknowledge the token’s self-closing
flag
, if it is set.

A start tag whose tag name is «noframes»

Process the token using the rules for the «in head» insertion mode.

An end-of-file token

If the current node is not the root html element, then this is a
parse error.

The current node can only be the root
html element in the fragment case.

Stop parsing.

Anything else

Parse error. Ignore the token.

13.2.6.4.21 The «after frameset» insertion mode

When the user agent is to apply the rules for the «after frameset» insertion mode, the user agent must handle the token
as follows:

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C
FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Insert the character.

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

An end tag whose tag name is «html»

Switch the insertion mode to «after after frameset».

A start tag whose tag name is «noframes»

Process the token using the rules for the «in head» insertion mode.

An end-of-file token

Stop parsing.

Anything else

Parse error. Ignore the token.

13.2.6.4.22 The «after after body» insertion mode

When the user agent is to apply the rules for the «after after body» insertion mode, the user agent must handle the token
as follows:

A comment token

Insert a comment as the last child of the Document object.

A DOCTYPE token
A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C
FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE
A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

An end-of-file token

Stop parsing.

Anything else

Parse error. Switch the insertion mode to «in body» and reprocess the token.

13.2.6.4.23 The «after after frameset» insertion mode

When the user agent is to apply the rules for the «after after frameset» insertion mode, the user agent must handle the
token as follows:

A comment token

Insert a comment as the last child of the Document object.

A DOCTYPE token
A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C
FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE
A start tag whose tag name is «html»

Process the token using the rules for the «in body» insertion mode.

An end-of-file token

Stop parsing.

A start tag whose tag name is «noframes»

Process the token using the rules for the «in head» insertion mode.

Anything else

Parse error. Ignore the token.

13.2.6.5 The rules for parsing tokens in foreign content

When the user agent is to apply the rules for parsing tokens in foreign content, the user agent
must handle the token as follows:

A character token that is U+0000 NULL

Parse error. Insert a U+FFFD REPLACEMENT
CHARACTER character.

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C
FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Insert the token’s character.

Any other character token

Insert the token’s character.

Set the frameset-ok flag to «not ok».

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is one of: «b», «big», «blockquote», «body», «br», «center», «code», «dd», «div», «dl», «dt», «em», «embed», «h1», «h2», «h3», «h4», «h5», «h6», «head», «hr», «i», «img»,
«li», «listing»,
«menu», «meta», «nobr», «ol», «p», «pre», «ruby», «s», «small», «span», «strong», «strike», «sub»,
«sup», «table», «tt», «u», «ul», «var»
A start tag whose tag name is «font», if the token has any attributes named «color», «face»,
or «size»
An end tag whose tag name is «br», «p»

Parse error.

While the current node is not a MathML text integration point, an
HTML integration point, or an element in the HTML namespace, pop
elements from the stack of open elements.

Reprocess the token according to the rules given in the section corresponding to the current
insertion mode in HTML content.

Any other start tag

If the adjusted current node is an element in the MathML namespace,
adjust MathML attributes for the token. (This fixes the case of MathML attributes
that are not all lowercase.)

If the adjusted current node is an element in the SVG namespace, and the
token’s tag name is one of the ones in the first column of the following table, change the tag
name to the name given in the corresponding cell in the second column. (This fixes the case of
SVG elements that are not all lowercase.)

Tag name Element name
altglyph altGlyph
altglyphdef altGlyphDef
altglyphitem altGlyphItem
animatecolor animateColor
animatemotion animateMotion
animatetransform animateTransform
clippath clipPath
feblend feBlend
fecolormatrix feColorMatrix
fecomponenttransfer feComponentTransfer
fecomposite feComposite
feconvolvematrix feConvolveMatrix
fediffuselighting feDiffuseLighting
fedisplacementmap feDisplacementMap
fedistantlight feDistantLight
fedropshadow feDropShadow
feflood feFlood
fefunca feFuncA
fefuncb feFuncB
fefuncg feFuncG
fefuncr feFuncR
fegaussianblur feGaussianBlur
feimage feImage
femerge feMerge
femergenode feMergeNode
femorphology feMorphology
feoffset feOffset
fepointlight fePointLight
fespecularlighting feSpecularLighting
fespotlight feSpotLight
fetile feTile
feturbulence feTurbulence
foreignobject foreignObject
glyphref glyphRef
lineargradient linearGradient
radialgradient radialGradient

textpath textPath

If the adjusted current node is an element in the SVG namespace,
adjust SVG attributes for the token. (This fixes the case of SVG attributes that
are not all lowercase.)

Adjust foreign attributes for the token. (This fixes the use of namespaced
attributes, in particular XLink in SVG.)

Insert a foreign element for the token, in the same namespace as the
adjusted current node.

If the token has its self-closing flag set, then run the appropriate steps from the
following list:

If the token’s tag name is «script», and the new current node is in the SVG namespace

Acknowledge the token’s self-closing
flag
, and then act as described in the steps for a «script» end tag below.

Otherwise

Pop the current node off the stack of open elements and acknowledge the token’s self-closing
flag
.

An end tag whose tag name is «script», if the current
node is an SVG script element

Pop the current node off the stack of open elements.

Let the old insertion point have the same value as the current
insertion point. Let the insertion point be just before the next
input character.

Increment the parser’s script nesting level by one. Set the parser pause
flag to true.

If the active speculative HTML parser is null and the user agent supports SVG,
then Process the
SVG script element according to the SVG rules. [SVG]

Even if this causes new characters to be
inserted into the tokenizer, the parser will not be executed reentrantly, since the
parser pause flag is true.

Decrement the parser’s script nesting level by one. If the parser’s script
nesting level is zero, then set the parser pause flag to false.

Let the insertion point have the value of the old insertion
point
. (In other words, restore the insertion point to its previous value.
This value might be the «undefined» value.)

Any other end tag

Run these steps:

  1. Initialize node to be the current node (the bottommost
    node of the stack).

  2. If node‘s tag name, converted to ASCII lowercase, is
    not the same as the tag name of the token, then this is a parse error.

  3. Loop: If node is the topmost element in the stack of
    open elements, then return. (fragment case)

  4. If node‘s tag name, converted to ASCII lowercase, is
    the same as the tag name of the token, pop elements from the stack of open
    elements until node has been popped from the stack, and then return.

  5. Set node to the previous entry in the stack of open
    elements.

  6. If node is not an element in the HTML namespace, return
    to the step labeled loop.

  7. Otherwise, process the token according to the rules given in the section corresponding
    to the current insertion mode in HTML content.

13.2.7 The end

Document/DOMContentLoaded_event

Support in all current engines.

Firefox1+Safari3.1+Chrome1+


Opera9+Edge79+


Edge (Legacy)12+Internet Explorer9+


Firefox Android?Safari iOS?Chrome Android?WebView Android?Samsung Internet?Opera Android10.1+

Once the user agent stops parsing the document, the user agent
must run the following steps:

Window/load_event

Support in all current engines.

Firefox1+Safari1.3+Chrome1+


Opera4+Edge79+


Edge (Legacy)12+Internet Explorer4+


Firefox Android?Safari iOS?Chrome Android?WebView Android?Samsung Internet?Opera Android10.1+

  1. If the active speculative HTML parser is not null, then stop the
    speculative HTML parser and return.

  2. Set the insertion point to undefined.

  3. Update the current document readiness to «interactive«.

  4. Pop all the nodes off the stack of open elements.

  5. While the list of scripts that will execute when the document has finished
    parsing is not empty:

    1. Spin the event loop until the first script in the list
      of scripts that will execute when the document has finished parsing has its ready
      to be parser-executed set to true and the parser’s Document
      has no style sheet that is blocking scripts.

    2. Execute the script element given by the first script in
      the list of scripts that will execute when the document has finished
      parsing.

    3. Remove the first script element from the list of scripts that will
      execute when the document has finished parsing (i.e. shift out the first entry in the
      list).

  6. Queue a global task on the DOM manipulation task source given the
    Document‘s relevant global object to run the following substeps:

    1. Set the Document‘s load timing info’s DOM content loaded
      event start time to the current high resolution time given the
      Document‘s relevant global object.

    2. Fire an event named DOMContentLoaded at the Document
      object, with its bubbles attribute initialized to
      true.

    3. Set the Document‘s load timing info’s DOM content loaded
      event end time to the current high resolution time given the
      Document‘s relevant global object.

    4. Enable the client message queue of the
      ServiceWorkerContainer object whose associated service worker client is the
      Document object’s relevant settings object.

    5. Invoke WebDriver BiDi DOM content loaded with the Document‘s
      browsing context, and a new WebDriver BiDi
      navigation status whose id is the
      Document object’s navigation id, status is «pending«, and url is the
      Document object’s URL.

  7. Spin the event loop until the set of scripts that will execute as soon
    as possible and the list of scripts that will execute in order as soon as
    possible are empty.

  8. Spin the event loop until there is nothing that delays the load event in the Document.

  9. Queue a global task on the DOM manipulation task source given the
    Document‘s relevant global object to run the following steps:

    1. Update the current document readiness to «complete«.

    2. If the Document object’s browsing
      context is null, then abort these steps.

    3. Let window be the Document‘s relevant global
      object.

    4. Set the Document‘s load timing info’s load event start
      time to the current high resolution time given window.

    5. Fire an event named load at window, with legacy target override
      flag
      set.

    6. Invoke WebDriver BiDi load complete with the Document‘s
      browsing context, and a new WebDriver BiDi
      navigation status whose id is the
      Document object’s navigation
      id, status is «complete«, and url is the Document object’s URL.

    7. Set the Document object’s
      navigation id to null.

    8. Set the Document‘s load timing info’s load event end
      time to the current high resolution time given window.

    9. Assert: Document‘s page showing is
      false.

    10. Set the Document‘s page showing flag to true.

    11. Fire a page transition event named pageshow at window with false.

    12. Completely finish loading the Document.

    13. Queue the navigation timing entry for the Document.

  10. If the Document‘s print when loaded flag is set, then run the
    printing steps.

  11. The Document is now ready for post-load tasks.

When the user agent is to abort a parser, it must run the following steps:

  1. Throw away any pending content in the input stream, and discard any future
    content that would have been added to it.

  2. Stop the speculative HTML parser for this HTML parser.

  3. Update the current document readiness to «interactive«.

  4. Pop all the nodes off the stack of open elements.

  5. Update the current document readiness to «complete«.

13.2.8 Speculative HTML parsing

User agents may implement an optimization, as described in this section, to speculatively fetch
resources that are declared in the HTML markup while the HTML parser is waiting for a
pending parsing-blocking script to be fetched and executed, or during normal parsing,
at the time an element is created for a token.
While this optimization is not defined in precise detail, there are some rules to consider for
interoperability.

Each HTML parser can have an active speculative HTML parser. It
is initially null.

The speculative HTML parser must act like the normal HTML parser (e.g., the
tree builder rules apply), with some exceptions:

  • The state of the normal HTML parser and the document itself must not be affected.

    For example, the next input character or the stack of open
    elements for the normal HTML parser is not affected by the speculative HTML
    parser.

  • Bytes pushed into the HTML parser’s input byte stream must also be pushed into
    the speculative HTML parser’s input byte stream. Bytes read from the streams must
    be independent.

  • The result of the speculative parsing is primarily a series of speculative fetches. Which kinds of resources to speculatively fetch is
    implementation-defined, but user agents must not speculatively fetch resources that
    would not be fetched with the normal HTML parser, under the assumption that the script that is
    blocking the HTML parser does nothing.

    It is possible that the same markup is seen multiple times from the
    speculative HTML parser and then the normal HTML parser. It is expected that
    duplicated fetches will be prevented by caching rules, which are not yet fully specified.

A speculative fetch for a speculative mock element element
must follow these rules:

Should some of these things be applied to the document «for real», even
though they are found speculatively?

  • If the speculative HTML parser encounters one of the following elements, then
    act as if that element is processed for the purpose of its effect of subsequent speculative
    fetches.

    • A base element.
    • A meta element whose http-equiv
      attribute is in the Content
      security policy state.
    • A meta element whose name attribute is an
      ASCII case-insensitive match for «referrer«.
    • A meta element whose name attribute is an
      ASCII case-insensitive match for «viewport«. (This can
      affect whether a media query list matches the environment.) [CSSDEVICEADAPT]
  • Let url be the URL that element would fetch if it was
    processed normally. If there is no such URL or if it is the empty string, then do
    nothing. Otherwise, if url is already in the list of speculative fetch
    URLs, then do nothing. Otherwise, fetch url as if the element was processed
    normally, and add url to the list of speculative fetch URLs.

Each Document has a list of speculative fetch URLs, which is a
list of URLs, initially empty.

To start the speculative HTML parser for an instance of an HTML parser
parser:

  1. Optionally, return.

    This step allows user agents to opt out of speculative HTML parsing.

  2. If parser‘s active speculative HTML parser is not null, then
    stop the speculative HTML parser for parser.

    This can happen when document.write()
    writes another parser-blocking script. For simplicity, this specification always restarts
    speculative parsing, but user agents can implement a more efficient strategy, so long as the end
    result is equivalent.

  3. Let speculativeParser be a new speculative HTML parser, with the
    same state as parser.

  4. Let speculativeDoc be a new isomorphic representation of parser‘s
    Document, where all elements are instead speculative mock elements. Let speculativeParser parse into
    speculativeDoc.

  5. Set parser‘s active speculative HTML parser to
    speculativeParser.

  6. In parallel, run speculativeParser until it is stopped or until it
    reaches the end of its input stream.

To stop the speculative HTML parser for an instance of an HTML parser
parser:

  1. Let speculativeParser be parser‘s active speculative HTML
    parser.

  2. If speculativeParser is null, then return.

  3. Throw away any pending content in speculativeParser‘s input
    stream, and discard any future content that would have been added to it.

  4. Set parser‘s active speculative HTML parser to null.

The speculative HTML parser will create speculative mock elements instead of normal elements. DOM
operations that the tree builder normally does on elements are expected to work appropriately on
speculative mock elements.

A speculative mock element is a struct with the following items:

  • A string namespace, corresponding
    to an element’s namespace.

  • A string local name,
    corresponding to an element’s local
    name.

  • A list attribute list,
    corresponding to an element’s attribute list.

  • A list children, corresponding to
    an element’s children.

To create a speculative mock element given a namespace,
tagName, and attributes:

  1. Let element be a new speculative mock element.

  2. Set element‘s namespace to
    namespace.

  3. Set element‘s local name to
    tagName.

  4. Set element‘s attribute list
    to attributes.

  5. Set element‘s children to a new
    empty list.

  6. Optionally, perform a speculative fetch for element.

  7. Return element.

When the tree builder says to insert an element into a template element’s
template contents, if that is a speculative mock element, instead do
nothing. URLs found speculatively inside template elements might themselves be
templates, and must not be speculatively fetched.

13.2.9 Coercing an HTML DOM into an infoset

When an application uses an HTML parser in conjunction with an XML pipeline, it is
possible that the constructed DOM is not compatible with the XML tool chain in certain subtle
ways. For example, an XML toolchain might not be able to represent attributes with the name xmlns, since they conflict with the Namespaces in XML syntax. There is also some
data that the HTML parser generates that isn’t included in the DOM itself. This
section specifies some rules for handling these issues.

If the XML API being used doesn’t support DOCTYPEs, the tool may drop DOCTYPEs altogether.

If the XML API doesn’t support attributes in no namespace that are named «xmlns«, attributes whose names start with «xmlns:«, or
attributes in the XMLNS namespace, then the tool may drop such attributes.

The tool may annotate the output with any namespace declarations required for proper
operation.

If the XML API being used restricts the allowable characters in the local names of elements and
attributes, then the tool may map all element and attribute local names that the API wouldn’t
support to a set of names that are allowed, by replacing any character that isn’t
supported with the uppercase letter U and the six digits of the character’s code point when
expressed in hexadecimal, using digits 0-9 and capital letters A-F as the symbols, in increasing
numeric order.

For example, the element name foo<bar, which can be
output by the HTML parser, though it is neither a legal HTML element name nor a
well-formed XML element name, would be converted into fooU00003Cbar, which
is a well-formed XML element name (though it’s still not legal in HTML by any means).

As another example, consider the attribute xlink:href.
Used on a MathML element, it becomes, after being adjusted, an attribute with a prefix «xlink» and a local
name «href«. However, used on an HTML element, it becomes an attribute with
no prefix and the local name «xlink:href«, which is not a valid NCName, and
thus might not be accepted by an XML API. It could thus get converted, becoming «xlinkU00003Ahref«.

The resulting names from this conversion conveniently can’t clash with any
attribute generated by the HTML parser, since those are all either lowercase or those
listed in the adjust foreign attributes algorithm’s table.

If the XML API restricts comments from having two consecutive U+002D HYPHEN-MINUS characters
(—), the tool may insert a single U+0020 SPACE character between any such offending
characters.

If the XML API restricts comments from ending in a U+002D HYPHEN-MINUS character (-), the tool
may insert a single U+0020 SPACE character at the end of such comments.

If the XML API restricts allowed characters in character data, attribute values, or comments,
the tool may replace any U+000C FORM FEED (FF) character with a U+0020 SPACE character, and any
other literal non-XML character with a U+FFFD REPLACEMENT CHARACTER.

If the tool has no way to convey out-of-band information, then the tool may drop the following
information:

  • Whether the document is set to no-quirks mode, limited-quirks mode, or
    quirks mode
  • The association between form controls and forms that aren’t their nearest form
    element ancestor (use of the form element pointer in the parser)
  • The template contents of any template elements.

The mutations allowed by this section apply after the HTML
parser’s rules have been applied. For example, a <a::> start tag
will be closed by a </a::> end tag, and never by a </aU00003AU00003A> end tag, even if the user agent is using the rules above to
then generate an actual element in the DOM with the name aU00003AU00003A for
that start tag.

13.2.10 An introduction to error handling and strange cases in the parser

This section is non-normative.

This section examines some erroneous markup and discusses how the HTML parser
handles these cases.

13.2.10.1 Misnested tags: <b><i></b></i>

This section is non-normative.

The most-often discussed example of erroneous markup is as follows:

<p>1<b>2<i>3</b>4</i>5</p>

The parsing of this markup is straightforward up to the «3». At this point, the DOM looks like
this:

  • html
    • head
    • body
      • p
        • #text: 1
        • b
          • #text: 2
          • i
            • #text: 3

Here, the stack of open elements has five elements on it: html,
body, p, b, and i. The list of active
formatting elements just has two: b and i. The insertion
mode is «in body».

Upon receiving the end tag token with the tag name «b», the «adoption
agency algorithm» is invoked. This is a simple case, in that the formatting
element
is the b element, and there is no furthest block.
Thus, the stack of open elements ends up with just three elements: html,
body, and p, while the list of active formatting elements
has just one: i. The DOM tree is unmodified at this point.

The next token is a character («4»), triggers the reconstruction of the active formatting elements, in this case just
the i element. A new i element is thus created for the «4»
Text node. After the end tag token for the «i» is also received, and the «5»
Text node is inserted, the DOM looks as follows:

  • html
    • head
    • body
      • p
        • #text: 1
        • b
          • #text: 2
          • i
            • #text: 3
        • i
          • #text: 4
        • #text: 5
13.2.10.2 Misnested tags: <b><p></b></p>

This section is non-normative.

A case similar to the previous one is the following:

<b>1<p>2</b>3</p>

Up to the «2» the parsing here is straightforward:

  • html
    • head
    • body
      • b
        • #text: 1
        • p
          • #text: 2

The interesting part is when the end tag token with the tag name «b» is parsed.

Before that token is seen, the stack of open elements has four elements on it:
html, body, b, and p. The list of active
formatting elements just has the one: b. The insertion mode is
«in body».

Upon receiving the end tag token with the tag name «b», the «adoption
agency algorithm» is invoked, as in the previous example. However, in this case, there
is a furthest block, namely the p element. Thus, this
time the adoption agency algorithm isn’t skipped over.

The common ancestor is the body element. A conceptual
«bookmark» marks the position of the b in the list of active formatting
elements, but since that list has only one element in it, the bookmark won’t have much
effect.

As the algorithm progresses, node ends up set to the formatting element
(b), and last node ends up set to the furthest
block
(p).

The last node gets appended (moved) to the common
ancestor
, so that the DOM looks like:

  • html
    • head
    • body
      • b
        • #text: 1
      • p
        • #text: 2

A new b element is created, and the children of the p element are
moved to it:

  • html
    • head
    • body
      • b
        • #text: 1
      • p
  • b
    • #text: 2

Finally, the new b element is appended to the p element, so that the
DOM looks like:

  • html
    • head
    • body
      • b
        • #text: 1
      • p
        • b
          • #text: 2

The b element is removed from the list of active formatting elements
and the stack of open elements, so that when the «3» is parsed, it is appended to the
p element:

  • html
    • head
    • body
      • b
        • #text: 1
      • p
        • b
          • #text: 2
        • #text: 3
13.2.10.3 Unexpected markup in tables

This section is non-normative.

Error handling in tables is, for historical reasons, especially strange. For example, consider
the following markup:

<table><b><tr><td>aaa</td></tr>bbb</table>ccc

The highlighted b element start tag is not allowed directly inside a table like
that, and the parser handles this case by placing the element before the table. (This is
called foster parenting.) This can be seen by examining the DOM tree
as it stands just after the table element’s start tag has been seen:

  • html
    • head
    • body
      • table

…and then immediately after the b element start tag has been seen:

  • html
    • head
    • body
      • b
      • table

At this point, the stack of open elements has on it the elements
html, body, table, and b (in that order,
despite the resulting DOM tree); the list of active formatting elements just has the
b element in it; and the insertion mode is «in table».

The tr start tag causes the b element to be popped off the stack and
a tbody start tag to be implied; the tbody and tr elements
are then handled in a rather straight-forward manner, taking the parser through the «in table body» and «in row» insertion modes, after which the DOM looks as follows:

  • html
    • head
    • body
      • b
      • table
        • tbody
          • tr

Here, the stack of open elements has on it the elements html,
body, table, tbody, and tr; the list of
active formatting elements still has the b element in it; and the
insertion mode is «in row».

The td element start tag token, after putting a td element on the
tree, puts a marker on the list of active
formatting elements (it also switches to the «in
cell» insertion mode).

  • html
    • head
    • body
      • b
      • table
        • tbody
          • tr
            • td

The marker means that when the «aaa» character
tokens are seen, no b element is created to hold the resulting Text
node:

  • html
    • head
    • body
      • b
      • table
        • tbody
          • tr
            • td
              • #text: aaa

The end tags are handled in a straight-forward manner; after handling them, the stack of
open elements has on it the elements html, body,
table, and tbody; the list of active formatting elements
still has the b element in it (the marker
having been removed by the «td» end tag token); and the insertion mode is «in table body».

Thus it is that the «bbb» character tokens are found. These trigger the «in table text» insertion mode to be used (with the original
insertion mode set to «in table body»).
The character tokens are collected, and when the next token (the table element end
tag) is seen, they are processed as a group. Since they are not all spaces, they are handled as
per the «anything else» rules in the «in table»
insertion mode, which defer to the «in body»
insertion mode but with foster parenting.

When the active formatting elements
are reconstructed, a b element is created and foster parented, and then the «bbb» Text node is appended to it:

  • html
    • head
    • body
      • b
      • b
        • #text: bbb
      • table
        • tbody
          • tr
            • td
              • #text: aaa

The stack of open elements has on it the elements html,
body, table, tbody, and the new b (again, note
that this doesn’t match the resulting tree!); the list of active formatting elements
has the new b element in it; and the insertion mode is still «in table body».

Had the character tokens been only ASCII whitespace instead of «bbb», then that
ASCII whitespace would just be appended to the tbody element.

Finally, the table is closed by a «table» end tag. This pops all the nodes from
the stack of open elements up to and including the table element, but it
doesn’t affect the list of active formatting elements, so the «ccc» character tokens
after the table result in yet another b element being created, this time after the
table:

  • html
    • head
    • body
      • b
      • b
        • #text: bbb
      • table
        • tbody
          • tr
            • td
              • #text: aaa
      • b
        • #text: ccc
13.2.10.4 Scripts that modify the page as it is being parsed

This section is non-normative.

Consider the following markup, which for this example we will assume is the document with
URL https://example.com/inner, being rendered as the content of
an iframe in another document with the URL https://example.com/outer:

<div id=a>
 <script>
  var div = document.getElementById('a');
  parent.document.body.appendChild(div);
 </script>
 <script>
  alert(document.URL);
 </script>
</div>
<script>
 alert(document.URL);
</script>

Up to the first «script» end tag, before the script is parsed, the result is relatively
straightforward:

  • html
    • head
    • body
      • div ida«
        • #text:
        • script
          • #text: var div = document.getElementById(‘a’); ⏎ parent.document.body.appendChild(div);

After the script is parsed, though, the div element and its child
script element are gone:

  • html
    • head
    • body

They are, at this point, in the Document of the aforementioned outer
browsing context. However, the stack of open elements still contains
the div element
.

Thus, when the second script element is parsed, it is inserted into the outer
Document object
.

Those parsed into different Documents than the one the parser was created for do
not execute, so the first alert does not show.

Once the div element’s end tag is parsed, the div element is popped
off the stack, and so the next script element is in the inner
Document:

  • html
    • head
    • body
      • script
        • #text: alert(document.URL);

This script does execute, resulting in an alert that says «https://example.com/inner».

13.2.10.5 The execution of scripts that are moving across multiple documents

This section is non-normative.

Elaborating on the example in the previous section, consider the case where the second
script element is an external script (i.e. one with a src attribute). Since the element was not in the parser’s
Document when it was created, that external script is not even downloaded.

In a case where a script element with a src
attribute is parsed normally into its parser’s Document, but while the external
script is being downloaded, the element is moved to another document, the script continues to
download, but does not execute.

In general, moving script elements between Documents is
considered a bad practice.

13.2.10.6 Unclosed formatting elements

This section is non-normative.

The following markup shows how nested formatting elements (such as b) get
collected and continue to be applied even as the elements they are contained in are closed, but
that excessive duplicates are thrown away.

<!DOCTYPE html>
<p><b class=x><b class=x><b><b class=x><b class=x><b>X
<p>X
<p><b><b class=x><b>X
<p></b></b></b></b></b></b>X

The resulting DOM tree is as follows:

  • DOCTYPE: html
  • html
    • head
    • body
      • p
        • b classx«
          • b classx«
            • b
              • b classx«
                • b classx«
                  • b
                    • #text: X⏎
      • p
        • b classx«
          • b
            • b classx«
              • b classx«
                • b
                  • #text: X⏎
      • p
        • b classx«
          • b
            • b classx«
              • b classx«
                • b
                  • b
                    • b classx«
                      • b
                        • #text: X⏎
      • p
        • #text: X⏎

Note how the second p element in the markup has no explicit b
elements, but in the resulting DOM, up to three of each kind of formatting element (in this case
three b elements with the class attribute, and two unadorned b elements)
get reconstructed before the element’s «X».

Also note how this means that in the final paragraph only six b end tags are
needed to completely clear the list of active formatting elements, even though nine
b start tags have been seen up to this point.

13.3 Serializing HTML fragments

For the purposes of the following algorithm, an element serializes as void if its
element type is one of the void elements, or is basefont,
bgsound, frame, or keygen.

The following steps form the HTML fragment serialization algorithm. The algorithm takes as input a DOM
Element, Document, or DocumentFragment referred to as
the node, and returns a string.

This algorithm serializes the children of the node being serialized, not
the node itself.

  1. If the node serializes as void, then return the empty
    string.

  2. Let s be a string, and initialize it to the empty string.

  3. If the node is a template element, then let the node instead be the template element’s template
    contents (a DocumentFragment node).

  4. For each child node of the node, in tree order, run the
    following steps:

    1. Let current node be the child node being processed.

    2. Append the appropriate string from the following list to s:

      If current node is an Element

      If current node is an element in the HTML namespace, the
      MathML namespace, or the SVG namespace, then let tagname be current node‘s local name. Otherwise, let tagname be current node‘s qualified name.

      Append a U+003C LESS-THAN SIGN character (<), followed by tagname.

      For HTML elements created by the HTML parser or
      createElement(), tagname will be
      lowercase.

      If current node‘s is value is not null, and the element does not have an is attribute in its attribute list, then append the string « is="«, followed by current node‘s is value escaped as described below in attribute mode,
      followed by a U+0022 QUOTATION MARK character («).

      For each attribute that the element has, append a U+0020 SPACE character, the attribute’s serialized name as described below, a
      U+003D EQUALS SIGN character (=), a U+0022 QUOTATION MARK character («), the
      attribute’s value, escaped as described below in
      attribute mode, and a second U+0022 QUOTATION MARK character («).

      An attribute’s serialized name
      for the purposes of the previous paragraph must be determined as follows:

      If the attribute has no namespace

      The attribute’s serialized name is the attribute’s local name.

      For attributes on HTML elements set by the HTML
      parser or by setAttribute(), the
      local name will be lowercase.

      If the attribute is in the XML namespace

      The attribute’s serialized name is the string «xml:» followed
      by the attribute’s local name.

      If the attribute is in the XMLNS namespace and the attribute’s local name
      is xmlns

      The attribute’s serialized name is the string «xmlns«.

      If the attribute is in the XMLNS namespace and the attribute’s local name
      is not xmlns

      The attribute’s serialized name is the string «xmlns:»
      followed by the attribute’s local name.

      If the attribute is in the XLink namespace

      The attribute’s serialized name is the string «xlink:»
      followed by the attribute’s local name.

      If the attribute is in some other namespace

      The attribute’s serialized name is the attribute’s qualified name.

      While the exact order of attributes is implementation-defined, and may
      depend on factors such as the order that the attributes were given in the original markup,
      the sort order must be stable, such that consecutive invocations of this algorithm serialize
      an element’s attributes in the same order.

      Append a U+003E GREATER-THAN SIGN character (>).

      If current node serializes as void, then continue on to the next
      child node at this point.

      Append the value of running the HTML fragment serialization algorithm on the
      current node element (thus recursing into this algorithm for that
      element), followed by a U+003C LESS-THAN SIGN character (<), a U+002F SOLIDUS character
      (/), tagname again, and finally a U+003E GREATER-THAN SIGN character
      (>).

      If current node is a Text node

      If the parent of current node is a style,
      script, xmp, iframe, noembed,
      noframes, or plaintext element, or if the parent of current node is a noscript element and scripting is enabled for the node, then append the value of
      current node‘s data literally.

      Otherwise, append the value of current node‘s data, escaped as described
      below.

      If current node is a

      Append the literal string «<!--» (U+003C LESS-THAN SIGN, U+0021
      EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS), followed by the value of
      current node‘s data, followed by the
      literal string «-->» (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS,
      U+003E GREATER-THAN SIGN).

      If current node is a ProcessingInstruction

      Append the literal string «<?» (U+003C LESS-THAN SIGN, U+003F
      QUESTION MARK), followed by the value of current node‘s target IDL attribute, followed by a single U+0020 SPACE character, followed
      by the value of current node‘s data,
      followed by a single U+003E GREATER-THAN SIGN character (>).

      If current node is a DocumentType

      Append the literal string «<!DOCTYPE» (U+003C LESS-THAN SIGN, U+0021
      EXCLAMATION MARK, U+0044 LATIN CAPITAL LETTER D, U+004F LATIN CAPITAL LETTER O, U+0043 LATIN
      CAPITAL LETTER C, U+0054 LATIN CAPITAL LETTER T, U+0059 LATIN CAPITAL LETTER Y, U+0050 LATIN
      CAPITAL LETTER P, U+0045 LATIN CAPITAL LETTER E), followed by a space (U+0020 SPACE),
      followed by the value of current node‘s name, followed by the literal string «>» (U+003E GREATER-THAN SIGN).

  5. Return s.

It is possible that the output of this algorithm, if parsed with an HTML
parser, will not return the original tree structure. Tree structures that do not roundtrip
a serialize and reparse step can also be produced by the HTML parser itself, although
such cases are typically non-conforming.

For instance, if a textarea element to which a Comment
node has been appended is serialized and the output is then reparsed, the comment will end up
being displayed in the text control. Similarly, if, as a result of DOM manipulation, an element
contains a comment that contains the literal string «-->«, then when
the result of serializing the element is parsed, the comment will be truncated at that point and
the rest of the comment will be interpreted as markup. More examples would be making a
script element contain a Text node with the text string «</script>«, or having a p element that contains a
ul element (as the ul element’s start
tag would imply the end tag for the p).

This can enable cross-site scripting attacks. An example of this would be a page that lets the
user enter some font family names that are then inserted into a CSS style block via
the DOM and which then uses the innerHTML IDL attribute to get
the HTML serialization of that style element: if the user enters
«</style><script>attack</script>» as a font family name, innerHTML will return markup that, if parsed in a different context,
would contain a script node, even though no script node existed in the
original DOM.

For example, consider the following markup:

<form id="outer"><div></form><form id="inner"><input>

This will be parsed into:

  • html
    • head
    • body
      • form idouter«
        • div
          • form idinner«
            • input

The input element will be associated with the inner form element.
Now, if this tree structure is serialized and reparsed, the <form
id="inner">
start tag will be ignored, and so the input element will be
associated with the outer form element instead.

<html><head></head><body><form id="outer"><div><form id="inner"><input></form></div></form></body></html>
  • html
    • head
    • body
      • form idouter«
        • div
          • input

As another example, consider the following markup:

<a><table><a>

This will be parsed into:

  • html
    • head
    • body
      • a
        • a
        • table

That is, the a elements are nested, because the second a element is
foster parented. After a serialize-reparse roundtrip, the
a elements and the table element would all be siblings, because the
second <a> start tag implicitly closes the first a
element.

<html><head></head><body><a><a></a><table></table></a></body></html>
  • html
    • head
    • body
      • a
      • a
      • table

For historical reasons, this algorithm does not round-trip an initial U+000A LINE FEED (LF)
character in pre, textarea, or listing elements, even
though (in the first two cases) the markup being round-tripped can be conforming. The HTML
parser will drop such a character during parsing, but this algorithm does not
serialize an extra U+000A LINE FEED (LF) character.

For example, consider the following markup:

<pre>

Hello.</pre>

When this document is first parsed, the pre element’s child text
content starts with a single newline character. After a serialize-reparse roundtrip, the
pre element’s child text content is simply «Hello.«.

Because of the special role of the is attribute in signaling the creation of customized built-in elements, in that it provides a mechanism for parsed
HTML to set the element’s is
value, we special-case its handling during serialization. This ensures that an element’s
is value is preserved
through serialize-parse roundtrips.

When creating a customized built-in element via the parser, a developer uses the is attribute directly; in such cases serialize-parse roundtrips work fine.

<script>
window.SuperP = class extends HTMLParagraphElement {};
customElements.define("super-p", SuperP, { extends: "p" });
</script>

<div id="container"><p is="super-p">Superb!</p></div>

<script>
console.log(container.innerHTML); // <p is="super-p">
container.innerHTML = container.innerHTML;
console.log(container.innerHTML); // <p is="super-p">
console.assert(container.firstChild instanceof SuperP);
</script>

But when creating a customized built-in element via its constructor or via createElement(), the is
attribute is not added. Instead, the is value (which is what the custom elements machinery uses) is set
without intermediating through an attribute.

<script>
container.innerHTML = "";
const p = document.createElement("p", { is: "super-p" });
container.appendChild(p);

// The is attribute is not present in the DOM:
console.assert(!p.hasAttribute("is"));

// But the element is still a super-p:
console.assert(p instanceof SuperP);
</script>

To ensure that serialize-parse roundtrips still work, the serialization process explicitly
writes out the element’s is
value as an is attribute:

<script>
console.log(container.innerHTML); // <p is="super-p">
container.innerHTML = container.innerHTML;
console.log(container.innerHTML); // <p is="super-p">
console.assert(container.firstChild instanceof SuperP);
</script>

Escaping a string (for the purposes of the algorithm above)
consists of running the following steps:

  1. Replace any occurrence of the «&» character by the string «&amp;«.

  2. Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string «&nbsp;«.

  3. If the algorithm was invoked in the attribute mode, replace any occurrences of the
    «"» character by the string «&quot;«.

  4. If the algorithm was not invoked in the attribute mode, replace any
    occurrences of the «<» character by the string «&lt;«, and any occurrences of the «>» character by
    the string «&gt;«.

13.4 Parsing HTML fragments

The following steps form the HTML fragment parsing algorithm. The algorithm
takes as input an Element node, referred to as the context element, which gives the context for
the parser, as well as input, a string to parse, and returns a list of zero or
more nodes.

Parts marked fragment case in algorithms in the parser section are
parts that only occur if the parser was created for the purposes of this algorithm. The algorithms have been annotated
with such markings for informational purposes only; such markings have no normative weight. If it
is possible for a condition described as a fragment case to occur even when the
parser wasn’t created for the purposes of handling this algorithm, then that is an error in the
specification.

  1. Create a new Document node, and mark it as being an HTML document.

  2. If the
    node document of the context element is in
    quirks mode, then let the Document be in quirks mode.
    Otherwise, the
    node document of the context element is in
    limited-quirks mode, then let the Document be in limited-quirks
    mode. Otherwise, leave the Document in no-quirks mode.

  3. Create a new HTML parser, and associate it with the just created
    Document node.

  4. Set the state of the HTML parser’s tokenization stage as
    follows, switching on the context element:

    title
    textarea
    Switch the tokenizer to the RCDATA state.
    style
    xmp
    iframe
    noembed
    noframes
    Switch the tokenizer to the RAWTEXT state.
    script
    Switch the tokenizer to the script data state.
    noscript
    If the scripting flag is enabled, switch the tokenizer to the RAWTEXT
    state. Otherwise, leave the tokenizer in the data state.
    plaintext
    Switch the tokenizer to the PLAINTEXT state.
    Any other element
    Leave the tokenizer in the data state.

    For performance reasons, an implementation that does not report errors and
    that uses the actual state machine described in this specification directly could use the
    PLAINTEXT state instead of the RAWTEXT and script data states where those are mentioned in the
    list above. Except for rules regarding parse errors, they are equivalent, since there is no
    appropriate end tag token in the fragment case, yet they involve far fewer state
    transitions.

  5. Let root be a new html element with no attributes.

  6. Append the element root to the Document node created
    above.

  7. Set up the parser’s stack of open elements so that it contains just the single
    element root.

  8. If the context element is a
    template element, push «in
    template» onto the stack of template insertion modes so that it is the new
    current template insertion mode.

  9. Create a start tag token whose name is the local name of context and whose attributes are the attributes of
    context.

    Let this start tag token be the start tag token of the context node, e.g. for the purposes of determining
    if it is an HTML integration point.

  10. Reset the parser’s insertion mode
    appropriately.

    The parser will reference the context element as part of that algorithm.

  11. Set the parser’s form element pointer to the nearest node to the
    context element that is a form
    element (going straight up the ancestor chain, and including the element itself, if it is a
    form element), if any. (If there is no such form element, the
    form element pointer keeps its initial value, null.)

  12. Place the input into the input stream for the HTML
    parser just created. The encoding confidence is irrelevant.

  13. Start the parser and let it run until it has consumed all the characters just inserted into
    the input stream.

  14. Return the child
    nodes of root, in tree order.

Большинство ошибок, возникающих при валидации кода можно свести к набору типовых вариантов, зная которые легко понять, на что «намекает» валидатор. В качестве образца возьмем расширение HTML Validator для браузера Firefox, предназначенное для проверки кода и рассмотрим список ошибок и замечаний по коду.

Посмотреть все возможные сообщения валидатора можно по адресу http://www.htmlpedia.org/wiki/HTML_Tidy, далее приведены основные ошибки с их описанием и решением. Зеленым цветом выделен корректный вариант, другой цвет используется для обозначения ошибки.

Notice: entity «…» doesn’t end in «;»

Это замечание возникает при использовании спецсимволов вроде &lt; при отсутствии на конце точки с запятой.

&nbsp;

&nbsp

Решение

Добавьте в конце спецсимвола точку с запятой.

Notice: numeric character reference «…» doesn’t end in ‘;’

Возникает при использовании числовых спецсимволов вроде когда в конце забыли добавить точку с запятой.

&#8482

Решение

Добавьте в конце спецсимвола точку с запятой.

unescaped & or unknown entity «&…»

Символ амперсанда (&) часто применяется в адресах ссылок (атрибут href тега <a>), поскольку он разделяет несколько параметров. Однако амперсанд зарезервирован для спецсимволов вроде &nbsp; поэтому в ссылках необходимо указывать &amp; вместо &.

<a href=»http://www.htmlbook.ru/content/?id=30&amp;text=1″>Ссылка</a>

<a href=»http://www.htmlbook.ru/content/?id=30&text=1″>Ссылка</a>

Решение

Замените & на &amp;.

missing </…>

Отсутствует обязательный закрывающий тег.

<head><title>Заголовок</title></head>

<head><title>Заголовок</head>

Решение

Добавьте закрывающий тег.

missing </aaa> before <bbb>

Ошибка возникает при нарушении порядка тегов, когда блочный тег располагается внутри встроенного. В данном случае блочный тег <bbb> находится внутри встроенного тега <aaa>.

<p><span>Текст</span></p>

<span><p>Текст</p></span>

Решение

Поменяйте расположение тегов — перенесите встроенный тег внутрь блочного.

discarding unexpected <…>

Обнаружен открывающий или закрывающий тег, у которого нет пары. Подобная ошибка возникает в двух случаях: есть открывающий тег, но нет закрывающего; имеется закрывающий тег, которому не соответствует открывающий.

<div><div>Текст</div></div>

<div>Текст</div></div>

<div><div>Текст</div>

Решение

В зависимости от ситуации добавьте или удалите открывающий или закрывающий тег.

Notice: nested emphasis …

Контейнер содержит аналогичный тег физического форматирования, который не должен повторяться.

<p><b>Текст</b></p>

<p><b><b>Текст</b></b></p>

Решение

Удалите один из тегов.

replacing unexpected … by </…>

Закрывающий тег не соответствует открывающему тегу.

<p><b>Текст</b></p>

<p><b>Текст</span></p>

Решение

Замените открывающий или закрывающий тег на парный.

… isn’t allowed in <…> elements

Обнаружены теги, которые запрещено размещать внутри указанных элементов.

<head><title>Заголовок</title></head>

<head><body>Текст</body></head>

Решение

Переместите HTML-элемент в правильный раздел.

missing <…>

Нет обязательного тега в структуре элементов. Ошибка, к примеру, может возникнуть при формировании таблицы, когда пропущен тег <tr> и сразу же после <table> следует <td>.

<ol><li>Список</li></ol>

<ol>Список</ol>

Решение

Проверить правильность вложения тегов в текущем элементе и наличие обязательных элементов.

Notice: inserting implicit <…>

Сообщение возникает из-за предыдущей ошибки на странице.

Решение

Исправьте предыдущие ошибки.

Insert missing <title> element

В коде не вставлен тег <title>.

<head><title>Заголовок</title></head>

<head></head>

Решение

Добавьте контейнер <title>.

Multiple <frameset> elements

Тег <frameset> используется в документе более одного раза без вложения. Допускается вставлять несколько элементов <frameset>, но вложенных один в другой.

<frameset …><frame …>
<frameset …><frame …></frameset>
</frameset>

<frameset …><frame …></frameset>
<frameset …><frame …></frameset>

Решение

Используйте вложенные теги <frameset>.

<…> is not approved by W3C

Указанный тег не входит в спецификацию HTML.

<span style=»white-space: nowrap;»>текст без переносов</span>

<nobr>текст без переносов</nobr>

Решение

Удалите тег или замените его подходящим эквивалентом.

Error: <…> is not recognized!

Тег не распознан и не входит в спецификацию HTML.

Правильно: <p>Текст</p>

Неверно: <p><adres>Текст</adres></p>

Решение

Удалите неизвестный тег.

Trimming Empty Tag

Контейнер пустой или содержит только пробел.

<p>Текст</p>

<p>&nbsp;</p>

<p></p>

Решение

Удалите тег или добавьте внутрь контейнера текст.

<a> is probably intended as </a>

В закрывающем теге <a> отсутствует слэш.

<a href=»http://htmlbook.ru»>Ссылка на сайт</a>

<a href=»http://htmlbook.ru»>Ссылка на сайт<a>

Решение

Добавьте слэш к закрывающему тегу.

… shouldn’t be nested

Некоторые теги вроде <form> не могут содержать сами себя. Это сообщение также возникает из-за предыдущей ошибки.

<form action=»gb.php» name=»guestbook»></form>
<form action=»gb2.php» name=»guestbook2″></form>

<form action=»gb.php» name=»guestbook»>
<form action=»gb2.php» name=»guestbook2″></form>
</form>

Решение

Удалите вложенные теги или исправьте предыдущую ошибку.

Text found after closing </body>-tag

Теги или текст добавляется после закрывающего тега </body>.

<html>
 <head><title>Заголовок</title></head>
 <body><p>Основной текст</p></body>
</html>

<html>
 <head><title>Заголовок</title></head>
 <body><p>Основной текст</p></body>
 <b>Привет!</b>
</html>

Решение

Удалите текст после тега </body> или перенесите этот тег в конец текста.

Adjacent hyphens within comment

Комментарии в коде HTML определяются конструкцией вида <!— комментарий —>. Если в тексте комментария подряд идет два и более дефиса, возникает ошибка.

<!— Комментарий — заголовок —>

<!— комментарий —>

<!— Комментарий — тело документа —>

Решение

Удалите лишние дефисы.

SYSTEM, PUBLIC, W3C, DTD, EN must be upper case

Элемент <!DOCTYPE> указан неверно, в частности следующие атрибуты необходимо писать в верхнем регистре: SYSTEM, PUBLIC, W3C, DTD, EN.

<!DOCTYPE HTML PUBLIC «-//W3C//DTD HTML 4.01 Transitional//EN» «http://www.w3.org/TR/html4/loose.dtd»>

<!doctype html public «-//w3c//dtd html 4.01 Transitional//en» «http://www.w3.org/TR/html4/loose.dtd»>

Решение

Пишите <!DOCTYPE> корректно.

Warning: missing <!DOCTYPE> declaration

Не указан элемент <!DOCTYPE>.

<!DOCTYPE HTML PUBLIC «-//W3C//DTD HTML 4.01//EN» «http://www.w3.org/TR/html4/strict.dtd»>
<html>
<head>
<title>Заголовок</title>
</head>
<body>
<p>Основной текст</p>
</body>
</html>

<html>
<head>
<title>Untitled Document</title>
</head>
<body>
</body>
</html>

Решение

Поместите элемент <!DOCTYPE> в самую первую строку кода документа.

Too much <…>-elements

Повторяется тег, который в коде должен быть только один. К таким тегам относится <html>, <head>, <title> и <body>.

<head>
<title>Заголовок</title>
</head>

<head>
<title>Заголовок</title>
<title>Название статьи</title>
</head>

Решение

Удалите повторяющийся тег.

<…> inserting «…» attribute

Не указан обязательный атрибут для данного тега.

<style type=»text/css»>

<style>

Решение

Проверьте тег и добавьте недостающие атрибуты.

… attribute … lacks value

Атрибут тега не содержит обязательное значение или оно написано с синтаксической ошибкой.

<a href=»link.html»>Ссылка</a>

<a href>Ссылка</a>

Решение

Проверьте атрибуты тега и добавьте недостающие значения.

… attribute «…» has invalid value «…»

Атрибут содержит некорректное значение. Ошибка проявляется в тех случаях, когда в значении вместо текста пишется число и наоборот. Так, атрибуты id и name должны начинаться с символа ([A-Za-z]) и могут содержать цифры ([0-9]), дефис (-), подчеркивание (_), двоеточие (:) и точку (.). Значение ширины и высоты в атрибутах тегов не должно содержать ничего, кроме цифр ([0-9]) и процентов (%).

<div id=»layer1″>Слой 1</div>

<img src=»images/pic.gif» width=»200″ height=»120″>

<div id=»2layer»>Слой 2</div>

<img src=»images/pic.gif» width=»200px» height=»120px»>

Решение

Проверьте атрибут тега и измените его значение.

<…> missing > for end of tag

Ошибка может возникать в двух случаях: некорректно написан тег, что происходит, когда забыли добавить закрывающую скобку и применение > вместо использования спецсимвола.

<p>Пример текста</p>

<p>Для случая 0&lt;p рассмотрим следующий пример.</p>

<p Пример текста</p>

<p>Для случая 0<p рассмотрим следующий пример.</p>

Решение

Вставьте отсутствующую закрывающую скобку.
Замените < на &lt;.

<…> proprietary attribute «…»

Тег содержит атрибут, специфичный только для браузера Internet Explorer или другого и не входящий в спецификацию. Примером является атрибут height тега <table>.

Список всех атрибутов, входящих в спецификацию HTML приведен по адресу http://www.w3.org/TR/html4/index/attributes.html

<table style=»height: 100%»>

<table height=»100%»>

Решение

Список наиболее характерных атрибутов тегов приведен в табл. 14.1.

Табл. 14.1. Замена нестандартных атрибутов тегов

Тег Устаревший атрибут Стандартный атрибут
<body> marginwidth=0, marginheight=0, leftmargin=0, topmargin=0 style=»margin: 0″
<table> height=100% style=»height: 100%»
<table> nowrap style=»white-space: nowrap» или
<td nowrap>
<td> background=»abc.gif» style=»background-image:url(abc.gif)»

… proprietary attribute value «…»

Значение атрибута не входит в спецификацию HTML и является специфичным для браузера Internet Explorer или другого. Например, значение align=»absmiddle» тега <img> недопустимо.

<p><img src=»hello.gif» alt=»Привет» align=»middle»></p>

<p><img src=»hello.gif» alt=»Привет» style=»vertical-align: middle»></p>

<p><img src=»hello.gif» alt=»Привет» align=»absmiddle»></p>

Решение

Используйте стандартные значения атрибутов тегов или используйте стилевой эквивалент.

… dropping value «…» for repeated attribute «…»

Атрибут применяется в теге больше одного раза.

<img src=»image.jpg»>

<img src=»image.jpg» src=»image.jpg»>

Решение

Удалите повторяющийся атрибут.

… unexpected or duplicate quote mark

Отсутствует открывающая или закрывающая кавычка в атрибуте тега.

<img src=»image.jpg»>

<img src=image.jpg»>

Решение

Добавьте парную кавычку к значению атрибута.

… attribute with missing trailing quote mark

Тег содержит атрибут, в котором задано неверное количество кавычек.

<p id=»my_id»>

<p id=»my_id»»>

Решение

Добавьте или удалите одну из кавычек.

… id and name attribute value mismatch

Ошибка возникает, когда значения атрибутов id и name не совпадают между собой, что приводит к конфликту при обращении к свойствам элемента через скрипты.

<a name=»elm» id=»elm»>

<a id=»elm»>

<a name=»abcdef» id=»db1″>

Решение

Удалите один из атрибутов или сделайте значения атрибутов name и id одинаковыми.

Notice: replacing <…> by <…>

Ошибка возникает в следующих случаях:

  • неверный порядок тегов;
  • добавлен лишний закрывающий тег;
  • имеется открывающий тег без наличия обязательного закрывающего.

<p>Текст</p><br>

<p>Текст</p></p>

<p>abc<br><table>…</table></p>

Решение

Измените порядок тегов или удалите один из открывающих или закрывающих тегов.

… anchor «…» already defined

Значения атрибута name у различных тегов совпадает между собой. Значение name должно быть уникальным.

<form name=»my_form1″ action=»test1.php»></form>
<form name=»my_form2″ action=»test2.php»></form>

<form name=»my_form» action=»test1.php»></form>
<form name=»my_form» action=»test2.php»></form>

Решение

Выберите другое имя или измените предыдущие имена таким образом, чтобы они не совпадали.

<…> is probably intended as </…>

Тег повторяется дважды в коде HTML, тогда как подобный тег не должен содержать сам себя.

<em>Привет, мир!</em>

<em>Привет<em>, мир!</em></em>

Решение

Удалите один из тегов.

<…> lacks «…» attribute

Требуется обязательный атрибут тега, который, тем не менее, отсутствует.

<form action=»my_action.php»>

<form>

Решение

Добавьте недостающий атрибут к тегу.

Понравилась статья? Поделить с друзьями:
  • Docker wordpress error establishing a database connection
  • Docker ssl error
  • Docker service job for docker service failed because the control process exited with error code
  • Docker run npm install error
  • Docker php fpm error log