"c" not " c". * This would increase the size of the changes for some operations but leave more * natural-looking output HTML. * * @package WordPress * @subpackage HTML-API * @since 6.2.0 */ /** * Core class used to modify attributes in an HTML document for tags matching a query. * * ## Usage * * Use of this class requires three steps: * * 1. Create a new class instance with your input HTML document. * 2. Find the tag(s) you are looking for. * 3. Request changes to the attributes in those tag(s). * * Example: * * $tags = new WP_HTML_Tag_Processor( $html ); * if ( $tags->next_tag( 'option' ) ) { * $tags->set_attribute( 'selected', true ); * } * * ### Finding tags * * The `next_tag()` function moves the internal cursor through * your input HTML document until it finds a tag meeting any of * the supplied restrictions in the optional query argument. If * no argument is provided then it will find the next HTML tag, * regardless of what kind it is. * * If you want to _find whatever the next tag is_: * * $tags->next_tag(); * * | Goal | Query | * |-----------------------------------------------------------|---------------------------------------------------------------------------------| * | Find any tag. | `$tags->next_tag();` | * | Find next image tag. | `$tags->next_tag( array( 'tag_name' => 'img' ) );` | * | Find next image tag (without passing the array). | `$tags->next_tag( 'img' );` | * | Find next tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'class_name' => 'fullwidth' ) );` | * | Find next image tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'tag_name' => 'img', 'class_name' => 'fullwidth' ) );` | * * If a tag was found meeting your criteria then `next_tag()` * will return `true` and you can proceed to modify it. If it * returns `false`, however, it failed to find the tag and * moved the cursor to the end of the file. * * Once the cursor reaches the end of the file the processor * is done and if you want to reach an earlier tag you will * need to recreate the processor and start over, as it's * unable to back up or move in reverse. * * See the section on bookmarks for an exception to this * no-backing-up rule. * * #### Custom queries * * Sometimes it's necessary to further inspect an HTML tag than * the query syntax here permits. In these cases one may further * inspect the search results using the read-only functions * provided by the processor or external state or variables. * * Example: * * // Paint up to the first five DIV or SPAN tags marked with the "jazzy" style. * $remaining_count = 5; * while ( $remaining_count > 0 && $tags->next_tag() ) { * if ( * ( 'DIV' === $tags->get_tag() || 'SPAN' === $tags->get_tag() ) && * 'jazzy' === $tags->get_attribute( 'data-style' ) * ) { * $tags->add_class( 'theme-style-everest-jazz' ); * $remaining_count--; * } * } * * `get_attribute()` will return `null` if the attribute wasn't present * on the tag when it was called. It may return `""` (the empty string) * in cases where the attribute was present but its value was empty. * For boolean attributes, those whose name is present but no value is * given, it will return `true` (the only way to set `false` for an * attribute is to remove it). * * #### When matching fails * * When `next_tag()` returns `false` it could mean different things: * * - The requested tag wasn't found in the input document. * - The input document ended in the middle of an HTML syntax element. * * When a document ends in the middle of a syntax element it will pause * the processor. This is to make it possible in the future to extend the * input document and proceed - an important requirement for chunked * streaming parsing of a document. * * Example: * * $processor = new WP_HTML_Tag_Processor( 'This
` inside an HTML comment. * - STYLE content is raw text. * - TITLE content is plain text but character references are decoded. * - TEXTAREA content is plain text but character references are decoded. * - XMP (deprecated) content is raw text. * * ### Modifying HTML attributes for a found tag * * Once you've found the start of an opening tag you can modify * any number of the attributes on that tag. You can set a new * value for an attribute, remove the entire attribute, or do * nothing and move on to the next opening tag. * * Example: * * if ( $tags->next_tag( array( 'class_name' => 'wp-group-block' ) ) ) { * $tags->set_attribute( 'title', 'This groups the contained content.' ); * $tags->remove_attribute( 'data-test-id' ); * } * * If `set_attribute()` is called for an existing attribute it will * overwrite the existing value. Similarly, calling `remove_attribute()` * for a non-existing attribute has no effect on the document. Both * of these methods are safe to call without knowing if a given attribute * exists beforehand. * * ### Modifying CSS classes for a found tag * * The tag processor treats the `class` attribute as a special case. * Because it's a common operation to add or remove CSS classes, this * interface adds helper methods to make that easier. * * As with attribute values, adding or removing CSS classes is a safe * operation that doesn't require checking if the attribute or class * exists before making changes. If removing the only class then the * entire `class` attribute will be removed. * * Example: * * // from `Yippee!` * // to `Yippee!` * $tags->add_class( 'is-active' ); * * // from `Yippee!` * // to `Yippee!` * $tags->add_class( 'is-active' ); * * // from `Yippee!` * // to `Yippee!` * $tags->add_class( 'is-active' ); * * // from `` * // to ` * $tags->remove_class( 'rugby' ); * * // from `` * // to ` * $tags->remove_class( 'rugby' ); * * // from `` * // to ` * $tags->remove_class( 'rugby' ); * * When class changes are enqueued but a direct change to `class` is made via * `set_attribute` then the changes to `set_attribute` (or `remove_attribute`) * will take precedence over those made through `add_class` and `remove_class`. * * ### Bookmarks * * While scanning through the input HTMl document it's possible to set * a named bookmark when a particular tag is found. Later on, after * continuing to scan other tags, it's possible to `seek` to one of * the set bookmarks and then proceed again from that point forward. * * Because bookmarks create processing overhead one should avoid * creating too many of them. As a rule, create only bookmarks * of known string literal names; avoid creating "mark_{$index}" * and so on. It's fine from a performance standpoint to create a * bookmark and update it frequently, such as within a loop. * * $total_todos = 0; * while ( $p->next_tag( array( 'tag_name' => 'UL', 'class_name' => 'todo' ) ) ) { * $p->set_bookmark( 'list-start' ); * while ( $p->next_tag( array( 'tag_closers' => 'visit' ) ) ) { * if ( 'UL' === $p->get_tag() && $p->is_tag_closer() ) { * $p->set_bookmark( 'list-end' ); * $p->seek( 'list-start' ); * $p->set_attribute( 'data-contained-todos', (string) $total_todos ); * $total_todos = 0; * $p->seek( 'list-end' ); * break; * } * * if ( 'LI' === $p->get_tag() && ! $p->is_tag_closer() ) { * $total_todos++; * } * } * } * * ## Tokens and finer-grained processing. * * It's possible to scan through every lexical token in the * HTML document using the `next_token()` function. This * alternative form takes no argument and provides no built-in * query syntax. * * Example: * * $title = '(untitled)'; * $text = ''; * while ( $processor->next_token() ) { * switch ( $processor->get_token_name() ) { * case '#text': * $text .= $processor->get_modifiable_text(); * break; * * case 'BR': * $text .= "\n"; * break; * * case 'TITLE': * $title = $processor->get_modifiable_text(); * break; * } * } * return trim( "# {$title}\n\n{$text}" ); * * ### Tokens and _modifiable text_. * * #### Special "atomic" HTML elements. * * Not all HTML elements are able to contain other elements inside of them. * For instance, the contents inside a TITLE element are plaintext (except * that character references like & will be decoded). This means that * if the string `` appears inside a TITLE element, then it's not an * image tag, but rather it's text describing an image tag. Likewise, the * contents of a SCRIPT or STYLE element are handled entirely separately in * a browser than the contents of other elements because they represent a * different language than HTML. * * For these elements the Tag Processor treats the entire sequence as one, * from the opening tag, including its contents, through its closing tag. * This means that the it's not possible to match the closing tag for a * SCRIPT element unless it's unexpected; the Tag Processor already matched * it when it found the opening tag. * * The inner contents of these elements are that element's _modifiable text_. * * The special elements are: * - `SCRIPT` whose contents are treated as raw plaintext but supports a legacy * style of including JavaScript inside of HTML comments to avoid accidentally * closing the SCRIPT from inside a JavaScript string. E.g. `console.log( '' )`. * - `TITLE` and `TEXTAREA` whose contents are treated as plaintext and then any * character references are decoded. E.g. `1 < 2 < 3` becomes `1 < 2 < 3`. * - `IFRAME`, `NOSCRIPT`, `NOEMBED`, `NOFRAME`, `STYLE` whose contents are treated as * raw plaintext and left as-is. E.g. `1 < 2 < 3` remains `1 < 2 < 3`. * * #### Other tokens with modifiable text. * * There are also non-elements which are void/self-closing in nature and contain * modifiable text that is part of that individual syntax token itself. * * - `#text` nodes, whose entire token _is_ the modifiable text. * - HTML comments and tokens that become comments due to some syntax error. The * text for these tokens is the portion of the comment inside of the syntax. * E.g. for `` the text is `" comment "` (note the spaces are included). * - `CDATA` sections, whose text is the content inside of the section itself. E.g. for * `` the text is `"some content"` (with restrictions [1]). * - "Funky comments," which are a special case of invalid closing tags whose name is * invalid. The text for these nodes is the text that a browser would transform into * an HTML comment when parsing. E.g. for `` the text is `%post_author`. * - `DOCTYPE` declarations like `` which have no closing tag. * - XML Processing instruction nodes like `` (with restrictions [2]). * - The empty end tag `` which is ignored in the browser and DOM. * * [1]: There are no CDATA sections in HTML. When encountering `` becomes a bogus HTML comment, meaning there can be no CDATA * section in an HTML document containing `>`. The Tag Processor will first find * all valid and bogus HTML comments, and then if the comment _would_ have been a * CDATA section _were they to exist_, it will indicate this as the type of comment. * * [2]: XML allows a broader range of characters in a processing instruction's target name * and disallows "xml" as a name, since it's special. The Tag Processor only recognizes * target names with an ASCII-representable subset of characters. It also exhibits the * same constraint as with CDATA sections, in that `>` cannot exist within the token * since Processing Instructions do no exist within HTML and their syntax transforms * into a bogus comment in the DOM. * * ## Design and limitations * * The Tag Processor is designed to linearly scan HTML documents and tokenize * HTML tags and their attributes. It's designed to do this as efficiently as * possible without compromising parsing integrity. Therefore it will be * slower than some methods of modifying HTML, such as those incorporating * over-simplified PCRE patterns, but will not introduce the defects and * failures that those methods bring in, which lead to broken page renders * and often to security vulnerabilities. On the other hand, it will be faster * than full-blown HTML parsers such as DOMDocument and use considerably * less memory. It requires a negligible memory overhead, enough to consider * it a zero-overhead system. * * The performance characteristics are maintained by avoiding tree construction * and semantic cleanups which are specified in HTML5. Because of this, for * example, it's not possible for the Tag Processor to associate any given * opening tag with its corresponding closing tag, or to return the inner markup * inside an element. Systems may be built on top of the Tag Processor to do * this, but the Tag Processor is and should be constrained so it can remain an * efficient, low-level, and reliable HTML scanner. * * The Tag Processor's design incorporates a "garbage-in-garbage-out" philosophy. * HTML5 specifies that certain invalid content be transformed into different forms * for display, such as removing null bytes from an input document and replacing * invalid characters with the Unicode replacement character `U+FFFD` (visually "�"). * Where errors or transformations exist within the HTML5 specification, the Tag Processor * leaves those invalid inputs untouched, passing them through to the final browser * to handle. While this implies that certain operations will be non-spec-compliant, * such as reading the value of an attribute with invalid content, it also preserves a * simplicity and efficiency for handling those error cases. * * Most operations within the Tag Processor are designed to minimize the difference * between an input and output document for any given change. For example, the * `add_class` and `remove_class` methods preserve whitespace and the class ordering * within the `class` attribute; and when encountering tags with duplicated attributes, * the Tag Processor will leave those invalid duplicate attributes where they are but * update the proper attribute which the browser will read for parsing its value. An * exception to this rule is that all attribute updates store their values as * double-quoted strings, meaning that attributes on input with single-quoted or * unquoted values will appear in the output with double-quotes. * * ### Scripting Flag * * The Tag Processor parses HTML with the "scripting flag" disabled. This means * that it doesn't run any scripts while parsing the page. In a browser with * JavaScript enabled, for example, the script can change the parse of the * document as it loads. On the server, however, evaluating JavaScript is not * only impractical, but also unwanted. * * Practically this means that the Tag Processor will descend into NOSCRIPT * elements and process its child tags. Were the scripting flag enabled, such * as in a typical browser, the contents of NOSCRIPT are skipped entirely. * * This allows the HTML API to process the content that will be presented in * a browser when scripting is disabled, but it offers a different view of a * page than most browser sessions will experience. E.g. the tags inside the * NOSCRIPT disappear. * * ### Text Encoding * * The Tag Processor assumes that the input HTML document is encoded with a * text encoding compatible with 7-bit ASCII's '<', '>', '&', ';', '/', '=', * "'", '"', 'a' - 'z', 'A' - 'Z', and the whitespace characters ' ', tab, * carriage-return, newline, and form-feed. * * In practice, this includes almost every single-byte encoding as well as * UTF-8. Notably, however, it does not include UTF-16. If providing input * that's incompatible, then convert the encoding beforehand. * * @since 6.2.0 * @since 6.2.1 Fix: Support for various invalid comments; attribute updates are case-insensitive. * @since 6.3.2 Fix: Skip HTML-like content inside rawtext elements such as STYLE. * @since 6.5.0 Pauses processor when input ends in an incomplete syntax token. * Introduces "special" elements which act like void elements, e.g. TITLE, STYLE. * Allows scanning through all tokens and processing modifiable text, where applicable. */ class WP_HTML_Tag_Processor { /** * The maximum number of bookmarks allowed to exist at * any given time. * * @since 6.2.0 * @var int * * @see WP_HTML_Tag_Processor::set_bookmark() */ const MAX_BOOKMARKS = 10; /** * Maximum number of times seek() can be called. * Prevents accidental infinite loops. * * @since 6.2.0 * @var int * * @see WP_HTML_Tag_Processor::seek() */ const MAX_SEEK_OPS = 1000; /** * The HTML document to parse. * * @since 6.2.0 * @var string */ protected $html; /** * The last query passed to next_tag(). * * @since 6.2.0 * @var array|null */ private $last_query; /** * The tag name this processor currently scans for. * * @since 6.2.0 * @var string|null */ private $sought_tag_name; /** * The CSS class name this processor currently scans for. * * @since 6.2.0 * @var string|null */ private $sought_class_name; /** * The match offset this processor currently scans for. * * @since 6.2.0 * @var int|null */ private $sought_match_offset; /** * Whether to visit tag closers, e.g.
, when walking an input document. * * @since 6.2.0 * @var bool */ private $stop_on_tag_closers; /** * Specifies mode of operation of the parser at any given time. * * | State | Meaning | * | ----------------|----------------------------------------------------------------------| * | *Ready* | The parser is ready to run. | * | *Complete* | There is nothing left to parse. | * | *Incomplete* | The HTML ended in the middle of a token; nothing more can be parsed. | * | *Matched tag* | Found an HTML tag; it's possible to modify its attributes. | * | *Text node* | Found a #text node; this is plaintext and modifiable. | * | *CDATA node* | Found a CDATA section; this is modifiable. | * | *Comment* | Found a comment or bogus comment; this is modifiable. | * | *Presumptuous* | Found an empty tag closer: ``. | * | *Funky comment* | Found a tag closer with an invalid tag name; this is modifiable. | * * @since 6.5.0 * * @see WP_HTML_Tag_Processor::STATE_READY * @see WP_HTML_Tag_Processor::STATE_COMPLETE * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT * @see WP_HTML_Tag_Processor::STATE_MATCHED_TAG * @see WP_HTML_Tag_Processor::STATE_TEXT_NODE * @see WP_HTML_Tag_Processor::STATE_CDATA_NODE * @see WP_HTML_Tag_Processor::STATE_COMMENT * @see WP_HTML_Tag_Processor::STATE_DOCTYPE * @see WP_HTML_Tag_Processor::STATE_PRESUMPTUOUS_TAG * @see WP_HTML_Tag_Processor::STATE_FUNKY_COMMENT * * @var string */ protected $parser_state = self::STATE_READY; /** * Indicates if the document is in quirks mode or no-quirks mode. * * Impact on HTML parsing: * * - In `NO_QUIRKS_MODE` (also known as "standard mode"): * - CSS class and ID selectors match byte-for-byte (case-sensitively). * - A TABLE start tag `` implicitly closes any open `P` element. * * - In `QUIRKS_MODE`: * - CSS class and ID selectors match match in an ASCII case-insensitive manner. * - A TABLE start tag `
` opens a `TABLE` element as a child of a `P` * element if one is open. * * Quirks and no-quirks mode are thus mostly about styling, but have an impact when * tables are found inside paragraph elements. * * @see self::QUIRKS_MODE * @see self::NO_QUIRKS_MODE * * @since 6.7.0 * * @var string */ protected $compat_mode = self::NO_QUIRKS_MODE; /** * Indicates whether the parser is inside foreign content, * e.g. inside an SVG or MathML element. * * One of 'html', 'svg', or 'math'. * * Several parsing rules change based on whether the parser * is inside foreign content, including whether CDATA sections * are allowed and whether a self-closing flag indicates that * an element has no content. * * @since 6.7.0 * * @var string */ private $parsing_namespace = 'html'; /** * What kind of syntax token became an HTML comment. * * Since there are many ways in which HTML syntax can create an HTML comment, * this indicates which of those caused it. This allows the Tag Processor to * represent more from the original input document than would appear in the DOM. * * @since 6.5.0 * * @var string|null */ protected $comment_type = null; /** * What kind of text the matched text node represents, if it was subdivided. * * @see self::TEXT_IS_NULL_SEQUENCE * @see self::TEXT_IS_WHITESPACE * @see self::TEXT_IS_GENERIC * @see self::subdivide_text_appropriately * * @since 6.7.0 * * @var string */ protected $text_node_classification = self::TEXT_IS_GENERIC; /** * How many bytes from the original HTML document have been read and parsed. * * This value points to the latest byte offset in the input document which * has been already parsed. It is the internal cursor for the Tag Processor * and updates while scanning through the HTML tokens. * * @since 6.2.0 * @var int */ private $bytes_already_parsed = 0; /** * Byte offset in input document where current token starts. * * Example: * *
... * 01234 * - token starts at 0 * * @since 6.5.0 * * @var int|null */ private $token_starts_at; /** * Byte length of current token. * * Example: * *
... * 012345678901234 * - token length is 14 - 0 = 14 * * a is a token. * 0123456789 123456789 123456789 * - token length is 17 - 2 = 15 * * @since 6.5.0 * * @var int|null */ private $token_length; /** * Byte offset in input document where current tag name starts. * * Example: * *
... * 01234 * - tag name starts at 1 * * @since 6.2.0 * * @var int|null */ private $tag_name_starts_at; /** * Byte length of current tag name. * * Example: * *
... * 01234 * --- tag name length is 3 * * @since 6.2.0 * * @var int|null */ private $tag_name_length; /** * Byte offset into input document where current modifiable text starts. * * @since 6.5.0 * * @var int */ private $text_starts_at; /** * Byte length of modifiable text. * * @since 6.5.0 * * @var int */ private $text_length; /** * Whether the current tag is an opening tag, e.g.
, or a closing tag, e.g.
. * * @var bool */ private $is_closing_tag; /** * Lazily-built index of attributes found within an HTML tag, keyed by the attribute name. * * Example: * * // Supposing the parser is working through this content * // and stops after recognizing the `id` attribute. * //
* // ^ parsing will continue from this point. * $this->attributes = array( * 'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false ) * ); * * // When picking up parsing again, or when asking to find the * // `class` attribute we will continue and add to this array. * $this->attributes = array( * 'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false ), * 'class' => new WP_HTML_Attribute_Token( 'class', 23, 7, 17, 13, false ) * ); * * // Note that only the `class` attribute value is stored in the index. * // That's because it is the only value used by this class at the moment. * * @since 6.2.0 * @var WP_HTML_Attribute_Token[] */ private $attributes = array(); /** * Tracks spans of duplicate attributes on a given tag, used for removing * all copies of an attribute when calling `remove_attribute()`. * * @since 6.3.2 * * @var (WP_HTML_Span[])[]|null */ private $duplicate_attributes = null; /** * Which class names to add or remove from a tag. * * These are tracked separately from attribute updates because they are * semantically distinct, whereas this interface exists for the common * case of adding and removing class names while other attributes are * generally modified as with DOM `setAttribute` calls. * * When modifying an HTML document these will eventually be collapsed * into a single `set_attribute( 'class', $changes )` call. * * Example: * * // Add the `wp-block-group` class, remove the `wp-group` class. * $classname_updates = array( * // Indexed by a comparable class name. * 'wp-block-group' => WP_HTML_Tag_Processor::ADD_CLASS, * 'wp-group' => WP_HTML_Tag_Processor::REMOVE_CLASS * ); * * @since 6.2.0 * @var bool[] */ private $classname_updates = array(); /** * Tracks a semantic location in the original HTML which * shifts with updates as they are applied to the document. * * @since 6.2.0 * @var WP_HTML_Span[] */ protected $bookmarks = array(); const ADD_CLASS = true; const REMOVE_CLASS = false; const SKIP_CLASS = null; /** * Lexical replacements to apply to input HTML document. * * "Lexical" in this class refers to the part of this class which * operates on pure text _as text_ and not as HTML. There's a line * between the public interface, with HTML-semantic methods like * `set_attribute` and `add_class`, and an internal state that tracks * text offsets in the input document. * * When higher-level HTML methods are called, those have to transform their * operations (such as setting an attribute's value) into text diffing * operations (such as replacing the sub-string from indices A to B with * some given new string). These text-diffing operations are the lexical * updates. * * As new higher-level methods are added they need to collapse their * operations into these lower-level lexical updates since that's the * Tag Processor's internal language of change. Any code which creates * these lexical updates must ensure that they do not cross HTML syntax * boundaries, however, so these should never be exposed outside of this * class or any classes which intentionally expand its functionality. * * These are enqueued while editing the document instead of being immediately * applied to avoid processing overhead, string allocations, and string * copies when applying many updates to a single document. * * Example: * * // Replace an attribute stored with a new value, indices * // sourced from the lazily-parsed HTML recognizer. * $start = $attributes['src']->start; * $length = $attributes['src']->length; * $modifications[] = new WP_HTML_Text_Replacement( $start, $length, $new_value ); * * // Correspondingly, something like this will appear in this array. * $lexical_updates = array( * WP_HTML_Text_Replacement( 14, 28, 'https://my-site.my-domain/wp-content/uploads/2014/08/kittens.jpg' ) * ); * * @since 6.2.0 * @var WP_HTML_Text_Replacement[] */ protected $lexical_updates = array(); /** * Tracks and limits `seek()` calls to prevent accidental infinite loops. * * @since 6.2.0 * @var int * * @see WP_HTML_Tag_Processor::seek() */ protected $seek_count = 0; /** * Whether the parser should skip over an immediately-following linefeed * character, as is the case with LISTING, PRE, and TEXTAREA. * * > If the next token is a U+000A LINE FEED (LF) character token, then * > ignore that token and move on to the next one. (Newlines at the start * > of [these] elements are ignored as an authoring convenience.) * * @since 6.7.0 * * @var int|null */ private $skip_newline_at = null; /** * Constructor. * * @since 6.2.0 * * @param string $html HTML to process. */ public function __construct( $html ) { if ( ! is_string( $html ) ) { _doing_it_wrong( __METHOD__, __( 'The HTML parameter must be a string.' ), '6.9.0' ); $html = ''; } $this->html = $html; } /** * Switches parsing mode into a new namespace, such as when * encountering an SVG tag and entering foreign content. * * @since 6.7.0 * * @param string $new_namespace One of 'html', 'svg', or 'math' indicating into what * namespace the next tokens will be processed. * @return bool Whether the namespace was valid and changed. */ public function change_parsing_namespace( string $new_namespace ): bool { if ( ! in_array( $new_namespace, array( 'html', 'math', 'svg' ), true ) ) { return false; } $this->parsing_namespace = $new_namespace; return true; } /** * Finds the next tag matching the $query. * * @since 6.2.0 * @since 6.5.0 No longer processes incomplete tokens at end of document; pauses the processor at start of token. * * @param array|string|null $query { * Optional. Which tag name to find, having which class, etc. Default is to find any tag. * * @type string|null $tag_name Which tag to find, or `null` for "any tag." * @type int|null $match_offset Find the Nth tag matching all search criteria. * 1 for "first" tag, 3 for "third," etc. * Defaults to first tag. * @type string|null $class_name Tag must contain this whole class name to match. * @type string|null $tag_closers "visit" or "skip": whether to stop on tag closers, e.g.
. * } * @return bool Whether a tag was matched. */ public function next_tag( $query = null ): bool { $this->parse_query( $query ); $already_found = 0; do { if ( false === $this->next_token() ) { return false; } if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { continue; } if ( $this->matches() ) { ++$already_found; } } while ( $already_found < $this->sought_match_offset ); return true; } /** * Finds the next token in the HTML document. * * An HTML document can be viewed as a stream of tokens, * where tokens are things like HTML tags, HTML comments, * text nodes, etc. This method finds the next token in * the HTML document and returns whether it found one. * * If it starts parsing a token and reaches the end of the * document then it will seek to the start of the last * token and pause, returning `false` to indicate that it * failed to find a complete token. * * Possible token types, based on the HTML specification: * * - an HTML tag, whether opening, closing, or void. * - a text node - the plaintext inside tags. * - an HTML comment. * - a DOCTYPE declaration. * - a processing instruction, e.g. ``. * * The Tag Processor currently only supports the tag token. * * @since 6.5.0 * @since 6.7.0 Recognizes CDATA sections within foreign content. * * @return bool Whether a token was parsed. */ public function next_token(): bool { return $this->base_class_next_token(); } /** * Internal method which finds the next token in the HTML document. * * This method is a protected internal function which implements the logic for * finding the next token in a document. It exists so that the parser can update * its state without affecting the location of the cursor in the document and * without triggering subclass methods for things like `next_token()`, e.g. when * applying patches before searching for the next token. * * @since 6.5.0 * * @access private * * @return bool Whether a token was parsed. */ private function base_class_next_token(): bool { $was_at = $this->bytes_already_parsed; $this->after_tag(); // Don't proceed if there's nothing more to scan. if ( self::STATE_COMPLETE === $this->parser_state || self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { return false; } /* * The next step in the parsing loop determines the parsing state; * clear it so that state doesn't linger from the previous step. */ $this->parser_state = self::STATE_READY; if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { $this->parser_state = self::STATE_COMPLETE; return false; } // Find the next tag if it exists. if ( false === $this->parse_next_tag() ) { if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { $this->bytes_already_parsed = $was_at; } return false; } /* * For legacy reasons the rest of this function handles tags and their * attributes. If the processor has reached the end of the document * or if it matched any other token then it should return here to avoid * attempting to process tag-specific syntax. */ if ( self::STATE_INCOMPLETE_INPUT !== $this->parser_state && self::STATE_COMPLETE !== $this->parser_state && self::STATE_MATCHED_TAG !== $this->parser_state ) { return true; } // Parse all of its attributes. while ( $this->parse_next_attribute() ) { continue; } // Ensure that the tag closes before the end of the document. if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state || $this->bytes_already_parsed >= strlen( $this->html ) ) { // Does this appropriately clear state (parsed attributes)? $this->parser_state = self::STATE_INCOMPLETE_INPUT; $this->bytes_already_parsed = $was_at; return false; } $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed ); if ( false === $tag_ends_at ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; $this->bytes_already_parsed = $was_at; return false; } $this->parser_state = self::STATE_MATCHED_TAG; $this->bytes_already_parsed = $tag_ends_at + 1; $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; /* * Certain tags require additional processing. The first-letter pre-check * avoids unnecessary string allocation when comparing the tag names. * * - IFRAME * - LISTING (deprecated) * - NOEMBED (deprecated) * - NOFRAMES (deprecated) * - PRE * - SCRIPT * - STYLE * - TEXTAREA * - TITLE * - XMP (deprecated) */ if ( $this->is_closing_tag || 'html' !== $this->parsing_namespace || 1 !== strspn( $this->html, 'iIlLnNpPsStTxX', $this->tag_name_starts_at, 1 ) ) { return true; } $tag_name = $this->get_tag(); /* * For LISTING, PRE, and TEXTAREA, the first linefeed of an immediately-following * text node is ignored as an authoring convenience. * * @see static::skip_newline_at */ if ( 'LISTING' === $tag_name || 'PRE' === $tag_name ) { $this->skip_newline_at = $this->bytes_already_parsed; return true; } /* * There are certain elements whose children are not DATA but are instead * RCDATA or RAWTEXT. These cannot contain other elements, and the contents * are parsed as plaintext, with character references decoded in RCDATA but * not in RAWTEXT. * * These elements are described here as "self-contained" or special atomic * elements whose end tag is consumed with the opening tag, and they will * contain modifiable text inside of them. * * Preserve the opening tag pointers, as these will be overwritten * when finding the closing tag. They will be reset after finding * the closing to tag to point to the opening of the special atomic * tag sequence. */ $tag_name_starts_at = $this->tag_name_starts_at; $tag_name_length = $this->tag_name_length; $tag_ends_at = $this->token_starts_at + $this->token_length; $attributes = $this->attributes; $duplicate_attributes = $this->duplicate_attributes; // Find the closing tag if necessary. switch ( $tag_name ) { case 'SCRIPT': $found_closer = $this->skip_script_data(); break; case 'TEXTAREA': case 'TITLE': $found_closer = $this->skip_rcdata( $tag_name ); break; /* * In the browser this list would include the NOSCRIPT element, * but the Tag Processor is an environment with the scripting * flag disabled, meaning that it needs to descend into the * NOSCRIPT element to be able to properly process what will be * sent to a browser. * * Note that this rule makes HTML5 syntax incompatible with XML, * because the parsing of this token depends on client application. * The NOSCRIPT element cannot be represented in the XHTML syntax. */ case 'IFRAME': case 'NOEMBED': case 'NOFRAMES': case 'STYLE': case 'XMP': $found_closer = $this->skip_rawtext( $tag_name ); break; // No other tags should be treated in their entirety here. default: return true; } if ( ! $found_closer ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; $this->bytes_already_parsed = $was_at; return false; } /* * The values here look like they reference the opening tag but they reference * the closing tag instead. This is why the opening tag values were stored * above in a variable. It reads confusingly here, but that's because the * functions that skip the contents have moved all the internal cursors past * the inner content of the tag. */ $this->token_starts_at = $was_at; $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; $this->text_starts_at = $tag_ends_at; $this->text_length = $this->tag_name_starts_at - $this->text_starts_at; $this->tag_name_starts_at = $tag_name_starts_at; $this->tag_name_length = $tag_name_length; $this->attributes = $attributes; $this->duplicate_attributes = $duplicate_attributes; return true; } /** * Whether the processor paused because the input HTML document ended * in the middle of a syntax element, such as in the middle of a tag. * * Example: * * $processor = new WP_HTML_Tag_Processor( '" ); * $p->next_tag(); * foreach ( $p->class_list() as $class_name ) { * echo "{$class_name} "; * } * // Outputs: "free lang-en " * * @since 6.4.0 */ public function class_list() { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return; } /** @var string $class contains the string value of the class attribute, with character references decoded. */ $class = $this->get_attribute( 'class' ); if ( ! is_string( $class ) ) { return; } $seen = array(); $is_quirks = self::QUIRKS_MODE === $this->compat_mode; $at = 0; while ( $at < strlen( $class ) ) { // Skip past any initial boundary characters. $at += strspn( $class, " \t\f\r\n", $at ); if ( $at >= strlen( $class ) ) { return; } // Find the byte length until the next boundary. $length = strcspn( $class, " \t\f\r\n", $at ); if ( 0 === $length ) { return; } $name = str_replace( "\x00", "\u{FFFD}", substr( $class, $at, $length ) ); if ( $is_quirks ) { $name = strtolower( $name ); } $at += $length; /* * It's expected that the number of class names for a given tag is relatively small. * Given this, it is probably faster overall to scan an array for a value rather * than to use the class name as a key and check if it's a key of $seen. */ if ( in_array( $name, $seen, true ) ) { continue; } $seen[] = $name; yield $name; } } /** * Returns if a matched tag contains the given ASCII case-insensitive class name. * * @since 6.4.0 * * @param string $wanted_class Look for this CSS class name, ASCII case-insensitive. * @return bool|null Whether the matched tag contains the given class name, or null if not matched. */ public function has_class( $wanted_class ): ?bool { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return null; } $case_insensitive = self::QUIRKS_MODE === $this->compat_mode; $wanted_length = strlen( $wanted_class ); foreach ( $this->class_list() as $class_name ) { if ( strlen( $class_name ) === $wanted_length && 0 === substr_compare( $class_name, $wanted_class, 0, strlen( $wanted_class ), $case_insensitive ) ) { return true; } } return false; } /** * Sets a bookmark in the HTML document. * * Bookmarks represent specific places or tokens in the HTML * document, such as a tag opener or closer. When applying * edits to a document, such as setting an attribute, the * text offsets of that token may shift; the bookmark is * kept updated with those shifts and remains stable unless * the entire span of text in which the token sits is removed. * * Release bookmarks when they are no longer needed. * * Example: * *

Surprising fact you may not know!

* ^ ^ * \-|-- this `H2` opener bookmark tracks the token * *

Surprising fact you may no… * ^ ^ * \-|-- it shifts with edits * * Bookmarks provide the ability to seek to a previously-scanned * place in the HTML document. This avoids the need to re-scan * the entire document. * * Example: * *
  • One
  • Two
  • Three
* ^^^^ * want to note this last item * * $p = new WP_HTML_Tag_Processor( $html ); * $in_list = false; * while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) { * if ( 'UL' === $p->get_tag() ) { * if ( $p->is_tag_closer() ) { * $in_list = false; * $p->set_bookmark( 'resume' ); * if ( $p->seek( 'last-li' ) ) { * $p->add_class( 'last-li' ); * } * $p->seek( 'resume' ); * $p->release_bookmark( 'last-li' ); * $p->release_bookmark( 'resume' ); * } else { * $in_list = true; * } * } * * if ( 'LI' === $p->get_tag() ) { * $p->set_bookmark( 'last-li' ); * } * } * * Bookmarks intentionally hide the internal string offsets * to which they refer. They are maintained internally as * updates are applied to the HTML document and therefore * retain their "position" - the location to which they * originally pointed. The inability to use bookmarks with * functions like `substr` is therefore intentional to guard * against accidentally breaking the HTML. * * Because bookmarks allocate memory and require processing * for every applied update, they are limited and require * a name. They should not be created with programmatically-made * names, such as "li_{$index}" with some loop. As a general * rule they should only be created with string-literal names * like "start-of-section" or "last-paragraph". * * Bookmarks are a powerful tool to enable complicated behavior. * Consider double-checking that you need this tool if you are * reaching for it, as inappropriate use could lead to broken * HTML structure or unwanted processing overhead. * * @since 6.2.0 * * @param string $name Identifies this particular bookmark. * @return bool Whether the bookmark was successfully created. */ public function set_bookmark( $name ): bool { // It only makes sense to set a bookmark if the parser has paused on a concrete token. if ( self::STATE_COMPLETE === $this->parser_state || self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { return false; } if ( ! array_key_exists( $name, $this->bookmarks ) && count( $this->bookmarks ) >= static::MAX_BOOKMARKS ) { _doing_it_wrong( __METHOD__, __( 'Too many bookmarks: cannot create any more.' ), '6.2.0' ); return false; } $this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->token_length ); return true; } /** * Removes a bookmark that is no longer needed. * * Releasing a bookmark frees up the small * performance overhead it requires. * * @param string $name Name of the bookmark to remove. * @return bool Whether the bookmark already existed before removal. */ public function release_bookmark( $name ): bool { if ( ! array_key_exists( $name, $this->bookmarks ) ) { return false; } unset( $this->bookmarks[ $name ] ); return true; } /** * Skips contents of generic rawtext elements. * * @since 6.3.2 * * @see https://html.spec.whatwg.org/#generic-raw-text-element-parsing-algorithm * * @param string $tag_name The uppercase tag name which will close the RAWTEXT region. * @return bool Whether an end to the RAWTEXT region was found before the end of the document. */ private function skip_rawtext( string $tag_name ): bool { /* * These two functions distinguish themselves on whether character references are * decoded, and since functionality to read the inner markup isn't supported, it's * not necessary to implement these two functions separately. */ return $this->skip_rcdata( $tag_name ); } /** * Skips contents of RCDATA elements, namely title and textarea tags. * * @since 6.2.0 * * @see https://html.spec.whatwg.org/multipage/parsing.html#rcdata-state * * @param string $tag_name The uppercase tag name which will close the RCDATA region. * @return bool Whether an end to the RCDATA region was found before the end of the document. */ private function skip_rcdata( string $tag_name ): bool { $html = $this->html; $doc_length = strlen( $html ); $tag_length = strlen( $tag_name ); $at = $this->bytes_already_parsed; while ( false !== $at && $at < $doc_length ) { $at = strpos( $this->html, 'tag_name_starts_at = $at; // Fail if there is no possible tag closer. if ( false === $at || ( $at + $tag_length ) >= $doc_length ) { return false; } $at += 2; /* * Find a case-insensitive match to the tag name. * * Because tag names are limited to US-ASCII there is no * need to perform any kind of Unicode normalization when * comparing; any character which could be impacted by such * normalization could not be part of a tag name. */ for ( $i = 0; $i < $tag_length; $i++ ) { $tag_char = $tag_name[ $i ]; $html_char = $html[ $at + $i ]; if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) { $at += $i; continue 2; } } $at += $tag_length; $this->bytes_already_parsed = $at; if ( $at >= strlen( $html ) ) { return false; } /* * Ensure that the tag name terminates to avoid matching on * substrings of a longer tag name. For example, the sequence * "' !== $c ) { continue; } while ( $this->parse_next_attribute() ) { continue; } $at = $this->bytes_already_parsed; if ( $at >= strlen( $this->html ) ) { return false; } if ( '>' === $html[ $at ] ) { $this->bytes_already_parsed = $at + 1; return true; } if ( $at + 1 >= strlen( $this->html ) ) { return false; } if ( '/' === $html[ $at ] && '>' === $html[ $at + 1 ] ) { $this->bytes_already_parsed = $at + 2; return true; } } return false; } /** * Skips contents of script tags. * * @since 6.2.0 * * @return bool Whether the script tag was closed before the end of the document. */ private function skip_script_data(): bool { $state = 'unescaped'; $html = $this->html; $doc_length = strlen( $html ); $at = $this->bytes_already_parsed; while ( false !== $at && $at < $doc_length ) { $at += strcspn( $html, '-<', $at ); /* * Optimization: Terminating a complete script element requires at least eight * additional bytes in the document. Some checks below may cause local escaped * state transitions when processing shorter strings, but those transitions are * irrelevant if the script tag is incomplete and the function must return false. * * This may need updating if those transitions become significant or exported from * this function in some way, such as when building safe methods to embed JavaScript * or data inside a SCRIPT element. * * $at may be here. * ↓ * ... * ╰──┬───╯ * $at + 8 additional bytes are required for a non-false return value. * * This single check eliminates the need to check lengths for the shorter spans: * * $at may be here. * ↓ * * ├╯ * $at + 2 additional characters does not require a length check. * * The transition from "escaped" to "unescaped" is not relevant if the document ends: * * $at may be here. * ↓ * `. A SCRIPT element could be prevented from * closing by contents like ` * * * @since 6.5.0 */ const COMMENT_AS_ABRUPTLY_CLOSED_COMMENT = 'COMMENT_AS_ABRUPTLY_CLOSED_COMMENT'; /** * Indicates that a comment would be parsed as a CDATA node, * were HTML to allow CDATA nodes outside of foreign content. * * Example: * * * * This is an HTML comment, but it looks like a CDATA node. * * @since 6.5.0 */ const COMMENT_AS_CDATA_LOOKALIKE = 'COMMENT_AS_CDATA_LOOKALIKE'; /** * Indicates that a comment was created when encountering * normative HTML comment syntax. * * Example: * * * * @since 6.5.0 */ const COMMENT_AS_HTML_COMMENT = 'COMMENT_AS_HTML_COMMENT'; /** * Indicates that a comment would be parsed as a Processing * Instruction node, were they to exist within HTML. * * Example: * * * * This is an HTML comment, but it looks like a CDATA node. * * @since 6.5.0 */ const COMMENT_AS_PI_NODE_LOOKALIKE = 'COMMENT_AS_PI_NODE_LOOKALIKE'; /** * Indicates that a comment was created when encountering invalid * HTML input, a so-called "bogus comment." * * Example: * * * * * @since 6.5.0 */ const COMMENT_AS_INVALID_HTML = 'COMMENT_AS_INVALID_HTML'; /** * No-quirks mode document compatibility mode. * * > In no-quirks mode, the behavior is (hopefully) the desired behavior * > described by the modern HTML and CSS specifications. * * @see self::$compat_mode * @see https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode * * @since 6.7.0 * * @var string */ const NO_QUIRKS_MODE = 'no-quirks-mode'; /** * Quirks mode document compatibility mode. * * > In quirks mode, layout emulates behavior in Navigator 4 and Internet * > Explorer 5. This is essential in order to support websites that were * > built before the widespread adoption of web standards. * * @see self::$compat_mode * @see https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode * * @since 6.7.0 * * @var string */ const QUIRKS_MODE = 'quirks-mode'; /** * Indicates that a span of text may contain any combination of significant * kinds of characters: NULL bytes, whitespace, and others. * * @see self::$text_node_classification * @see self::subdivide_text_appropriately * * @since 6.7.0 */ const TEXT_IS_GENERIC = 'TEXT_IS_GENERIC'; /** * Indicates that a span of text comprises a sequence only of NULL bytes. * * @see self::$text_node_classification * @see self::subdivide_text_appropriately * * @since 6.7.0 */ const TEXT_IS_NULL_SEQUENCE = 'TEXT_IS_NULL_SEQUENCE'; /** * Indicates that a span of decoded text comprises only whitespace. * * @see self::$text_node_classification * @see self::subdivide_text_appropriately * * @since 6.7.0 */ const TEXT_IS_WHITESPACE = 'TEXT_IS_WHITESPACE'; /** * Wakeup magic method. * * @since 6.9.2 */ public function __wakeup() { throw new \LogicException( __CLASS__ . ' should never be unserialized' ); } }