gh-135661: Fix parsing start and end tags in HTMLParser #135930

serhiy-storchaka · 2025-06-25T11:46:03Z

Whitespaces no longer accepted between </ and the tag name. E.g. </ script> does not end the script section.
Vertical tabulation (\v) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are \t\n\r\f .
Null character (U+0000) no longer ends the tag name.
End tag can have attributes and slashes after tag name. It no longer ends after the first > in quoted attribute value. E.g. </script/foo=">"/>.
Multiple slashes and whitespaces between the last attribute and closing > are now accepted in both start and end tags. E.g. <a foo=bar/ //>.
Multiple = between attribute name and value are no longer collapsed. E.g. <a foo==bar> produces attribute "foo" with value "=bar".
Whitespaces between the = separator and attribute name or value are no longer ignored. E.g. <a foo =bar> produces two attributes "foo" and "=bar", both with value None; <a foo= bar> produces two attributes: "foo" with value "" and "bar" with value None.

Issue: HTMLParser differences from the HTML5 specification #135661

* Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * End tag can have attributes and slashes after tag name. It no longer ends after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now accepted in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None.

serhiy-storchaka · 2025-06-25T12:44:22Z

I tried to minimize changes and split this PR on several PRs, but they would not be independent, and all these changes are needed to fix the possible XSS.

I am planning further refactoring, but this is only for the main branch.

ezio-melotti · 2025-07-02T14:36:08Z

Lib/html/parser.py

@@ -36,29 +36,33 @@
 #     explode, so don't do it.


I don't know if you saw and heeded the warning or if you just got lucky, but it looks like you were able to change these regex!
Since you renamed locatestarttagend, the comment at line 34 should also be updated.

In addition, make sure that existing comments are still relevant. In particular I would appreciate this for comments linking to specific sections of the HTML5 standard.

There are links below, they still work, although they now redirect to other address. I updated them.

On other hand, section numbers were changed. I updated them in places which I touched.

ezio-melotti · 2025-07-02T14:52:17Z

Lib/html/parser.py

+     )?
+    [\t\n\r\f /]*                   # possibly followed by a space
+   )*
+   >?


These changes make sense to me.

I also noticed that you removed the start from locatestarttagend_tolerant, presumably because you are now using it to find the end of end tags too (which can contain attributes, even if they are invalid).

This variable is not documented however I can see two options:

we consider it private and just rename it;

we create an alias to the old name for backward compatibility, in case someone was using it;

Note that before there was also a set of *_strict variable that got removed, so the _tolerant suffix is no longer needed and it was kept for backward compatibility. Since you are refactoring/renaming (some of) these variables, you might want to consider dropping the _tolerant suffix altogether (and possibly adding aliases to preserve backward compatibility), either in this or in a separate PR.

Restored the removed variables. I will remove them in the main branch in the following PR.

ezio-melotti · 2025-07-02T14:54:18Z

Lib/html/parser.py

@@ -141,7 +145,8 @@ def get_starttag_text(self):

    def set_cdata_mode(self, elem):
        self.cdata_elem = elem.lower()
-        self.interesting = re.compile(r'</\s*%s\s*>' % self.cdata_elem, re.I)
+        self.interesting = re.compile(r'</%s(?=[\t\n\r\f />])' % self.cdata_elem,
+                                      re.IGNORECASE|re.ASCII)


Any reason for adding re.ASCII here?

Yes, it affects case-insensitive mode. Otherwise 'ſ' ~ 's' and 'ı' ~ 'i'. There may be more cases after adding support for title and textarea.

This is not actually a problem in the current code, but future changes could make this important.

ezio-melotti · 2025-07-02T15:01:43Z

Lib/html/parser.py


    # Internal -- parse endtag, return end or -1 if incomplete
    def parse_endtag(self, i):
        rawdata = self.rawdata
        assert rawdata[i:i+2] == "</", "unexpected call to parse_endtag"
-        match = endendtag.search(rawdata, i+1) # >
-        if not match:
+        if rawdata.find('>', i+2) < 0:


Suggested change

if rawdata.find('>', i+2) < 0:

if rawdata.rfind('>', i+2) < 0:

Probably inconsequential performance-wise, but using rfind seems more logical here (and possibly elsewhere).

This check is not actually needed. It is simply an optimization for the case of truncated end tag, because it is faster than endtagopen.match() + locatetagend.match(). I do not know whether it really helps, but I left it as insurance against unpredicted performance degradation.

find may be faster than rfind in general, and in case of end tag, there is large chance to find ">" in first few characters.

Lib/html/parser.py

Misc/NEWS.d/next/Library/2025-06-25-14-13-39.gh-issue-135661.idjQ0B.rst

Lib/test/test_htmlparser.py

Misc/NEWS.d/next/Library/2025-06-25-14-13-39.gh-issue-135661.idjQ0B.rst

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

…o htmlparser-tag

serhiy-storchaka

Thank you for review, @ezio-melotti.

serhiy-storchaka · 2025-07-02T17:43:12Z

Lib/html/parser.py

@@ -36,29 +36,33 @@
 #     explode, so don't do it.


There are links below, they still work, although they now redirect to other address. I updated them.

On other hand, section numbers were changed. I updated them in places which I touched.

serhiy-storchaka · 2025-07-02T17:45:55Z

Lib/html/parser.py

+     )?
+    [\t\n\r\f /]*                   # possibly followed by a space
+   )*
+   >?


Restored the removed variables. I will remove them in the main branch in the following PR.

serhiy-storchaka · 2025-07-02T17:51:01Z

Lib/html/parser.py

@@ -141,7 +145,8 @@ def get_starttag_text(self):

    def set_cdata_mode(self, elem):
        self.cdata_elem = elem.lower()
-        self.interesting = re.compile(r'</\s*%s\s*>' % self.cdata_elem, re.I)
+        self.interesting = re.compile(r'</%s(?=[\t\n\r\f />])' % self.cdata_elem,
+                                      re.IGNORECASE|re.ASCII)


Yes, it affects case-insensitive mode. Otherwise 'ſ' ~ 's' and 'ı' ~ 'i'. There may be more cases after adding support for title and textarea.

This is not actually a problem in the current code, but future changes could make this important.

serhiy-storchaka · 2025-07-02T17:59:14Z

Lib/html/parser.py


    # Internal -- parse endtag, return end or -1 if incomplete
    def parse_endtag(self, i):
        rawdata = self.rawdata
        assert rawdata[i:i+2] == "</", "unexpected call to parse_endtag"
-        match = endendtag.search(rawdata, i+1) # >
-        if not match:
+        if rawdata.find('>', i+2) < 0:


This check is not actually needed. It is simply an optimization for the case of truncated end tag, because it is faster than endtagopen.match() + locatetagend.match(). I do not know whether it really helps, but I left it as insurance against unpredicted performance degradation.

find may be faster than rfind in general, and in case of end tag, there is large chance to find ">" in first few characters.

Misc/NEWS.d/next/Library/2025-06-25-14-13-39.gh-issue-135661.idjQ0B.rst

Lib/test/test_htmlparser.py

ezio-melotti · 2025-07-02T21:39:16Z

Lib/html/parser.py

@@ -36,29 +36,33 @@
 #     explode, so don't do it.


miss-islington-app · 2025-07-03T20:33:05Z

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10, 3.11, 3.12, 3.13, 3.14.
🐍🍒⛏🤖

…ng to the HTML5 standard (pythonGH-135930) * Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * Attributes and slashes after the tag name in end tags are now ignored, instead of terminating after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None. * Fix Sphinx errors. * Apply suggestions from code review Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com> * Address review comments. * Move to Security. --------- (cherry picked from commit 0243f97) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

…ng to the HTML5 standard (pythonGH-135930) * Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * Attributes and slashes after the tag name in end tags are now ignored, instead of terminating after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None. * Fix Sphinx errors. * Apply suggestions from code review Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com> * Address review comments. * Move to Security. --------- (cherry picked from commit 0243f97cbadec8d985e63b1daec5d1cbc850cae3) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

bedevere-app · 2025-07-03T20:33:15Z

GH-136255 is a backport of this pull request to the 3.14 branch.

miss-islington-app · 2025-07-03T20:33:16Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 0243f97cbadec8d985e63b1daec5d1cbc850cae3 3.12

bedevere-app · 2025-07-03T20:33:20Z

GH-136256 is a backport of this pull request to the 3.13 branch.

miss-islington-app · 2025-07-03T20:33:20Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.11 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 0243f97cbadec8d985e63b1daec5d1cbc850cae3 3.11

miss-islington-app · 2025-07-03T20:33:23Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.10 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 0243f97cbadec8d985e63b1daec5d1cbc850cae3 3.10

miss-islington-app · 2025-07-03T20:33:27Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.9 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 0243f97cbadec8d985e63b1daec5d1cbc850cae3 3.9

…ing to the HTML5 standard (GH-135930) (GH-136255) * Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * Attributes and slashes after the tag name in end tags are now ignored, instead of terminating after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None. --------- (cherry picked from commit 0243f97) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

…ing to the HTML5 standard (GH-135930) (GH-136256) * Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * Attributes and slashes after the tag name in end tags are now ignored, instead of terminating after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None. --------- (cherry picked from commit 0243f97) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

…according to the HTML5 standard (pythonGH-135930) * Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * Attributes and slashes after the tag name in end tags are now ignored, instead of terminating after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None. * Fix Sphinx errors. * Apply suggestions from code review Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com> * Address review comments. * Move to Security. --------- (cherry picked from commit 0243f97) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

…according to the HTML5 standard (pythonGH-135930) * Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * Attributes and slashes after the tag name in end tags are now ignored, instead of terminating after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None. * Fix data loss after unclosed script or style tag (pythongh-86155). Also backport test.support.subTests() (pythongh-135120). --------- (cherry picked from commit 0243f97) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com> Co-authored-by: Waylan Limberg <waylan.limberg@icloud.com>

bedevere-app · 2025-07-04T05:28:19Z

GH-136268 is a backport of this pull request to the 3.12 branch.

serhiy-storchaka requested a review from ezio-melotti as a code owner June 25, 2025 11:46

serhiy-storchaka added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Jun 25, 2025

bedevere-app bot added the awaiting core review label Jun 25, 2025

bedevere-app bot mentioned this pull request Jun 25, 2025

HTMLParser differences from the HTML5 specification #135661

Open

Fix Sphinx errors.

182b16f

ezio-melotti reviewed Jul 2, 2025

View reviewed changes

serhiy-storchaka and others added 4 commits July 2, 2025 20:17

Merge branch 'main' into htmlparser-tag

436a8a9

Apply suggestions from code review

ebf8ce3

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

Merge remote-tracking branch 'refs/remotes/origin/htmlparser-tag' int…

d05303b

…o htmlparser-tag

Address review comments.

955db4e

serhiy-storchaka commented Jul 2, 2025

View reviewed changes

ezio-melotti approved these changes Jul 2, 2025

View reviewed changes

Lib/html/parser.py

@@ -36,29 +36,33 @@

# explode, so don't do it.

Copy link

Member

ezio-melotti Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bedevere-app bot added awaiting merge and removed awaiting core review labels Jul 2, 2025

serhiy-storchaka added 2 commits July 3, 2025 18:22

Merge branch 'main' into htmlparser-tag

66ec1a0

Move to Security.

f38ad41

serhiy-storchaka added needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels Jul 3, 2025

serhiy-storchaka merged commit 0243f97 into python:main Jul 3, 2025
48 checks passed

bedevere-app bot removed the awaiting merge label Jul 3, 2025

bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Jul 3, 2025

miss-islington-app bot assigned serhiy-storchaka Jul 3, 2025

bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Jul 3, 2025

bedevere-app bot removed the needs backport to 3.12 only security fixes label Jul 4, 2025

serhiy-storchaka removed needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes labels Jul 4, 2025

serhiy-storchaka removed their assignment Jul 4, 2025

	if rawdata.find('>', i+2) < 0:
	if rawdata.rfind('>', i+2) < 0:

Uh oh!

gh-135661: Fix parsing start and end tags in HTMLParser #135930

gh-135661: Fix parsing start and end tags in HTMLParser #135930

Conversation

serhiy-storchaka commented Jun 25, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

miss-islington-app bot commented Jul 3, 2025

Uh oh!

bedevere-app bot commented Jul 3, 2025

Uh oh!

miss-islington-app bot commented Jul 3, 2025

Uh oh!

bedevere-app bot commented Jul 3, 2025

Uh oh!

miss-islington-app bot commented Jul 3, 2025

Uh oh!

miss-islington-app bot commented Jul 3, 2025

Uh oh!

miss-islington-app bot commented Jul 3, 2025

Uh oh!

bedevere-app bot commented Jul 4, 2025

Uh oh!

Uh oh!

TMZ Celebrity News – Breaking Stories, Videos & Gossip

🎥 Watch TMZ Live

serhiy-storchaka commented Jun 25, 2025 •

edited by bedevere-app bot

Loading