-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
gh-102555: Fix comment parsing in HTMLParser #135664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-102555: Fix comment parsing in HTMLParser #135664
Conversation
* "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->".
Co-author: Kerim Kabirov <the.privat33r+gh@pm.me>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid implementation overall :)
@@ -309,6 +310,21 @@ def parse_html_declaration(self, i): | |||
else: | |||
return self.parse_bogus_comment(i) | |||
|
|||
# Internal -- parse comment, return length or -1 if not terminated | |||
# see https://html.spec.whatwg.org/multipage/parsing.html#comment-start-state | |||
def parse_comment(self, i, report=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the change should be made in the _markupbase.
Lines 165 to 175 in c2f2fd4
def parse_comment(self, i, report=1): | |
rawdata = self.rawdata | |
if rawdata[i:i+4] != '<!--': | |
raise AssertionError('unexpected call to parse_comment()') | |
match = _commentclose.search(rawdata, i+4) | |
if not match: | |
return -1 | |
if report: | |
j = match.start(0) | |
self.handle_comment(rawdata[i+4: j]) | |
return match.end(0) |
If the method is overloaded here, then there are no other use cases, and the original method becomes dead code.
https://github.com/search?q=repo%3Apython%2Fcpython%20parse_comment&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to override the method here. Since _markupbase
has been made internal/private in Python 3 and it's only used by html.parser
, it makes sense to me to add new code directly to html.parser
(and possibly even merging _markupbase
into html.parser
eventually).
Regarding the (now) dead code, we could either let it be, adding a comment noting that the method is unused/overridden, or delete it. The first two options are less destructive, but since the module is private there shouldn't be much concern about breaking backward compatibility (and if anyone is relying on the original implementation, they are probably using it through HTMLParser
anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_markupbase
can be used in third-party code (if it is also the base for the SGML parser or other parsers), so it is better to not touch it in maintained versions. This change can break it if SGML has other rules for comments. But in the developing branch we can remove it, after finishing all other bug fixes.
@@ -309,6 +310,21 @@ def parse_html_declaration(self, i): | |||
else: | |||
return self.parse_bogus_comment(i) | |||
|
|||
# Internal -- parse comment, return length or -1 if not terminated | |||
# see https://html.spec.whatwg.org/multipage/parsing.html#comment-start-state | |||
def parse_comment(self, i, report=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to override the method here. Since _markupbase
has been made internal/private in Python 3 and it's only used by html.parser
, it makes sense to me to add new code directly to html.parser
(and possibly even merging _markupbase
into html.parser
eventually).
Regarding the (now) dead code, we could either let it be, adding a comment noting that the method is unused/overridden, or delete it. The first two options are less destructive, but since the module is private there shouldn't be much concern about breaking backward compatibility (and if anyone is relying on the original implementation, they are probably using it through HTMLParser
anyway).
Misc/NEWS.d/next/Security/2025-06-18-13-28-08.gh-issue-102555.nADrzJ.rst
Outdated
Show resolved
Hide resolved
…nADrzJ.rst Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10, 3.11, 3.12, 3.13, 3.14. |
…TML5 standard (pythonGH-135664) * "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- (cherry picked from commit 8ac7613) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
…TML5 standard (pythonGH-135664) * "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- (cherry picked from commit 8ac7613) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
GH-136271 is a backport of this pull request to the 3.14 branch. |
…TML5 standard (pythonGH-135664) * "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- (cherry picked from commit 8ac7613) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
GH-136272 is a backport of this pull request to the 3.13 branch. |
GH-136273 is a backport of this pull request to the 3.12 branch. |
…TML5 standard (pythonGH-135664) * "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- (cherry picked from commit 8ac7613) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
GH-136274 is a backport of this pull request to the 3.11 branch. |
…TML5 standard (pythonGH-135664) * "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- (cherry picked from commit 8ac7613) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
GH-136275 is a backport of this pull request to the 3.10 branch. |
…TML5 standard (pythonGH-135664) * "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- (cherry picked from commit 8ac7613) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
GH-136276 is a backport of this pull request to the 3.9 branch. |
…HTML5 standard (GH-135664) (GH-136272) * "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- (cherry picked from commit 8ac7613) Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
…HTML5 standard (GH-135664) (GH-136271) * "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- (cherry picked from commit 8ac7613) Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Uh oh!
There was an error while loading. Please reload this page.