A tale of two pull requests: Addendum
- Posted by Michał ‘mina86’ Nazarewicz on 15th of June 2025
- Share on Bluesky
In the previous post, I criticised Rust’s contribution process, where a simple patch languished due to communication hurdles. Rust isn’t unique in struggling with its process. This time, the story is about Python.
Parsing HTML in Python
As its name implies, the html.parser
module provides interfaces for parsing HTML documents. It offers an HTMLParser
base class users can extend to implement their own handling of HTML markup. Of our interest is the unknown_decl
method, which ‘is called when an unrecognised declaration is read by the parser.’ It’s called with an argument containing ‘the entire contents of the declaration inside the <![...]>
markup.’ For example:
from html.parser import HTMLParser class MyParser(HTMLParser): def unknown_decl(self, data: str) -> None: print(data) parser = MyParser() parser.feed('<![if test]>') # Prints out: if test # (unless Python 3.13.4+, see below) parser.feed('<![CDATA[test]]>') # Prints out: CDATA[test
Notice the problem? When used with a CDATA
declaration, the behavior doesn’t quite match the documentation: the argument passed to unknown_decl
is missing a closing square bracket. This behaviour makes a simple task unexpectedly difficult. An HTML filter — say one which sanitises user input — would risk corrupting the data by adding the wrong number of closing brackets.
In May 2021, I developed and submitted a fix for the issue. However, contributing to Python requires signing a Python Software Foundation contributor license agreement (CLA), which required an account on bugs.python.org website. The problem is: I never received the activation email.
Eventually, a few days after the submission, a bot tagged the pull request with ‘CLA signed’ label. That should imply that everything was in order, and the patch was ready to be reviewed and merged. Yet, a year later, the label was manually removed, leaving the PR in limbo with no explanation. Was the CLA signed or not? The system itself seemed to have no consistent answer.
Python 3.13.4
Python 3.13.4 came out last week and changed teh particular corner of the code-base. CDATA
handling is unchanged, but other declarations are now passed to the parse_bogus_comment
method, which uses a different matching mechanism.
Ironically, while that solved a different issue users had, the documentation remains incorrect and the CDATA
handling is still bizarre (unknown_decl
is called with unmatched square brackets) not to call it outright broken.
Discussion
I’m not fond of CLAs in the best of times, but if a project requires them, the least it could do is make sure that the system for getting them signed works correctly. It is surprising getting a physical paperwork for my Emacs contributions1 was easier than getting things done electronically for Python.
There were two differences: barrier to entry and someone to follow up on the signing process. To initiate contribution to Emacs, an email account is sufficient; sending a patch is sufficient to get the process starts rolling. In Python, there is upfront barrier of creating bugs.python.org account and signing the CLA.
Secondly, Emacs process had people involved ready to follow up. Any confusion I had was addressed, and — even though slow as it involved the post — it went smoothly. This was not the case in Python where there was no obvious way to contact someone about problems.
Ultimately, a thriving free software project needs not only quality code but also healthy community of contributors. Both Python and Rust are phenomenal technical achievements, but these stories show how even giants can stumble on human-scale issues. 1 It is my understanding that GNU projects which require copyright assignment offer an electronic process now. ↩