Encode named HTML entities with Python
If you’re using Python to parse text that’s going to end up on the web, odds are you need to worry about character entities for Unicode characters.
Obtaining numeric HTML/XML entities is easy enough; I found several different ways to do so in a quick Google search, this being the easiest (text is just an example variable):
text = text.encode('ascii', 'xmlcharrefreplace')
However, it is significantly more difficult to find a way to encode text as named character entities, which if you’re ever going to need to look at the markup later is vastly preferable. After a lot of digging, I discovered some basic logic for named HTML entities in the Python Cookbook. After improving on it to make sure that all entities get replaced (even high level ones without named equivalents) I’ve got something that others may find useful in turn. Simply place this code into a file called named_entities.py and stick it somewhere that your script can find it (or just stick the code at the top of your file, if you only need it in one place). Usage info is in the comment at the top of the code.
Please note that this code is for Python 2.x. Python 3 moved codepoint2name into html.entities.codepoint2name, so you’d need to modify the import.
'''
Registers a special handler for named HTML entities
Usage:
import named_entities
text = u'Some string with Unicode characters'
text = text.encode('ascii', 'named_entities')
'''
import codecs
from htmlentitydefs import codepoint2name
def named_entities(text):
if isinstance(text, (UnicodeEncodeError, UnicodeTranslateError)):
s = []
for c in text.object[text.start:text.end]:
if ord(c) in codepoint2name:
s.append(u'&%s;' % codepoint2name[ord(c)])
else:
s.append(u'%s;' % ord(c))
return ''.join(s), text.end
else:
raise TypeError("Can't handle %s" % text.__name__)
codecs.register_error('named_entities', named_entities)
One last thing: whether you use numeric or named entities, you’ll probably want to encode ampersands afterward. Here’s the regex that I’ve been using to do so, and it’s safe to run on the output of either the named or numeric entity creation code (remember to import re before trying this at home):
text = re.sub('&(?!([a-zA-Z0-9]+|#[0-9]+|#x[0-9a-fA-F]+);)', '&', text)
Enjoy!




