Encode named HTML entities with Python

If you’re using Python to parse text that’s going to end up on the web, odds are you need to worry about character entities for Unicode characters.

Obtaining numeric HTML/XML entities is easy enough; I found several different ways to do so in a quick Google search, this being the easiest (text is just an example variable):

text = text.encode('ascii', 'xmlcharrefreplace')

However, it is significantly more difficult to find a way to encode text as named character entities, which if you’re ever going to need to look at the markup later is vastly preferable. After a lot of digging, I discovered some basic logic for named HTML entities in the Python Cookbook. After improving on it to make sure that all entities get replaced (even high level ones without named equivalents) I’ve got something that others may find useful in turn. Simply place this code into a file called named_entities.py and stick it somewhere that your script can find it (or just stick the code at the top of your file, if you only need it in one place). Usage info is in the comment at the top of the code.

Please note that this code is for Python 2.x. Python 3 moved codepoint2name into html.entities.codepoint2name, so you’d need to modify the import.

'''
Registers a special handler for named HTML entities

Usage:
import named_entities
text = u'Some string with Unicode characters'
text = text.encode('ascii', 'named_entities')
'''

import codecs
from htmlentitydefs import codepoint2name

def named_entities(text):
    if isinstance(text, (UnicodeEncodeError, UnicodeTranslateError)):
        s = []
        for c in text.object[text.start:text.end]:
            if ord(c) in codepoint2name:
                s.append(u'&%s;' % codepoint2name[ord(c)])
            else:
                s.append(u'&#%s;' % ord(c))
        return ''.join(s), text.end
    else:
        raise TypeError("Can't handle %s" % text.__name__)
codecs.register_error('named_entities', named_entities)

One last thing: whether you use numeric or named entities, you’ll probably want to encode ampersands afterward. Here’s the regex that I’ve been using to do so, and it’s safe to run on the output of either the named or numeric entity creation code (remember to import re before trying this at home):

text = re.sub('&(?!([a-zA-Z0-9]+|#[0-9]+|#x[0-9a-fA-F]+);)', '&', text)

Enjoy!

5 responses to “Encode named HTML entities with Python”

Leave a response

  1. Martin says:

    Thank you!

  2. This code seems generally useful. I found myself cutting-and-pasting it into other modules, so I packaged it and put it on PyPI. It’s just a ‘pip install namedentities’ away.

    Package: http://pypi.python.org/pypi/namedentities/
    Repository: https://bitbucket.org/jeunice/namedentities

  3. Luis says:

    Beautiful!!! Thank you, very much for sharing this!!!!

  4. Guilherme David da Costa says:

    That code its extremely wonderful. Thank you so much for sharing this. I bump my head a lot trying something usefull like this. Glad I asked on stackoverflow and another person answered with this page for me. I pasted the code there in case you go offline.

  5. Bruno says:

    You saved me a lot of time.
    Thank you very much!

Leave a response

Clicky Web Analytics