Encode named HTML entities with Python

Edit [June 29, 2021]: This post evidently has quite a high Google ranking for certain keywords, despite the code being ancient and written for a version of Python that is no longer supported. You can review my original post below, but here’s some Python 3 that will likely treat you better:

#!/usr/bin/env python3
from html.entities import codepoint2name

def encode_named_entities(text, convert_less_than=False, convert_greater_than=False):
    """Converts UTF-8 characters into HTML entities

    By default, all non-ASCII characters and ampersand will be converted to named
    entities, falling back on numeric entities.

    Less than and greater than characters can be optionally included with the
    keyword arguments.

    Returns the modified string.

    USAGE:
        text = "I love <b>jalapeños & fun</b> ☜!"
        entity_text = encode_named_entities(text)
        # "I love <b>jalape&ntilde;os &amp; fun</b> &#9756;!"
        entity_text = encode_named_entities(text, convert_less_than=True)
        # "I love &lt;b>jalape&ntilde;os &amp; fun&lt;/b> &#9756;!"
    """
    new_text_list = []
    for character in text:
        code_point = ord(character)
        # ASCII characters are 0-127
        if code_point < 128:
            # Process ampersands and carets, if requested
            if character == "&":
                new_text_list.append("&amp;")
            elif character == "<" and convert_less_than:
                new_text_list.append("&lt;")
            elif character == ">" and convert_greater_than:
                new_text_list.append("&gt;")
            else:
                new_text_list.append(character)
        else:
            # For all other characters, try to convert to named entity
            try:
                new_text_list.append(f"&{codepoint2name[code_point]};")
            except KeyError:
                # And fall back to a numeric entity
                new_text_list.append(f"&#{code_point};")
    return "".join(new_text_list)

Original post

If you’re using Python to parse text that’s going to end up on the web, odds are you need to worry about character entities for Unicode characters.

Obtaining numeric HTML/XML entities is easy enough; I found several different ways to do so in a quick Google search, this being the easiest (text is just an example variable):

text = text.encode('ascii', 'xmlcharrefreplace')

However, it is significantly more difficult to find a way to encode text as named character entities, which if you’re ever going to need to look at the markup later is vastly preferable. After a lot of digging, I discovered some basic logic for named HTML entities in the Python Cookbook. After improving on it to make sure that all entities get replaced (even high level ones without named equivalents) I’ve got something that others may find useful in turn. Simply place this code into a file called named_entities.py and stick it somewhere that your script can find it (or just stick the code at the top of your file, if you only need it in one place). Usage info is in the comment at the top of the code.

Please note that this code is for Python 2.x. Python 3 moved codepoint2name into html.entities.codepoint2name, so you’d need to modify the import.

'''
Registers a special handler for named HTML entities

Usage:
import named_entities
text = u'Some string with Unicode characters'
text = text.encode('ascii', 'named_entities')
'''

import codecs
from htmlentitydefs import codepoint2name

def named_entities(text):
    if isinstance(text, (UnicodeEncodeError, UnicodeTranslateError)):
        s = []
        for c in text.object[text.start:text.end]:
            if ord(c) in codepoint2name:
                s.append(u'&%s;' % codepoint2name[ord(c)])
            else:
                s.append(u'&#%s;' % ord(c))
        return ''.join(s), text.end
    else:
        raise TypeError("Can't handle %s" % text.__name__)
codecs.register_error('named_entities', named_entities)

One last thing: whether you use numeric or named entities, you’ll probably want to encode ampersands afterward. Here’s the regex that I’ve been using to do so, and it’s safe to run on the output of either the named or numeric entity creation code (remember to import re before trying this at home):

text = re.sub('&(?!([a-zA-Z0-9]+|#[0-9]+|#x[0-9a-fA-F]+);)', '&amp;', text)

Enjoy!

7 responses to “Encode named HTML entities with Python”

Leave a response

  1. Martin says:

    Thank you!

  2. This code seems generally useful. I found myself cutting-and-pasting it into other modules, so I packaged it and put it on PyPI. It’s just a ‘pip install namedentities’ away.

    Package: http://pypi.python.org/pypi/namedentities/
    Repository: https://bitbucket.org/jeunice/namedentities

  3. Luis says:

    Beautiful!!! Thank you, very much for sharing this!!!!

  4. Guilherme David da Costa says:

    That code its extremely wonderful. Thank you so much for sharing this. I bump my head a lot trying something usefull like this. Glad I asked on stackoverflow and another person answered with this page for me. I pasted the code there in case you go offline.

  5. Bruno says:

    You saved me a lot of time.
    Thank you very much!

  6. Ismael says:

    well, it’s great but it seems a waste to ignore ampersand, lessthan and greaterthan (“&”, “<” and “>”). I certainly understand why, but wouldn’t it be better to have an option to smash those also? You have “escape=True”. So why not, “unsafe=True”, or something along those lines. Kudos either way. And thanks.

    • Ian Beck says:

      Sure, you certainly could do that (though not using the built-in text encoding error handling the way this approach does; you’d need to make a standalone utility function that accepted a string and converted things internally because no errors are thrown for ampersand/greater than/less than so those characters will never be parsed by this method). Perhaps something like:

      
      #!/usr/bin/env python3
      from html.entities import codepoint2name
      
      def encode_named_entities(text, convert_less_than=False, convert_greater_than=False):
          """Converts UTF-8 characters into HTML entities
      
          By default, all non-ASCII characters and ampersand will be converted to named
          entities, falling back on numeric entities.
      
          Less than and greater than characters can be optionally included with the
          keyword arguments.
      
          Returns the modified string.
      
          USAGE:
              text = "I love <b>jalapeños & fun</b> ☜!"
              entity_text = encode_named_entities(text)
              # "I love <b>jalape&ntilde;os &amp; fun</b> &#9756;!"
              entity_text = encode_named_entities(text, convert_less_than=True)
              # "I love &lt;b>jalape&ntilde;os &amp; fun&lt;/b> &#9756;!"
          """
          new_text_list = []
          for character in text:
              code_point = ord(character)
              # ASCII characters are 0-127
              if code_point < 128:
                  # Process ampersands and carets, if requested
                  if character == "&":
                      new_text_list.append("&amp;")
                  elif character == "<" and convert_less_than:
                      new_text_list.append("&lt;")
                  elif character == ">" and convert_greater_than:
                      new_text_list.append("&gt;")
                  else:
                      new_text_list.append(character)
              else:
                  # For all other characters, try to convert to named entity
                  try:
                      new_text_list.append(f"&{codepoint2name[code_point]};")
                  except KeyError:
                      # And fall back to a numeric entity
                      new_text_list.append(f"&#{code_point};")
          return "".join(new_text_list)
      

Leave a response