TextSoap 6 and my XHTML Suite of custom cleaners

In case you hadn’t heard, TextSoap was updated to version 6.0 a few weeks ago. I’ve waited on posting about it because I wanted to share some of my custom cleaners (you can jump straight to the download if you’re so inclined), and now I’ve finally found the time.

For those who know about TextSoap, version 6.0’s main benefit (at least from my point of view) is a vastly redesigned custom cleaner editor. Cleaners can now run text through sub-routines, there’s a quick regex reference right in the window, and you can attach notes to cleaner actions to remind yourself what the heck that complicated regex pattern is supposed to be doing. There are other improvements, as well, but the custom cleaners interface is where it’s at for me. If you’re curious, check out the release notes for the full scoop.

For those not in the know, TextSoap is a fantastic piece of software that allows you to make changes to plain and rich text both using built-in cleaners or custom cleaners that you define yourself by combining regular expressions and any of the built-in cleaners with familiar Automator-style rules.

Yeah, I know, it doesn’t sound too impressive, does it? But that’s only because you’re used to wasting a lot of your time on mindless repetitive tasks involving text. TextSoap not only provides an easy way to save sets of common text-based find and replace actions, but it allows you access to them from pretty much anywhere on your computer by integrating with popular programs via plugins, offering a system-wide contextual menu, or hanging out in the Services menu.

When I first bought TextSoap, I regretted it because I barely ever used it (this was back in version 4.0, I think). Then one day I was doing something incredibly repetitive with text (I don’t even remember what), and I got fed up, launched TextSoap, and took a look at the custom cleaners. I’ve never looked back. Although the most powerful custom cleaners require knowledge of regular expressions, there are still hundreds of things you can do without ever worrying about regex simply by combining TextSoap’s provided cleaners with the building blocks available in custom cleaners. TextSoap provides an approach to text manipulation that has saved me hundreds of hours of drudgery.

Over time, I’ve found that the custom cleaners I create tend to fall into two categories:

  1. Cleaners that address specific problems that either recur or only happen once but require the same actions repeated a bunch in that sitting. For instance, for one client I have to convert a Word document into a newsletter every two weeks, observing their byzantine rules for HTML formatting. The first time I did it, it took a mind-numbing four hours. The second time, I created a custom cleaner while I worked and it took me two. The third time all I had to do was use the custom cleaner, and it took me one. With practice, I’m now down to about forty minutes.
  2. Cleaners that address generic recurring actions. These are cleaners that I’ve slowly tweaked over time, and now use primarily as building blocks for my task-specific cleaners.

It’s this last type of cleaner that I would like to share with you.

My XHTML suite of custom cleaners

My main use for TextSoap is manipulating HTML, and because I know a lot of other people out there have to do this on a regular basis I’ve decided to share the basic cleaners that serve as the foundation for my workflow. TextSoap has revolutionized how I perform certain tasks (particularly converting styled text to HTML and converting really hideous HTML into tasty XHTML), and I strongly recommend it to any web junkie who has cursed out a previous developer for their table-filled monstrosity of a website. Before I get into the nitty-gritty details of what’s included, here’s the download:

Download TextSoap XHTML Suite

Included are eight custom cleaners (if you’re only interested in one or two, see the ReadMe for details on which cleaners require one of their brethren):

  • Encode Ampersands. This encodes every ampersand that isn’t already part of an HTML entity.
  • Escape Single Quotes. Primarily useful for Javascript, PHP, etc., this escapes all single quotes with a backslash.
  • HTML Curly Quotes. For those clients who must have curly quotes, this is your solution. It converts every quote outside of HTML tags into a curly. (Please note: only works for English curly quotes.)
  • HTML Paragraphs. This converts text blocks separated by double line breaks into paragraphs, and converts single line breaks to <br /> tags.
  • Style to HTML. One of my workhorses, this cleaner takes richly formatted text and turns it into simple, paragraph-delineated HTML with appropriately placed strong and em tags.
  • URLs to HTML Links. This cleaner finds all of the easily recognizable URLs in a document (starting with http, https, or www) and converts them into HTML anchor links.
  • WebHappy. This happy little cleaner simply converts richly styled italics and bold into strong and em tags, straightens all quotes, and converts any problematic characters into HTML entities.
  • XHTML Cleaner. This is a pretty hefty cleaner, and I run it by default on any HTML that needs serious love to turn into XHTML. The cleaner performs a laundry list of common tasks (properly escaping self-closing tags, b to strong, lowercased tag and attribute names, etc.) and also attempts to add and remove linebreaks so that you can easily indent the code in your favorite editor (like Textmate). I rarely use XHTML Cleaner directly, but it offers a great starting point for any custom cleaner that needs to deal with poorly written HTML.

Although some of these cleaners are great on their own (I have a special place in my workflow for WebHappy, for instance, even if I never did think of a good descriptive name for it), a lot of them work best as the starting place for your own task-specific custom cleaners. I’ve tried to add notes to all of the regex rules, as well, so they may help you figure out how best to perform your own tasks (keep in mind that some of these cleaners were developed while I was still figuring regex out, so some of those regular expressions are narsty). If you improve or otherwise modify any of the core suite of cleaners, drop me a line because I’d love to see what you’ve done.


3 responses to “TextSoap 6 and my XHTML Suite of custom cleaners”

Leave a response

  1. Sandy ONeil says:

    I love your comments on TextSoap. Mark is my brother and I am very proud of him. Have a great Thanksgiving!

  2. Ian,

    You were kind enough to send me your custom cleaner
    HTML Curly Quotes after I posted to the TextSoap forum (unless it was the MarsEdit forum…) about needing something that would smarten all quotes and apostrophes, except those in HTML tags.

    It has worked so well for me. I use it every day.


  3. Steven Black says:

    This is awesome, thank you.

    By the way, all these still work with TextSoap 7.3.7 too. Not bad for code written nearly 5-years ago!

Leave a response