Locating Strange Characters in Word Documents with Regular Expressions

I recently got a Word document from someone that had some strange characters that, instead of rendering as one accented character, rendered as an unaccented character with a box after it. It was some strange Unicode encoding and it was a bit of a problem, because I needed to make sure there were no more such characters in this 300-page document.

It seemed that if I saved the document as RTF and then opened it again in Word, it automatically converted the problematic characters. But since this is a book in the late stages, meaning it will not be proofread again, “seemed” really wasn’t good enough. I wanted to know. But I couldn’t search for the characters, because I didn’t know what they were. I needed to search for “any characters that are not the ones I allow in this document.”

Word has fairly good “wildcard” searching, but it starts to have trouble with more complex search expressions, especially complex “not” expressions. I live and die by PowerGREP, which I use for all kinds of things. It is an amazing, but somewhat expensive, tool for text processing that goes miles beyond plain GREP. Plain GREP as anyone from the Unix world knows, allows you to create complex search expressions, called “regular expressions” that let you effectively do thousands of search combinations in just one search. PowerGREP is GREP integrated into an application that makes all foudn occurences of your search string clickable (i.e. click on it and it takes you to that line in an editor), let’s you select files with great precision and easy and lets you chain many regular expressions together and run them in one go. It is an amazing tool and I’ve been using it since at least 2005 as you can see from my user testimonial (that link is an affiliate link by the way – if you buy through that link, I get a commission).

In my case the document is in French, with a few special formatting characters, so this is the “whitelist” of characters I want to allow:
[^A-Z}{*@\&\~+#a-z0-9ÉÀàùéèôûêîçÇë()[].\/,;:»«'”\?!–—-_\s=]

As regular expressions go, this is actually one of the most trivial and straightforward that you can use. There are only a few tricky things. The [] define the character set, the ^ negates it (meaning, find any except the ones listed), the \ “escapes” a character that would otherwise have a special meaning, and then some sequences do have special meanings like “\s” which means “any whitespace” (space, non-breaking space, tab, return, etc).

Pretty quickly this finds most things that don’t belong, but it won’t find the odd Unicode characters rendering as two characters, because they display as two standard characters. So now that most cleanup is done, you want to find characters that are not being displayed as the standard Unicode character, but as a composite character. For example, with standard Unicode encoding, if you look for the word Châteaux in the RTF file using a simple text editor (so no conversion and rendering of the RTF as formatted text, but showing the underlying code), it will show in the source code simply as Châteaux just as expected. In the file I received from a colleague, however, it showed instead as Cha\u770\’5e\loch\f0 teaux with all that extra stuff. The \loch and \f0 codes will show up thousands of times, so we ignore those. But anything that shows up as \u and then some digits is one of these Unicode characters that is not rendered as a single character in RTF. So we can catch all those with a dead simple regex:

\u\d+ (slash U and any number of digits)

If that gives false positives, you can refine it to

[^\]\u\d{2,4}

This will get rid of anything with an escaped slash and since all Unicode characters are going to have 2-4 digits, we don’t return cases with one or five digits. We get rid of the escaped slash only because if I have a string like \u770 in my text, it that would show in the RTF file as \u770. So in my initial example above, it goes into code as Cha\u770\’5e\loch\f0 teaux.

If you do a lot of complex search and replace (or just searches) in text, including computer programs and the like, PowerGREP is absolutely worth the money. You can download a free trial that runs for 90 days. I suspect at the end of 90 days most people will decide that they have to have it in their toolkit or that they have no use for it whatsoever, depending on the kind of work they do.

Alternately, you can try one of these that are free. I have not used them and only know what you see on the page:
BareGrep which is ad-supported, though for $25 you can turn of the ads. BareGrepPro ($35) adds some of the features of PowerGREP, though it looks like it’s still much simpler in its capabilities.

Grep for Windows is free and you can have the registered version for a mere $5, but he really needs a better website with some screenshots or something, because it’s hard to see what it actually does and how much of the regular expression syntax it supports. If you’re on a budget though, it might be worth a look.

You can also get the standard Unix grep utility running on Windows, but it’s quite basic and, how shall I say this, if you don’t already use it and know where to get it, it’s probably not what you’re after.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>