Locating Strange Characters in Word Documents with Regular Expressions

19 Feb 2011 (updated 14 Nov, 2019)

I recently got a Word document from someone that had some strange characters that, instead of rendering as one accented character, rendered as an unaccented character with a box after it. It was some strange Unicode encoding and it was a bit of a problem, because I needed to make sure there were no more such characters in this 300-page document.

It seemed that if I saved the document as RTF and then opened it again in Word, it automatically converted the problematic characters. But since this is a book in the late stages, meaning it will not be proofread again, “seemed” really wasn’t good enough. I wanted to know. But I couldn’t search for the characters, because I didn’t know what they were. I needed to search for “any characters that are not the ones I allow in this document.”

Word has fairly good “wildcard” searching, but it starts to have trouble with more complex search expressions, especially complex “not” expressions. I live and die by PowerGREP, which I use for all kinds of things. It is an amazing, but somewhat expensive, tool for text processing that goes miles beyond plain GREP. Plain GREP as anyone from the Unix world knows, allows you to create complex search expressions, called “regular expressions” that let you effectively do thousands of search combinations in just one search. PowerGREP is GREP integrated into an application that makes all foudn occurences of your search string clickable (i.e. click on it and it takes you to that line in an editor), let’s you select files with great precision and easy and lets you chain many regular expressions together and run them in one go. It is an amazing tool and I’ve been using it since at least 2005 as you can see from my user testimonial (that link is an affiliate link by the way – if you buy through that link, I get a commission).

In my case the document is in French, with a few special formatting characters, so this is the “whitelist” of characters I want to allow:
[^A-Z}{*@\&\~+#a-z0-9ÉÀàùéèôûêîçÇë()[].\/,;:»«'”\?!–—-_\s=]

As regular expressions go, this is actually one of the most trivial and straightforward that you can use. There are only a few tricky things. The [] define the character set, the ^ negates it (meaning, find any except the ones listed), the \ “escapes” a character that would otherwise have a special meaning, and then some sequences do have special meanings like “\s” which means “any whitespace” (space, non-breaking space, tab, return, etc).

Pretty quickly this finds most things that don’t belong, but it won’t find the odd Unicode characters rendering as two characters, because they display as two standard characters. So now that most cleanup is done, you want to find characters that are not being displayed as the standard Unicode character, but as a composite character. For example, with standard Unicode encoding, if you look for the word Châteaux in the RTF file using a simple text editor (so no conversion and rendering of the RTF as formatted text, but showing the underlying code), it will show in the source code simply as Châteaux just as expected. In the file I received from a colleague, however, it showed instead as Cha\u770\’5e\loch\f0 teaux with all that extra stuff. The \loch and \f0 codes will show up thousands of times, so we ignore those. But anything that shows up as \u and then some digits is one of these Unicode characters that is not rendered as a single character in RTF. So we can catch all those with a dead simple regex:

\u\d+ (slash U and any number of digits)

If that gives false positives, you can refine it to

[^\]\u\d{2,4}

This will get rid of anything with an escaped slash and since all Unicode characters are going to have 2-4 digits, we don’t return cases with one or five digits. We get rid of the escaped slash only because if I have a string like \u770 in my text, it that would show in the RTF file as \u770. So in my initial example above, it goes into code as Cha\u770\’5e\loch\f0 teaux.

If you do a lot of complex search and replace (or just searches) in text, including computer programs and the like, PowerGREP is absolutely worth the money. You can download a free trial that runs for 90 days. I suspect at the end of 90 days most people will decide that they have to have it in their toolkit or that they have no use for it whatsoever, depending on the kind of work they do.

Alternately, you can try one of these that are free. I have not used them and only know what you see on the page:
BareGrep which is ad-supported, though for $25 you can turn of the ads. BareGrepPro ($35) adds some of the features of PowerGREP, though it looks like it’s still much simpler in its capabilities.

Grep for Windows is free and you can have the registered version for a mere $5, but he really needs a better website with some screenshots or something, because it’s hard to see what it actually does and how much of the regular expression syntax it supports. If you’re on a budget though, it might be worth a look.

You can also get the standard Unix grep utility running on Windows, but it’s quite basic and, how shall I say this, if you don’t already use it and know where to get it, it’s probably not what you’re after.

Tom Lambert

Click here to cancel reply.

About Raised by Turtles

First and foremost: no ads, no tracking.

Second, this photo is really old now. I have a LOT more gray hair now.

Third, I actually lied in #1. I have added an ad for the Raised by Turtles book. I guess it's true: everything online becomes an ad-infested cesspool eventually.

Finally, this blog is maintained (to put it charitably) by Tom Lambert. Sometimes I have things on my mind and on rare occasions, I write them down. On even rarer occasions, people read them. If you like them, you're welcome. If you don't, I'm not offended. I don't like most things I read. YouTube is just one click away.

If you have a question, send me an email.

A few favorites

Improved Roads (thoughts on the loss of wild places)
The Forgetoir (a new literary form to replace the memoir)
A Theory of Dryer Crimes (fiction in which Jason turns out to be an asshole)
The Faithful Couple (a poem about long-term love)
Curriculum Vitae (a poem about learning to live)
Best Jeremiah Johnson Quotes (a movie to live by)
Ice Cream For Dinner (imaginary high school graduation speech; my chance to tell young people to stop listening to the lies of their elders)
The Boy Behind the Couch (reading Susan Cain and my secret life as an introvert)
Sarah's Watch (fiction fragment and experiment in writing with Hemingway App)
Learning to Read (and what makes a great reader).
Two Mirrors (the one in your head and the one in the head of someone who loves you)

Tom Lambert

Leave a Reply