Archive for January, 2007

How to hire Guillaume Portes

Sunday, January 7th, 2007

How to hire Guillaume Portes or what you should better avoid, if you really intended to write an open standard. This is about OOXML by the way [via p.g.o].

PHP, Perl regular expressions, umlauts, unicode and you

Sunday, January 7th, 2007

Never trust your users. Verify all input they provide before you trust that input. You probably know this rule in web development. Yesterday I needed to verify some German text input. In PHP, you can easily use Perl compatible regular expressions with the preg_match method. So let’s try this one:

  1. /^[\w\s,.;?!:]{10,}$/

There is a problem with this kind of regular expressions, though: German text has umlauts (e.g. ü, ä, ö, ß). But umlauts won’t match. First I tried to set the locale to a different format using the setlocale method. This produced some happiness, as uncapitalised umlauts matched. However, capitalised umlauts didn’t. I don’t know why, but it didn’t work. Passing the i flag (to force case-insensitivity) to the regular expression engine did not help either. Let’s use a more sophisticated approach then:

  1. /^[\w\säöüÄÖÜß,.;?!:]{10,}$/

Cool, isn’t it? But it doesn’t work. At this time, I should point out that the character encoding we were using was unicode. I just wasn’t aware that this could have been an issue. An hour of research later, I finally found what I was looking for. The u flag is a special flag used by the PHP clone of the Perl regular expressions engine. It tells the engine that you want the expression to be treated as unicode. And that worked. At last.

  1. preg_match('/^[\w\säöüÄÖÜß,.;?!:]{10,}$/u', $value);

Voila.