关于使用 /u 模式修饰符时 UTF-8 字符串的有效性,需要注意以下几个事项;
1. 如果模式本身包含无效的 UTF-8 字符,则会收到错误(如上面的文档中所述 - “自 PHP 4.3.5 起检查了模式的 UTF-8 有效性”
2. 当主题字符串包含无效的 UTF-8 序列/代码点时,它基本上导致 preg_* 函数“悄无声息地终止”,其中没有任何匹配,但没有指示字符串的 UTF-8 无效
3. PCRE 将五个和六个八位组的 UTF-8 字符序列视为有效(在模式和主题字符串中),但 Unicode 不支持这些序列(请参阅“适用于 Linux 和 Unix 的安全编程教程”的 5.9 节“字符编码” - 可以访问 http://www.tldp.org/ 和其他地方)
4. 如需用于测试 UTF-8 字符串有效性(并舍弃五/六个八位组序列)的 PHP 示例算法,请访问:http://hsivonen.iki.fi/php-utf8/
下面的脚本应该让您了解哪些有效哪些无效;
<?php
$examples = array(
'Valid ASCII' => "a",
'Valid 2 Octet Sequence' => "\xc3\xb1",
'Invalid 2 Octet Sequence' => "\xc3\x28",
'Invalid Sequence Identifier' => "\xa0\xa1",
'Valid 3 Octet Sequence' => "\xe2\x82\xa1",
'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1",
'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28",
'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc",
'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc",
'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc",
'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28",
'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1",
'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1",
);
echo "++Invalid UTF-8 in pattern\n";
foreach ( $examples as $name => $str ) {
echo "$name\n";
preg_match("/".$str."/u",'Testing');
}
echo "++ preg_match() examples\n";
foreach ( $examples as $name => $str ) {
preg_match("/\xf8\xa1\xa1\xa1\xa1/u", $str, $ar);
echo "$name: ";
if ( count($ar) == 0 ) {
echo "Matched nothing!\n";
} else {
echo "Matched {$ar[0]}\n";
}
}
echo "++ preg_match_all() examples\n";
foreach ( $examples as $name => $str ) {
preg_match_all('/./u', $str, $ar);
echo "$name: ";
$num_utf8_chars = count($ar[0]);
if ( $num_utf8_chars == 0 ) {
echo "Matched nothing!\n";
} else {
echo "Matched $num_utf8_chars character\n";
}
}
?>