X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=6606ecdc86249c28b34ac229a736cec9b0d1b531;hb=888aee597441568824c1835285c8012bab253529;hp=e56f3ff9dacf650494b8918856e4db1151f2e0dd;hpb=d1be9408a3c14848d30728674452e191ba5fffaa;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index e56f3ff..6606ecd 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -275,6 +275,8 @@ have their directionality defined: BidiWS Whitespace BidiON Other Neutrals +=back + =head2 Scripts The scripts available for C<\p{In...}> and C<\P{In...}>, for example @@ -481,6 +483,8 @@ below list that have the C appended). Yi Radicals Yi Syllables +=over 4 + =item * The special pattern C<\X> match matches any extended Unicode sequence @@ -563,7 +567,7 @@ than one Unicode character =back -What doesn't yet work are the followng cases: +What doesn't yet work are the following cases: =over 8 @@ -628,14 +632,23 @@ Level 1 - Basic Unicode Support [ 1] \x{...} [ 2] \N{...} [ 3] . \p{Is...} \P{Is...} - [ 4] now scripts (see UTR#24 Script Names) in addition to blocks + [ 4] now scripts (see UTR#24 Script Names) in addition to blocks [ 5] have negation - [ 6] can use look-ahead to emulate subtracion + [ 6] can use look-ahead to emulate subtraction (*) [ 7] include Letters in word characters [ 8] see UTR#21 Case Mappings: Perl implements 1:1 mappings [ 9] see UTR#13 Unicode Newline Guidelines [10] should do ^ and $ also on \x{2028} and \x{2029} +(*) Instead of [\u0370-\u03FF-[{UNASSIGNED}]] as suggested by the TR +18 you can use negated lookahead: to match currently assigned modern +Greek characters use for example + + /(?!\p{Cn})[\x{0370}-\x{03ff}]/ + +In other words: the matched character must not be a non-assigned +character, but it must be in the block of modern Greek characters. + =item * Level 2 - Extended Unicode Support @@ -681,6 +694,31 @@ length (1 to 6 bytes, current character allocations require 4 bytes), byteorder independent encoding. For ASCII, UTF-8 is transparent (and we really do mean 7-bit ASCII, not any 8-bit encoding). +The following table is from Unicode 3.1. + + Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte + + U+0000..U+007F 00..7F    + U+0080..U+07FF C2..DF 80..BF    + U+0800..U+0FFF E0 A0..BF 80..BF   + U+1000..U+FFFF E1..EF 80..BF 80..BF   + U+10000..U+3FFFF F0 90..BF 80..BF 80..BF + U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF + U+100000..U+10FFFF F4 80..8F 80..BF 80..BF + +Or, another way to look at it, as bits: + + Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte + + 0aaaaaaa 0aaaaaaa + 00000bbbbbaaaaaa 110bbbbb 10aaaaaa + ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa + 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa + +As you can see, the continuation bytes all begin with C<10>, and the +leading bits of the start byte tells how many bytes the are in the +encoded character. + =item UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks) UTF-16 is a 2 or 4 byte encoding. The Unicode code points @@ -769,7 +807,7 @@ for more discussion of the issues. =head1 SEE ALSO -L, L, L, L, L, L, -L +L, L, L, L, L, L, +L, L =cut