X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=64116bcae10856ff0b2a952719213290a2a01e72;hb=a801c63c4c283fdf8af1d9fbd7d3d89096ee73f6;hp=2c9b0780297533ef912da7b1f6b416a715be3477;hpb=90a59240269f4a0b2fc176328a009d30cf595988;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 2c9b078..64116bc 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -20,7 +20,7 @@ Other encodings can be converted to perl's encoding on input, or from perl's encoding on output by use of the ":encoding(...)" layer. See L. -To mark the Perl source itself as being in an particular encoding, +To mark the Perl source itself as being in a particular encoding, see L. =item Regular Expressions @@ -275,6 +275,8 @@ have their directionality defined: BidiWS Whitespace BidiON Other Neutrals +=back + =head2 Scripts The scripts available for C<\p{In...}> and C<\P{In...}>, for example @@ -481,6 +483,8 @@ below list that have the C appended). Yi Radicals Yi Syllables +=over 4 + =item * The special pattern C<\X> match matches any extended Unicode sequence @@ -563,7 +567,7 @@ than one Unicode character =back -What doesn't yet work are the followng cases: +What doesn't yet work are the following cases: =over 8 @@ -628,14 +632,23 @@ Level 1 - Basic Unicode Support [ 1] \x{...} [ 2] \N{...} [ 3] . \p{Is...} \P{Is...} - [ 4] now scripts (see UTR#24 Script Names) in addition to blocks + [ 4] now scripts (see UTR#24 Script Names) in addition to blocks [ 5] have negation - [ 6] can use look-ahead to emulate subtracion + [ 6] can use look-ahead to emulate subtraction (*) [ 7] include Letters in word characters [ 8] see UTR#21 Case Mappings: Perl implements 1:1 mappings [ 9] see UTR#13 Unicode Newline Guidelines [10] should do ^ and $ also on \x{2028} and \x{2029} +(*) Instead of [\u0370-\u03FF-[{UNASSIGNED}]] as suggested by the TR +18 you can use negated lookahead: to match currently assigned modern +Greek characters use for example + + /(?!\p{Cn})[\x{0370}-\x{03ff}]/ + +In other words: the matched character must not be a non-assigned +character, but it must be in the block of modern Greek characters. + =item * Level 2 - Extended Unicode Support @@ -742,6 +755,18 @@ is not extensible beyond 0xFFFF, because it does not use surrogates. A seven-bit safe (non-eight-bit) encoding, useful if the transport/storage is not eight-bit safe. Defined by RFC 2152. +=head2 Security Implications of Malformed UTF-8 + +Unfortunately, the specification of UTF-8 leaves some room for +interpretation of how many bytes of encoded output one should generate +from one input Unicode character. Strictly speaking, one is supposed +to always generate the shortest possible sequence of UTF-8 bytes, +because otherwise there is potential for input buffer overflow at the +receiving end of a UTF-8 connection. Perl always generates the shortest +length UTF-8, and with warnings on (C<-w> or C) Perl will +warn about non-shortest length UTF-8 (and other malformations, too, +such as the surrogates, which are not real character code points.) + =head2 Unicode in Perl on EBCDIC The way Unicode is handled on EBCDIC platforms is still rather @@ -757,7 +782,7 @@ for more discussion of the issues. =head1 SEE ALSO -L, L, L, L, L, L, -L +L, L, L, L, L, L, +L, L =cut