perl's encoding on output by use of the ":encoding(...)" layer.
See L<open>.
-To mark the Perl source itself as being in an particular encoding,
+To mark the Perl source itself as being in a particular encoding,
see L<encoding>.
=item Regular Expressions
BidiWS Whitespace
BidiON Other Neutrals
+=back
+
=head2 Scripts
The scripts available for C<\p{In...}> and C<\P{In...}>, for example
Yi Radicals
Yi Syllables
+=over 4
+
=item *
The special pattern C<\X> match matches any extended Unicode sequence
=back
-What doesn't yet work are the followng cases:
+What doesn't yet work are the following cases:
=over 8
2.2 Categories - done [3][4]
2.3 Subtraction - MISSING [5][6]
2.4 Simple Word Boundaries - done [7]
- 2.5 Simple Loose Matches - MISSING [8]
+ 2.5 Simple Loose Matches - done [8]
2.6 End of Line - MISSING [9][10]
[ 1] \x{...}
[ 2] \N{...}
[ 3] . \p{Is...} \P{Is...}
- [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
+ [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
[ 5] have negation
- [ 6] can use look-ahead to emulate subtracion
+ [ 6] can use look-ahead to emulate subtraction (*)
[ 7] include Letters in word characters
- [ 8] see UTR#21 Case Mappings
+ [ 8] see UTR#21 Case Mappings: Perl implements 1:1 mappings
[ 9] see UTR#13 Unicode Newline Guidelines
[10] should do ^ and $ also on \x{2028} and \x{2029}
+(*) Instead of [\u0370-\u03FF-[{UNASSIGNED}]] as suggested by the TR
+18 you can use negated lookahead: to match currently assigned modern
+Greek characters use for example
+
+ /(?!\p{Cn})[\x{0370}-\x{03ff}]/
+
+In other words: the matched character must not be a non-assigned
+character, but it must be in the block of modern Greek characters.
+
=item *
Level 2 - Extended Unicode Support
byteorder independent encoding. For ASCII, UTF-8 is transparent
(and we really do mean 7-bit ASCII, not any 8-bit encoding).
+The following table is from Unicode 3.1.
+
+ Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
+
+ U+0000..U+007F 00..7F
+ U+0080..U+07FF C2..DF 80..BF
+ U+0800..U+0FFF E0 A0..BF 80..BF
+ U+1000..U+FFFF E1..EF 80..BF 80..BF
+ U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
+ U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
+ U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
+
+Or, another way to look at it, as bits:
+
+ Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
+
+ 0aaaaaaa 0aaaaaaa
+ 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
+ ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
+ 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
+
+As you can see, the continuation bytes all begin with C<10>, and the
+leading bits of the start byte tells how many bytes the are in the
+encoded character.
+
=item UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks)
UTF-16 is a 2 or 4 byte encoding. The Unicode code points
A seven-bit safe (non-eight-bit) encoding, useful if the
transport/storage is not eight-bit safe. Defined by RFC 2152.
+=head2 Security Implications of Malformed UTF-8
+
+Unfortunately, the specification of UTF-8 leaves some room for
+interpretation of how many bytes of encoded output one should generate
+from one input Unicode character. Strictly speaking, one is supposed
+to always generate the shortest possible sequence of UTF-8 bytes,
+because otherwise there is potential for input buffer overflow at the
+receiving end of a UTF-8 connection. Perl always generates the shortest
+length UTF-8, and with warnings on (C<-w> or C<use warnings;>) Perl will
+warn about non-shortest length UTF-8 (and other malformations, too,
+such as the surrogates, which are not real character code points.)
+
=head2 Unicode in Perl on EBCDIC
The way Unicode is handled on EBCDIC platforms is still rather
=head1 SEE ALSO
-L<encoding>, L<Encode>, L<open>, L<bytes>, L<utf8>, L<perlretut>,
-L<perlvar/"${^WIDE_SYSTEM_CALLS}">
+L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">
=cut