=over 4
-=item Input and Output Disciplines
+=item Input and Output Layers
Perl knows when a filehandle uses Perl's internal Unicode encodings
(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
(in string or regular expression literals, or in identifier names) on
ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
machines. B<These are the only times when an explicit C<use utf8>
-is needed.>
+is needed.> See L<utf8>.
You can also use the C<encoding> pragma to change the default encoding
of the data in your script; see L<encoding>.
for Unicode data and byte semantics for non-Unicode data.
The decision to use character semantics is made transparently. If
input data comes from a Unicode source--for example, if a character
-encoding discipline is added to a filehandle or a literal Unicode
+encoding layer is added to a filehandle or a literal Unicode
string constant appears in a program--character semantics apply.
Otherwise, byte semantics are in effect. The C<bytes> pragma should
be used to force byte semantics on Unicode data.
=back
-The following cases do not yet work:
-
-=over 8
-
-=item *
-
-the "final sigma" (Greek), and
-
-=item *
-
-anything to with locales (Lithuanian, Turkish, Azeri).
+Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
+since Perl does not understand the concept of Unicode locales.
=back
which will match assigned characters known to be part of the Greek script.
-[b] See L</User-defined Character Properties>.
+[b] See L</"User-Defined Character Properties">.
=item *
Level 2 - Extended Unicode Support
- 3.1 Surrogates - MISSING
- 3.2 Canonical Equivalents - MISSING [11][12]
- 3.3 Locale-Independent Graphemes - MISSING [13]
- 3.4 Locale-Independent Words - MISSING [14]
- 3.5 Locale-Independent Loose Matches - MISSING [15]
-
- [11] see UTR#15 Unicode Normalization
- [12] have Unicode::Normalize but not integrated to regexes
- [13] have \X but at this level . should equal that
- [14] need three classes, not just \w and \W
- [15] see UTR#21 Case Mappings
+ 3.1 Surrogates - MISSING [11]
+ 3.2 Canonical Equivalents - MISSING [12][13]
+ 3.3 Locale-Independent Graphemes - MISSING [14]
+ 3.4 Locale-Independent Words - MISSING [15]
+ 3.5 Locale-Independent Loose Matches - MISSING [16]
+
+ [11] Surrogates are solely a UTF-16 concept and Perl's internal
+ representation is UTF-8. The Encode module does UTF-16, though.
+ [12] see UTR#15 Unicode Normalization
+ [13] have Unicode::Normalize but not integrated to regexes
+ [14] have \X but at this level . should equal that
+ [15] need three classes, not just \w and \W
+ [16] see UTR#21 Case Mappings
=item *