X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=23bee6eacf37edfa69107ca6f962d4df3881f4ad;hb=7b667b5fb1c41f31aff1e46b9f74e36eb063010e;hp=d47e7dff6296cc893eefd125482aadefe02bb41d;hpb=1e8e823624ada1d9231e47a66cb2b9e3ab42701a;p=p5sagit%2Fp5-mst-13.2.git

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index d47e7df..23bee6e 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -42,6 +42,14 @@ is needed.>  See L<utf8>.
 You can also use the C<encoding> pragma to change the default encoding
 of the data in your script; see L<encoding>.
 
+=item BOM-marked scripts and UTF-16 scripts autodetected
+
+If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
+or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
+endianness, Perl will correctly read in the script as Unicode.
+(BOMless UTF-8 cannot be effectively recognized or differentiated from
+ISO 8859-1 or other eight-bit encodings.)
+
 =item C<use encoding> needed to upgrade non-Latin-1 byte strings
 
 By default, there is a fundamental asymmetry in Perl's unicode model:
@@ -166,6 +174,10 @@ bytes and match against the character properties specified in the
 Unicode properties database.  C<\w> can be used to match a Japanese
 ideograph, for instance.
 
+(However, and as a limitation of the current implementation, using
+C<\w> or C<\W> I<inside> a C<[...]> character class will still match
+with byte semantics.)
+
 =item *
 
 Named Unicode properties, scripts, and block ranges may be used like
@@ -203,6 +215,7 @@ for instance, are identical.
     Short       Long
 
     L           Letter
+    LC          CasedLetter
     Lu          UppercaseLetter
     Ll          LowercaseLetter
     Lt          TitlecaseLetter
@@ -250,7 +263,8 @@ for instance, are identical.
 
 Single-letter properties match all characters in any of the
 two-letter sub-properties starting with the same letter.
-C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
+C<LC> and C<L&> are special cases, which are aliases for the set of
+C<Ll>, C<Lu>, and C<Lt>.
 
 Because Perl hides the need for the user to understand the internal
 representation of Unicode characters, there is no need to implement
@@ -258,31 +272,32 @@ the somewhat messy concept of surrogates. C<Cs> is therefore not
 supported.
 
 Because scripts differ in their directionality--Hebrew is
-written right to left, for example--Unicode supplies these properties:
+written right to left, for example--Unicode supplies these properties in
+the BidiClass class:
 
     Property    Meaning
 
-    BidiL       Left-to-Right
-    BidiLRE     Left-to-Right Embedding
-    BidiLRO     Left-to-Right Override
-    BidiR       Right-to-Left
-    BidiAL      Right-to-Left Arabic
-    BidiRLE     Right-to-Left Embedding
-    BidiRLO     Right-to-Left Override
-    BidiPDF     Pop Directional Format
-    BidiEN      European Number
-    BidiES      European Number Separator
-    BidiET      European Number Terminator
-    BidiAN      Arabic Number
-    BidiCS      Common Number Separator
-    BidiNSM     Non-Spacing Mark
-    BidiBN      Boundary Neutral
-    BidiB       Paragraph Separator
-    BidiS       Segment Separator
-    BidiWS      Whitespace
-    BidiON      Other Neutrals
-
-For example, C<\p{BidiR}> matches characters that are normally
+    L           Left-to-Right
+    LRE         Left-to-Right Embedding
+    LRO         Left-to-Right Override
+    R           Right-to-Left
+    AL          Right-to-Left Arabic
+    RLE         Right-to-Left Embedding
+    RLO         Right-to-Left Override
+    PDF         Pop Directional Format
+    EN          European Number
+    ES          European Number Separator
+    ET          European Number Terminator
+    AN          Arabic Number
+    CS          Common Number Separator
+    NSM         Non-Spacing Mark
+    BN          Boundary Neutral
+    B           Paragraph Separator
+    S           Segment Separator
+    WS          Whitespace
+    ON          Other Neutrals
+
+For example, C<\p{BidiClass:R}> matches characters that are normally
 written right to left.
 
 =back
@@ -555,10 +570,10 @@ that make the distinction.
 
 Most operators that deal with positions or lengths in a string will
 automatically switch to using character positions, including
-C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
+C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
 C<sprintf()>, C<write()>, and C<length()>.  Operators that
 specifically do not switch include C<vec()>, C<pack()>, and
-C<unpack()>.  Operators that really don't care include C<chomp()>,
+C<unpack()>.  Operators that really don't care include
 operators that treats strings as a bucket of bits such as C<sort()>,
 and operators dealing with filenames.
 
@@ -628,10 +643,21 @@ And finally, C<scalar reverse()> reverses by character rather than by byte.
 =head2 User-Defined Character Properties
 
 You can define your own character properties by defining subroutines
-whose names begin with "In" or "Is".  The subroutines must be defined
-in the C<main> package.  The user-defined properties can be used in the
-regular expression C<\p> and C<\P> constructs.  Note that the effect
-is compile-time and immutable once defined.
+whose names begin with "In" or "Is".  The subroutines can be defined in
+any package.  The user-defined properties can be used in the regular
+expression C<\p> and C<\P> constructs; if you are using a user-defined
+property from a package other than the one you are in, you must specify
+its package in the C<\p> or C<\P> construct.
+
+    # assuming property IsForeign defined in Lang::
+    package main;  # property package name required
+    if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
+
+    package Lang;  # property package name not required
+    if ($txt =~ /\p{IsForeign}+/) { ... }
+
+
+Note that the effect is compile-time and immutable once defined.
 
 The subroutines must return a specially-formatted string, with one
 or more newline-separated lines.  Each line must be one of the following:
@@ -646,23 +672,30 @@ tabular characters) denoting a range of Unicode code points to include.
 =item *
 
 Something to include, prefixed by "+": a built-in character
-property (prefixed by "utf8::"), to represent all the characters in that
-property; two hexadecimal code points for a range; or a single
-hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
 
 =item *
 
 Something to exclude, prefixed by "-": an existing character
-property (prefixed by "utf8::"), for all the characters in that
-property; two hexadecimal code points for a range; or a single
-hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
 
 =item *
 
 Something to negate, prefixed "!": an existing character
-property (prefixed by "utf8::") for all the characters except the
-characters in the property; two hexadecimal code points for a range;
-or a single hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
+
+=item *
+
+Something to intersect with, prefixed by "&": an existing character
+property (prefixed by "utf8::") or a user-defined character property,
+for all the characters except the characters in the property; two
+hexadecimal code points for a range; or a single hexadecimal code point.
 
 =back
 
@@ -710,6 +743,19 @@ The negation is useful for defining (surprise!) negated classes.
     END
     }
 
+Intersection is useful for getting the common characters matched by
+two (or more) classes.
+
+    sub InFooAndBar {
+        return <<'END';
+    +main::Foo
+    &main::Bar
+    END
+    }
+
+It's important to remember not to use "&" for the first set -- that
+would be intersecting with nothing (resulting in an empty set).
+
 You can also define your own mappings to be used in the lc(),
 lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
 The principle is the same: define subroutines in the C<main> package
@@ -789,7 +835,9 @@ Level 1 - Basic Unicode Support
         [ 1] \x{...}
         [ 2] \N{...}
         [ 3] . \p{...} \P{...}
-        [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
+        [ 4] support for scripts (see UTR#24 Script Names), blocks,
+             binary properties, enumerated non-binary properties, and
+             numeric properties (as listed in UTR#18 Other Properties)
         [ 5] have negation
         [ 6] can use regular expression look-ahead [a]
              or user-defined character properties [b] to emulate subtraction
@@ -1087,8 +1135,9 @@ as Unicode (UTF-8), there still are many places where Unicode (in some
 encoding or another) could be given as arguments or received as
 results, or both, but it is not.
 
-The following are such interfaces.  For all of these Perl currently
-(as of 5.8.1) simply assumes byte strings both as arguments and results.
+The following are such interfaces.  For all of these interfaces Perl
+currently (as of 5.8.3) simply assumes byte strings both as arguments
+and results, or UTF-8 strings if the C<encoding> pragma has been used.
 
 One reason why Perl does not attempt to resolve the role of Unicode in
 this cases is that the answers are highly dependent on the operating