X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=21c5bb3ab7c858d844d001ec4c9d1a6f2145aee1;hb=ef36c6a75a14ecb6dcb87636f28264674a031e0b;hp=d47e7dff6296cc893eefd125482aadefe02bb41d;hpb=1e8e823624ada1d9231e47a66cb2b9e3ab42701a;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index d47e7df..21c5bb3 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -42,6 +42,14 @@ is needed.> See L. You can also use the C pragma to change the default encoding of the data in your script; see L. +=item BOM-marked scripts and UTF-16 scripts autodetected + +If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, +or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either +endianness, Perl will correctly read in the script as Unicode. +(BOMless UTF-8 cannot be effectively recognized or differentiated from +ISO 8859-1 or other eight-bit encodings.) + =item C needed to upgrade non-Latin-1 byte strings By default, there is a fundamental asymmetry in Perl's unicode model: @@ -145,7 +153,6 @@ Additionally, if you you can use the C<\N{...}> notation and put the official Unicode character name within the braces, such as C<\N{WHITE SMILING FACE}>. - =item * If an appropriate L is specified, identifiers within the @@ -166,11 +173,128 @@ bytes and match against the character properties specified in the Unicode properties database. C<\w> can be used to match a Japanese ideograph, for instance. +(However, and as a limitation of the current implementation, using +C<\w> or C<\W> I a C<[...]> character class will still match +with byte semantics.) + =item * Named Unicode properties, scripts, and block ranges may be used like character classes via the C<\p{}> "matches property" construct and -the C<\P{}> negation, "doesn't match property". +the C<\P{}> negation, "doesn't match property". + +See L for more details. + +You can define your own character properties and use them +in the regular expression with the C<\p{}> or C<\P{}> construct. + +See L for more details. + +=item * + +The special pattern C<\X> matches any extended Unicode +sequence--"a combining character sequence" in Standardese--where the +first character is a base character and subsequent characters are mark +characters that apply to the base character. C<\X> is equivalent to +C<(?:\PM\pM*)>. + +=item * + +The C operator translates characters instead of bytes. Note +that the C functionality has been removed. For similar +functionality see pack('U0', ...) and pack('C0', ...). + +=item * + +Case translation operators use the Unicode case translation tables +when character input is provided. Note that C, or C<\U> in +interpolated strings, translates to uppercase, while C, +or C<\u> in interpolated strings, translates to titlecase in languages +that make the distinction. + +=item * + +Most operators that deal with positions or lengths in a string will +automatically switch to using character positions, including +C, C, C, C, C, C, +C, C, and C. An operator that +specifically does not switch is C. Operators that really don't +care include operators that treat strings as a bucket of bits such as +C, and operators dealing with filenames. + +=item * + +The C/C letter C does I change, since it is often +used for byte-oriented formats. Again, think C in the C language. + +There is a new C specifier that converts between Unicode characters +and code points. There is also a C specifier that is the equivalent of +C/C and properly handles character values even if they are above 255. + +=item * + +The C and C functions work on characters, similar to +C and C, I C and +C. C and C are methods for +emulating byte-oriented C and C on Unicode strings. +While these methods reveal the internal encoding of Unicode strings, +that is not something one normally needs to care about at all. + +=item * + +The bit string operators, C<& | ^ ~>, can operate on character data. +However, for backward compatibility, such as when using bit string +operations when characters are all less than 256 in ordinal value, one +should not use C<~> (the bit complement) with characters of both +values less than 256 and values greater than 256. Most importantly, +DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) +will not hold. The reason for this mathematical I is that +the complement cannot return B the 8-bit (byte-wide) bit +complement B the full character-wide bit complement. + +=item * + +lc(), uc(), lcfirst(), and ucfirst() work for the following cases: + +=over 8 + +=item * + +the case mapping is from a single Unicode character to another +single Unicode character, or + +=item * + +the case mapping is from a single Unicode character to more +than one Unicode character. + +=back + +Things to do with locales (Lithuanian, Turkish, Azeri) do B work +since Perl does not understand the concept of Unicode locales. + +See the Unicode Technical Report #21, Case Mappings, for more details. + +But you can also define your own mappings to be used in the lc(), +lcfirst(), uc(), and ucfirst() (or their string-inlined versions). + +See L for more details. + +=back + +=over 4 + +=item * + +And finally, C reverses by character rather than by byte. + +=back + +=head2 Unicode Character Properties + +Named Unicode properties, scripts, and block ranges may be used like +character classes via the C<\p{}> "matches property" construct and +the C<\P{}> negation, "doesn't match property". For instance, C<\p{Lu}> matches any character with the Unicode "Lu" (Letter, uppercase) property, while C<\p{M}> matches any character @@ -196,6 +320,10 @@ B +=over 4 + +=item General Category + Here are the basic Unicode General Category properties, followed by their long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, for instance, are identical. @@ -203,6 +331,7 @@ for instance, are identical. Short Long L Letter + LC CasedLetter Lu UppercaseLetter Ll LowercaseLetter Lt TitlecaseLetter @@ -250,44 +379,46 @@ for instance, are identical. Single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. -C is a special case, which is an alias for C, C, and C. +C and C are special cases, which are aliases for the set of +C, C, and C. Because Perl hides the need for the user to understand the internal representation of Unicode characters, there is no need to implement the somewhat messy concept of surrogates. C is therefore not supported. +=item Bidirectional Character Types + Because scripts differ in their directionality--Hebrew is -written right to left, for example--Unicode supplies these properties: +written right to left, for example--Unicode supplies these properties in +the BidiClass class: Property Meaning - BidiL Left-to-Right - BidiLRE Left-to-Right Embedding - BidiLRO Left-to-Right Override - BidiR Right-to-Left - BidiAL Right-to-Left Arabic - BidiRLE Right-to-Left Embedding - BidiRLO Right-to-Left Override - BidiPDF Pop Directional Format - BidiEN European Number - BidiES European Number Separator - BidiET European Number Terminator - BidiAN Arabic Number - BidiCS Common Number Separator - BidiNSM Non-Spacing Mark - BidiBN Boundary Neutral - BidiB Paragraph Separator - BidiS Segment Separator - BidiWS Whitespace - BidiON Other Neutrals - -For example, C<\p{BidiR}> matches characters that are normally + L Left-to-Right + LRE Left-to-Right Embedding + LRO Left-to-Right Override + R Right-to-Left + AL Right-to-Left Arabic + RLE Right-to-Left Embedding + RLO Right-to-Left Override + PDF Pop Directional Format + EN European Number + ES European Number Separator + ET European Number Terminator + AN Arabic Number + CS Common Number Separator + NSM Non-Spacing Mark + BN Boundary Neutral + B Paragraph Separator + S Segment Separator + WS Whitespace + ON Other Neutrals + +For example, C<\p{BidiClass:R}> matches characters that are normally written right to left. -=back - -=head2 Scripts +=item Scripts The script names which can be used by C<\p{...}> and C<\P{...}>, such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: @@ -337,6 +468,8 @@ such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: Tibetan Yi +=item Extended property classes + Extended property classes can supplement the basic properties, defined by the F Unicode database: @@ -384,11 +517,13 @@ and there are further derived properties: Common Any character (or unassigned code point) not explicitly assigned to a script +=item Use of "Is" Prefix + For backward compatibility (with Perl 5.6), all properties mentioned so far may have C prepended to their name, so C<\P{IsLu}>, for example, is equal to C<\P{Lu}>. -=head2 Blocks +=item Blocks In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the @@ -527,111 +662,26 @@ These block names are supported: InYiRadicals InYiSyllables -=over 4 - -=item * - -The special pattern C<\X> matches any extended Unicode -sequence--"a combining character sequence" in Standardese--where the -first character is a base character and subsequent characters are mark -characters that apply to the base character. C<\X> is equivalent to -C<(?:\PM\pM*)>. - -=item * - -The C operator translates characters instead of bytes. Note -that the C functionality has been removed. For similar -functionality see pack('U0', ...) and pack('C0', ...). - -=item * - -Case translation operators use the Unicode case translation tables -when character input is provided. Note that C, or C<\U> in -interpolated strings, translates to uppercase, while C, -or C<\u> in interpolated strings, translates to titlecase in languages -that make the distinction. - -=item * - -Most operators that deal with positions or lengths in a string will -automatically switch to using character positions, including -C, C, C, C, C, -C, C, and C. Operators that -specifically do not switch include C, C, and -C. Operators that really don't care include C, -operators that treats strings as a bucket of bits such as C, -and operators dealing with filenames. - -=item * - -The C/C letters C and C do I change, -since they are often used for byte-oriented formats. Again, think -C in the C language. - -There is a new C specifier that converts between Unicode characters -and code points. - -=item * - -The C and C functions work on characters, similar to -C and C, I C and -C. C and C are methods for -emulating byte-oriented C and C on Unicode strings. -While these methods reveal the internal encoding of Unicode strings, -that is not something one normally needs to care about at all. - -=item * - -The bit string operators, C<& | ^ ~>, can operate on character data. -However, for backward compatibility, such as when using bit string -operations when characters are all less than 256 in ordinal value, one -should not use C<~> (the bit complement) with characters of both -values less than 256 and values greater than 256. Most importantly, -DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) -will not hold. The reason for this mathematical I is that -the complement cannot return B the 8-bit (byte-wide) bit -complement B the full character-wide bit complement. - -=item * - -lc(), uc(), lcfirst(), and ucfirst() work for the following cases: - -=over 8 - -=item * - -the case mapping is from a single Unicode character to another -single Unicode character, or - -=item * - -the case mapping is from a single Unicode character to more -than one Unicode character. - =back -Things to do with locales (Lithuanian, Turkish, Azeri) do B work -since Perl does not understand the concept of Unicode locales. - -See the Unicode Technical Report #21, Case Mappings, for more details. - -=back +=head2 User-Defined Character Properties -=over 4 +You can define your own character properties by defining subroutines +whose names begin with "In" or "Is". The subroutines can be defined in +any package. The user-defined properties can be used in the regular +expression C<\p> and C<\P> constructs; if you are using a user-defined +property from a package other than the one you are in, you must specify +its package in the C<\p> or C<\P> construct. -=item * + # assuming property IsForeign defined in Lang:: + package main; # property package name required + if ($txt =~ /\p{Lang::IsForeign}+/) { ... } -And finally, C reverses by character rather than by byte. + package Lang; # property package name not required + if ($txt =~ /\p{IsForeign}+/) { ... } -=back -=head2 User-Defined Character Properties - -You can define your own character properties by defining subroutines -whose names begin with "In" or "Is". The subroutines must be defined -in the C
package. The user-defined properties can be used in the -regular expression C<\p> and C<\P> constructs. Note that the effect -is compile-time and immutable once defined. +Note that the effect is compile-time and immutable once defined. The subroutines must return a specially-formatted string, with one or more newline-separated lines. Each line must be one of the following: @@ -646,23 +696,30 @@ tabular characters) denoting a range of Unicode code points to include. =item * Something to include, prefixed by "+": a built-in character -property (prefixed by "utf8::"), to represent all the characters in that -property; two hexadecimal code points for a range; or a single -hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. =item * Something to exclude, prefixed by "-": an existing character -property (prefixed by "utf8::"), for all the characters in that -property; two hexadecimal code points for a range; or a single -hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. =item * Something to negate, prefixed "!": an existing character -property (prefixed by "utf8::") for all the characters except the -characters in the property; two hexadecimal code points for a range; -or a single hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. + +=item * + +Something to intersect with, prefixed by "&": an existing character +property (prefixed by "utf8::") or a user-defined character property, +for all the characters except the characters in the property; two +hexadecimal code points for a range; or a single hexadecimal code point. =back @@ -710,9 +767,29 @@ The negation is useful for defining (surprise!) negated classes. END } +Intersection is useful for getting the common characters matched by +two (or more) classes. + + sub InFooAndBar { + return <<'END'; + +main::Foo + &main::Bar + END + } + +It's important to remember not to use "&" for the first set -- that +would be intersecting with nothing (resulting in an empty set). + +A final note on the user-defined property tests: they will be used +only if the scalar has been marked as having Unicode characters. +Old byte-style strings will not be affected. + +=head2 User-Defined Case Mappings + You can also define your own mappings to be used in the lc(), lcfirst(), uc(), and ucfirst() (or their string-inlined versions). -The principle is the same: define subroutines in the C
package +The principle is similar to that of user-defined character +properties: to define subroutines in the C
package with names like C (for lc() and lcfirst()), C (for the first character in ucfirst()), and C (for uc(), and the rest of the characters in ucfirst()). @@ -756,9 +833,9 @@ are not directly user-accessible, one can use either the C module, or just match case-insensitively (that's when the C mapping is used). -A final note on the user-defined property tests and mappings: they -will be used only if the scalar has been marked as having Unicode -characters. Old byte-style strings will not be affected. +A final note on the user-defined case mappings: they will be used +only if the scalar has been marked as having Unicode characters. +Old byte-style strings will not be affected. =head2 Character Encodings for Input and Output @@ -789,7 +866,9 @@ Level 1 - Basic Unicode Support [ 1] \x{...} [ 2] \N{...} [ 3] . \p{...} \P{...} - [ 4] now scripts (see UTR#24 Script Names) in addition to blocks + [ 4] support for scripts (see UTR#24 Script Names), blocks, + binary properties, enumerated non-binary properties, and + numeric properties (as listed in UTR#18 Other Properties) [ 5] have negation [ 6] can use regular expression look-ahead [a] or user-defined character properties [b] to emulate subtraction @@ -1087,8 +1166,9 @@ as Unicode (UTF-8), there still are many places where Unicode (in some encoding or another) could be given as arguments or received as results, or both, but it is not. -The following are such interfaces. For all of these Perl currently -(as of 5.8.1) simply assumes byte strings both as arguments and results. +The following are such interfaces. For all of these interfaces Perl +currently (as of 5.8.3) simply assumes byte strings both as arguments +and results, or UTF-8 strings if the C pragma has been used. One reason why Perl does not attempt to resolve the role of Unicode in this cases is that the answers are highly dependent on the operating @@ -1101,7 +1181,7 @@ portable concept. Similarly for the qx and system: how well will the =item * -chmod, chmod, chown, chroot, exec, link, lstat, mkdir, +chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename, rmdir, stat, symlink, truncate, unlink, utime, -X =item *