From: SADAHIRO Tomoyuki Date: Sun, 4 Jun 2006 15:52:54 +0000 (+0900) Subject: [DOCPATCH perlunicode.pod] paragraphing nit X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=822502e5e1ee67853c76322faa5c660c9f9a49da;p=p5sagit%2Fp5-mst-13.2.git [DOCPATCH perlunicode.pod] paragraphing nit Message-Id: <20060604155149.0913.BQW10602@nifty.com> p4raw-id: //depot/perl@28352 --- diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 61f18b3..21c5bb3 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -153,7 +153,6 @@ Additionally, if you you can use the C<\N{...}> notation and put the official Unicode character name within the braces, such as C<\N{WHITE SMILING FACE}>. - =item * If an appropriate L is specified, identifiers within the @@ -182,7 +181,120 @@ with byte semantics.) Named Unicode properties, scripts, and block ranges may be used like character classes via the C<\p{}> "matches property" construct and -the C<\P{}> negation, "doesn't match property". +the C<\P{}> negation, "doesn't match property". + +See L for more details. + +You can define your own character properties and use them +in the regular expression with the C<\p{}> or C<\P{}> construct. + +See L for more details. + +=item * + +The special pattern C<\X> matches any extended Unicode +sequence--"a combining character sequence" in Standardese--where the +first character is a base character and subsequent characters are mark +characters that apply to the base character. C<\X> is equivalent to +C<(?:\PM\pM*)>. + +=item * + +The C operator translates characters instead of bytes. Note +that the C functionality has been removed. For similar +functionality see pack('U0', ...) and pack('C0', ...). + +=item * + +Case translation operators use the Unicode case translation tables +when character input is provided. Note that C, or C<\U> in +interpolated strings, translates to uppercase, while C, +or C<\u> in interpolated strings, translates to titlecase in languages +that make the distinction. + +=item * + +Most operators that deal with positions or lengths in a string will +automatically switch to using character positions, including +C, C, C, C, C, C, +C, C, and C. An operator that +specifically does not switch is C. Operators that really don't +care include operators that treat strings as a bucket of bits such as +C, and operators dealing with filenames. + +=item * + +The C/C letter C does I change, since it is often +used for byte-oriented formats. Again, think C in the C language. + +There is a new C specifier that converts between Unicode characters +and code points. There is also a C specifier that is the equivalent of +C/C and properly handles character values even if they are above 255. + +=item * + +The C and C functions work on characters, similar to +C and C, I C and +C. C and C are methods for +emulating byte-oriented C and C on Unicode strings. +While these methods reveal the internal encoding of Unicode strings, +that is not something one normally needs to care about at all. + +=item * + +The bit string operators, C<& | ^ ~>, can operate on character data. +However, for backward compatibility, such as when using bit string +operations when characters are all less than 256 in ordinal value, one +should not use C<~> (the bit complement) with characters of both +values less than 256 and values greater than 256. Most importantly, +DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) +will not hold. The reason for this mathematical I is that +the complement cannot return B the 8-bit (byte-wide) bit +complement B the full character-wide bit complement. + +=item * + +lc(), uc(), lcfirst(), and ucfirst() work for the following cases: + +=over 8 + +=item * + +the case mapping is from a single Unicode character to another +single Unicode character, or + +=item * + +the case mapping is from a single Unicode character to more +than one Unicode character. + +=back + +Things to do with locales (Lithuanian, Turkish, Azeri) do B work +since Perl does not understand the concept of Unicode locales. + +See the Unicode Technical Report #21, Case Mappings, for more details. + +But you can also define your own mappings to be used in the lc(), +lcfirst(), uc(), and ucfirst() (or their string-inlined versions). + +See L for more details. + +=back + +=over 4 + +=item * + +And finally, C reverses by character rather than by byte. + +=back + +=head2 Unicode Character Properties + +Named Unicode properties, scripts, and block ranges may be used like +character classes via the C<\p{}> "matches property" construct and +the C<\P{}> negation, "doesn't match property". For instance, C<\p{Lu}> matches any character with the Unicode "Lu" (Letter, uppercase) property, while C<\p{M}> matches any character @@ -208,6 +320,10 @@ B +=over 4 + +=item General Category + Here are the basic Unicode General Category properties, followed by their long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, for instance, are identical. @@ -271,6 +387,8 @@ representation of Unicode characters, there is no need to implement the somewhat messy concept of surrogates. C is therefore not supported. +=item Bidirectional Character Types + Because scripts differ in their directionality--Hebrew is written right to left, for example--Unicode supplies these properties in the BidiClass class: @@ -300,9 +418,7 @@ the BidiClass class: For example, C<\p{BidiClass:R}> matches characters that are normally written right to left. -=back - -=head2 Scripts +=item Scripts The script names which can be used by C<\p{...}> and C<\P{...}>, such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: @@ -352,6 +468,8 @@ such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: Tibetan Yi +=item Extended property classes + Extended property classes can supplement the basic properties, defined by the F Unicode database: @@ -399,11 +517,13 @@ and there are further derived properties: Common Any character (or unassigned code point) not explicitly assigned to a script +=item Use of "Is" Prefix + For backward compatibility (with Perl 5.6), all properties mentioned so far may have C prepended to their name, so C<\P{IsLu}>, for example, is equal to C<\P{Lu}>. -=head2 Blocks +=item Blocks In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the @@ -542,101 +662,6 @@ These block names are supported: InYiRadicals InYiSyllables -=over 4 - -=item * - -The special pattern C<\X> matches any extended Unicode -sequence--"a combining character sequence" in Standardese--where the -first character is a base character and subsequent characters are mark -characters that apply to the base character. C<\X> is equivalent to -C<(?:\PM\pM*)>. - -=item * - -The C operator translates characters instead of bytes. Note -that the C functionality has been removed. For similar -functionality see pack('U0', ...) and pack('C0', ...). - -=item * - -Case translation operators use the Unicode case translation tables -when character input is provided. Note that C, or C<\U> in -interpolated strings, translates to uppercase, while C, -or C<\u> in interpolated strings, translates to titlecase in languages -that make the distinction. - -=item * - -Most operators that deal with positions or lengths in a string will -automatically switch to using character positions, including -C, C, C, C, C, C, -C, C, and C. An operator that -specifically does not switch is C. Operators that really don't -care include operators that treat strings as a bucket of bits such as -C, and operators dealing with filenames. - -=item * - -The C/C letter C does I change, since it is often -used for byte-oriented formats. Again, think C in the C language. - -There is a new C specifier that converts between Unicode characters -and code points. There is also a C specifier that is the equivalent of -C/C and properly handles character values even if they are above 255. - -=item * - -The C and C functions work on characters, similar to -C and C, I C and -C. C and C are methods for -emulating byte-oriented C and C on Unicode strings. -While these methods reveal the internal encoding of Unicode strings, -that is not something one normally needs to care about at all. - -=item * - -The bit string operators, C<& | ^ ~>, can operate on character data. -However, for backward compatibility, such as when using bit string -operations when characters are all less than 256 in ordinal value, one -should not use C<~> (the bit complement) with characters of both -values less than 256 and values greater than 256. Most importantly, -DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) -will not hold. The reason for this mathematical I is that -the complement cannot return B the 8-bit (byte-wide) bit -complement B the full character-wide bit complement. - -=item * - -lc(), uc(), lcfirst(), and ucfirst() work for the following cases: - -=over 8 - -=item * - -the case mapping is from a single Unicode character to another -single Unicode character, or - -=item * - -the case mapping is from a single Unicode character to more -than one Unicode character. - -=back - -Things to do with locales (Lithuanian, Turkish, Azeri) do B work -since Perl does not understand the concept of Unicode locales. - -See the Unicode Technical Report #21, Case Mappings, for more details. - -=back - -=over 4 - -=item * - -And finally, C reverses by character rather than by byte. - =back =head2 User-Defined Character Properties @@ -755,9 +780,16 @@ two (or more) classes. It's important to remember not to use "&" for the first set -- that would be intersecting with nothing (resulting in an empty set). +A final note on the user-defined property tests: they will be used +only if the scalar has been marked as having Unicode characters. +Old byte-style strings will not be affected. + +=head2 User-Defined Case Mappings + You can also define your own mappings to be used in the lc(), lcfirst(), uc(), and ucfirst() (or their string-inlined versions). -The principle is the same: define subroutines in the C
package +The principle is similar to that of user-defined character +properties: to define subroutines in the C
package with names like C (for lc() and lcfirst()), C (for the first character in ucfirst()), and C (for uc(), and the rest of the characters in ucfirst()). @@ -801,9 +833,9 @@ are not directly user-accessible, one can use either the C module, or just match case-insensitively (that's when the C mapping is used). -A final note on the user-defined property tests and mappings: they -will be used only if the scalar has been marked as having Unicode -characters. Old byte-style strings will not be affected. +A final note on the user-defined case mappings: they will be used +only if the scalar has been marked as having Unicode characters. +Old byte-style strings will not be affected. =head2 Character Encodings for Input and Output