From: karl williamson Date: Mon, 28 Dec 2009 16:16:27 +0000 (-0700) Subject: PATCH: document all Perl Unicode \p{} extensions X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=9f815e241cf04d04fc645970753438216a0ed024;p=p5sagit%2Fp5-mst-13.2.git PATCH: document all Perl Unicode \p{} extensions This also changes some C<> constructs. From d01b049b3aa9bc3a394adb30d6db735f5dd52321 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Mon, 28 Dec 2009 09:14:48 -0700 Subject: [PATCH] Document all perl Unicode \p extensions Signed-off-by: H.Merijn Brand --- diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 09b5215..1737b52 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -192,8 +192,8 @@ See L for more details. =item * -The special pattern C<\X> matches a logical character, an C in Standardese. In Unicode what appears to the user to be a single +The special pattern C<\X> matches a logical character, an "extended grapheme +cluster" in Standardese. In Unicode what appears to the user to be a single character, for example an accented C, may in fact be composed of a sequence of characters, in this case a C followed by an accent character. C<\X> will match the entire sequence. @@ -290,8 +290,8 @@ take on more values than just True and False. For example, the Bidi_Class (see L below), can take on a number of different values, such as Left, Right, Whitespace, and others. To match these, one needs to specify the property name (Bidi_Class), and the value being matched against -(Left, Right, etc.). This is done, as in the examples above, by having the two -components separated by an equal sign (or interchangeably, a colon), like +(Left, Right, I). This is done, as in the examples above, by having the +two components separated by an equal sign (or interchangeably, a colon), like C<\p{Bidi_Class: Left}>. All Unicode-defined character properties may be written in these compound forms @@ -339,7 +339,7 @@ Every Unicode character is assigned a general category, which is the "most usual categorization of a character" (from L). -The compound way of writing these is like C<{\p{General_Category=Number}> +The compound way of writing these is like C<\p{General_Category=Number}> (short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up through the equal or colon separator is omitted. So you can instead just write C<\pN>. @@ -452,13 +452,6 @@ C<\P{Cyrillic}>. A complete list of scripts and their shortcuts is in L. -=head3 B - -There are many more property classes than the basic ones described here, -including some Perl extensions. -A complete list is in L. -The extensions are more fully described in L - =head3 B For backward compatibility (with Perl 5.6), all properties mentioned @@ -472,11 +465,11 @@ In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the concept of scripts is closer to natural languages, while the concept of blocks is more of an artificial grouping based on groups of Unicode -characters with consecutive ordinal values. For example, the C +characters with consecutive ordinal values. For example, the "Basic Latin" block is all characters whose ordinals are between 0 and 127, inclusive, in -other words, the ASCII characters. The C script contains some letters -from this block as well as several more, like C, -C, I, but it does not contain all the characters from +other words, the ASCII characters. The "Latin" script contains some letters +from this block as well as several more, like "Latin-1 Supplement", +"Latin Extended-A", I, but it does not contain all the characters from those blocks. It does not, for example, contain digits, because digits are shared across many scripts. Digits and similar groups, like punctuation, are in the script called C. There is also a script called C for @@ -504,14 +497,14 @@ reasons: =item 1 It is confusing. There are many naming conflicts, and you may forget some. -For example, \p{Hebrew} means the I