From: Karl Williamson Date: Mon, 21 Dec 2009 18:44:35 +0000 (-0700) Subject: Fix up pods for \X X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=0111a78fcc993bdfaa4b46112924c3a9751ecfa5;p=p5sagit%2Fp5-mst-13.2.git Fix up pods for \X --- diff --git a/pod/perlre.pod b/pod/perlre.pod index 794a512..7127de0 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -257,8 +257,7 @@ X X X X \D Match a non-digit character \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P - \X Match eXtended Unicode "combining character sequence", - equivalent to (?>\PM\pM*) + \X Match Unicode "eXtended grapheme cluster" \C Match a single C char (octet) even under Unicode. NOTE: breaks up characters into their UTF-8 bytes, so you may end up with malformed pieces of UTF-8. @@ -517,7 +516,6 @@ left parentheses have opened before it. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences. - If the bracketing group did not match, the associated backreference won't match either. (This can happen if the bracketing group is optional, or in a different branch of an alternation.) diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 40f73fc..e8ffcf1 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -100,7 +100,7 @@ quoted constructs>. \w Character class for word characters. \W Character class for non-word characters. \x{}, \x00 Hexadecimal escape sequence. - \X Extended Unicode "combining character sequence". + \X Unicode "extended grapheme cluster". \z End of string. \Z End of string. @@ -507,18 +507,14 @@ metacharacter, and suggests C<\R> as the notation. =item \X -This matches an extended Unicode I, and -is equivalent to C<< (?>\PM\pM*) >>. C<\PM> matches any character that is -not considered a Unicode mark character, while C<\pM> matches any character -that is considered a Unicode mark character; so C<\X> matches any non -mark character followed by zero or more mark characters. Mark characters -include (but are not restricted to) I and -I. +This matches a Unicode I. C<\X> matches quite well what normal (non-Unicode-programmer) usage -would consider a single character: for example a base character -(the C<\PM> above), for example a letter, followed by zero or more -diacritics, which are I (the C<\pM*> above). +would consider a single character. As an example, consider a G with some sort +of accent mark over it (a diacritic). There is no such single character in +Unicode, but something like one can be constructed by using a G followed by a +Unicode combining accent, and would be displayed by Unicode-aware software as +if it were a single character. Mnemonic: eItended Unicode character. diff --git a/pod/perlreref.pod b/pod/perlreref.pod index f384fe7..5213420 100644 --- a/pod/perlreref.pod +++ b/pod/perlreref.pod @@ -135,7 +135,7 @@ and L for details. \p{...} Match Unicode property with long name \PP Match non-P \P{...} Match lack of Unicode property with long name - \X Match extended Unicode combining character sequence + \X Match Unicode extended grapheme cluster POSIX character classes and their Unicode and Perl equivalents: diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 36f729c..8144303 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -45,25 +45,29 @@ these properties are independent of the names of the characters. Furthermore, various operations on the characters like uppercasing, lowercasing, and collating (sorting) are defined. -A Unicode character consists either of a single code point, or a -I (like C), followed by one or -more I (like C). This sequence of +A Unicode I "character" can actually consist of more than one internal +I "character" or code point. For Western languages, this is adequately +represented by a I (like C), followed +by one or more I (like C). This sequence of base character and modifiers is called a I. - -Whether to call these combining character sequences "characters" -depends on your point of view. If you are a programmer, you probably -would tend towards seeing each element in the sequences as one unit, -or "character". The whole sequence could be seen as one "character", -however, from the user's point of view, since that's probably what it -looks like in the context of the user's language. +sequence>. Some non-western languages require more complicated +representations, so Unicode invented a I and then an +I. For example, A Korean Hangul syllable is +considered a single logical character, but most often consists of three actual +characters: a leading consonant followed by an interior vowel followed by a +trailing consonant. + +Whether to call these extended grapheme clusters "characters" depends on your +point of view. If you are a programmer, you probably would tend towards seeing +each element in the sequences as one unit, or "character". The whole sequence +could be seen as one "character", however, from the user's point of view, since +that's probably what it looks like in the context of the user's language. With this "whole sequence" view of characters, the total number of characters is open-ended. But in the programmer's "one unit is one character" point of view, the concept of "characters" is more deterministic. In this document, we take that second point of view: -one "character" is one Unicode code point, be it a base character or -a combining character. +one "character" is one Unicode code point. For some combinations, there are I characters. C, for example, is defined as @@ -261,14 +265,14 @@ strings as usual. Functions like C, C, and C will work on the Unicode characters; regular expressions will work on the Unicode characters (see L and L). -Note that Perl considers combining character sequences to be -separate characters, so for example +Note that Perl considers grapheme clusters to be separate characters, so for +example use charnames ':full'; print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n"; will print 2, not 1. The only exception is that regular expressions -have C<\X> for matching a combining character sequence. +have C<\X> for matching an extended grapheme cluster. Life is not quite so transparent, however, when working with legacy encodings, I/O, and certain special cases: