X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperluniintro.pod;h=814430326045e78b27a4cba13fc30827a913f216;hb=0111a78fcc993bdfaa4b46112924c3a9751ecfa5;hp=36f729c67b451a2b138a20153c15ebe7f932e5e7;hpb=c3c0aa283b73660f84ae7e190dcbbd607facb512;p=p5sagit%2Fp5-mst-13.2.git
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 36f729c..8144303 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -45,25 +45,29 @@ these properties are independent of the names of the characters.
Furthermore, various operations on the characters like uppercasing,
lowercasing, and collating (sorting) are defined.
-A Unicode character consists either of a single code point, or a
-I (like C), followed by one or
-more I (like C). This sequence of
+A Unicode I "character" can actually consist of more than one internal
+I "character" or code point. For Western languages, this is adequately
+represented by a I (like C), followed
+by one or more I (like C). This sequence of
base character and modifiers is called a I.
-
-Whether to call these combining character sequences "characters"
-depends on your point of view. If you are a programmer, you probably
-would tend towards seeing each element in the sequences as one unit,
-or "character". The whole sequence could be seen as one "character",
-however, from the user's point of view, since that's probably what it
-looks like in the context of the user's language.
+sequence>. Some non-western languages require more complicated
+representations, so Unicode invented a I and then an
+I. For example, A Korean Hangul syllable is
+considered a single logical character, but most often consists of three actual
+characters: a leading consonant followed by an interior vowel followed by a
+trailing consonant.
+
+Whether to call these extended grapheme clusters "characters" depends on your
+point of view. If you are a programmer, you probably would tend towards seeing
+each element in the sequences as one unit, or "character". The whole sequence
+could be seen as one "character", however, from the user's point of view, since
+that's probably what it looks like in the context of the user's language.
With this "whole sequence" view of characters, the total number of
characters is open-ended. But in the programmer's "one unit is one
character" point of view, the concept of "characters" is more
deterministic. In this document, we take that second point of view:
-one "character" is one Unicode code point, be it a base character or
-a combining character.
+one "character" is one Unicode code point.
For some combinations, there are I characters.
C, for example, is defined as
@@ -261,14 +265,14 @@ strings as usual. Functions like C, C, and
C will work on the Unicode characters; regular expressions
will work on the Unicode characters (see L and L).
-Note that Perl considers combining character sequences to be
-separate characters, so for example
+Note that Perl considers grapheme clusters to be separate characters, so for
+example
use charnames ':full';
print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
will print 2, not 1. The only exception is that regular expressions
-have C<\X> for matching a combining character sequence.
+have C<\X> for matching an extended grapheme cluster.
Life is not quite so transparent, however, when working with legacy
encodings, I/O, and certain special cases: