\x{263a} wide hex char (example: SMILEY)
\c[ control char (example: ESC)
\N{name} named Unicode character
+ \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
The character following C<\c> is mapped to some other character by
converting letters to upper case and then (on ASCII systems) by inverting
'@', the letters, '[', '\', ']', '^', '_' and '?' will work, resulting
in 0x00 through 0x1F and 0x7F.
-B<NOTE>: Unlike C and other languages, Perl has no \v escape sequence for
-the vertical tab (VT - ASCII 11), but you may use C<\ck> or C<\x0b>.
+C<\N{U+I<wide hex char>}> means the Unicode character whose Unicode ordinal
+number is I<wide hex char>.
+For documentation of C<\N{name}>, see L<charnames>.
-The following escape sequences are available in constructs that interpolate
+B<NOTE>: Unlike C and other languages, Perl has no C<\v> escape sequence for
+the vertical tab (VT - ASCII 11), but you may use C<\ck> or C<\x0b>. (C<\v>
+does have meaning in regular expression patterns in Perl, see L<perlre>.)
+
+The following escape sequences are available in constructs that interpolate,
but not in transliterations.
X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
C<\u> and C<\U> is taken from the current locale. See L<perllocale>.
If Unicode (for example, C<\N{}> or wide hex characters of 0x100 or
beyond) is being used, the case map used by C<\l>, C<\L>, C<\u> and
-C<\U> is as defined by Unicode. For documentation of C<\N{name}>,
-see L<charnames>.
+C<\U> is as defined by Unicode.
All systems use the virtual C<"\n"> to represent a line terminator,
called a "newline". There is no such thing as an unvarying, physical
\x{263a} long hex char (example: Unicode SMILEY)
\cK control char (example: VT)
\N{name} named Unicode character
+ \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\L lowercase till \E (think vi)
X<\R>
Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the
-character whose name is C<NAME>. Otherwise it matches any character but C<\n>.
+character whose name is C<NAME>; and similarly when of the form
+C<\N{U+I<wide hex char>}>, it matches the character whose Unicode ordinal is
+I<wide hex char>. Otherwise it matches any character but C<\n>.
The POSIX character class syntax
X<character class>
\L Lowercase till \E. Not in [].
\n (Logical) newline character.
\N Any character but newline. Not in [].
- \N{} Named (Unicode) character.
+ \N{} Named or numbered (Unicode) character.
\p{}, \pP Character with the given Unicode property.
\P{}, \PP Character without the given Unicode property.
\Q Quotemeta till \E. Not in [].
$str =~ /\cK/; # Matches if $str contains a vertical tab (control-K).
-=head3 Named characters
+=head3 Named or numbered characters
-All Unicode characters have a Unicode name. It is even possible to give your
-own names to characters, even to short sequences of characters. You can use a
-character by name by using the C<\N{}> construct; the name of the character
-goes between the curly braces. You do have to C<use charnames> to load the
-Unicode names of the characters, otherwise Perl will complain. (If you instead
-have your own names, a C<use> statement will be required for your translator.)
-For more details, see L<charnames>.
+All Unicode characters have a Unicode name and numeric ordinal value. Use the
+C<\N{}> construct to specify a character by either of these values.
+
+To specify by name, the name of the character goes between the curly braces.
+In this case, you have to C<use charnames> to load the Unicode names of the
+characters, otherwise Perl will complain.
+
+To specify by Unicode ordinal number, use the form
+C<\N{U+I<wide hex character>}>, where I<wide hex character> is a number in
+hexadecimal that gives the ordinal number that Unicode has assigned to the
+desired character. It is customary (but not required) to use leading zeros to
+pad the number to 4 digits. Thus C<\N{U+0041}> means
+C<Latin Capital Letter A>, and you will rarely see it written without the two
+leading zeros. C<\N{U+0041}> means C<A> even on EBCDIC machines (where the
+ordinal value of C<A> is not 0x41).
+
+It is even possible to give your own names to characters, and even to short
+sequences of characters. For details, see L<charnames>.
Mnemonic: I<N>amed character.
metasymbol, except under the C</s> flag, which changes the meaning of C<.>, but
not C<\N>.
-Note that C<\N{...}> can mean a L<named character|/Named characters>.
+Note that C<\N{...}> can mean a
+L<named or numbered character|/Named or numbered characters>.
Mnemonic: Complement of I<\n>.
is not a quantifier, perl will assume that a character name is coming. For
example, C<\N{3}> means to match 3 non-newlines; C<\N{5,}> means to match 5 or
more non-newlines, but C<\N{4F}> is not a legal quantifier, and will cause
-perl to look for a character named C<4F> (and won't find it unless custom names
-have been defined.)
+perl to look for a character named C<4F> (and won't find one unless custom names
+have been defined that include it.)
C<\v> will match any character that is considered vertical white space;
this includes the carriage return and line feed characters (newline).
C<\e>,
C<\f>,
C<\n>,
-C<\N{NAME}>,
+C<\N{I<NAME>}>,
+C<\N{U+I<wide hex char>}>,
C<\r>,
C<\t>,
and
character class. For instance, C<[a-f\d]> will match any digit, or any of the
lowercase letters between 'a' and 'f' inclusive.
-C<\N> within a bracketed character class must be of the form C<\N{NAME}> for
-the same reason that a dot C<.> inside a bracketed character class loses its
-special meaning: it matches nearly anything, which generally isn't what you
-want to happen.
+C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> or
+C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a
+bracketed character class loses its special meaning: it matches nearly
+anything, which generally isn't what you want to happen.
Examples:
\x{263a} A wide hexadecimal value
\cx Control-x
\N{name} A named character
+ \N{U+263D} A Unicode character by hex ordinal
\l Lowercase next character
\u Titlecase next character
Figuring out the hexadecimal sequence of a Unicode character you want
or deciphering someone else's hexadecimal Unicode regexp is about as
much fun as programming in machine code. So another way to specify
-Unicode characters is to use the I<named character>> escape
-sequence C<\N{name}>. C<name> is a name for the Unicode character, as
+Unicode characters is to use the I<named character> escape
+sequence C<\N{I<name>}>. I<name> is a name for the Unicode character, as
specified in the Unicode standard. For instance, if we wanted to
represent or match the astrological sign for the planet Mercury, we
could use