pod/perlrecharclass.pod

   1 =head1 NAME
   2
   3 perlrecharclass - Perl Regular Expression Character Classes
   4
   5 =head1 DESCRIPTION
   6
   7 The top level documentation about Perl regular expressions
   8 is found in L<perlre>.
   9
  10 This manual page discusses the syntax and use of character
  11 classes in Perl Regular Expressions.
  12
  13 A character class is a way of denoting a set of characters,
  14 in such a way that one character of the set is matched.
  15 It's important to remember that matching a character class
  16 consumes exactly one character in the source string. (The source
  17 string is the string the regular expression is matched against.)
  18
  19 There are three types of character classes in Perl regular
  20 expressions: the dot, backslashed sequences, and the bracketed form.
  21
  22 =head2 The dot
  23
  24 The dot (or period), C<.> is probably the most used, and certainly
  25 the most well-known character class. By default, a dot matches any
  26 character, except for the newline. The default can be changed to
  27 add matching the newline with the I<single line> modifier: either
  28 for the entire regular expression using the C</s> modifier, or
  29 locally using C<(?s)>.
  30
  31 Here are some examples:
  32
  33  "a"  =~  /./       # Match
  34  "."  =~  /./       # Match
  35  ""   =~  /./       # No match (dot has to match a character)
  36  "\n" =~  /./       # No match (dot does not match a newline)
  37  "\n" =~  /./s      # Match (global 'single line' modifier)
  38  "\n" =~  /(?s:.)/  # Match (local 'single line' modifier)
  39  "ab" =~  /^.$/     # No match (dot matches one character)
  40
  41 =head2 Backslashed sequences
  42
  43 Perl regular expressions contain many backslashed sequences that
  44 constitute a character class. That is, they will match a single
  45 character, if that character belongs to a specific set of characters
  46 (defined by the sequence). A backslashed sequence is a sequence of
  47 characters starting with a backslash. Not all backslashed sequences
  48 are character classes; for a full list, see L<perlrebackslash>.
  49
  50 Here's a list of the backslashed sequences, which are discussed in
  51 more detail below.
  52
  53  \d             Match a digit character.
  54  \D             Match a non-digit character.
  55  \w             Match a "word" character.
  56  \W             Match a non-"word" character.
  57  \s             Match a white space character.
  58  \S             Match a non-white space character.
  59  \h             Match a horizontal white space character.
  60  \H             Match a character that isn't horizontal white space.
  61  \N             Match a character that isn't newline.
  62  \v             Match a vertical white space character.
  63  \V             Match a character that isn't vertical white space.
  64  \pP, \p{Prop}  Match a character matching a Unicode property.
  65  \PP, \P{Prop}  Match a character that doesn't match a Unicode property.
  66
  67 =head3 Digits
  68
  69 C<\d> matches a single character that is considered to be a I<digit>.
  70 What is considered a digit depends on the internal encoding of
  71 the source string. If the source string is in UTF-8 format, C<\d>
  72 not only matches the digits '0' - '9', but also Arabic, Devanagari and
  73 digits from other languages. Otherwise, if there is a locale in effect,
  74 it will match whatever characters the locale considers digits. Without
  75 a locale, C<\d> matches the digits '0' to '9'.
  76 See L</Locale, Unicode and UTF-8>.
  77
  78 Any character that isn't matched by C<\d> will be matched by C<\D>.
  79
  80 =head3 Word characters
  81
  82 C<\w> matches a single I<word> character: an alphanumeric character
  83 (that is, an alphabetic character, or a digit), or the underscore (C<_>).
  84 What is considered a word character depends on the internal encoding
  85 of the string. If it's in UTF-8 format, C<\w> matches those characters
  86 that are considered word characters in the Unicode database. That is, it
  87 not only matches ASCII letters, but also Thai letters, Greek letters, etc.
  88 If the source string isn't in UTF-8 format, C<\w> matches those characters
  89 that are considered word characters by the current locale. Without
  90 a locale in effect, C<\w> matches the ASCII letters, digits and the
  91 underscore.
  92
  93 Any character that isn't matched by C<\w> will be matched by C<\W>.
  94
  95 =head3 White space
  96
  97 C<\s> matches any single character that is considered white space. In the
  98 ASCII range, C<\s> matches the horizontal tab (C<\t>), the new line
  99 (C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the
 100 space (the vertical tab, C<\cK> is not matched by C<\s>).  The exact set
 101 of characters matched by C<\s> depends on whether the source string is
 102 in UTF-8 format. If it is, C<\s> matches what is considered white space
 103 in the Unicode database. Otherwise, if there is a locale in effect, C<\s>
 104 matches whatever is considered white space by the current locale. Without
 105 a locale, C<\s> matches the five characters mentioned in the beginning
 106 of this paragraph.  Perhaps the most notable difference is that C<\s>
 107 matches a non-breaking space only if the non-breaking space is in a
 108 UTF-8 encoded string.
 109
 110 Any character that isn't matched by C<\s> will be matched by C<\S>.
 111
 112 C<\h> will match any character that is considered horizontal white space;
 113 this includes the space and the tab characters. C<\H> will match any character
 114 that is not considered horizontal white space.
 115
 116 C<\N>, like the dot, will match any character that is not a newline. The
 117 difference is that C<\N> will not be influenced by the single line C</s>
 118 regular expression modifier. (Note that, since C<\N{}> is also used for
 119 named characters, if C<\N> is followed by an opening brace and something that
 120 is not a quantifier, perl will assume that a character name is coming.  For
 121 example, C<\N{3}> means to match 3 non-newlines; C<\N{5,}> means to match 5 or
 122 more non-newlines, but C<\N{4F}> is not a legal quantifier, and will cause
 123 perl to look for a character named C<4F> (and won't find one unless custom names
 124 have been defined that include it.)
 125
 126 C<\v> will match any character that is considered vertical white space;
 127 this includes the carriage return and line feed characters (newline).
 128 C<\V> will match any character that is not considered vertical white space.
 129
 130 C<\R> matches anything that can be considered a newline under Unicode
 131 rules. It's not a character class, as it can match a multi-character
 132 sequence. Therefore, it cannot be used inside a bracketed character
 133 class. Details are discussed in L<perlrebackslash>.
 134
 135 C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10.0.
 136
 137 Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
 138 the same characters, regardless whether the source string is in UTF-8
 139 format or not. The set of characters they match is also not influenced
 140 by locale.
 141
 142 One might think that C<\s> is equivalent with C<[\h\v]>. This is not true.
 143 The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however
 144 considered vertical white space. Furthermore, if the source string is
 145 not in UTF-8 format, the next line (C<"\x85">) and the no-break space
 146 (C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively.
 147 If the source string is in UTF-8 format, both the next line and the
 148 no-break space are matched by C<\s>.
 149
 150 The following table is a complete listing of characters matched by
 151 C<\s>, C<\h> and C<\v>.
 152
 153 The first column gives the code point of the character (in hex format),
 154 the second column gives the (Unicode) name. The third column indicates
 155 by which class(es) the character is matched.
 156
 157  0x00009        CHARACTER TABULATION   h s
 158  0x0000a              LINE FEED (LF)    vs
 159  0x0000b             LINE TABULATION    v
 160  0x0000c              FORM FEED (FF)    vs
 161  0x0000d        CARRIAGE RETURN (CR)    vs
 162  0x00020                       SPACE   h s
 163  0x00085             NEXT LINE (NEL)    vs  [1]
 164  0x000a0              NO-BREAK SPACE   h s  [1]
 165  0x01680            OGHAM SPACE MARK   h s
 166  0x0180e   MONGOLIAN VOWEL SEPARATOR   h s
 167  0x02000                     EN QUAD   h s
 168  0x02001                     EM QUAD   h s
 169  0x02002                    EN SPACE   h s
 170  0x02003                    EM SPACE   h s
 171  0x02004          THREE-PER-EM SPACE   h s
 172  0x02005           FOUR-PER-EM SPACE   h s
 173  0x02006            SIX-PER-EM SPACE   h s
 174  0x02007                FIGURE SPACE   h s
 175  0x02008           PUNCTUATION SPACE   h s
 176  0x02009                  THIN SPACE   h s
 177  0x0200a                  HAIR SPACE   h s
 178  0x02028              LINE SEPARATOR    vs
 179  0x02029         PARAGRAPH SEPARATOR    vs
 180  0x0202f       NARROW NO-BREAK SPACE   h s
 181  0x0205f   MEDIUM MATHEMATICAL SPACE   h s
 182  0x03000           IDEOGRAPHIC SPACE   h s
 183
 184 =over 4
 185
 186 =item [1]
 187
 188 NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
 189 UTF-8 format.
 190
 191 =back
 192
 193 It is worth noting that C<\d>, C<\w>, etc, match single characters, not
 194 complete numbers or words. To match a number (that consists of integers),
 195 use C<\d+>; to match a word, use C<\w+>.
 196
 197
 198 =head3 Unicode Properties
 199
 200 C<\pP> and C<\p{Prop}> are character classes to match characters that
 201 fit given Unicode classes. One letter classes can be used in the C<\pP>
 202 form, with the class name following the C<\p>, otherwise, braces are required.
 203 There is a single form, which is just the property name enclosed in the braces,
 204 and a compound form which looks like C<\p{name=value}>, which means to match
 205 if the property C<name> for the character has the particular C<value>.
 206 For instance, a match for a number can be written as C</\pN/> or as
 207 C</\p{Number}/>, or as C</\p{Number=True}/>.
 208 Lowercase letters are matched by the property I<Lowercase_Letter> which
 209 has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or
 210 C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/>
 211 (the underscores are optional).
 212 C</\pLl/> is valid, but means something different.
 213 It matches a two character string: a letter (Unicode property C<\pL>),
 214 followed by a lowercase C<l>.
 215
 216 For more details, see L<perlunicode/Unicode Character Properties>; for a
 217 complete list of possible properties, see
 218 L<perluniprops/Properties accessible through \p{} and \P{}>.
 219 It is also possible to define your own properties. This is discussed in
 220 L<perlunicode/User-Defined Character Properties>.
 221
 222
 223 =head4 Examples
 224
 225  "a"  =~  /\w/      # Match, "a" is a 'word' character.
 226  "7"  =~  /\w/      # Match, "7" is a 'word' character as well.
 227  "a"  =~  /\d/      # No match, "a" isn't a digit.
 228  "7"  =~  /\d/      # Match, "7" is a digit.
 229  " "  =~  /\s/      # Match, a space is white space.
 230  "a"  =~  /\D/      # Match, "a" is a non-digit.
 231  "7"  =~  /\D/      # No match, "7" is not a non-digit.
 232  " "  =~  /\S/      # No match, a space is not non-white space.
 233
 234  " "  =~  /\h/      # Match, space is horizontal white space.
 235  " "  =~  /\v/      # No match, space is not vertical white space.
 236  "\r" =~  /\v/      # Match, a return is vertical white space.
 237
 238  "a"  =~  /\pL/     # Match, "a" is a letter.
 239  "a"  =~  /\p{Lu}/  # No match, /\p{Lu}/ matches upper case letters.
 240
 241  "\x{0e0b}" =~ /\p{Thai}/  # Match, \x{0e0b} is the character
 242                            # 'THAI CHARACTER SO SO', and that's in
 243                            # Thai Unicode class.
 244  "a"  =~  /\P{Lao}/ # Match, as "a" is not a Laoian character.
 245
 246
 247 =head2 Bracketed Character Classes
 248
 249 The third form of character class you can use in Perl regular expressions
 250 is the bracketed form. In its simplest form, it lists the characters
 251 that may be matched inside square brackets, like this: C<[aeiou]>.
 252 This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the other
 253 character classes, exactly one character will be matched. To match
 254 a longer string consisting of characters mentioned in the characters
 255 class, follow the character class with a quantifier. For instance,
 256 C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.
 257
 258 Repeating a character in a character class has no
 259 effect; it's considered to be in the set only once.
 260
 261 Examples:
 262
 263  "e"  =~  /[aeiou]/        # Match, as "e" is listed in the class.
 264  "p"  =~  /[aeiou]/        # No match, "p" is not listed in the class.
 265  "ae" =~  /^[aeiou]$/      # No match, a character class only matches
 266                            # a single character.
 267  "ae" =~  /^[aeiou]+$/     # Match, due to the quantifier.
 268
 269 =head3 Special Characters Inside a Bracketed Character Class
 270
 271 Most characters that are meta characters in regular expressions (that
 272 is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose
 273 their special meaning and can be used inside a character class without
 274 the need to escape them. For instance, C<[()]> matches either an opening
 275 parenthesis, or a closing parenthesis, and the parens inside the character
 276 class don't group or capture.
 277
 278 Characters that may carry a special meaning inside a character class are:
 279 C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
 280 escaped with a backslash, although this is sometimes not needed, in which
 281 case the backslash may be omitted.
 282
 283 The sequence C<\b> is special inside a bracketed character class. While
 284 outside the character class C<\b> is an assertion indicating a point
 285 that does not have either two word characters or two non-word characters
 286 on either side, inside a bracketed character class, C<\b> matches a
 287 backspace character.
 288
 289 The sequences
 290 C<\a>,
 291 C<\c>,
 292 C<\e>,
 293 C<\f>,
 294 C<\n>,
 295 C<\N{I<NAME>}>,
 296 C<\N{U+I<wide hex char>}>,
 297 C<\r>,
 298 C<\t>,
 299 and
 300 C<\x>
 301 are also special and have the same meanings as they do outside a bracketed character
 302 class.
 303
 304 Also, a backslash followed by digits is considered an octal number.
 305
 306 A C<[> is not special inside a character class, unless it's the start
 307 of a POSIX character class (see below). It normally does not need escaping.
 308
 309 A C<]> is either the end of a POSIX character class (see below), or it
 310 signals the end of the bracketed character class. Normally it needs
 311 escaping if you want to include a C<]> in the set of characters.
 312 However, if the C<]> is the I<first> (or the second if the first
 313 character is a caret) character of a bracketed character class, it
 314 does not denote the end of the class (as you cannot have an empty class)
 315 and is considered part of the set of characters that can be matched without
 316 escaping.
 317
 318 Examples:
 319
 320  "+"   =~ /[+?*]/     #  Match, "+" in a character class is not special.
 321  "\cH" =~ /[\b]/      #  Match, \b inside in a character class
 322                       #  is equivalent with a backspace.
 323  "]"   =~ /[][]/      #  Match, as the character class contains.
 324                       #  both [ and ].
 325  "[]"  =~ /[[]]/      #  Match, the pattern contains a character class
 326                       #  containing just ], and the character class is
 327                       #  followed by a ].
 328
 329 =head3 Character Ranges
 330
 331 It is not uncommon to want to match a range of characters. Luckily, instead
 332 of listing all the characters in the range, one may use the hyphen (C<->).
 333 If inside a bracketed character class you have two characters separated
 334 by a hyphen, it's treated as if all the characters between the two are in
 335 the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
 336 matches any lowercase letter from the first half of the ASCII alphabet.
 337
 338 Note that the two characters on either side of the hyphen are not
 339 necessary both letters or both digits. Any character is possible,
 340 although not advisable.  C<['-?]> contains a range of characters, but
 341 most people will not know which characters that will be. Furthermore,
 342 such ranges may lead to portability problems if the code has to run on
 343 a platform that uses a different character set, such as EBCDIC.
 344
 345 If a hyphen in a character class cannot be part of a range, for instance
 346 because it is the first or the last character of the character class,
 347 or if it immediately follows a range, the hyphen isn't special, and will be
 348 considered a character that may be matched. You have to escape the hyphen
 349 with a backslash if you want to have a hyphen in your set of characters to
 350 be matched, and its position in the class is such that it can be considered
 351 part of a range.
 352
 353 Examples:
 354
 355  [a-z]       #  Matches a character that is a lower case ASCII letter.
 356  [a-fz]      #  Matches any letter between 'a' and 'f' (inclusive) or the
 357              #  letter 'z'.
 358  [-z]        #  Matches either a hyphen ('-') or the letter 'z'.
 359  [a-f-m]     #  Matches any letter between 'a' and 'f' (inclusive), the
 360              #  hyphen ('-'), or the letter 'm'.
 361  ['-?]       #  Matches any of the characters  '()*+,-./0123456789:;<=>?
 362              #  (But not on an EBCDIC platform).
 363
 364
 365 =head3 Negation
 366
 367 It is also possible to instead list the characters you do not want to
 368 match. You can do so by using a caret (C<^>) as the first character in the
 369 character class. For instance, C<[^a-z]> matches a character that is not a
 370 lowercase ASCII letter.
 371
 372 This syntax make the caret a special character inside a bracketed character
 373 class, but only if it is the first character of the class. So if you want
 374 to have the caret as one of the characters you want to match, you either
 375 have to escape the caret, or not list it first.
 376
 377 Examples:
 378
 379  "e"  =~  /[^aeiou]/   #  No match, the 'e' is listed.
 380  "x"  =~  /[^aeiou]/   #  Match, as 'x' isn't a lowercase vowel.
 381  "^"  =~  /[^^]/       #  No match, matches anything that isn't a caret.
 382  "^"  =~  /[x^]/       #  Match, caret is not special here.
 383
 384 =head3 Backslash Sequences
 385
 386 You can put any backslash sequence character class (with one exception listed
 387 in the next paragraph) inside a bracketed character class, and it will act just
 388 as if you put all the characters matched by the backslash sequence inside the
 389 character class. For instance, C<[a-f\d]> will match any digit, or any of the
 390 lowercase letters between 'a' and 'f' inclusive.
 391
 392 C<\N> within a bracketed character class must be of the forms C<\N{I<name>}>  or
 393 C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a
 394 bracketed character class loses its special meaning: it matches nearly
 395 anything, which generally isn't what you want to happen.
 396
 397 Examples:
 398
 399  /[\p{Thai}\d]/     # Matches a character that is either a Thai
 400                     # character, or a digit.
 401  /[^\p{Arabic}()]/  # Matches a character that is neither an Arabic
 402                     # character, nor a parenthesis.
 403
 404 Backslash sequence character classes cannot form one of the endpoints
 405 of a range.
 406
 407 =head3 Posix Character Classes
 408
 409 Posix character classes have the form C<[:class:]>, where I<class> is
 410 name, and the C<[:> and C<:]> delimiters. Posix character classes appear
 411 I<inside> bracketed character classes, and are a convenient and descriptive
 412 way of listing a group of characters. Be careful about the syntax,
 413
 414  # Correct:
 415  $string =~ /[[:alpha:]]/
 416
 417  # Incorrect (will warn):
 418  $string =~ /[:alpha:]/
 419
 420 The latter pattern would be a character class consisting of a colon,
 421 and the letters C<a>, C<l>, C<p> and C<h>.
 422
 423 Perl recognizes the following POSIX character classes:
 424
 425  alpha  Any alphabetical character.
 426  alnum  Any alphanumerical character.
 427  ascii  Any ASCII character.
 428  blank  A GNU extension, equal to a space or a horizontal tab ("\t").
 429  cntrl  Any control character.
 430  digit  Any digit, equivalent to "\d".
 431  graph  Any printable character, excluding a space.
 432  lower  Any lowercase character.
 433  print  Any printable character, including a space.
 434  punct  Any punctuation character.
 435  space  Any white space character. "\s" plus the vertical tab ("\cK").
 436  upper  Any uppercase character.
 437  word   Any "word" character, equivalent to "\w".
 438  xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'.
 439
 440 The exact set of characters matched depends on whether the source string
 441 is internally in UTF-8 format or not. See L</Locale, Unicode and UTF-8>.
 442
 443 Most POSIX character classes have C<\p> counterparts. The difference
 444 is that the C<\p> classes will always match according to the Unicode
 445 properties, regardless whether the string is in UTF-8 format or not.
 446
 447 The following table shows the relation between POSIX character classes
 448 and the Unicode properties:
 449
 450  [[:...:]]   \p{...}      backslash
 451
 452  alpha       IsAlpha
 453  alnum       IsAlnum
 454  ascii       IsASCII
 455  blank
 456  cntrl       IsCntrl
 457  digit       IsDigit      \d
 458  graph       IsGraph
 459  lower       IsLower
 460  print       IsPrint
 461  punct       IsPunct
 462  space       IsSpace
 463              IsSpacePerl  \s
 464  upper       IsUpper
 465  word        IsWord
 466  xdigit      IsXDigit
 467
 468 Some of these names may not be obvious:
 469
 470 =over 4
 471
 472 =item cntrl
 473
 474 Any control character. Usually, control characters don't produce output
 475 as such, but instead control the terminal somehow: for example newline
 476 and backspace are control characters. All characters with C<ord()> less
 477 than 32 are usually classified as control characters (in ASCII, the ISO
 478 Latin character sets, and Unicode), as is the character C<ord()> value
 479 of 127 (C<DEL>).
 480
 481 =item graph
 482
 483 Any character that is I<graphical>, that is, visible. This class consists
 484 of all the alphanumerical characters and all punctuation characters.
 485
 486 =item print
 487
 488 All printable characters, which is the set of all the graphical characters
 489 plus the space.
 490
 491 =item punct
 492
 493 Any punctuation (special) character.
 494
 495 =back
 496
 497 =head4 Negation
 498
 499 A Perl extension to the POSIX character class is the ability to
 500 negate it. This is done by prefixing the class name with a caret (C<^>).
 501 Some examples:
 502
 503  POSIX         Unicode       Backslash
 504  [[:^digit:]]  \P{IsDigit}   \D
 505  [[:^space:]]  \P{IsSpace}   \S
 506  [[:^word:]]   \P{IsWord}    \W
 507
 508 =head4 [= =] and [. .]
 509
 510 Perl will recognize the POSIX character classes C<[=class=]>, and
 511 C<[.class.]>, but does not (yet?) support this construct. Use of
 512 such a construct will lead to an error.
 513
 514
 515 =head4 Examples
 516
 517  /[[:digit:]]/            # Matches a character that is a digit.
 518  /[01[:lower:]]/          # Matches a character that is either a
 519                           # lowercase letter, or '0' or '1'.
 520  /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,
 521                           # but the letters 'a' to 'f' in either case.
 522                           # This is because the character class contains
 523                           # all digits, and anything that isn't a
 524                           # hex digit, resulting in a class containing
 525                           # all characters, but the letters 'a' to 'f'
 526                           # and 'A' to 'F'.
 527
 528
 529 =head2 Locale, Unicode and UTF-8
 530
 531 Some of the character classes have a somewhat different behaviour depending
 532 on the internal encoding of the source string, and the locale that is
 533 in effect.
 534
 535 C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
 536 including C<\W>, C<\D>, C<\S>) suffer from this behaviour.
 537
 538 The rule is that if the source string is in UTF-8 format, the character
 539 classes match according to the Unicode properties. If the source string
 540 isn't, then the character classes match according to whatever locale is
 541 in effect. If there is no locale, they match the ASCII defaults
 542 (52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc).
 543
 544 This usually means that if you are matching against characters whose C<ord()>
 545 values are between 128 and 255 inclusive, your character class may match
 546 or not depending on the current locale, and whether the source string is
 547 in UTF-8 format. The string will be in UTF-8 format if it contains
 548 characters whose C<ord()> value exceeds 255. But a string may be in UTF-8
 549 format without it having such characters.
 550
 551 For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
 552 or the POSIX character classes, and use the Unicode properties instead.
 553
 554 =head4 Examples
 555
 556  $str =  "\xDF";      # $str is not in UTF-8 format.
 557  $str =~ /^\w/;       # No match, as $str isn't in UTF-8 format.
 558  $str .= "\x{0e0b}";  # Now $str is in UTF-8 format.
 559  $str =~ /^\w/;       # Match! $str is now in UTF-8 format.
 560  chop $str;
 561  $str =~ /^\w/;       # Still a match! $str remains in UTF-8 format.
 562
 563 =cut