X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=ae13a33b1455516a37b8fa5ea73e67e13f01cc9a;hb=7b059540b116737402869fbccad6d5c540c7f62e;hp=1d3f84626f86cb232903f5ad15e9dcb5260f3fda;hpb=fde18df140d5f64815bdd632a127ecd5ce3d97fa;p=p5sagit%2Fp5-mst-13.2.git
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 1d3f846..ae13a33 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not
implement the Unicode standard or the accompanying technical reports
from cover to cover, Perl does support many Unicode features.
+People who want to learn to use Unicode in Perl, should probably read
+L, before reading
+this reference document.
+
=over 4
=item Input and Output Layers
@@ -20,15 +24,15 @@ the ":utf8" layer. Other encodings can be converted to Perl's
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See L.
-To indicate that Perl source itself is using a particular encoding,
-see L.
+To indicate that Perl source itself is in UTF-8, use C below.
+
The C pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
Note that this pragma is only required while Perl defaults to byte
@@ -83,15 +116,13 @@ input data comes from a Unicode source--for example, if a character
encoding layer is added to a filehandle or a literal Unicode
string constant appears in a program--character semantics apply.
Otherwise, byte semantics are in effect. The C pragma should
-be used to force byte semantics on Unicode data.
+be used to force byte semantics on Unicode data, and the C pragma to force Unicode semantics on byte data (though in
+5.12 it isn't fully implemented).
If strings operating under byte semantics and strings with Unicode
-character data are concatenated, the new string will be upgraded to
-I, even if the old Unicode string used EBCDIC.
-This translation is done without regard to the system's native 8-bit
-encoding, so to change this for systems with non-Latin-1 and
-non-EBCDIC native encodings use the C pragma. See
-L.
+character data are concatenated, the new string will have
+character semantics. This can cause surprises: See L, below
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
@@ -111,17 +142,20 @@ Character semantics have the following effects:
Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.
-If you use a Unicode editor to edit your program, Unicode characters
-may occur directly within the literal strings in one of the various
-Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
-as such and converted to Perl's internal representation only if the
-appropriate L is specified.
+If you use a Unicode editor to edit your program, Unicode characters may
+occur directly within the literal strings in UTF-8 encoding, or UTF-16.
+(The former requires a BOM or C, the latter requires a BOM.)
-Unicode characters can also be added to a string by using the
-C<\x{...}> notation. The Unicode code for the desired character, in
-hexadecimal, should be placed in the braces. For instance, a smiley
-face is C<\x{263A}>. This encoding scheme only works for characters
-with a code of 0x100 or above.
+Unicode characters can also be added to a string by using the C<\N{U+...}>
+notation. The Unicode code for the desired character, in hexadecimal,
+should be placed in the braces, after the C. For instance, a smiley face is
+C<\N{U+263A}>.
+
+Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
+above. For characters below 0x100 you may get byte semantics instead of
+character semantics; see L. On EBCDIC machines there is
+the additional problem that the value for such characters gives the EBCDIC
+character rather than the Unicode one.
Additionally, if you
@@ -129,7 +163,7 @@ Additionally, if you
you can use the C<\N{...}> notation and put the official Unicode
character name within the braces, such as C<\N{WHITE SMILING FACE}>.
-
+See L.
=item *
@@ -141,8 +175,7 @@ names.
=item *
Regular expressions match characters instead of bytes. "." matches
-a character instead of a byte. The C<\C> pattern is provided to force
-a match a single byte--a C in C, hence C<\C>.
+a character instead of a byte.
=item *
@@ -155,464 +188,581 @@ ideograph, for instance.
Named Unicode properties, scripts, and block ranges may be used like
character classes via the C<\p{}> "matches property" construct and
-the C<\P{}> negation, "doesn't match property".
-
-For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
-(Letter, uppercase) property, while C<\p{M}> matches any character
-with an "M" (mark--accents and such) property. Brackets are not
-required for single letter properties, so C<\p{M}> is equivalent to
-C<\pM>. Many predefined properties are available, such as
-C<\p{Mirrored}> and C<\p{Tibetan}>.
-
-The official Unicode script and block names have spaces and dashes as
-separators, but for convenience you can use dashes, spaces, or
-underbars, and case is unimportant. It is recommended, however, that
-for consistency you use the following naming: the official Unicode
-script, property, or block name (see below for the additional rules
-that apply to block names) with whitespace and dashes removed, and the
-words "uppercase-first-lowercase-rest". C thus
-becomes C.
+the C<\P{}> negation, "doesn't match property".
+See L"Unicode Character Properties"> for more details.
+
+You can define your own character properties and use them
+in the regular expression with the C<\p{}> or C<\P{}> construct.
+See L"User-Defined Character Properties"> for more details.
+
+=item *
+
+The special pattern C<\X> matches a logical character, an "extended grapheme
+cluster" in Standardese. In Unicode what appears to the user to be a single
+character, for example an accented C, may in fact be composed of a sequence
+of characters, in this case a C followed by an accent character. C<\X>
+will match the entire sequence.
+
+=item *
+
+The C
operator translates characters instead of bytes. Note
+that the C functionality has been removed. For similar
+functionality see pack('U0', ...) and pack('C0', ...).
+
+=item *
+
+Case translation operators use the Unicode case translation tables
+when character input is provided. Note that C, or C<\U> in
+interpolated strings, translates to uppercase, while C,
+or C<\u> in interpolated strings, translates to titlecase in languages
+that make the distinction (which is equivalent to uppercase in languages
+without the distinction).
+
+=item *
+
+Most operators that deal with positions or lengths in a string will
+automatically switch to using character positions, including
+C, C, C, C, C, C,
+C, C, and C. An operator that
+specifically does not switch is C. Operators that really don't
+care include operators that treat strings as a bucket of bits such as
+C, and operators dealing with filenames.
+
+=item *
+
+The C/C letter C does I change, since it is often
+used for byte-oriented formats. Again, think C in the C language.
+
+There is a new C specifier that converts between Unicode characters
+and code points. There is also a C specifier that is the equivalent of
+C/C and properly handles character values even if they are above 255.
+
+=item *
+
+The C and C functions work on characters, similar to
+C and C, I C and
+C. C and C are methods for
+emulating byte-oriented C and C on Unicode strings.
+While these methods reveal the internal encoding of Unicode strings,
+that is not something one normally needs to care about at all.
+
+=item *
+
+The bit string operators, C<& | ^ ~>, can operate on character data.
+However, for backward compatibility, such as when using bit string
+operations when characters are all less than 256 in ordinal value, one
+should not use C<~> (the bit complement) with characters of both
+values less than 256 and values greater than 256. Most importantly,
+DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
+will not hold. The reason for this mathematical I is that
+the complement cannot return B the 8-bit (byte-wide) bit
+complement B the full character-wide bit complement.
+
+=item *
+
+You can define your own mappings to be used in lc(),
+lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
+See L"User-Defined Case Mappings"> for more details.
+
+=back
+
+=over 4
+
+=item *
+
+And finally, C reverses by character rather than by byte.
+
+=back
+
+=head2 Unicode Character Properties
+
+Most Unicode character properties are accessible by using regular expressions.
+They are used like character classes via the C<\p{}> "matches property"
+construct and the C<\P{}> negation, "doesn't match property".
+
+For instance, C<\p{Uppercase}> matches any character with the Unicode
+"Uppercase" property, while C<\p{L}> matches any character with a
+General_Category of "L" (letter) property. Brackets are not
+required for single letter properties, so C<\p{L}> is equivalent to C<\pL>.
+
+More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase
+property value is True, and C<\P{Uppercase}> matches any character whose
+Uppercase property value is False, and they could have been written as
+C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively
+
+This formality is needed when properties are not binary, that is if they can
+take on more values than just True and False. For example, the Bidi_Class (see
+L"Bidirectional Character Types"> below), can take on a number of different
+values, such as Left, Right, Whitespace, and others. To match these, one needs
+to specify the property name (Bidi_Class), and the value being matched against
+(Left, Right, I). This is done, as in the examples above, by having the
+two components separated by an equal sign (or interchangeably, a colon), like
+C<\p{Bidi_Class: Left}>.
+
+All Unicode-defined character properties may be written in these compound forms
+of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
+additional properties that are written only in the single form, as well as
+single-form short-cuts for all binary properties and certain others described
+below, in which you may omit the property name and the equals or colon
+separator.
+
+Most Unicode character properties have at least two synonyms (or aliases if you
+prefer), a short one that is easier to type, and a longer one which is more
+descriptive and hence it is easier to understand what it means. Thus the "L"
+and "Letter" above are equivalent and can be used interchangeably. Likewise,
+"Upper" is a synonym for "Uppercase", and we could have written
+C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
+various synonyms for the values the property can be. For binary properties,
+"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
+"No", and "N". But be careful. A short form of a value for one property may
+not mean the same thing as the same short form for another. Thus, for the
+General_Category property, "L" means "Letter", but for the Bidi_Class property,
+"L" means "Left". A complete list of properties and synonyms is in
+L.
+
+Upper/lower case differences in the property names and values are irrelevant,
+thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
+Similarly, you can add or subtract underscores anywhere in the middle of a
+word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
+is irrelevant adjacent to non-word characters, such as the braces and the equals
+or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
+equivalent to these as well. In fact, in most cases, white space and even
+hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
+equivalent. All this is called "loose-matching" by Unicode. The few places
+where stricter matching is employed is in the middle of numbers, and the Perl
+extension properties that begin or end with an underscore. Stricter matching
+cares about white space (except adjacent to the non-word characters) and
+hyphens, and non-interior underscores.
You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
(^) between the first brace and the property name: C<\p{^Tamil}> is
equal to C<\P{Tamil}>.
-Here are the basic Unicode General Category properties, followed by their
-long form. You can use either; C<\p{Lu}> and C<\p{LowercaseLetter}>,
-for instance, are identical.
+=head3 B
+
+Every Unicode character is assigned a general category, which is the "most
+usual categorization of a character" (from
+L).
+
+The compound way of writing these is like C<\p{General_Category=Number}>
+(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
+through the equal or colon separator is omitted. So you can instead just write
+C<\pN>.
+
+Here are the short and long forms of the General Category properties:
Short Long
L Letter
- Lu UppercaseLetter
- Ll LowercaseLetter
- Lt TitlecaseLetter
- Lm ModifierLetter
- Lo OtherLetter
+ LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
+ Lu Uppercase_Letter
+ Ll Lowercase_Letter
+ Lt Titlecase_Letter
+ Lm Modifier_Letter
+ Lo Other_Letter
M Mark
- Mn NonspacingMark
- Mc SpacingMark
- Me EnclosingMark
+ Mn Nonspacing_Mark
+ Mc Spacing_Mark
+ Me Enclosing_Mark
N Number
- Nd DecimalNumber
- Nl LetterNumber
- No OtherNumber
-
- P Punctuation
- Pc ConnectorPunctuation
- Pd DashPunctuation
- Ps OpenPunctuation
- Pe ClosePunctuation
- Pi InitialPunctuation
+ Nd Decimal_Number (also Digit)
+ Nl Letter_Number
+ No Other_Number
+
+ P Punctuation (also Punct)
+ Pc Connector_Punctuation
+ Pd Dash_Punctuation
+ Ps Open_Punctuation
+ Pe Close_Punctuation
+ Pi Initial_Punctuation
(may behave like Ps or Pe depending on usage)
- Pf FinalPunctuation
+ Pf Final_Punctuation
(may behave like Ps or Pe depending on usage)
- Po OtherPunctuation
+ Po Other_Punctuation
S Symbol
- Sm MathSymbol
- Sc CurrencySymbol
- Sk ModifierSymbol
- So OtherSymbol
+ Sm Math_Symbol
+ Sc Currency_Symbol
+ Sk Modifier_Symbol
+ So Other_Symbol
Z Separator
- Zs SpaceSeparator
- Zl LineSeparator
- Zp ParagraphSeparator
+ Zs Space_Separator
+ Zl Line_Separator
+ Zp Paragraph_Separator
C Other
- Cc Control
+ Cc Control (also Cntrl)
Cf Format
Cs Surrogate (not usable)
- Co PrivateUse
+ Co Private_Use
Cn Unassigned
Single-letter properties match all characters in any of the
two-letter sub-properties starting with the same letter.
-C is a special case, which is an alias for C, C, and C.
+C and C are special cases, which are aliases for the set of
+C, C, and C.
Because Perl hides the need for the user to understand the internal
representation of Unicode characters, there is no need to implement
the somewhat messy concept of surrogates. C is therefore not
supported.
+=head3 B
+
Because scripts differ in their directionality--Hebrew is
-written right to left, for example--Unicode supplies these properties:
+written right to left, for example--Unicode supplies these properties in
+the Bidi_Class class:
Property Meaning
- BidiL Left-to-Right
- BidiLRE Left-to-Right Embedding
- BidiLRO Left-to-Right Override
- BidiR Right-to-Left
- BidiAL Right-to-Left Arabic
- BidiRLE Right-to-Left Embedding
- BidiRLO Right-to-Left Override
- BidiPDF Pop Directional Format
- BidiEN European Number
- BidiES European Number Separator
- BidiET European Number Terminator
- BidiAN Arabic Number
- BidiCS Common Number Separator
- BidiNSM Non-Spacing Mark
- BidiBN Boundary Neutral
- BidiB Paragraph Separator
- BidiS Segment Separator
- BidiWS Whitespace
- BidiON Other Neutrals
-
-For example, C<\p{BidiR}> matches characters that are normally
+ L Left-to-Right
+ LRE Left-to-Right Embedding
+ LRO Left-to-Right Override
+ R Right-to-Left
+ AL Arabic Letter
+ RLE Right-to-Left Embedding
+ RLO Right-to-Left Override
+ PDF Pop Directional Format
+ EN European Number
+ ES European Separator
+ ET European Terminator
+ AN Arabic Number
+ CS Common Separator
+ NSM Non-Spacing Mark
+ BN Boundary Neutral
+ B Paragraph Separator
+ S Segment Separator
+ WS Whitespace
+ ON Other Neutrals
+
+This property is always written in the compound form.
+For example, C<\p{Bidi_Class:R}> matches characters that are normally
written right to left.
-=back
+=head3 B
+
+The world's languages are written in a number of scripts. This sentence
+(unless you're reading it in translation) is written in Latin, while Russian is
+written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in
+Hiragana or Katakana. There are many more.
+
+The Unicode Script property gives what script a given character is in,
+and can be matched with the compound form like C<\p{Script=Hebrew}> (short:
+C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit
+everything up through the equals (or colon), and simply write C<\p{Latin}> or
+C<\P{Cyrillic}>.
-=head2 Scripts
-
-The script names which can be used by C<\p{...}> and C<\P{...}>,
-such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
-
- Arabic
- Armenian
- Bengali
- Bopomofo
- Buhid
- CanadianAboriginal
- Cherokee
- Cyrillic
- Deseret
- Devanagari
- Ethiopic
- Georgian
- Gothic
- Greek
- Gujarati
- Gurmukhi
- Han
- Hangul
- Hanunoo
- Hebrew
- Hiragana
- Inherited
- Kannada
- Katakana
- Khmer
- Lao
- Latin
- Malayalam
- Mongolian
- Myanmar
- Ogham
- OldItalic
- Oriya
- Runic
- Sinhala
- Syriac
- Tagalog
- Tagbanwa
- Tamil
- Telugu
- Thaana
- Thai
- Tibetan
- Yi
-
-Extended property classes can supplement the basic
-properties, defined by the F Unicode database:
-
- ASCIIHexDigit
- BidiControl
- Dash
- Deprecated
- Diacritic
- Extender
- GraphemeLink
- HexDigit
- Hyphen
- Ideographic
- IDSBinaryOperator
- IDSTrinaryOperator
- JoinControl
- LogicalOrderException
- NoncharacterCodePoint
- OtherAlphabetic
- OtherDefaultIgnorableCodePoint
- OtherGraphemeExtend
- OtherLowercase
- OtherMath
- OtherUppercase
- QuotationMark
- Radical
- SoftDotted
- TerminalPunctuation
- UnifiedIdeograph
- WhiteSpace
-
-and there are further derived properties:
-
- Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
- Lowercase Ll + OtherLowercase
- Uppercase Lu + OtherUppercase
- Math Sm + OtherMath
-
- ID_Start Lu + Ll + Lt + Lm + Lo + Nl
- ID_Continue ID_Start + Mn + Mc + Nd + Pc
-
- Any Any character
- Assigned Any non-Cn character (i.e. synonym for \P{Cn})
- Unassigned Synonym for \p{Cn}
- Common Any character (or unassigned code point)
- not explicitly assigned to a script
+A complete list of scripts and their shortcuts is in L.
+
+=head3 B
For backward compatibility (with Perl 5.6), all properties mentioned
-so far may have C prepended to their name, so C<\P{IsLu}>, for
-example, is equal to C<\P{Lu}>.
+so far may have C or C prepended to their name, so C<\P{Is_Lu}>, for
+example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
+C<\p{Arabic}>.
-=head2 Blocks
+=head3 B
In addition to B, Unicode also defines B of
characters. The difference between scripts and blocks is that the
concept of scripts is closer to natural languages, while the concept
-of blocks is more of an artificial grouping based on groups of 256
-Unicode characters. For example, the C script contains letters
-from many blocks but does not contain all the characters from those
-blocks. It does not, for example, contain digits, because digits are
-shared across many scripts. Digits and similar groups, like
-punctuation, are in a category called C.
-
-For more about scripts, see the UTR #24:
-
- http://www.unicode.org/unicode/reports/tr24/
-
-For more about blocks, see:
-
- http://www.unicode.org/Public/UNIDATA/Blocks.txt
-
-Block names are given with the C prefix. For example, the
-Katakana block is referenced via C<\p{InKatakana}>. The C
-prefix may be omitted if there is no naming conflict with a script
-or any other property, but it is recommended that C always be used
-for block tests to avoid confusion.
-
-These block names are supported:
-
- InAlphabeticPresentationForms
- InArabic
- InArabicPresentationFormsA
- InArabicPresentationFormsB
- InArmenian
- InArrows
- InBasicLatin
- InBengali
- InBlockElements
- InBopomofo
- InBopomofoExtended
- InBoxDrawing
- InBraillePatterns
- InBuhid
- InByzantineMusicalSymbols
- InCJKCompatibility
- InCJKCompatibilityForms
- InCJKCompatibilityIdeographs
- InCJKCompatibilityIdeographsSupplement
- InCJKRadicalsSupplement
- InCJKSymbolsAndPunctuation
- InCJKUnifiedIdeographs
- InCJKUnifiedIdeographsExtensionA
- InCJKUnifiedIdeographsExtensionB
- InCherokee
- InCombiningDiacriticalMarks
- InCombiningDiacriticalMarksforSymbols
- InCombiningHalfMarks
- InControlPictures
- InCurrencySymbols
- InCyrillic
- InCyrillicSupplementary
- InDeseret
- InDevanagari
- InDingbats
- InEnclosedAlphanumerics
- InEnclosedCJKLettersAndMonths
- InEthiopic
- InGeneralPunctuation
- InGeometricShapes
- InGeorgian
- InGothic
- InGreekExtended
- InGreekAndCoptic
- InGujarati
- InGurmukhi
- InHalfwidthAndFullwidthForms
- InHangulCompatibilityJamo
- InHangulJamo
- InHangulSyllables
- InHanunoo
- InHebrew
- InHighPrivateUseSurrogates
- InHighSurrogates
- InHiragana
- InIPAExtensions
- InIdeographicDescriptionCharacters
- InKanbun
- InKangxiRadicals
- InKannada
- InKatakana
- InKatakanaPhoneticExtensions
- InKhmer
- InLao
- InLatin1Supplement
- InLatinExtendedA
- InLatinExtendedAdditional
- InLatinExtendedB
- InLetterlikeSymbols
- InLowSurrogates
- InMalayalam
- InMathematicalAlphanumericSymbols
- InMathematicalOperators
- InMiscellaneousMathematicalSymbolsA
- InMiscellaneousMathematicalSymbolsB
- InMiscellaneousSymbols
- InMiscellaneousTechnical
- InMongolian
- InMusicalSymbols
- InMyanmar
- InNumberForms
- InOgham
- InOldItalic
- InOpticalCharacterRecognition
- InOriya
- InPrivateUseArea
- InRunic
- InSinhala
- InSmallFormVariants
- InSpacingModifierLetters
- InSpecials
- InSuperscriptsAndSubscripts
- InSupplementalArrowsA
- InSupplementalArrowsB
- InSupplementalMathematicalOperators
- InSupplementaryPrivateUseAreaA
- InSupplementaryPrivateUseAreaB
- InSyriac
- InTagalog
- InTagbanwa
- InTags
- InTamil
- InTelugu
- InThaana
- InThai
- InTibetan
- InUnifiedCanadianAboriginalSyllabics
- InVariationSelectors
- InYiRadicals
- InYiSyllables
+of blocks is more of an artificial grouping based on groups of Unicode
+characters with consecutive ordinal values. For example, the "Basic Latin"
+block is all characters whose ordinals are between 0 and 127, inclusive, in
+other words, the ASCII characters. The "Latin" script contains some letters
+from this block as well as several more, like "Latin-1 Supplement",
+"Latin Extended-A", I, but it does not contain all the characters from
+those blocks. It does not, for example, contain digits, because digits are
+shared across many scripts. Digits and similar groups, like punctuation, are in
+the script called C. There is also a script called C for
+characters that modify other characters, and inherit the script value of the
+controlling character.
+
+For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
+L
+
+The Script property is likely to be the one you want to use when processing
+natural language; the Block property may be useful in working with the nuts and
+bolts of Unicode.
+
+Block names are matched in the compound form, like C<\p{Block: Arrows}> or
+C<\p{Blk=Hebrew}>. Unlike most other properties only a few block names have a
+Unicode-defined short name. But Perl does provide a (slight) shortcut: You
+can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
+compatibility, the C prefix may be omitted if there is no naming conflict
+with a script or any other property, and you can even use an C prefix
+instead in those cases. But it is not a good idea to do this, for a couple
+reasons:
=over 4
-=item *
+=item 1
-The special pattern C<\X> matches any extended Unicode
-sequence--"a combining character sequence" in Standardese--where the
-first character is a base character and subsequent characters are mark
-characters that apply to the base character. C<\X> is equivalent to
-C<(?:\PM\pM*)>.
+It is confusing. There are many naming conflicts, and you may forget some.
+For example, C<\p{Hebrew}> means the I