X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=ae13a33b1455516a37b8fa5ea73e67e13f01cc9a;hb=7b059540b116737402869fbccad6d5c540c7f62e;hp=1d3f84626f86cb232903f5ad15e9dcb5260f3fda;hpb=fde18df140d5f64815bdd632a127ecd5ce3d97fa;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 1d3f846..ae13a33 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. +People who want to learn to use Unicode in Perl, should probably read +L, before reading +this reference document. + =over 4 =item Input and Output Layers @@ -20,15 +24,15 @@ the ":utf8" layer. Other encodings can be converted to Perl's encoding on input or from Perl's encoding on output by use of the ":encoding(...)" layer. See L. -To indicate that Perl source itself is using a particular encoding, -see L. +To indicate that Perl source itself is in UTF-8, use C. =item Regular Expressions The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode -character scheme when presented with Unicode data--or instead uses -a traditional byte scheme when presented with byte data. +character scheme when presented with data that is internally encoded in +UTF-8, or instead uses a traditional byte scheme when presented with +byte data. =item C still needed to enable UTF-8/UTF-EBCDIC in scripts @@ -39,8 +43,23 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. B is needed.> See L. -You can also use the C pragma to change the default encoding -of the data in your script; see L. +=item BOM-marked scripts and UTF-16 scripts autodetected + +If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, +or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either +endianness, Perl will correctly read in the script as Unicode. +(BOMless UTF-8 cannot be effectively recognized or differentiated from +ISO 8859-1 or other eight-bit encodings.) + +=item C needed to upgrade non-Latin-1 byte strings + +By default, there is a fundamental asymmetry in Perl's Unicode model: +implicit upgrading from byte strings to Unicode strings assumes that +they were encoded in I, but Unicode strings are +downgraded with UTF-8 encoding. This happens because the first 256 +codepoints in Unicode happens to agree with Latin-1. + +See L for more details. =back @@ -60,9 +79,18 @@ character semantics. For operations where this determination cannot be made without additional information from the user, Perl decides in favor of compatibility and chooses to use byte semantics. +Under byte semantics, when C is in effect, Perl uses the +semantics associated with the current locale. Absent a C, and +absent a C pragma, Perl currently uses US-ASCII +(or Basic Latin in Unicode terminology) byte semantics, meaning that characters +whose ordinal numbers are in the range 128 - 255 are undefined except for their +ordinal numbers. This means that none have case (upper and lower), nor are any +a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong +to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.) + This behavior preserves compatibility with earlier versions of Perl, which allowed byte semantics in Perl operations only if -none of the program's inputs were marked as being as source of Unicode +none of the program's inputs were marked as being a source of Unicode character data. Such data may come from filehandles, from calls to external programs, from information provided by the system (such as %ENV), or from literals and constants in the source text. @@ -70,6 +98,11 @@ or from literals and constants in the source text. The C pragma will always, regardless of platform, force byte semantics in a particular lexical scope. See L. +The C pragma is intended to always, regardless +of platform, force Unicode semantics in a particular lexical scope. In +release 5.12, it is partially implemented, applying only to case changes. +See L below. + The C pragma is primarily a compatibility device that enables recognition of UTF-(8|EBCDIC) in literals encountered by the parser. Note that this pragma is only required while Perl defaults to byte @@ -83,15 +116,13 @@ input data comes from a Unicode source--for example, if a character encoding layer is added to a filehandle or a literal Unicode string constant appears in a program--character semantics apply. Otherwise, byte semantics are in effect. The C pragma should -be used to force byte semantics on Unicode data. +be used to force byte semantics on Unicode data, and the C pragma to force Unicode semantics on byte data (though in +5.12 it isn't fully implemented). If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will be upgraded to -I, even if the old Unicode string used EBCDIC. -This translation is done without regard to the system's native 8-bit -encoding, so to change this for systems with non-Latin-1 and -non-EBCDIC native encodings use the C pragma. See -L. +character data are concatenated, the new string will have +character semantics. This can cause surprises: See L, below Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is @@ -111,17 +142,20 @@ Character semantics have the following effects: Strings--including hash keys--and regular expression patterns may contain characters that have an ordinal value larger than 255. -If you use a Unicode editor to edit your program, Unicode characters -may occur directly within the literal strings in one of the various -Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized -as such and converted to Perl's internal representation only if the -appropriate L is specified. +If you use a Unicode editor to edit your program, Unicode characters may +occur directly within the literal strings in UTF-8 encoding, or UTF-16. +(The former requires a BOM or C, the latter requires a BOM.) -Unicode characters can also be added to a string by using the -C<\x{...}> notation. The Unicode code for the desired character, in -hexadecimal, should be placed in the braces. For instance, a smiley -face is C<\x{263A}>. This encoding scheme only works for characters -with a code of 0x100 or above. +Unicode characters can also be added to a string by using the C<\N{U+...}> +notation. The Unicode code for the desired character, in hexadecimal, +should be placed in the braces, after the C. For instance, a smiley face is +C<\N{U+263A}>. + +Alternatively, you can use the C<\x{...}> notation for characters 0x100 and +above. For characters below 0x100 you may get byte semantics instead of +character semantics; see L. On EBCDIC machines there is +the additional problem that the value for such characters gives the EBCDIC +character rather than the Unicode one. Additionally, if you @@ -129,7 +163,7 @@ Additionally, if you you can use the C<\N{...}> notation and put the official Unicode character name within the braces, such as C<\N{WHITE SMILING FACE}>. - +See L. =item * @@ -141,8 +175,7 @@ names. =item * Regular expressions match characters instead of bytes. "." matches -a character instead of a byte. The C<\C> pattern is provided to force -a match a single byte--a C in C, hence C<\C>. +a character instead of a byte. =item * @@ -155,464 +188,581 @@ ideograph, for instance. Named Unicode properties, scripts, and block ranges may be used like character classes via the C<\p{}> "matches property" construct and -the C<\P{}> negation, "doesn't match property". - -For instance, C<\p{Lu}> matches any character with the Unicode "Lu" -(Letter, uppercase) property, while C<\p{M}> matches any character -with an "M" (mark--accents and such) property. Brackets are not -required for single letter properties, so C<\p{M}> is equivalent to -C<\pM>. Many predefined properties are available, such as -C<\p{Mirrored}> and C<\p{Tibetan}>. - -The official Unicode script and block names have spaces and dashes as -separators, but for convenience you can use dashes, spaces, or -underbars, and case is unimportant. It is recommended, however, that -for consistency you use the following naming: the official Unicode -script, property, or block name (see below for the additional rules -that apply to block names) with whitespace and dashes removed, and the -words "uppercase-first-lowercase-rest". C thus -becomes C. +the C<\P{}> negation, "doesn't match property". +See L for more details. + +You can define your own character properties and use them +in the regular expression with the C<\p{}> or C<\P{}> construct. +See L for more details. + +=item * + +The special pattern C<\X> matches a logical character, an "extended grapheme +cluster" in Standardese. In Unicode what appears to the user to be a single +character, for example an accented C, may in fact be composed of a sequence +of characters, in this case a C followed by an accent character. C<\X> +will match the entire sequence. + +=item * + +The C operator translates characters instead of bytes. Note +that the C functionality has been removed. For similar +functionality see pack('U0', ...) and pack('C0', ...). + +=item * + +Case translation operators use the Unicode case translation tables +when character input is provided. Note that C, or C<\U> in +interpolated strings, translates to uppercase, while C, +or C<\u> in interpolated strings, translates to titlecase in languages +that make the distinction (which is equivalent to uppercase in languages +without the distinction). + +=item * + +Most operators that deal with positions or lengths in a string will +automatically switch to using character positions, including +C, C, C, C, C, C, +C, C, and C. An operator that +specifically does not switch is C. Operators that really don't +care include operators that treat strings as a bucket of bits such as +C, and operators dealing with filenames. + +=item * + +The C/C letter C does I change, since it is often +used for byte-oriented formats. Again, think C in the C language. + +There is a new C specifier that converts between Unicode characters +and code points. There is also a C specifier that is the equivalent of +C/C and properly handles character values even if they are above 255. + +=item * + +The C and C functions work on characters, similar to +C and C, I C and +C. C and C are methods for +emulating byte-oriented C and C on Unicode strings. +While these methods reveal the internal encoding of Unicode strings, +that is not something one normally needs to care about at all. + +=item * + +The bit string operators, C<& | ^ ~>, can operate on character data. +However, for backward compatibility, such as when using bit string +operations when characters are all less than 256 in ordinal value, one +should not use C<~> (the bit complement) with characters of both +values less than 256 and values greater than 256. Most importantly, +DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) +will not hold. The reason for this mathematical I is that +the complement cannot return B the 8-bit (byte-wide) bit +complement B the full character-wide bit complement. + +=item * + +You can define your own mappings to be used in lc(), +lcfirst(), uc(), and ucfirst() (or their string-inlined versions). +See L for more details. + +=back + +=over 4 + +=item * + +And finally, C reverses by character rather than by byte. + +=back + +=head2 Unicode Character Properties + +Most Unicode character properties are accessible by using regular expressions. +They are used like character classes via the C<\p{}> "matches property" +construct and the C<\P{}> negation, "doesn't match property". + +For instance, C<\p{Uppercase}> matches any character with the Unicode +"Uppercase" property, while C<\p{L}> matches any character with a +General_Category of "L" (letter) property. Brackets are not +required for single letter properties, so C<\p{L}> is equivalent to C<\pL>. + +More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase +property value is True, and C<\P{Uppercase}> matches any character whose +Uppercase property value is False, and they could have been written as +C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively + +This formality is needed when properties are not binary, that is if they can +take on more values than just True and False. For example, the Bidi_Class (see +L below), can take on a number of different +values, such as Left, Right, Whitespace, and others. To match these, one needs +to specify the property name (Bidi_Class), and the value being matched against +(Left, Right, I). This is done, as in the examples above, by having the +two components separated by an equal sign (or interchangeably, a colon), like +C<\p{Bidi_Class: Left}>. + +All Unicode-defined character properties may be written in these compound forms +of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some +additional properties that are written only in the single form, as well as +single-form short-cuts for all binary properties and certain others described +below, in which you may omit the property name and the equals or colon +separator. + +Most Unicode character properties have at least two synonyms (or aliases if you +prefer), a short one that is easier to type, and a longer one which is more +descriptive and hence it is easier to understand what it means. Thus the "L" +and "Letter" above are equivalent and can be used interchangeably. Likewise, +"Upper" is a synonym for "Uppercase", and we could have written +C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically +various synonyms for the values the property can be. For binary properties, +"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F", +"No", and "N". But be careful. A short form of a value for one property may +not mean the same thing as the same short form for another. Thus, for the +General_Category property, "L" means "Letter", but for the Bidi_Class property, +"L" means "Left". A complete list of properties and synonyms is in +L. + +Upper/lower case differences in the property names and values are irrelevant, +thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. +Similarly, you can add or subtract underscores anywhere in the middle of a +word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space +is irrelevant adjacent to non-word characters, such as the braces and the equals +or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are +equivalent to these as well. In fact, in most cases, white space and even +hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is +equivalent. All this is called "loose-matching" by Unicode. The few places +where stricter matching is employed is in the middle of numbers, and the Perl +extension properties that begin or end with an underscore. Stricter matching +cares about white space (except adjacent to the non-word characters) and +hyphens, and non-interior underscores. You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret (^) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. -Here are the basic Unicode General Category properties, followed by their -long form. You can use either; C<\p{Lu}> and C<\p{LowercaseLetter}>, -for instance, are identical. +=head3 B + +Every Unicode character is assigned a general category, which is the "most +usual categorization of a character" (from +L). + +The compound way of writing these is like C<\p{General_Category=Number}> +(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up +through the equal or colon separator is omitted. So you can instead just write +C<\pN>. + +Here are the short and long forms of the General Category properties: Short Long L Letter - Lu UppercaseLetter - Ll LowercaseLetter - Lt TitlecaseLetter - Lm ModifierLetter - Lo OtherLetter + LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}]) + Lu Uppercase_Letter + Ll Lowercase_Letter + Lt Titlecase_Letter + Lm Modifier_Letter + Lo Other_Letter M Mark - Mn NonspacingMark - Mc SpacingMark - Me EnclosingMark + Mn Nonspacing_Mark + Mc Spacing_Mark + Me Enclosing_Mark N Number - Nd DecimalNumber - Nl LetterNumber - No OtherNumber - - P Punctuation - Pc ConnectorPunctuation - Pd DashPunctuation - Ps OpenPunctuation - Pe ClosePunctuation - Pi InitialPunctuation + Nd Decimal_Number (also Digit) + Nl Letter_Number + No Other_Number + + P Punctuation (also Punct) + Pc Connector_Punctuation + Pd Dash_Punctuation + Ps Open_Punctuation + Pe Close_Punctuation + Pi Initial_Punctuation (may behave like Ps or Pe depending on usage) - Pf FinalPunctuation + Pf Final_Punctuation (may behave like Ps or Pe depending on usage) - Po OtherPunctuation + Po Other_Punctuation S Symbol - Sm MathSymbol - Sc CurrencySymbol - Sk ModifierSymbol - So OtherSymbol + Sm Math_Symbol + Sc Currency_Symbol + Sk Modifier_Symbol + So Other_Symbol Z Separator - Zs SpaceSeparator - Zl LineSeparator - Zp ParagraphSeparator + Zs Space_Separator + Zl Line_Separator + Zp Paragraph_Separator C Other - Cc Control + Cc Control (also Cntrl) Cf Format Cs Surrogate (not usable) - Co PrivateUse + Co Private_Use Cn Unassigned Single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. -C is a special case, which is an alias for C, C, and C. +C and C are special cases, which are aliases for the set of +C, C, and C. Because Perl hides the need for the user to understand the internal representation of Unicode characters, there is no need to implement the somewhat messy concept of surrogates. C is therefore not supported. +=head3 B + Because scripts differ in their directionality--Hebrew is -written right to left, for example--Unicode supplies these properties: +written right to left, for example--Unicode supplies these properties in +the Bidi_Class class: Property Meaning - BidiL Left-to-Right - BidiLRE Left-to-Right Embedding - BidiLRO Left-to-Right Override - BidiR Right-to-Left - BidiAL Right-to-Left Arabic - BidiRLE Right-to-Left Embedding - BidiRLO Right-to-Left Override - BidiPDF Pop Directional Format - BidiEN European Number - BidiES European Number Separator - BidiET European Number Terminator - BidiAN Arabic Number - BidiCS Common Number Separator - BidiNSM Non-Spacing Mark - BidiBN Boundary Neutral - BidiB Paragraph Separator - BidiS Segment Separator - BidiWS Whitespace - BidiON Other Neutrals - -For example, C<\p{BidiR}> matches characters that are normally + L Left-to-Right + LRE Left-to-Right Embedding + LRO Left-to-Right Override + R Right-to-Left + AL Arabic Letter + RLE Right-to-Left Embedding + RLO Right-to-Left Override + PDF Pop Directional Format + EN European Number + ES European Separator + ET European Terminator + AN Arabic Number + CS Common Separator + NSM Non-Spacing Mark + BN Boundary Neutral + B Paragraph Separator + S Segment Separator + WS Whitespace + ON Other Neutrals + +This property is always written in the compound form. +For example, C<\p{Bidi_Class:R}> matches characters that are normally written right to left. -=back +=head3 B + +The world's languages are written in a number of scripts. This sentence +(unless you're reading it in translation) is written in Latin, while Russian is +written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in +Hiragana or Katakana. There are many more. + +The Unicode Script property gives what script a given character is in, +and can be matched with the compound form like C<\p{Script=Hebrew}> (short: +C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit +everything up through the equals (or colon), and simply write C<\p{Latin}> or +C<\P{Cyrillic}>. -=head2 Scripts - -The script names which can be used by C<\p{...}> and C<\P{...}>, -such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: - - Arabic - Armenian - Bengali - Bopomofo - Buhid - CanadianAboriginal - Cherokee - Cyrillic - Deseret - Devanagari - Ethiopic - Georgian - Gothic - Greek - Gujarati - Gurmukhi - Han - Hangul - Hanunoo - Hebrew - Hiragana - Inherited - Kannada - Katakana - Khmer - Lao - Latin - Malayalam - Mongolian - Myanmar - Ogham - OldItalic - Oriya - Runic - Sinhala - Syriac - Tagalog - Tagbanwa - Tamil - Telugu - Thaana - Thai - Tibetan - Yi - -Extended property classes can supplement the basic -properties, defined by the F Unicode database: - - ASCIIHexDigit - BidiControl - Dash - Deprecated - Diacritic - Extender - GraphemeLink - HexDigit - Hyphen - Ideographic - IDSBinaryOperator - IDSTrinaryOperator - JoinControl - LogicalOrderException - NoncharacterCodePoint - OtherAlphabetic - OtherDefaultIgnorableCodePoint - OtherGraphemeExtend - OtherLowercase - OtherMath - OtherUppercase - QuotationMark - Radical - SoftDotted - TerminalPunctuation - UnifiedIdeograph - WhiteSpace - -and there are further derived properties: - - Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic - Lowercase Ll + OtherLowercase - Uppercase Lu + OtherUppercase - Math Sm + OtherMath - - ID_Start Lu + Ll + Lt + Lm + Lo + Nl - ID_Continue ID_Start + Mn + Mc + Nd + Pc - - Any Any character - Assigned Any non-Cn character (i.e. synonym for \P{Cn}) - Unassigned Synonym for \p{Cn} - Common Any character (or unassigned code point) - not explicitly assigned to a script +A complete list of scripts and their shortcuts is in L. + +=head3 B For backward compatibility (with Perl 5.6), all properties mentioned -so far may have C prepended to their name, so C<\P{IsLu}>, for -example, is equal to C<\P{Lu}>. +so far may have C or C prepended to their name, so C<\P{Is_Lu}>, for +example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to +C<\p{Arabic}>. -=head2 Blocks +=head3 B In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the concept of scripts is closer to natural languages, while the concept -of blocks is more of an artificial grouping based on groups of 256 -Unicode characters. For example, the C script contains letters -from many blocks but does not contain all the characters from those -blocks. It does not, for example, contain digits, because digits are -shared across many scripts. Digits and similar groups, like -punctuation, are in a category called C. - -For more about scripts, see the UTR #24: - - http://www.unicode.org/unicode/reports/tr24/ - -For more about blocks, see: - - http://www.unicode.org/Public/UNIDATA/Blocks.txt - -Block names are given with the C prefix. For example, the -Katakana block is referenced via C<\p{InKatakana}>. The C -prefix may be omitted if there is no naming conflict with a script -or any other property, but it is recommended that C always be used -for block tests to avoid confusion. - -These block names are supported: - - InAlphabeticPresentationForms - InArabic - InArabicPresentationFormsA - InArabicPresentationFormsB - InArmenian - InArrows - InBasicLatin - InBengali - InBlockElements - InBopomofo - InBopomofoExtended - InBoxDrawing - InBraillePatterns - InBuhid - InByzantineMusicalSymbols - InCJKCompatibility - InCJKCompatibilityForms - InCJKCompatibilityIdeographs - InCJKCompatibilityIdeographsSupplement - InCJKRadicalsSupplement - InCJKSymbolsAndPunctuation - InCJKUnifiedIdeographs - InCJKUnifiedIdeographsExtensionA - InCJKUnifiedIdeographsExtensionB - InCherokee - InCombiningDiacriticalMarks - InCombiningDiacriticalMarksforSymbols - InCombiningHalfMarks - InControlPictures - InCurrencySymbols - InCyrillic - InCyrillicSupplementary - InDeseret - InDevanagari - InDingbats - InEnclosedAlphanumerics - InEnclosedCJKLettersAndMonths - InEthiopic - InGeneralPunctuation - InGeometricShapes - InGeorgian - InGothic - InGreekExtended - InGreekAndCoptic - InGujarati - InGurmukhi - InHalfwidthAndFullwidthForms - InHangulCompatibilityJamo - InHangulJamo - InHangulSyllables - InHanunoo - InHebrew - InHighPrivateUseSurrogates - InHighSurrogates - InHiragana - InIPAExtensions - InIdeographicDescriptionCharacters - InKanbun - InKangxiRadicals - InKannada - InKatakana - InKatakanaPhoneticExtensions - InKhmer - InLao - InLatin1Supplement - InLatinExtendedA - InLatinExtendedAdditional - InLatinExtendedB - InLetterlikeSymbols - InLowSurrogates - InMalayalam - InMathematicalAlphanumericSymbols - InMathematicalOperators - InMiscellaneousMathematicalSymbolsA - InMiscellaneousMathematicalSymbolsB - InMiscellaneousSymbols - InMiscellaneousTechnical - InMongolian - InMusicalSymbols - InMyanmar - InNumberForms - InOgham - InOldItalic - InOpticalCharacterRecognition - InOriya - InPrivateUseArea - InRunic - InSinhala - InSmallFormVariants - InSpacingModifierLetters - InSpecials - InSuperscriptsAndSubscripts - InSupplementalArrowsA - InSupplementalArrowsB - InSupplementalMathematicalOperators - InSupplementaryPrivateUseAreaA - InSupplementaryPrivateUseAreaB - InSyriac - InTagalog - InTagbanwa - InTags - InTamil - InTelugu - InThaana - InThai - InTibetan - InUnifiedCanadianAboriginalSyllabics - InVariationSelectors - InYiRadicals - InYiSyllables +of blocks is more of an artificial grouping based on groups of Unicode +characters with consecutive ordinal values. For example, the "Basic Latin" +block is all characters whose ordinals are between 0 and 127, inclusive, in +other words, the ASCII characters. The "Latin" script contains some letters +from this block as well as several more, like "Latin-1 Supplement", +"Latin Extended-A", I, but it does not contain all the characters from +those blocks. It does not, for example, contain digits, because digits are +shared across many scripts. Digits and similar groups, like punctuation, are in +the script called C. There is also a script called C for +characters that modify other characters, and inherit the script value of the +controlling character. + +For more about scripts versus blocks, see UAX#24 "Unicode Script Property": +L + +The Script property is likely to be the one you want to use when processing +natural language; the Block property may be useful in working with the nuts and +bolts of Unicode. + +Block names are matched in the compound form, like C<\p{Block: Arrows}> or +C<\p{Blk=Hebrew}>. Unlike most other properties only a few block names have a +Unicode-defined short name. But Perl does provide a (slight) shortcut: You +can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards +compatibility, the C prefix may be omitted if there is no naming conflict +with a script or any other property, and you can even use an C prefix +instead in those cases. But it is not a good idea to do this, for a couple +reasons: =over 4 -=item * +=item 1 -The special pattern C<\X> matches any extended Unicode -sequence--"a combining character sequence" in Standardese--where the -first character is a base character and subsequent characters are mark -characters that apply to the base character. C<\X> is equivalent to -C<(?:\PM\pM*)>. +It is confusing. There are many naming conflicts, and you may forget some. +For example, C<\p{Hebrew}> means the I