From: Karl Williamson Date: Sun, 20 Dec 2009 22:33:56 +0000 (+0100) Subject: Unicode documentation updates X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=51f494ccbae191b4b6fd4232895ccf47ceb713e5;p=p5sagit%2Fp5-mst-13.2.git Unicode documentation updates --- diff --git a/lib/unicore/mktables b/lib/unicore/mktables index c61a3f4..12f6659 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -12206,7 +12206,7 @@ To change this file, edit $0 instead. =head1 NAME -$pod_file - Complete index of Unicode Version $string_version properties +$pod_file - Index of Unicode Version $string_version properties in Perl =head1 DESCRIPTION diff --git a/pod/perl5113delta.pod b/pod/perl5113delta.pod index c97540d..5904ec9 100644 --- a/pod/perl5113delta.pod +++ b/pod/perl5113delta.pod @@ -28,7 +28,7 @@ to bless them into C. =head2 Unicode version -Perl is shipped with the latest Unicode version, 5.2, October 2009. See +Perl is shipped with the latest Unicode version, 5.2, dated October 2009. See L for details about this release of Unicode. See L for instructions on installing and using older versions of Unicode. @@ -55,23 +55,45 @@ now accepted. C, which matches a Unicode logical character, has been expanded to work better with various Asian languages. It now is defined as an C. (See L). One change -due to this is that C<\X> will match the whole sequence C>. Another -change is that C<\X> will match an isolated mark. Marks generally come after a -base character, but it is possible in Unicode to have them in isolation, and -C<\X> will now handle that case. Otherwise, this change should be transparent -for non-affected languages. +grapheme cluster>. (See L). +Anything matched by previously will continue to be matched. But in addition: + +=over + +=item * + +C<\X> will now not break apart a C> sequence. + +=item * + +C<\X> will now match a sequence including the C and C characters. + +=item * + +C<\X> will now always match at least one character, including an initial mark. +Marks generally come after a base character, but it is possible in Unicode to +have them in isolation, and C<\X> will now handle that case, for example at the +beginning of a line or after a C. + +=item * + +C<\X> will now match a (Korean) Hangul syllable sequence, and the Thai and Lao +exception cases. + +=back + +Otherwise, this change should be transparent for the non-affected languages. C<\p{...}> matches using the Canonical_Combining_Class property were completely broken in previous Perls. This is now fixed. -In previous Perls, the Unicode Decomposition_Type=Compat property and a +In previous Perls, the Unicode C property and a Perl extension had the same name, which led to neither matching all the correct values (with more than 100 mistakes in one, and several thousand in the other). The Perl extension has now been renamed to be -Decomposition_Type=Noncanonical (short: dt=noncanon). It has the same +C (short: C). It has the same meaning as was previously intended, namely the union of all the -non-canonical Decomposition types, with Unicode Compat being just one of +non-canonical Decomposition types, with Unicode C being just one of those. C<\p{Uppercase}> and C<\p{Lowercase}> have been brought into line with the @@ -88,25 +110,25 @@ similar, plus Bi-directional controls. C<\p{Alpha}> now matches the same characters as C<\p{Alphabetic}>. The Perl definition included a number of things that aren't really alpha (all -marks), while omitting many that were. The Unicode definition is -clearly better, so we are switching to it. As a direct consequence, the +marks), while omitting many that were. As a direct consequence, the definitions of C<\p{Alnum}> and C<\p{Word}> which depend on Alpha also change. C<\p{Word}> also now doesn't match certain characters it wasn't supposed to, such as fractions. -C<\p{Print}> no longer matches the line control characters: tab, lf, cr, -ff, vt, and nel. This brings it in line with the documentation. +C<\p{Print}> no longer matches the line control characters: Tab, LF, CR, +FF, VT, and NEL. This brings it in line with the documentation. -C<\p{Decomposition_Type=Canonical}> now includes the Hangul syllables +C<\p{Decomposition_Type=Canonical}> now includes the Hangul syllables. The Numeric type property has been extended to include the Unihan characters. -There is a new Perl extension, the 'Present_In', or simply 'In' +There is a new Perl extension, the 'Present_In', or simply 'In', property. This is an extension of the Unicode Age property, but -C<\p{In=5.0}> matches any code point whose usage has been determined as of -Unicode version 5.0. The C<\p{Age=5.0}> only matches code points added in 5.0. +C<\p{In=5.0}> matches any code point whose usage has been determined +I Unicode version 5.0. The C<\p{Age=5.0}> only matches code points +added in I version 5.0. A number of properties did not have the correct values for unassigned code points. This is now fixed. The affected properties are @@ -114,15 +136,14 @@ Bidi_Class, East_Asian_Width, Joining_Type, Decomposition_Type, Hangul_Syllable_Type, Numeric_Type, and Line_Break. The Default_Ignorable_Code_Point, ID_Continue, and ID_Start properties -have been updated to their current definitions. +have been updated to their current Unicode definitions. Certain properties that are supposed to be Unicode internal-only were erroneously exposed by previous Perls. Use of these in regular -expressions will now generate a deprecated warning message, if those -warnings are enabled. The properties are: Other_Alphabetic, -Other_Default_Ignorable_Code_Point, Other_Grapheme_Extend, -Other_ID_Continue, Other_ID_Start, Other_Lowercase, Other_Math, and -Other_Uppercase. +expressions will now generate, if enabled, a deprecated warning message. +The properties are: Other_Alphabetic, Other_Default_Ignorable_Code_Point, +Other_Grapheme_Extend, Other_ID_Continue, Other_ID_Start, Other_Lowercase, +Other_Math, and Other_Uppercase. An installation can now fairly easily change which Unicode properties Perl understands. As mentioned above, certain properties are by default @@ -132,12 +153,12 @@ Unicode internal-only property that Perl has never exposed. XXX what does "files in the To directory" mean? -- dagolden, 2009-12-20 -The files in the To directory are now more clearly marked as being -stable, directly usable by applications. New hash entries in them give -the format of the normal entries which allows for easier machine -parsing. Perl can generate files in this directory for any property, -though most are suppressed. An installation can choose to change which -get written. Instructions are in L. +The files in the C directory are now more clearly marked as +being stable, directly usable by applications. New hash entries in them give +the format of the normal entries, which allows for easier machine parsing. +Perl can generate files in this directory for any property, though most are +suppressed. An installation can choose to change which get written. +Instructions are in L. =head1 Modules and Pragmata diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 5dbd3cd..26a7af0 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -57,7 +57,7 @@ By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in I, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 -codepoints in Unicode happens to agree with Latin-1. +codepoints in Unicode happens to agree with Latin-1. See L for more details. @@ -79,7 +79,7 @@ character semantics. For operations where this determination cannot be made without additional information from the user, Perl decides in favor of compatibility and chooses to use byte semantics. -Under byte semantics, when C is in effect, Perl uses the +Under byte semantics, when C is in effect, Perl uses the semantics associated with the current locale. Absent a C, Perl currently uses US-ASCII (or Basic Latin in Unicode terminology) byte semantics, meaning that characters whose ordinal numbers are in the range 128 - 255 are @@ -115,7 +115,7 @@ Otherwise, byte semantics are in effect. The C pragma should be used to force byte semantics on Unicode data. If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will have +character data are concatenated, the new string will have character semantics. This can cause surprises: See L, below Under character semantics, many operations that formerly operated on @@ -188,11 +188,11 @@ See L for more details. =item * -The special pattern C<\X> matches any extended Unicode -sequence--"a combining character sequence" in Standardese--where the -first character is a base character and subsequent characters are mark -characters that apply to the base character. C<\X> is equivalent to -C<< (?>\PM\pM*) >>. +The special pattern C<\X> matches a logical character, an C in Standardese. In Unicode what appears to the user to be a single +character, for example an accented C, may in fact be composed of a sequence +of characters, in this case a C followed by an accent character. C<\X> +will match the entire sequence. =item * @@ -214,13 +214,13 @@ Most operators that deal with positions or lengths in a string will automatically switch to using character positions, including C, C, C, C, C, C, C, C, and C. An operator that -specifically does not switch is C. Operators that really don't -care include operators that treat strings as a bucket of bits such as +specifically does not switch is C. Operators that really don't +care include operators that treat strings as a bucket of bits such as C, and operators dealing with filenames. =item * -The C/C letter C does I change, since it is often +The C/C letter C does I change, since it is often used for byte-oriented formats. Again, think C in the C language. There is a new C specifier that converts between Unicode characters @@ -288,88 +288,128 @@ And finally, C reverses by character rather than by byte. =head2 Unicode Character Properties -Named Unicode properties, scripts, and block ranges may be used like -character classes via the C<\p{}> "matches property" construct and -the C<\P{}> negation, "doesn't match property". - -For instance, C<\p{Lu}> matches any character with the Unicode "Lu" -(Letter, uppercase) property, while C<\p{M}> matches any character -with an "M" (mark--accents and such) property. Brackets are not -required for single letter properties, so C<\p{M}> is equivalent to -C<\pM>. Many predefined properties are available, such as -C<\p{Mirrored}> and C<\p{Tibetan}>. - -The official Unicode script and block names have spaces and dashes as -separators, but for convenience you can use dashes, spaces, or -underbars, and case is unimportant. It is recommended, however, that -for consistency you use the following naming: the official Unicode -script, property, or block name (see below for the additional rules -that apply to block names) with whitespace and dashes removed, and the -words "uppercase-first-lowercase-rest". C thus -becomes C. +Most Unicode character properties are accessible by using regular expressions. +They are used like character classes via the C<\p{}> "matches property" +construct and the C<\P{}> negation, "doesn't match property". + +For instance, C<\p{Uppercase}> matches any character with the Unicode +"Uppercase" property, while C<\p{L}> matches any character with a +General_Category of "L" (letter) property. Brackets are not +required for single letter properties, so C<\p{L}> is equivalent to C<\pL>. + +More formally, C<\p{Uppercase}> matches any character whose Uppercase property +value is True, and C<\P{Uppercase}> matches any character whose Uppercase +property value is False, and they could have been written as +C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively + +This formality is needed when properties are not binary, that is if they can +take on more values than just True and False. For example, the Bidi_Class (see +L below), can take on a number of different +values, such as Left, Right, Whitespace, and others. To match these, one needs +to specify the property name (Bidi_Class), and the value being matched with +(Left, Right, etc.). This is done, as in the examples above, by having the two +components separated by an equal sign (or interchangeably, a colon), like +C<\p{Bidi_Class: Left}>. + +All Unicode-defined character properties may be written in these compound forms +of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some +additional properties that are written only in the single form, as well as +single-form short-cuts for all binary properties and certain others described +below, in which you may omit the property name and the equals or colon +separator. + +Most Unicode character properties have at least two synonyms (or aliases if you +prefer), a short one that is easier to type, and a longer one which is more +descriptive and hence it is easier to understand what it means. Thus the "L" +and "Letter" above are equivalent and can be used interchangeably. Likewise, +"Upper" is a synonym for "Uppercase", and we could have written +C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically +various synonyms for the values the property can be. For binary properties, +"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F", +"No", and "N". But be careful. A short form of a value for one property may +not mean the same thing as the same name for another. Thus, for the +General_Category property, "L" means "Letter", but for the Bidi_Class property, +"L" means "Left". A complete list of properties and synonyms is in +L. + +Upper/lower case differences in the property names and values are irrelevant, +thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. +Similarly, you can add or subtract underscores anywhere in the middle of a +word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space +is irrelevant adjacent to non-word characters, such as the braces and the equals +or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are +equivalent to these as well. In fact, in most cases, white space and even +hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is +equivalent. All this is called "loose-matching" by Unicode. The few places +where stricter matching is employed is in the middle of numbers, and the Perl +extension properties that begin or end with an underscore. Stricter matching +cares about white space (except adjacent to the non-word characters) and +hyphens, and non-interior underscores. You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret (^) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. -B +=head3 B -=over 4 +Every Unicode character is assigned a general category, which is the "most +usual categorization of a character" (from +L). -=item General Category +The compound way of writing these is like C<{\p{General_Category=Number}> +(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up +through the equal or colon separator is omitted. So you can instead just write +C<\pN>. -Here are the basic Unicode General Category properties, followed by their -long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, -for instance, are identical. +Here are the short and long forms of the General Category properties: Short Long L Letter - LC CasedLetter - Lu UppercaseLetter - Ll LowercaseLetter - Lt TitlecaseLetter - Lm ModifierLetter - Lo OtherLetter + LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}]) + Lu Uppercase_Letter + Ll Lowercase_Letter + Lt Titlecase_Letter + Lm Modifier_Letter + Lo Other_Letter M Mark - Mn NonspacingMark - Mc SpacingMark - Me EnclosingMark + Mn Nonspacing_Mark + Mc Spacing_Mark + Me Enclosing_Mark N Number - Nd DecimalNumber - Nl LetterNumber - No OtherNumber - - P Punctuation - Pc ConnectorPunctuation - Pd DashPunctuation - Ps OpenPunctuation - Pe ClosePunctuation - Pi InitialPunctuation + Nd Decimal_Number (also Digit) + Nl Letter_Number + No Other_Number + + P Punctuation (also Punct) + Pc Connector_Punctuation + Pd Dash_Punctuation + Ps Open_Punctuation + Pe Close_Punctuation + Pi Initial_Punctuation (may behave like Ps or Pe depending on usage) - Pf FinalPunctuation + Pf Final_Punctuation (may behave like Ps or Pe depending on usage) - Po OtherPunctuation + Po Other_Punctuation S Symbol - Sm MathSymbol - Sc CurrencySymbol - Sk ModifierSymbol - So OtherSymbol + Sm Math_Symbol + Sc Currency_Symbol + Sk Modifier_Symbol + So Other_Symbol Z Separator - Zs SpaceSeparator - Zl LineSeparator - Zp ParagraphSeparator + Zs Space_Separator + Zl Line_Separator + Zp Paragraph_Separator C Other - Cc Control + Cc Control (also Cntrl) Cf Format Cs Surrogate (not usable) - Co PrivateUse + Co Private_Use Cn Unassigned Single-letter properties match all characters in any of the @@ -382,11 +422,11 @@ representation of Unicode characters, there is no need to implement the somewhat messy concept of surrogates. C is therefore not supported. -=item Bidirectional Character Types +=head3 B Because scripts differ in their directionality--Hebrew is written right to left, for example--Unicode supplies these properties in -the BidiClass class: +the Bidi_Class class: Property Meaning @@ -394,15 +434,15 @@ the BidiClass class: LRE Left-to-Right Embedding LRO Left-to-Right Override R Right-to-Left - AL Right-to-Left Arabic + AL Arabic Letter RLE Right-to-Left Embedding RLO Right-to-Left Override PDF Pop Directional Format EN European Number - ES European Number Separator - ET European Number Terminator + ES European Separator + ET European Terminator AN Arabic Number - CS Common Number Separator + CS Common Separator NSM Non-Spacing Mark BN Boundary Neutral B Paragraph Separator @@ -410,342 +450,105 @@ the BidiClass class: WS Whitespace ON Other Neutrals -For example, C<\p{BidiClass:R}> matches characters that are normally +This property is always written in the compound form. +For example, C<\p{Bidi_Class:R}> matches characters that are normally written right to left. -=item Scripts - -The script names which can be used by C<\p{...}> and C<\P{...}>, -such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: - - Arabic - Armenian - Balinese - Bengali - Bopomofo - Braille - Buginese - Buhid - CanadianAboriginal - Cherokee - Coptic - Cuneiform - Cypriot - Cyrillic - Deseret - Devanagari - Ethiopic - Georgian - Glagolitic - Gothic - Greek - Gujarati - Gurmukhi - Han - Hangul - Hanunoo - Hebrew - Hiragana - Inherited - Kannada - Katakana - Kharoshthi - Khmer - Lao - Latin - Limbu - LinearB - Malayalam - Mongolian - Myanmar - NewTaiLue - Nko - Ogham - OldItalic - OldPersian - Oriya - Osmanya - PhagsPa - Phoenician - Runic - Shavian - Sinhala - SylotiNagri - Syriac - Tagalog - Tagbanwa - TaiLe - Tamil - Telugu - Thaana - Thai - Tibetan - Tifinagh - Ugaritic - Yi - -=item Extended property classes - -Extended property classes can supplement the basic -properties, defined by the F Unicode database: - - ASCIIHexDigit - BidiControl - Dash - Deprecated - Diacritic - Extender - HexDigit - Hyphen - Ideographic - IDSBinaryOperator - IDSTrinaryOperator - JoinControl - LogicalOrderException - NoncharacterCodePoint - OtherAlphabetic - OtherDefaultIgnorableCodePoint - OtherGraphemeExtend - OtherIDStart - OtherIDContinue - OtherLowercase - OtherMath - OtherUppercase - PatternSyntax - PatternWhiteSpace - QuotationMark - Radical - SoftDotted - STerm - TerminalPunctuation - UnifiedIdeograph - VariationSelector - WhiteSpace - -and there are further derived properties: - - Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic - Lowercase = Ll + OtherLowercase - Uppercase = Lu + OtherUppercase - Math = Sm + OtherMath - - IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart - IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue - - DefaultIgnorableCodePoint - = OtherDefaultIgnorableCodePoint - + Cf + Cc + Cs + Noncharacters + VariationSelector - - WhiteSpace - FFF9..FFFB (Annotation Characters) - - Any = Any code points (i.e. U+0000 to U+10FFFF) - Assigned = Any non-Cn code points (i.e. synonym for \P{Cn}) - Unassigned = Synonym for \p{Cn} - ASCII = ASCII (i.e. U+0000 to U+007F) - - Common = Any character (or unassigned code point) - not explicitly assigned to a script - -=item Use of "Is" Prefix +=head3 B + +The world's languages are written in a number of scripts. This sentence is +written in Latin, while Russian is written in Cyrllic, and Greek is written in, +well, Greek; Japanese mainly in Hiragana or Katakana. There are many more. + +The Unicode Script property gives what script a given character is in, +and can be matched with the compound form like C<\p{Script=Hebrew}> (short: +C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit +everything up through the equals (or colon), and simply write C<\p{Latin}> or +C<\P{Cyrillic}>. + +A complete list of scripts and their shortcuts is in L. + +=head3 B + +There are many more property classes than the basic ones described here, +including some Perl extensions. +A complete list is in L. +The extensions are more fully described in L + +=head3 B For backward compatibility (with Perl 5.6), all properties mentioned -so far may have C prepended to their name, so C<\P{IsLu}>, for -example, is equal to C<\P{Lu}>. +so far may have C or C prepended to their name, so C<\P{Is_Lu}>, for +example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to +C<\p{Arabic}>. -=item Blocks +=head3 B In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the concept of scripts is closer to natural languages, while the concept -of blocks is more of an artificial grouping based on groups of 256 -Unicode characters. For example, the C script contains letters -from many blocks but does not contain all the characters from those -blocks. It does not, for example, contain digits, because digits are -shared across many scripts. Digits and similar groups, like -punctuation, are in a category called C. - -For more about scripts, see the UAX#24 "Script Names": - - http://www.unicode.org/reports/tr24/ - -For more about blocks, see: - - http://www.unicode.org/Public/UNIDATA/Blocks.txt - -Block names are given with the C prefix. For example, the -Katakana block is referenced via C<\p{InKatakana}>. The C -prefix may be omitted if there is no naming conflict with a script -or any other property, but it is recommended that C always be used -for block tests to avoid confusion. - -These block names are supported: - - InAegeanNumbers - InAlphabeticPresentationForms - InAncientGreekMusicalNotation - InAncientGreekNumbers - InArabic - InArabicPresentationFormsA - InArabicPresentationFormsB - InArabicSupplement - InArmenian - InArrows - InBalinese - InBasicLatin - InBengali - InBlockElements - InBopomofo - InBopomofoExtended - InBoxDrawing - InBraillePatterns - InBuginese - InBuhid - InByzantineMusicalSymbols - InCJKCompatibility - InCJKCompatibilityForms - InCJKCompatibilityIdeographs - InCJKCompatibilityIdeographsSupplement - InCJKRadicalsSupplement - InCJKStrokes - InCJKSymbolsAndPunctuation - InCJKUnifiedIdeographs - InCJKUnifiedIdeographsExtensionA - InCJKUnifiedIdeographsExtensionB - InCherokee - InCombiningDiacriticalMarks - InCombiningDiacriticalMarksSupplement - InCombiningDiacriticalMarksforSymbols - InCombiningHalfMarks - InControlPictures - InCoptic - InCountingRodNumerals - InCuneiform - InCuneiformNumbersAndPunctuation - InCurrencySymbols - InCypriotSyllabary - InCyrillic - InCyrillicSupplement - InDeseret - InDevanagari - InDingbats - InEnclosedAlphanumerics - InEnclosedCJKLettersAndMonths - InEthiopic - InEthiopicExtended - InEthiopicSupplement - InGeneralPunctuation - InGeometricShapes - InGeorgian - InGeorgianSupplement - InGlagolitic - InGothic - InGreekExtended - InGreekAndCoptic - InGujarati - InGurmukhi - InHalfwidthAndFullwidthForms - InHangulCompatibilityJamo - InHangulJamo - InHangulSyllables - InHanunoo - InHebrew - InHighPrivateUseSurrogates - InHighSurrogates - InHiragana - InIPAExtensions - InIdeographicDescriptionCharacters - InKanbun - InKangxiRadicals - InKannada - InKatakana - InKatakanaPhoneticExtensions - InKharoshthi - InKhmer - InKhmerSymbols - InLao - InLatin1Supplement - InLatinExtendedA - InLatinExtendedAdditional - InLatinExtendedB - InLatinExtendedC - InLatinExtendedD - InLetterlikeSymbols - InLimbu - InLinearBIdeograms - InLinearBSyllabary - InLowSurrogates - InMalayalam - InMathematicalAlphanumericSymbols - InMathematicalOperators - InMiscellaneousMathematicalSymbolsA - InMiscellaneousMathematicalSymbolsB - InMiscellaneousSymbols - InMiscellaneousSymbolsAndArrows - InMiscellaneousTechnical - InModifierToneLetters - InMongolian - InMusicalSymbols - InMyanmar - InNKo - InNewTaiLue - InNumberForms - InOgham - InOldItalic - InOldPersian - InOpticalCharacterRecognition - InOriya - InOsmanya - InPhagspa - InPhoenician - InPhoneticExtensions - InPhoneticExtensionsSupplement - InPrivateUseArea - InRunic - InShavian - InSinhala - InSmallFormVariants - InSpacingModifierLetters - InSpecials - InSuperscriptsAndSubscripts - InSupplementalArrowsA - InSupplementalArrowsB - InSupplementalMathematicalOperators - InSupplementalPunctuation - InSupplementaryPrivateUseAreaA - InSupplementaryPrivateUseAreaB - InSylotiNagri - InSyriac - InTagalog - InTagbanwa - InTags - InTaiLe - InTaiXuanJingSymbols - InTamil - InTelugu - InThaana - InThai - InTibetan - InTifinagh - InUgaritic - InUnifiedCanadianAboriginalSyllabics - InVariationSelectors - InVariationSelectorsSupplement - InVerticalForms - InYiRadicals - InYiSyllables - InYijingHexagramSymbols +of blocks is more of an artificial grouping based on groups of Unicode +characters with consecutive ordinal values. For example, the C +block is all characters whose ordinals are between 0 and 127, inclusive, in +other words, the ASCII characters. The C script contains some letters +from this block as well as several more, like C, +C, I, but it does not contain all the characters from +those blocks. It does not, for example, contain digits, because digits are +shared across many scripts. Digits and similar groups, like punctuation, are in +the script called C. There is also a script called C for +characters that modify other characters, and inherit the script value of the +controlling character. + +For more about scripts versus blocks, see UAX#24 "Unicode Script Property": +L + +The Script property is likely to be the one you want to use when processing +natural language; the Block property may be useful in working with the nuts and +bolts of Unicode. + +Block names are matched in the compound form, like C<\p{Block: Arrows}> or +C<\p{Blk=Hebrew}>. Unlike most other properties only a few block names have a +Unicode-defined short name. But Perl does provide a (slight) shortcut: You +can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards +compatibility, the C prefix may be omitted if there is no naming conflict +with a script or any other property, and you can even use an C prefix +instead in those cases. But it is not a good idea to do this, for a couple +reasons: + +=over 4 + +=item 1 + +It is confusing. There are many naming conflicts, and you may forget some. +For example, \p{Hebrew} means the I