From: Jarkko Hietaniemi Date: Sun, 21 Oct 2001 13:36:40 +0000 (+0000) Subject: Prettyprinting. X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=e9ad1727960211698dc6e5554115c0dbf4254536;p=p5sagit%2Fp5-mst-13.2.git Prettyprinting. p4raw-id: //depot/perl@12543 --- diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 6bd0423..4e7c936 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -254,7 +254,8 @@ The following reserved ranges have C tests: Plane 16 Private Use For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true. -(Handling of surrogates is not implemented yet.) +(Handling of surrogates is not implemented yet, because Perl +uses UTF-8 and not UTF-16 internally to represent Unicode.) Additionally, because scripts differ in their directionality (for example Hebrew is written right to left), all characters @@ -285,66 +286,66 @@ have their directionality defined: The scripts available for C<\p{In...}> and C<\P{In...}>, for example \p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>: - Latin - Greek - Cyrillic - Armenian - Hebrew Arabic - Syriac - Thaana - Devanagari + Armenian Bengali - Gurmukhi + Bopomofo + Canadian-Aboriginal + Cherokee + Cyrillic + Deseret + Devanagari + Ethiopic + Georgian + Gothic + Greek Gujarati - Oriya - Tamil - Telugu + Gurmukhi + Han + Hangul + Hebrew + Hiragana + Inherited Kannada - Malayalam - Sinhala - Thai + Katakana + Khmer Lao - Tibetan + Latin + Malayalam + Mongolian Myanmar - Georgian - Hangul - Ethiopic - Cherokee - Canadian Aboriginal Ogham + Old-Italic + Oriya Runic - Khmer - Mongolian - Hiragana - Katakana - Bopomofo - Han + Sinhala + Syriac + Tamil + Telugu + Thaana + Thai + Tibetan Yi - Old Italic - Gothic - Deseret - Inherited There are also extended property classes that supplement the basic properties, defined by the F Unicode database: - White_space + ASCII_Hex_Digit Bidi_Control - Join_Control Dash - Hyphen - Quotation_Mark - Other_Math - Hex_Digit - ASCII_Hex_Digit - Other_Alphabetic - Ideographic Diacritic Extender + Hex_Digit + Hyphen + Ideographic + Join_Control + Noncharacter_Code_Point + Other_Alphabetic Other_Lowercase + Other_Math Other_Uppercase - Noncharacter_Code_Point + Quotation_Mark + White_space and further derived properties: @@ -365,11 +366,14 @@ and further derived properties: In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the -former concept is closer to natural languages, while the latter +scripts concept is closer to natural languages, while the blocks concept is more an artificial grouping based on groups of 256 Unicode characters. For example, the C script contains letters from -many blocks, but it does not contain all the characters from those -blocks, it does not for example contain digits. +many blocks. On the other hand, the C script does not contain +all the characters from those blocks, it does not for example contain +digits because digits are shared across many scripts. Digits and +other similar groups, like punctuation, are in a category called +C. For more about scripts see the UTR #24: http://www.unicode.org/unicode/reports/tr24/ @@ -386,102 +390,102 @@ preferential Unicode character class definition; this meant that the definitions of some character classes changed (the ones in the below list that have the C appended). + Alphabetic Presentation Forms + Arabic Block + Arabic Presentation Forms-A + Arabic Presentation Forms-B + Armenian Block + Arrows Basic Latin - Latin 1 Supplement - Latin Extended-A - Latin Extended-B - IPA Extensions - Spacing Modifier Letters + Bengali Block + Block Elements + Bopomofo Block + Bopomofo Extended + Box Drawing + Braille Patterns + Byzantine Musical Symbols + CJK Compatibility + CJK Compatibility Forms + CJK Compatibility Ideographs + CJK Compatibility Ideographs Supplement + CJK Radicals Supplement + CJK Symbols and Punctuation + CJK Unified Ideographs + CJK Unified Ideographs Extension A + CJK Unified Ideographs Extension B + Cherokee Block Combining Diacritical Marks - Greek Block + Combining Half Marks + Combining Marks for Symbols + Control Pictures + Currency Symbols Cyrillic Block - Armenian Block - Hebrew Block - Arabic Block - Syriac Block - Thaana Block + Deseret Block Devanagari Block - Bengali Block - Gurmukhi Block - Gujarati Block - Oriya Block - Tamil Block - Telugu Block - Kannada Block - Malayalam Block - Sinhala Block - Thai Block - Lao Block - Tibetan Block - Myanmar Block + Dingbats + Enclosed Alphanumerics + Enclosed CJK Letters and Months + Ethiopic Block + General Punctuation + Geometric Shapes Georgian Block + Gothic Block + Greek Block + Greek Extended + Gujarati Block + Gurmukhi Block + Halfwidth and Fullwidth Forms + Hangul Compatibility Jamo Hangul Jamo - Ethiopic Block - Cherokee Block - Unified Canadian Aboriginal Syllabics - Ogham Block - Runic Block + Hangul Syllables + Hebrew Block + High Private Use Surrogates + High Surrogates + Hiragana Block + IPA Extensions + Ideographic Description Characters + Kanbun + Kangxi Radicals + Kannada Block + Katakana Block Khmer Block - Mongolian Block + Lao Block + Latin 1 Supplement Latin Extended Additional - Greek Extended - General Punctuation - Superscripts and Subscripts - Currency Symbols - Combining Marks for Symbols + Latin Extended-A + Latin Extended-B Letterlike Symbols - Number Forms - Arrows + Low Surrogates + Malayalam Block + Mathematical Alphanumeric Symbols Mathematical Operators + Miscellaneous Symbols Miscellaneous Technical - Control Pictures + Mongolian Block + Musical Symbols + Myanmar Block + Number Forms + Ogham Block + Old Italic Block Optical Character Recognition - Enclosed Alphanumerics - Box Drawing - Block Elements - Geometric Shapes - Miscellaneous Symbols - Dingbats - Braille Patterns - CJK Radicals Supplement - Kangxi Radicals - Ideographic Description Characters - CJK Symbols and Punctuation - Hiragana Block - Katakana Block - Bopomofo Block - Hangul Compatibility Jamo - Kanbun - Bopomofo Extended - Enclosed CJK Letters and Months - CJK Compatibility - CJK Unified Ideographs Extension A - CJK Unified Ideographs - Yi Syllables - Yi Radicals - Hangul Syllables - High Surrogates - High Private Use Surrogates - Low Surrogates + Oriya Block Private Use - CJK Compatibility Ideographs - Alphabetic Presentation Forms - Arabic Presentation Forms-A - Combining Half Marks - CJK Compatibility Forms + Runic Block + Sinhala Block Small Form Variants - Arabic Presentation Forms-B + Spacing Modifier Letters Specials - Halfwidth and Fullwidth Forms - Old Italic Block - Gothic Block - Deseret Block - Byzantine Musical Symbols - Musical Symbols - Mathematical Alphanumeric Symbols - CJK Unified Ideographs Extension B - CJK Compatibility Ideographs Supplement + Superscripts and Subscripts + Syriac Block Tags + Tamil Block + Telugu Block + Thaana Block + Thai Block + Tibetan Block + Unified Canadian Aboriginal Syllabics + Yi Radicals + Yi Syllables =item *