From: Jarkko Hietaniemi Date: Mon, 11 Jun 2001 17:55:47 +0000 (+0000) Subject: Move the full \p\P lists to perlunicode. X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=3229381570cd559d546c04f62dc3e86718ceccd8;p=p5sagit%2Fp5-mst-13.2.git Move the full \p\P lists to perlunicode. p4raw-id: //depot/perl@10520 --- diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 2960950..77e9942 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -1746,72 +1746,11 @@ C<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase letters, or C<\P{Nd}> for non-digits. If a C is just one letter, the braces can be dropped. For instance, C<\pM> is the character class of Unicode 'marks', for example accent marks. -Here is the list as of Unicode 3.1.0 (the two-letter classes) and -Perl 5.8.0 (the one-letter classes): - - L Letter - Lu Letter, Uppercase - Ll Letter, Lowercase - Lt Letter, Titlecase - Lm Letter, Modifier - Lo Letter, Other - M Mark - Mn Mark, Non-Spacing - Mc Mark, Spacing Combining - Me Mark, Enclosing - N Number - Nd Number, Decimal Digit - Nl Number, Letter - No Number, Other - P Punctuation - Pc Punctuation, Connector - Pd Punctuation, Dash - Ps Punctuation, Open - Pe Punctuation, Close - Pi Punctuation, Initial quote - (may behave like Ps or Pe depending on usage) - Pf Punctuation, Final quote - (may behave like Ps or Pe depending on usage) - Po Punctuation, Other - S Symbol - Sm Symbol, Math - Sc Symbol, Currency - Sk Symbol, Modifier - So Symbol, Other - Z Separator - Zs Separator, Space - Zl Separator, Line - Zp Separator, Paragraph - C Other - Cc Other, Control - Cf Other, Format - Cs Other, Surrogate - Co Other, Private Use - Cn Other, Not Assigned (Unicode defines no Cn characters) - -Additionally, because scripts differ in their directionality -(for example Hebrew is written right to left), all characters -have their directionality defined: - - BidiL Left-to-Right - BidiLRE Left-to-Right Embedding - BidiLRO Left-to-Right Override - BidiR Right-to-Left - BidiAL Right-to-Left Arabic - BidiRLE Right-to-Left Embedding - BidiRLO Right-to-Left Override - BidiPDF Pop Directional Format - BidiEN European Number - BidiES European Number Separator - BidiET European Number Terminator - BidiAN Arabic Number - BidiCS Common Number Separator - BidiNSM Non-Spacing Mark - BidiBN Boundary Neutral - BidiB Paragraph Separator - BidiS Segment Separator - BidiWS Whitespace - BidiON Other Neutrals +For the full list see L. + +The Unicode has also been separated into blocks of charaters which you +can test with C<\p{InBlock}> and C<\P{InBlock}>, for example C<\p{InGreek}> +and C<\P{InKatakana}. For the full list see L. For the the full and latest information see the latest Unicode standard. diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 12bee5c..d629cab 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -165,6 +165,173 @@ names of the C classes are the official Unicode block names but with all non-alphanumeric characters removed, for example the block name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>. +Here is the list as of Unicode 3.1.0 (the two-letter classes) and +Perl 5.8.0 (the one-letter classes): + + L Letter + Lu Letter, Uppercase + Ll Letter, Lowercase + Lt Letter, Titlecase + Lm Letter, Modifier + Lo Letter, Other + M Mark + Mn Mark, Non-Spacing + Mc Mark, Spacing Combining + Me Mark, Enclosing + N Number + Nd Number, Decimal Digit + Nl Number, Letter + No Number, Other + P Punctuation + Pc Punctuation, Connector + Pd Punctuation, Dash + Ps Punctuation, Open + Pe Punctuation, Close + Pi Punctuation, Initial quote + (may behave like Ps or Pe depending on usage) + Pf Punctuation, Final quote + (may behave like Ps or Pe depending on usage) + Po Punctuation, Other + S Symbol + Sm Symbol, Math + Sc Symbol, Currency + Sk Symbol, Modifier + So Symbol, Other + Z Separator + Zs Separator, Space + Zl Separator, Line + Zp Separator, Paragraph + C Other + Cc Other, Control + Cf Other, Format + Cs Other, Surrogate + Co Other, Private Use + Cn Other, Not Assigned (Unicode defines no Cn characters) + +Additionally, because scripts differ in their directionality +(for example Hebrew is written right to left), all characters +have their directionality defined: + + BidiL Left-to-Right + BidiLRE Left-to-Right Embedding + BidiLRO Left-to-Right Override + BidiR Right-to-Left + BidiAL Right-to-Left Arabic + BidiRLE Right-to-Left Embedding + BidiRLO Right-to-Left Override + BidiPDF Pop Directional Format + BidiEN European Number + BidiES European Number Separator + BidiET European Number Terminator + BidiAN Arabic Number + BidiCS Common Number Separator + BidiNSM Non-Spacing Mark + BidiBN Boundary Neutral + BidiB Paragraph Separator + BidiS Segment Separator + BidiWS Whitespace + BidiON Other Neutrals + +The blocks available for C<\p{InBlock}> and C<\P{InBlock}>, for +example \p{InCyrillic>, are as follows: + + BasicLatin + Latin1Supplement + LatinExtendedA + LatinExtendedB + IPAExtensions + SpacingModifierLetters + CombiningDiacriticalMarks + Greek + Cyrillic + Armenian + Hebrew + Arabic + Syriac + Thaana + Devanagari + Bengali + Gurmukhi + Gujarati + Oriya + Tamil + Telugu + Kannada + Malayalam + Sinhala + Thai + Lao + Tibetan + Myanmar + Georgian + HangulJamo + Ethiopic + Cherokee + UnifiedCanadianAboriginalSyllabics + Ogham + Runic + Khmer + Mongolian + LatinExtendedAdditional + GreekExtended + GeneralPunctuation + SuperscriptsandSubscripts + CurrencySymbols + CombiningMarksforSymbols + LetterlikeSymbols + NumberForms + Arrows + MathematicalOperators + MiscellaneousTechnical + ControlPictures + OpticalCharacterRecognition + EnclosedAlphanumerics + BoxDrawing + BlockElements + GeometricShapes + MiscellaneousSymbols + Dingbats + BraillePatterns + CJKRadicalsSupplement + KangxiRadicals + IdeographicDescriptionCharacters + CJKSymbolsandPunctuation + Hiragana + Katakana + Bopomofo + HangulCompatibilityJamo + Kanbun + BopomofoExtended + EnclosedCJKLettersandMonths + CJKCompatibility + CJKUnifiedIdeographsExtensionA + CJKUnifiedIdeographs + YiSyllables + YiRadicals + HangulSyllables + HighSurrogates + HighPrivateUseSurrogates + LowSurrogates + PrivateUse + CJKCompatibilityIdeographs + AlphabeticPresentationForms + ArabicPresentationFormsA + CombiningHalfMarks + CJKCompatibilityForms + SmallFormVariants + ArabicPresentationFormsB + Specials + HalfwidthandFullwidthForms + OldItalic + Gothic + Deseret + ByzantineMusicalSymbols + MusicalSymbols + MathematicalAlphanumericSymbols + CJKUnifiedIdeographsExtensionB + CJKCompatibilityIdeographsSupplement + Tags + =item * The special pattern C<\X> match matches any extended Unicode sequence @@ -253,6 +420,6 @@ tend to run slower. Avoidance of locales is strongly encouraged. =head1 SEE ALSO -L, L, L +L, L, L, L =cut