From: Jarkko Hietaniemi Date: Fri, 4 May 2001 03:22:59 +0000 (+0000) Subject: Document the \pX and \p{Yz} (and \p{BidiXYZ}) classes a bit more. X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=98f22ffc44bf9f52e28c7b1406e69c58dcd67835;p=p5sagit%2Fp5-mst-13.2.git Document the \pX and \p{Yz} (and \p{BidiXYZ}) classes a bit more. p4raw-id: //depot/perl@9983 --- diff --git a/pod/perlretut.pod b/pod/perlretut.pod index fa6479c..2960950 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -1745,7 +1745,75 @@ You can also use the official Unicode class names with the C<\p> and C<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase letters, or C<\P{Nd}> for non-digits. If a C is just one letter, the braces can be dropped. For instance, C<\pM> is the -character class of Unicode 'marks'. +character class of Unicode 'marks', for example accent marks. +Here is the list as of Unicode 3.1.0 (the two-letter classes) and +Perl 5.8.0 (the one-letter classes): + + L Letter + Lu Letter, Uppercase + Ll Letter, Lowercase + Lt Letter, Titlecase + Lm Letter, Modifier + Lo Letter, Other + M Mark + Mn Mark, Non-Spacing + Mc Mark, Spacing Combining + Me Mark, Enclosing + N Number + Nd Number, Decimal Digit + Nl Number, Letter + No Number, Other + P Punctuation + Pc Punctuation, Connector + Pd Punctuation, Dash + Ps Punctuation, Open + Pe Punctuation, Close + Pi Punctuation, Initial quote + (may behave like Ps or Pe depending on usage) + Pf Punctuation, Final quote + (may behave like Ps or Pe depending on usage) + Po Punctuation, Other + S Symbol + Sm Symbol, Math + Sc Symbol, Currency + Sk Symbol, Modifier + So Symbol, Other + Z Separator + Zs Separator, Space + Zl Separator, Line + Zp Separator, Paragraph + C Other + Cc Other, Control + Cf Other, Format + Cs Other, Surrogate + Co Other, Private Use + Cn Other, Not Assigned (Unicode defines no Cn characters) + +Additionally, because scripts differ in their directionality +(for example Hebrew is written right to left), all characters +have their directionality defined: + + BidiL Left-to-Right + BidiLRE Left-to-Right Embedding + BidiLRO Left-to-Right Override + BidiR Right-to-Left + BidiAL Right-to-Left Arabic + BidiRLE Right-to-Left Embedding + BidiRLO Right-to-Left Override + BidiPDF Pop Directional Format + BidiEN European Number + BidiES European Number Separator + BidiET European Number Terminator + BidiAN Arabic Number + BidiCS Common Number Separator + BidiNSM Non-Spacing Mark + BidiBN Boundary Neutral + BidiB Paragraph Separator + BidiS Segment Separator + BidiWS Whitespace + BidiON Other Neutrals + +For the the full and latest information see the latest Unicode standard. C<\X> is an abbreviation for a character class sequence that includes the Unicode 'combining character sequences'. A 'combining character