title titlecase equivalent mapping
block block the character belongs to (used in \p{In...})
- script script the character belongs to
+ script script the character belongs to
If no match is found, a reference to an empty hash is returned.
See also L</Blocks versus Scripts>.
-If supplied with an argument that can't be a code point, charblock()
-tries to do the opposite and interpret the argument as a character
-block. The return value is a I<range>: an anonymous list that
-contains anonymous lists, which in turn contain I<start-of-range>,
-I<end-of-range> code point pairs. You can test whether a code point
-is in a range using the L</charinrange> function. If the argument is
-not a known charater block, C<undef> is returned.
+If supplied with an argument that can't be a code point, charblock() tries
+to do the opposite and interpret the argument as a character block. The
+return value is a I<range>: an anonymous list of lists that contain
+I<start-of-range>, I<end-of-range> code point pairs. You can test whether a
+code point is in a range using the L</charinrange> function. If the
+argument is not a known charater block, C<undef> is returned.
=cut
See also L</Blocks versus Scripts>.
-If supplied with an argument that can't be a code point, charscript()
-tries to do the opposite and interpret the argument as a character
-script. The return value is a I<range>: an anonymous list that
-contains anonymous lists, which in turn contain I<start-of-range>,
-I<end-of-range> code point pairs. You can test whether a code point
-is in a range using the L</charinrange> function. If the argument is
-not a known charater script, C<undef> is returned.
+If supplied with an argument that can't be a code point, charscript() tries
+to do the opposite and interpret the argument as a character script. The
+return value is a I<range>: an anonymous list of lists that contain
+I<start-of-range>, I<end-of-range> code point pairs. You can test whether a
+code point is in a range using the L</charinrange> function. If the
+argument is not a known charater script, C<undef> is returned.
=cut
The difference between a block and a script is that scripts are closer
to the linguistic notion of a set of characters required to present
languages, while block is more of an artifact of the Unicode character
-numbering and separation into blocks of 256 characters.
+numbering and separation into blocks of (mostly) 256 characters.
For example the Latin B<script> is spread over several B<blocks>, such
as C<Basic Latin>, C<Latin 1 Supplement>, C<Latin Extended-A>, and
C<Latin Extended-B>. On the other hand, the Latin script does not
contain all the characters of the C<Basic Latin> block (also known as
-the ASCII): it includes only the letters, not for example the digits
+the ASCII): it includes only the letters, and not, for example, the digits
or the punctuation.
For blocks see http://www.unicode.org/Public/UNIDATA/Blocks.txt
=head2 Matching Scripts and Blocks
-Both scripts and blocks can be matched using the regular expression
-construct C<\p{In...}> and its negation C<\P{In...}>.
-
-The name of the script or the block comes after the C<In>, for example
-C<\p{InCyrillic}>, C<\P{InBasicLatin}>. Spaces and dashes ('-') are
-removed from the names for the C<\p{In...}>, for example
-C<LatinExtendedA> instead of C<Latin Extended-A>.
-
-There are a few cases where there is both a script and a block by the
-same name, in these cases the block version has C<Block> appended to
-its name: C<\p{InKatakana}> is the script, C<\p{InKatakanaBlock}> is
-the block.
+Scripts are matched with the regular-expression construct
+C<\p{...}> (e.g. C<\p{Tibetan}> matches characters of the Tibetan script),
+while C<\p{In...}> is used for blocks (e.g. C<\p{InTibetan}> matches
+any of the 256 code points in the Tibetan block).
=head2 Code Point Arguments
with external libraries or existing data. G_FLOAT is still available as
a configuration option. The default on VAX (D_FLOAT) has not changed.
-=head2 Different Definition of the Unicode Character Classes \p{In...}
-
-As suggested by the Unicode consortium, the Unicode character classes
-now prefer I<scripts> as opposed to I<blocks> (as defined by Unicode);
-in Perl, when the C<\p{In....}> and the C<\p{In....}> regular expression
-constructs are used. This has changed the definition of some of those
-character classes.
-
-The difference between scripts and blocks is that scripts are the
-glyphs used by a language or a group of languages, while the blocks
-are more artificial groupings of 256 characters based on the Unicode
-numbering.
-
-In general this change results in more inclusive Unicode character
-classes, but changes to the other direction also do take place:
-for example while the script C<Latin> includes all the Latin
-characters and their various diacritic-adorned versions, it
-does not include the various punctuation or digits (since they
-are not solely C<Latin>).
-
-Changes in the character class semantics may have happened if a script
-and a block happen to have the same name, for example C<Hebrew>.
-In such cases the script wins and C<\p{InHebrew}> now means the script
-definition of Hebrew. The block definition in still available,
-though, by appending C<Block> to the name: C<\p{InHebrewBlock}> means
-what C<\p{InHebrew}> meant in perl 5.6.0. For the full list
-of affected character classes, see L<perlunicode/Blocks>.
+=head2 New Unicode Properties
+
+Unicode I<scripts> are now supported. Scripts are similar to (and superior
+to) Unicode I<blocks>. The difference between scripts and blocks is that
+scripts are the glyphs used by a language or a group of languages, while
+the blocks are more artificial groupings of (mostly) 256 characters based
+on the Unicode numbering.
+
+In general, scripts are more inclusive, but not universally so. For
+example, while the script C<Latin> includes all the Latin characters and
+their various diacritic-adorned versions, it does not include the various
+punctuation or digits (since they are not solely C<Latin>).
+
+A number of other properties are now supported, including C<\p{L&}>,
+C<\p{Any}> C<\p{Assigned}>, C<\p{Unassigned}>, C<\p{Blank}> and
+C<\p{SpacePerl}> (along with their C<\P{...}> versions, of course).
+See L<perlunicode> for details, and more additions.
+
+The C<In> or C<Is> prefix to names used with the C<\p{...}> and C<\P{...}>
+are now almost always optional. The only exception is that a C<In> prefix
+is required to signify a Unicode block when a block name conflicts with a
+script name. For example, C<\p{Tibetan}> refers to the script, while
+C<\p{InTibetan}> refers to the block. When there is no name conflict, you
+can omit the C<In> from the block name (e.g. C<\p{BraillePatterns}>), but
+to be safe, it's probably best to always use the C<In>).
=head2 Perl Parser Stress Tested
=item *
-The Unicode character classes \p{Blank} and \p{SpacePerl} have been
-added. "Blank" is like C isblank(), that is, it contains only
-"horizontal whitespace" (the space character is, the newline isn't),
-and the "SpacePerl" is the Unicode equivalent of C<\s> (\p{Space}
-isn't, since that includes the vertical tabulator character, whereas
-C<\s> doesn't.)
+The properties \p{Blank} and \p{SpacePerl} have been added. "Blank" is like
+C isblank(), that is, it contains only "horizontal whitespace" (the space
+character is, the newline isn't), and the "SpacePerl" is the Unicode
+equivalent of C<\s> (\p{Space} isn't, since that includes the vertical
+tabulator character, whereas C<\s> doesn't.)
+
+See "New Unicode Properties" earlier in this document for additional
+information on changes with Unicode properties.
=back
=item *
-Named Unicode properties and block ranges may be used as character
-classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
-match property) constructs. For instance, C<\p{Lu}> matches any
+Named Unicode properties, scripts, and block ranges may be used like
+character classes via the new C<\p{}> (matches property) and C<\P{}>
+(doesn't match property) constructs. For instance, C<\p{Lu}> matches any
character with the Unicode "Lu" (Letter, uppercase) property, while
C<\p{M}> matches any character with a "M" (mark -- accents and such)
-property. Single letter properties may omit the brackets, so that can
-be written C<\pM> also. Many predefined character classes are
-available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
-
-The C<\p{Is...}> test for "general properties" such as "letter",
-"digit", while the C<\p{In...}> test for Unicode scripts and blocks.
+property. Single letter properties may omit the brackets, so that can be
+written C<\pM> also. Many predefined properties are available, such
+as C<\p{Mirrored}> and C<\p{Tibetan}>.
The official Unicode script and block names have spaces and dashes as
-separators, but for convenience you can have dashes, spaces, and
-underbars at every word division, and you need not care about correct
-casing. It is recommended, however, that for consistency you use the
-following naming: the official Unicode script, block, or property name
-(see below for the additional rules that apply to block names), with
-whitespace and dashes replaced with underbar, and the words
-"uppercase-first-lowercase-rest". That is, "Latin-1 Supplement"
-becomes "Latin_1_Supplement".
+separators, but for convenience you can have dashes, spaces, and underbars
+at every word division, and you need not care about correct casing. It is
+recommended, however, that for consistency you use the following naming:
+the official Unicode script, block, or property name (see below for the
+additional rules that apply to block names), with whitespace and dashes
+removed, and the words "uppercase-first-lowercase-rest". That is, "Latin-1
+Supplement" becomes "Latin1Supplement".
You can also negate both C<\p{}> and C<\P{}> by introducing a caret
-(^) between the first curly and the property name: C<\p{^In_Tamil}> is
-equal to C<\P{In_Tamil}>.
+(^) between the first curly and the property name: C<\p{^Tamil}> is
+equal to C<\P{Tamil}>.
-The C<In> and C<Is> can be left out: C<\p{Greek}> is equal to
-C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
+Here are the basic Unicode General Category properties, followed by their
+long form (you can use either, e.g. C<\p{Lu}> and C<\p{LowercaseLetter}>
+are identical).
Short Long
L Letter
- Lu Uppercase_Letter
- Ll Lowercase_Letter
- Lt Titlecase_Letter
- Lm Modifier_Letter
- Lo Other_Letter
+ Lu UppercaseLetter
+ Ll LowercaseLetter
+ Lt TitlecaseLetter
+ Lm ModifierLetter
+ Lo OtherLetter
M Mark
- Mn Nonspacing_Mark
- Mc Spacing_Mark
- Me Enclosing_Mark
+ Mn NonspacingMark
+ Mc SpacingMark
+ Me EnclosingMark
N Number
- Nd Decimal_Number
- Nl Letter_Number
- No Other_Number
+ Nd DecimalNumber
+ Nl LetterNumber
+ No OtherNumber
P Punctuation
- Pc Connector_Punctuation
- Pd Dash_Punctuation
- Ps Open_Punctuation
- Pe Close_Punctuation
- Pi Initial_Punctuation
+ Pc ConnectorPunctuation
+ Pd DashPunctuation
+ Ps OpenPunctuation
+ Pe ClosePunctuation
+ Pi InitialPunctuation
(may behave like Ps or Pe depending on usage)
- Pf Final_Punctuation
+ Pf FinalPunctuation
(may behave like Ps or Pe depending on usage)
- Po Other_Punctuation
+ Po OtherPunctuation
S Symbol
- Sm Math_Symbol
- Sc Currency_Symbol
- Sk Modifier_Symbol
- So Other_Symbol
+ Sm MathSymbol
+ Sc CurrencySymbol
+ Sk ModifierSymbol
+ So OtherSymbol
Z Separator
- Zs Space_Separator
- Zl Line_Separator
- Zp Paragraph_Separator
+ Zs SpaceSeparator
+ Zl LineSeparator
+ Zp ParagraphSeparator
C Other
Cc Control
Cf Format
- Cs Surrogate
- Co Private_Use
+ Cs Surrogate (not usable)
+ Co PrivateUse
Cn Unassigned
The single-letter properties match all characters in any of the
two-letter sub-properties starting with the same letter.
There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>.
-The following reserved ranges have C<In> tests:
-
- CJK_Ideograph_Extension_A
- CJK_Ideograph
- Hangul_Syllable
- Non_Private_Use_High_Surrogate
- Private_Use_High_Surrogate
- Low_Surrogate
- Private_Surrogate
- CJK_Ideograph_Extension_B
- Plane_15_Private_Use
- Plane_16_Private_Use
-
-For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true.
-(Handling of surrogates is not implemented yet, because Perl
-uses UTF-8 and not UTF-16 internally to represent Unicode.
-So you really can't use the "Cs" category.)
+Because Perl hides the need for the user to understand the internal
+representation of Unicode characters, it has no need to support the
+somewhat messy concept of surrogates. Therefore, the C<Cs> property is not
+supported.
-Additionally, because scripts differ in their directionality
-(for example Hebrew is written right to left), all characters
-have their directionality defined:
+Because scripts differ in their directionality (for example Hebrew is
+written right to left), Unicode supplies these properties:
+ Property Meaning
+
BidiL Left-to-Right
BidiLRE Left-to-Right Embedding
BidiLRO Left-to-Right Override
BidiWS Whitespace
BidiON Other Neutrals
+For example, C<\p{BidiR}> matches all characters that are normally
+written right to left.
+
=back
=head2 Scripts
-The scripts available for C<\p{In...}> and C<\P{In...}>, for example
-C<\p{InLatin}> or \p{InCyrillic>, are as follows:
+The scripts available via C<\p{...}> and C<\P{...}>, for example
+C<\p{Latin}> or \p{Cyrillic>, are as follows:
Arabic
Armenian
Bengali
Bopomofo
- Canadian-Aboriginal
+ CanadianAboriginal
Cherokee
Cyrillic
Deseret
Mongolian
Myanmar
Ogham
- Old-Italic
+ OldItalic
Oriya
Runic
Sinhala
properties, defined by the F<PropList> Unicode database:
ASCII_Hex_Digit
- Bidi_Control
+ BidiControl
Dash
Diacritic
Extender
- Hex_Digit
+ HexDigit
Hyphen
Ideographic
- Join_Control
- Noncharacter_Code_Point
- Other_Alphabetic
- Other_Lowercase
- Other_Math
- Other_Uppercase
- Quotation_Mark
- White_Space
+ JoinControl
+ NoncharacterCodePoint
+ OtherAlphabetic
+ OtherLowercase
+ OtherMath
+ OtherUppercase
+ QuotationMark
+ WhiteSpace
and further derived properties:
- Alphabetic Lu + Ll + Lt + Lm + Lo + Other_Alphabetic
- Lowercase Ll + Other_Lowercase
- Uppercase Lu + Other_Uppercase
- Math Sm + Other_Math
+ Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
+ Lowercase Ll + OtherLowercase
+ Uppercase Lu + OtherUppercase
+ Math Sm + OtherMath
ID_Start Lu + Ll + Lt + Lm + Lo + Nl
ID_Continue ID_Start + Mn + Mc + Nd + Pc
Any Any character
- Assigned Any non-Cn character
+ Assigned Any non-Cn character (i.e. synonym for C<\P{Cn}>)
+ Unassigned Synonym for C<\p{Cn}>
Common Any character (or unassigned code point)
not explicitly assigned to a script
+For backward compatability, all properties mentioned so far may have C<Is>
+prepended to their name (e.g. C<\P{IsLu}> is equal to C<\P{Lu}>).
+
=head2 Blocks
-In addition to B<scripts>, Unicode also defines B<blocks> of
-characters. The difference between scripts and blocks is that the
-scripts concept is closer to natural languages, while the blocks
-concept is more an artificial grouping based on groups of 256 Unicode
-characters. For example, the C<Latin> script contains letters from
-many blocks. On the other hand, the C<Latin> script does not contain
-all the characters from those blocks. It does not, for example,
-contain digits because digits are shared across many scripts. Digits
-and other similar groups, like punctuation, are in a category called
-C<Common>.
+In addition to B<scripts>, Unicode also defines B<blocks> of characters.
+The difference between scripts and blocks is that the scripts concept is
+closer to natural languages, while the blocks concept is more an artificial
+grouping based on groups of mostly 256 Unicode characters. For example, the
+C<Latin> script contains letters from many blocks. On the other hand, the
+C<Latin> script does not contain all the characters from those blocks. It
+does not, for example, contain digits because digits are shared across many
+scripts. Digits and other similar groups, like punctuation, are in a
+category called C<Common>.
For more about scripts, see the UTR #24:
http://www.unicode.org/Public/UNIDATA/Blocks.txt
-Because there are overlaps in naming (there are, for example, both
-a script called C<Katakana> and a block called C<Katakana>, the block
-version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
-
-Notice that this definition was introduced in Perl 5.8.0: in Perl
-5.6 only the blocks were used; in Perl 5.8.0 scripts became the
-preferential Unicode character class definition (prompted by
-recommendations from the Unicode consortium); this meant that
-the definitions of some character classes changed (the ones in
-the below list that have the C<Block> appended).
-
- Alphabetic Presentation Forms
- Arabic Block
- Arabic Presentation Forms-A
- Arabic Presentation Forms-B
- Armenian Block
- Arrows
- Basic Latin
- Bengali Block
- Block Elements
- Bopomofo Block
- Bopomofo Extended
- Box Drawing
- Braille Patterns
- Byzantine Musical Symbols
- CJK Compatibility
- CJK Compatibility Forms
- CJK Compatibility Ideographs
- CJK Compatibility Ideographs Supplement
- CJK Radicals Supplement
- CJK Symbols and Punctuation
- CJK Unified Ideographs
- CJK Unified Ideographs Extension A
- CJK Unified Ideographs Extension B
- Cherokee Block
- Combining Diacritical Marks
- Combining Half Marks
- Combining Marks for Symbols
- Control Pictures
- Currency Symbols
- Cyrillic Block
- Deseret Block
- Devanagari Block
- Dingbats
- Enclosed Alphanumerics
- Enclosed CJK Letters and Months
- Ethiopic Block
- General Punctuation
- Geometric Shapes
- Georgian Block
- Gothic Block
- Greek Block
- Greek Extended
- Gujarati Block
- Gurmukhi Block
- Halfwidth and Fullwidth Forms
- Hangul Compatibility Jamo
- Hangul Jamo
- Hangul Syllables
- Hebrew Block
- High Private Use Surrogates
- High Surrogates
- Hiragana Block
- IPA Extensions
- Ideographic Description Characters
- Kanbun
- Kangxi Radicals
- Kannada Block
- Katakana Block
- Khmer Block
- Lao Block
- Latin 1 Supplement
- Latin Extended Additional
- Latin Extended-A
- Latin Extended-B
- Letterlike Symbols
- Low Surrogates
- Malayalam Block
- Mathematical Alphanumeric Symbols
- Mathematical Operators
- Miscellaneous Symbols
- Miscellaneous Technical
- Mongolian Block
- Musical Symbols
- Myanmar Block
- Number Forms
- Ogham Block
- Old Italic Block
- Optical Character Recognition
- Oriya Block
- Private Use
- Runic Block
- Sinhala Block
- Small Form Variants
- Spacing Modifier Letters
- Specials
- Superscripts and Subscripts
- Syriac Block
- Tags
- Tamil Block
- Telugu Block
- Thaana Block
- Thai Block
- Tibetan Block
- Unified Canadian Aboriginal Syllabics
- Yi Radicals
- Yi Syllables
+Blocks names are given with the C<In> prefix. For example, the
+Katakana block is referenced via C<\p{InKatakana}. The C<In>
+prefix may be omitted if there is no nameing conflict with a script
+or any other property, but it is recommended that C<In> always be used
+to avoid confusion.
+
+These block names are supported:
+
+ InAlphabeticPresentationForms
+ InArabicBlock
+ InArabicPresentationFormsA
+ InArabicPresentationFormsB
+ InArmenianBlock
+ InArrows
+ InBasicLatin
+ InBengaliBlock
+ InBlockElements
+ InBopomofoBlock
+ InBopomofoExtended
+ InBoxDrawing
+ InBraillePatterns
+ InByzantineMusicalSymbols
+ InCJKCompatibility
+ InCJKCompatibilityForms
+ InCJKCompatibilityIdeographs
+ InCJKCompatibilityIdeographsSupplement
+ InCJKRadicalsSupplement
+ InCJKSymbolsAndPunctuation
+ InCJKUnifiedIdeographs
+ InCJKUnifiedIdeographsExtensionA
+ InCJKUnifiedIdeographsExtensionB
+ InCherokeeBlock
+ InCombiningDiacriticalMarks
+ InCombiningHalfMarks
+ InCombiningMarksForSymbols
+ InControlPictures
+ InCurrencySymbols
+ InCyrillicBlock
+ InDeseretBlock
+ InDevanagariBlock
+ InDingbats
+ InEnclosedAlphanumerics
+ InEnclosedCJKLettersAndMonths
+ InEthiopicBlock
+ InGeneralPunctuation
+ InGeometricShapes
+ InGeorgianBlock
+ InGothicBlock
+ InGreekBlock
+ InGreekExtended
+ InGujaratiBlock
+ InGurmukhiBlock
+ InHalfwidthAndFullwidthForms
+ InHangulCompatibilityJamo
+ InHangulJamo
+ InHangulSyllables
+ InHebrewBlock
+ InHighPrivateUseSurrogates
+ InHighSurrogates
+ InHiraganaBlock
+ InIPAExtensions
+ InIdeographicDescriptionCharacters
+ InKanbun
+ InKangxiRadicals
+ InKannadaBlock
+ InKatakanaBlock
+ InKhmerBlock
+ InLaoBlock
+ InLatin1Supplement
+ InLatinExtendedAdditional
+ InLatinExtended-A
+ InLatinExtended-B
+ InLetterlikeSymbols
+ InLowSurrogates
+ InMalayalamBlock
+ InMathematicalAlphanumericSymbols
+ InMathematicalOperators
+ InMiscellaneousSymbols
+ InMiscellaneousTechnical
+ InMongolianBlock
+ InMusicalSymbols
+ InMyanmarBlock
+ InNumberForms
+ InOghamBlock
+ InOldItalicBlock
+ InOpticalCharacterRecognition
+ InOriyaBlock
+ InPrivateUse
+ InRunicBlock
+ InSinhalaBlock
+ InSmallFormVariants
+ InSpacingModifierLetters
+ InSpecials
+ InSuperscriptsAndSubscripts
+ InSyriacBlock
+ InTags
+ InTamilBlock
+ InTeluguBlock
+ InThaanaBlock
+ InThaiBlock
+ InTibetanBlock
+ InUnifiedCanadianAboriginalSyllabics
+ InYiRadicals
+ InYiSyllables
=over 4
[ 1] \x{...}
[ 2] \N{...}
- [ 3] . \p{Is...} \P{Is...}
+ [ 3] . \p{...} \P{...}
[ 4] now scripts (see UTR#24 Script Names) in addition to blocks
[ 5] have negation
[ 6] can use look-ahead to emulate subtraction (*)
in Perl can be written as:
- (?!\p{UNASSIGNED})\p{GreekBlock}
- (?=\p{ASSIGNED})\p{GreekBlock}
+ (?!\p{Unassigned})\p{InGreek}
+ (?=\p{Assigned})\p{InGreek}
But in this particular example, you probably really want