=head2 Unicode version
-Perl is shipped with the latest Unicode version, 5.2, October 2009. See
+Perl is shipped with the latest Unicode version, 5.2, dated October 2009. See
L<http://www.unicode.org/versions/Unicode5.2.0> for details about this release
of Unicode. See L<perlunicode> for instructions on installing and using
older versions of Unicode.
C<qr/\X/>, which matches a Unicode logical character, has been expanded to work
better with various Asian languages. It now is defined as an C<extended
-grapheme cluster>. (See L<http://www.unicode.org/reports/tr29/>). One change
-due to this is that C<\X> will match the whole sequence C<S<CR LF>>. Another
-change is that C<\X> will match an isolated mark. Marks generally come after a
-base character, but it is possible in Unicode to have them in isolation, and
-C<\X> will now handle that case. Otherwise, this change should be transparent
-for non-affected languages.
+grapheme cluster>. (See L<http://www.unicode.org/reports/tr29/>).
+Anything matched by previously will continue to be matched. But in addition:
+
+=over
+
+=item *
+
+C<\X> will now not break apart a C<S<CR LF>> sequence.
+
+=item *
+
+C<\X> will now match a sequence including the C<ZWJ> and C<ZWNJ> characters.
+
+=item *
+
+C<\X> will now always match at least one character, including an initial mark.
+Marks generally come after a base character, but it is possible in Unicode to
+have them in isolation, and C<\X> will now handle that case, for example at the
+beginning of a line or after a C<ZWSP>.
+
+=item *
+
+C<\X> will now match a (Korean) Hangul syllable sequence, and the Thai and Lao
+exception cases.
+
+=back
+
+Otherwise, this change should be transparent for the non-affected languages.
C<\p{...}> matches using the Canonical_Combining_Class property were
completely broken in previous Perls. This is now fixed.
-In previous Perls, the Unicode Decomposition_Type=Compat property and a
+In previous Perls, the Unicode C<Decomposition_Type=Compat> property and a
Perl extension had the same name, which led to neither matching all the
correct values (with more than 100 mistakes in one, and several thousand
in the other). The Perl extension has now been renamed to be
-Decomposition_Type=Noncanonical (short: dt=noncanon). It has the same
+C<Decomposition_Type=Noncanonical> (short: C<dt=noncanon>). It has the same
meaning as was previously intended, namely the union of all the
-non-canonical Decomposition types, with Unicode Compat being just one of
+non-canonical Decomposition types, with Unicode C<Compat> being just one of
those.
C<\p{Uppercase}> and C<\p{Lowercase}> have been brought into line with the
C<\p{Alpha}> now matches the same characters as C<\p{Alphabetic}>. The Perl
definition included a number of things that aren't really alpha (all
-marks), while omitting many that were. The Unicode definition is
-clearly better, so we are switching to it. As a direct consequence, the
+marks), while omitting many that were. As a direct consequence, the
definitions of C<\p{Alnum}> and C<\p{Word}> which depend on Alpha also change.
C<\p{Word}> also now doesn't match certain characters it wasn't supposed
to, such as fractions.
-C<\p{Print}> no longer matches the line control characters: tab, lf, cr,
-ff, vt, and nel. This brings it in line with the documentation.
+C<\p{Print}> no longer matches the line control characters: Tab, LF, CR,
+FF, VT, and NEL. This brings it in line with the documentation.
-C<\p{Decomposition_Type=Canonical}> now includes the Hangul syllables
+C<\p{Decomposition_Type=Canonical}> now includes the Hangul syllables.
The Numeric type property has been extended to include the Unihan
characters.
-There is a new Perl extension, the 'Present_In', or simply 'In'
+There is a new Perl extension, the 'Present_In', or simply 'In',
property. This is an extension of the Unicode Age property, but
-C<\p{In=5.0}> matches any code point whose usage has been determined as of
-Unicode version 5.0. The C<\p{Age=5.0}> only matches code points added in 5.0.
+C<\p{In=5.0}> matches any code point whose usage has been determined
+I<as of> Unicode version 5.0. The C<\p{Age=5.0}> only matches code points
+added in I<precisely> version 5.0.
A number of properties did not have the correct values for unassigned
code points. This is now fixed. The affected properties are
Hangul_Syllable_Type, Numeric_Type, and Line_Break.
The Default_Ignorable_Code_Point, ID_Continue, and ID_Start properties
-have been updated to their current definitions.
+have been updated to their current Unicode definitions.
Certain properties that are supposed to be Unicode internal-only were
erroneously exposed by previous Perls. Use of these in regular
-expressions will now generate a deprecated warning message, if those
-warnings are enabled. The properties are: Other_Alphabetic,
-Other_Default_Ignorable_Code_Point, Other_Grapheme_Extend,
-Other_ID_Continue, Other_ID_Start, Other_Lowercase, Other_Math, and
-Other_Uppercase.
+expressions will now generate, if enabled, a deprecated warning message.
+The properties are: Other_Alphabetic, Other_Default_Ignorable_Code_Point,
+Other_Grapheme_Extend, Other_ID_Continue, Other_ID_Start, Other_Lowercase,
+Other_Math, and Other_Uppercase.
An installation can now fairly easily change which Unicode properties
Perl understands. As mentioned above, certain properties are by default
XXX what does "files in the To directory" mean? -- dagolden, 2009-12-20
-The files in the To directory are now more clearly marked as being
-stable, directly usable by applications. New hash entries in them give
-the format of the normal entries which allows for easier machine
-parsing. Perl can generate files in this directory for any property,
-though most are suppressed. An installation can choose to change which
-get written. Instructions are in L<perluniprops>.
+The files in the C<lib/unicore/To> directory are now more clearly marked as
+being stable, directly usable by applications. New hash entries in them give
+the format of the normal entries, which allows for easier machine parsing.
+Perl can generate files in this directory for any property, though most are
+suppressed. An installation can choose to change which get written.
+Instructions are in L<perluniprops>.
=head1 Modules and Pragmata
implicit upgrading from byte strings to Unicode strings assumes that
they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
downgraded with UTF-8 encoding. This happens because the first 256
-codepoints in Unicode happens to agree with Latin-1.
+codepoints in Unicode happens to agree with Latin-1.
See L</"Byte and Character Semantics"> for more details.
be made without additional information from the user, Perl decides in
favor of compatibility and chooses to use byte semantics.
-Under byte semantics, when C<use locale> is in effect, Perl uses the
+Under byte semantics, when C<use locale> is in effect, Perl uses the
semantics associated with the current locale. Absent a C<use locale>, Perl
currently uses US-ASCII (or Basic Latin in Unicode terminology) byte semantics,
meaning that characters whose ordinal numbers are in the range 128 - 255 are
be used to force byte semantics on Unicode data.
If strings operating under byte semantics and strings with Unicode
-character data are concatenated, the new string will have
+character data are concatenated, the new string will have
character semantics. This can cause surprises: See L</BUGS>, below
Under character semantics, many operations that formerly operated on
=item *
-The special pattern C<\X> matches any extended Unicode
-sequence--"a combining character sequence" in Standardese--where the
-first character is a base character and subsequent characters are mark
-characters that apply to the base character. C<\X> is equivalent to
-C<< (?>\PM\pM*) >>.
+The special pattern C<\X> matches a logical character, an C<extended grapheme
+cluster> in Standardese. In Unicode what appears to the user to be a single
+character, for example an accented C<G>, may in fact be composed of a sequence
+of characters, in this case a C<G> followed by an accent character. C<\X>
+will match the entire sequence.
=item *
automatically switch to using character positions, including
C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
C<sprintf()>, C<write()>, and C<length()>. An operator that
-specifically does not switch is C<vec()>. Operators that really don't
-care include operators that treat strings as a bucket of bits such as
+specifically does not switch is C<vec()>. Operators that really don't
+care include operators that treat strings as a bucket of bits such as
C<sort()>, and operators dealing with filenames.
=item *
-The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
+The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
used for byte-oriented formats. Again, think C<char> in the C language.
There is a new C<U> specifier that converts between Unicode characters
=head2 Unicode Character Properties
-Named Unicode properties, scripts, and block ranges may be used like
-character classes via the C<\p{}> "matches property" construct and
-the C<\P{}> negation, "doesn't match property".
-
-For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
-(Letter, uppercase) property, while C<\p{M}> matches any character
-with an "M" (mark--accents and such) property. Brackets are not
-required for single letter properties, so C<\p{M}> is equivalent to
-C<\pM>. Many predefined properties are available, such as
-C<\p{Mirrored}> and C<\p{Tibetan}>.
-
-The official Unicode script and block names have spaces and dashes as
-separators, but for convenience you can use dashes, spaces, or
-underbars, and case is unimportant. It is recommended, however, that
-for consistency you use the following naming: the official Unicode
-script, property, or block name (see below for the additional rules
-that apply to block names) with whitespace and dashes removed, and the
-words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
-becomes C<Latin1Supplement>.
+Most Unicode character properties are accessible by using regular expressions.
+They are used like character classes via the C<\p{}> "matches property"
+construct and the C<\P{}> negation, "doesn't match property".
+
+For instance, C<\p{Uppercase}> matches any character with the Unicode
+"Uppercase" property, while C<\p{L}> matches any character with a
+General_Category of "L" (letter) property. Brackets are not
+required for single letter properties, so C<\p{L}> is equivalent to C<\pL>.
+
+More formally, C<\p{Uppercase}> matches any character whose Uppercase property
+value is True, and C<\P{Uppercase}> matches any character whose Uppercase
+property value is False, and they could have been written as
+C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively
+
+This formality is needed when properties are not binary, that is if they can
+take on more values than just True and False. For example, the Bidi_Class (see
+L</"Bidirectional Character Types"> below), can take on a number of different
+values, such as Left, Right, Whitespace, and others. To match these, one needs
+to specify the property name (Bidi_Class), and the value being matched with
+(Left, Right, etc.). This is done, as in the examples above, by having the two
+components separated by an equal sign (or interchangeably, a colon), like
+C<\p{Bidi_Class: Left}>.
+
+All Unicode-defined character properties may be written in these compound forms
+of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
+additional properties that are written only in the single form, as well as
+single-form short-cuts for all binary properties and certain others described
+below, in which you may omit the property name and the equals or colon
+separator.
+
+Most Unicode character properties have at least two synonyms (or aliases if you
+prefer), a short one that is easier to type, and a longer one which is more
+descriptive and hence it is easier to understand what it means. Thus the "L"
+and "Letter" above are equivalent and can be used interchangeably. Likewise,
+"Upper" is a synonym for "Uppercase", and we could have written
+C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
+various synonyms for the values the property can be. For binary properties,
+"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
+"No", and "N". But be careful. A short form of a value for one property may
+not mean the same thing as the same name for another. Thus, for the
+General_Category property, "L" means "Letter", but for the Bidi_Class property,
+"L" means "Left". A complete list of properties and synonyms is in
+L<perluniprops>.
+
+Upper/lower case differences in the property names and values are irrelevant,
+thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
+Similarly, you can add or subtract underscores anywhere in the middle of a
+word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
+is irrelevant adjacent to non-word characters, such as the braces and the equals
+or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
+equivalent to these as well. In fact, in most cases, white space and even
+hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
+equivalent. All this is called "loose-matching" by Unicode. The few places
+where stricter matching is employed is in the middle of numbers, and the Perl
+extension properties that begin or end with an underscore. Stricter matching
+cares about white space (except adjacent to the non-word characters) and
+hyphens, and non-interior underscores.
You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
(^) between the first brace and the property name: C<\p{^Tamil}> is
equal to C<\P{Tamil}>.
-B<NOTE: the properties, scripts, and blocks listed here are as of
-Unicode 5.0.0 in July 2006.>
+=head3 B<General_Category>
-=over 4
+Every Unicode character is assigned a general category, which is the "most
+usual categorization of a character" (from
+L<http://www.unicode.org/reports/tr44>).
-=item General Category
+The compound way of writing these is like C<{\p{General_Category=Number}>
+(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
+through the equal or colon separator is omitted. So you can instead just write
+C<\pN>.
-Here are the basic Unicode General Category properties, followed by their
-long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
-for instance, are identical.
+Here are the short and long forms of the General Category properties:
Short Long
L Letter
- LC CasedLetter
- Lu UppercaseLetter
- Ll LowercaseLetter
- Lt TitlecaseLetter
- Lm ModifierLetter
- Lo OtherLetter
+ LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
+ Lu Uppercase_Letter
+ Ll Lowercase_Letter
+ Lt Titlecase_Letter
+ Lm Modifier_Letter
+ Lo Other_Letter
M Mark
- Mn NonspacingMark
- Mc SpacingMark
- Me EnclosingMark
+ Mn Nonspacing_Mark
+ Mc Spacing_Mark
+ Me Enclosing_Mark
N Number
- Nd DecimalNumber
- Nl LetterNumber
- No OtherNumber
-
- P Punctuation
- Pc ConnectorPunctuation
- Pd DashPunctuation
- Ps OpenPunctuation
- Pe ClosePunctuation
- Pi InitialPunctuation
+ Nd Decimal_Number (also Digit)
+ Nl Letter_Number
+ No Other_Number
+
+ P Punctuation (also Punct)
+ Pc Connector_Punctuation
+ Pd Dash_Punctuation
+ Ps Open_Punctuation
+ Pe Close_Punctuation
+ Pi Initial_Punctuation
(may behave like Ps or Pe depending on usage)
- Pf FinalPunctuation
+ Pf Final_Punctuation
(may behave like Ps or Pe depending on usage)
- Po OtherPunctuation
+ Po Other_Punctuation
S Symbol
- Sm MathSymbol
- Sc CurrencySymbol
- Sk ModifierSymbol
- So OtherSymbol
+ Sm Math_Symbol
+ Sc Currency_Symbol
+ Sk Modifier_Symbol
+ So Other_Symbol
Z Separator
- Zs SpaceSeparator
- Zl LineSeparator
- Zp ParagraphSeparator
+ Zs Space_Separator
+ Zl Line_Separator
+ Zp Paragraph_Separator
C Other
- Cc Control
+ Cc Control (also Cntrl)
Cf Format
Cs Surrogate (not usable)
- Co PrivateUse
+ Co Private_Use
Cn Unassigned
Single-letter properties match all characters in any of the
the somewhat messy concept of surrogates. C<Cs> is therefore not
supported.
-=item Bidirectional Character Types
+=head3 B<Bidirectional Character Types>
Because scripts differ in their directionality--Hebrew is
written right to left, for example--Unicode supplies these properties in
-the BidiClass class:
+the Bidi_Class class:
Property Meaning
LRE Left-to-Right Embedding
LRO Left-to-Right Override
R Right-to-Left
- AL Right-to-Left Arabic
+ AL Arabic Letter
RLE Right-to-Left Embedding
RLO Right-to-Left Override
PDF Pop Directional Format
EN European Number
- ES European Number Separator
- ET European Number Terminator
+ ES European Separator
+ ET European Terminator
AN Arabic Number
- CS Common Number Separator
+ CS Common Separator
NSM Non-Spacing Mark
BN Boundary Neutral
B Paragraph Separator
WS Whitespace
ON Other Neutrals
-For example, C<\p{BidiClass:R}> matches characters that are normally
+This property is always written in the compound form.
+For example, C<\p{Bidi_Class:R}> matches characters that are normally
written right to left.
-=item Scripts
-
-The script names which can be used by C<\p{...}> and C<\P{...}>,
-such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
-
- Arabic
- Armenian
- Balinese
- Bengali
- Bopomofo
- Braille
- Buginese
- Buhid
- CanadianAboriginal
- Cherokee
- Coptic
- Cuneiform
- Cypriot
- Cyrillic
- Deseret
- Devanagari
- Ethiopic
- Georgian
- Glagolitic
- Gothic
- Greek
- Gujarati
- Gurmukhi
- Han
- Hangul
- Hanunoo
- Hebrew
- Hiragana
- Inherited
- Kannada
- Katakana
- Kharoshthi
- Khmer
- Lao
- Latin
- Limbu
- LinearB
- Malayalam
- Mongolian
- Myanmar
- NewTaiLue
- Nko
- Ogham
- OldItalic
- OldPersian
- Oriya
- Osmanya
- PhagsPa
- Phoenician
- Runic
- Shavian
- Sinhala
- SylotiNagri
- Syriac
- Tagalog
- Tagbanwa
- TaiLe
- Tamil
- Telugu
- Thaana
- Thai
- Tibetan
- Tifinagh
- Ugaritic
- Yi
-
-=item Extended property classes
-
-Extended property classes can supplement the basic
-properties, defined by the F<PropList> Unicode database:
-
- ASCIIHexDigit
- BidiControl
- Dash
- Deprecated
- Diacritic
- Extender
- HexDigit
- Hyphen
- Ideographic
- IDSBinaryOperator
- IDSTrinaryOperator
- JoinControl
- LogicalOrderException
- NoncharacterCodePoint
- OtherAlphabetic
- OtherDefaultIgnorableCodePoint
- OtherGraphemeExtend
- OtherIDStart
- OtherIDContinue
- OtherLowercase
- OtherMath
- OtherUppercase
- PatternSyntax
- PatternWhiteSpace
- QuotationMark
- Radical
- SoftDotted
- STerm
- TerminalPunctuation
- UnifiedIdeograph
- VariationSelector
- WhiteSpace
-
-and there are further derived properties:
-
- Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic
- Lowercase = Ll + OtherLowercase
- Uppercase = Lu + OtherUppercase
- Math = Sm + OtherMath
-
- IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart
- IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue
-
- DefaultIgnorableCodePoint
- = OtherDefaultIgnorableCodePoint
- + Cf + Cc + Cs + Noncharacters + VariationSelector
- - WhiteSpace - FFF9..FFFB (Annotation Characters)
-
- Any = Any code points (i.e. U+0000 to U+10FFFF)
- Assigned = Any non-Cn code points (i.e. synonym for \P{Cn})
- Unassigned = Synonym for \p{Cn}
- ASCII = ASCII (i.e. U+0000 to U+007F)
-
- Common = Any character (or unassigned code point)
- not explicitly assigned to a script
-
-=item Use of "Is" Prefix
+=head3 B<Scripts>
+
+The world's languages are written in a number of scripts. This sentence is
+written in Latin, while Russian is written in Cyrllic, and Greek is written in,
+well, Greek; Japanese mainly in Hiragana or Katakana. There are many more.
+
+The Unicode Script property gives what script a given character is in,
+and can be matched with the compound form like C<\p{Script=Hebrew}> (short:
+C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit
+everything up through the equals (or colon), and simply write C<\p{Latin}> or
+C<\P{Cyrillic}>.
+
+A complete list of scripts and their shortcuts is in L<perluniprops>.
+
+=head3 B<Extended property classes>
+
+There are many more property classes than the basic ones described here,
+including some Perl extensions.
+A complete list is in L<perluniprops>.
+The extensions are more fully described in L<perlrecharclass>
+
+=head3 B<Use of "Is" Prefix>
For backward compatibility (with Perl 5.6), all properties mentioned
-so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
-example, is equal to C<\P{Lu}>.
+so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
+example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
+C<\p{Arabic}>.
-=item Blocks
+=head3 B<Blocks>
In addition to B<scripts>, Unicode also defines B<blocks> of
characters. The difference between scripts and blocks is that the
concept of scripts is closer to natural languages, while the concept
-of blocks is more of an artificial grouping based on groups of 256
-Unicode characters. For example, the C<Latin> script contains letters
-from many blocks but does not contain all the characters from those
-blocks. It does not, for example, contain digits, because digits are
-shared across many scripts. Digits and similar groups, like
-punctuation, are in a category called C<Common>.
-
-For more about scripts, see the UAX#24 "Script Names":
-
- http://www.unicode.org/reports/tr24/
-
-For more about blocks, see:
-
- http://www.unicode.org/Public/UNIDATA/Blocks.txt
-
-Block names are given with the C<In> prefix. For example, the
-Katakana block is referenced via C<\p{InKatakana}>. The C<In>
-prefix may be omitted if there is no naming conflict with a script
-or any other property, but it is recommended that C<In> always be used
-for block tests to avoid confusion.
-
-These block names are supported:
-
- InAegeanNumbers
- InAlphabeticPresentationForms
- InAncientGreekMusicalNotation
- InAncientGreekNumbers
- InArabic
- InArabicPresentationFormsA
- InArabicPresentationFormsB
- InArabicSupplement
- InArmenian
- InArrows
- InBalinese
- InBasicLatin
- InBengali
- InBlockElements
- InBopomofo
- InBopomofoExtended
- InBoxDrawing
- InBraillePatterns
- InBuginese
- InBuhid
- InByzantineMusicalSymbols
- InCJKCompatibility
- InCJKCompatibilityForms
- InCJKCompatibilityIdeographs
- InCJKCompatibilityIdeographsSupplement
- InCJKRadicalsSupplement
- InCJKStrokes
- InCJKSymbolsAndPunctuation
- InCJKUnifiedIdeographs
- InCJKUnifiedIdeographsExtensionA
- InCJKUnifiedIdeographsExtensionB
- InCherokee
- InCombiningDiacriticalMarks
- InCombiningDiacriticalMarksSupplement
- InCombiningDiacriticalMarksforSymbols
- InCombiningHalfMarks
- InControlPictures
- InCoptic
- InCountingRodNumerals
- InCuneiform
- InCuneiformNumbersAndPunctuation
- InCurrencySymbols
- InCypriotSyllabary
- InCyrillic
- InCyrillicSupplement
- InDeseret
- InDevanagari
- InDingbats
- InEnclosedAlphanumerics
- InEnclosedCJKLettersAndMonths
- InEthiopic
- InEthiopicExtended
- InEthiopicSupplement
- InGeneralPunctuation
- InGeometricShapes
- InGeorgian
- InGeorgianSupplement
- InGlagolitic
- InGothic
- InGreekExtended
- InGreekAndCoptic
- InGujarati
- InGurmukhi
- InHalfwidthAndFullwidthForms
- InHangulCompatibilityJamo
- InHangulJamo
- InHangulSyllables
- InHanunoo
- InHebrew
- InHighPrivateUseSurrogates
- InHighSurrogates
- InHiragana
- InIPAExtensions
- InIdeographicDescriptionCharacters
- InKanbun
- InKangxiRadicals
- InKannada
- InKatakana
- InKatakanaPhoneticExtensions
- InKharoshthi
- InKhmer
- InKhmerSymbols
- InLao
- InLatin1Supplement
- InLatinExtendedA
- InLatinExtendedAdditional
- InLatinExtendedB
- InLatinExtendedC
- InLatinExtendedD
- InLetterlikeSymbols
- InLimbu
- InLinearBIdeograms
- InLinearBSyllabary
- InLowSurrogates
- InMalayalam
- InMathematicalAlphanumericSymbols
- InMathematicalOperators
- InMiscellaneousMathematicalSymbolsA
- InMiscellaneousMathematicalSymbolsB
- InMiscellaneousSymbols
- InMiscellaneousSymbolsAndArrows
- InMiscellaneousTechnical
- InModifierToneLetters
- InMongolian
- InMusicalSymbols
- InMyanmar
- InNKo
- InNewTaiLue
- InNumberForms
- InOgham
- InOldItalic
- InOldPersian
- InOpticalCharacterRecognition
- InOriya
- InOsmanya
- InPhagspa
- InPhoenician
- InPhoneticExtensions
- InPhoneticExtensionsSupplement
- InPrivateUseArea
- InRunic
- InShavian
- InSinhala
- InSmallFormVariants
- InSpacingModifierLetters
- InSpecials
- InSuperscriptsAndSubscripts
- InSupplementalArrowsA
- InSupplementalArrowsB
- InSupplementalMathematicalOperators
- InSupplementalPunctuation
- InSupplementaryPrivateUseAreaA
- InSupplementaryPrivateUseAreaB
- InSylotiNagri
- InSyriac
- InTagalog
- InTagbanwa
- InTags
- InTaiLe
- InTaiXuanJingSymbols
- InTamil
- InTelugu
- InThaana
- InThai
- InTibetan
- InTifinagh
- InUgaritic
- InUnifiedCanadianAboriginalSyllabics
- InVariationSelectors
- InVariationSelectorsSupplement
- InVerticalForms
- InYiRadicals
- InYiSyllables
- InYijingHexagramSymbols
+of blocks is more of an artificial grouping based on groups of Unicode
+characters with consecutive ordinal values. For example, the C<Basic Latin>
+block is all characters whose ordinals are between 0 and 127, inclusive, in
+other words, the ASCII characters. The C<Latin> script contains some letters
+from this block as well as several more, like C<Latin-1 Supplement>,
+C<Latin Extended-A>, I<etc.>, but it does not contain all the characters from
+those blocks. It does not, for example, contain digits, because digits are
+shared across many scripts. Digits and similar groups, like punctuation, are in
+the script called C<Common>. There is also a script called C<Inherited> for
+characters that modify other characters, and inherit the script value of the
+controlling character.
+
+For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
+L<http://www.unicode.org/reports/tr24>
+
+The Script property is likely to be the one you want to use when processing
+natural language; the Block property may be useful in working with the nuts and
+bolts of Unicode.
+
+Block names are matched in the compound form, like C<\p{Block: Arrows}> or
+C<\p{Blk=Hebrew}>. Unlike most other properties only a few block names have a
+Unicode-defined short name. But Perl does provide a (slight) shortcut: You
+can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
+compatibility, the C<In> prefix may be omitted if there is no naming conflict
+with a script or any other property, and you can even use an C<Is> prefix
+instead in those cases. But it is not a good idea to do this, for a couple
+reasons:
+
+=over 4
+
+=item 1
+
+It is confusing. There are many naming conflicts, and you may forget some.
+For example, \p{Hebrew} means the I<script> Hebrew, and NOT the I<block>
+Hebrew. But would you remember that 6 months from now?
+
+=item 2
+
+It is unstable. A new version of Unicode may pre-empt the current meaning by
+creating a property with the same name. There was a time in very early Unicode
+releases when \p{Hebrew} would have matched the I<block> Hebrew; now it
+doesn't.
=back
+Some people just prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
+instead of the shortcuts, for clarity, and because they can't remember the
+difference between 'In' and 'Is' anyway (or aren't confident that those who
+eventually will read their code will know).
+
+A complete list of blocks and their shortcuts is in L<perluniprops>.
+
=head2 User-Defined Character Properties
-You can define your own character properties by defining subroutines
-whose names begin with "In" or "Is". The subroutines can be defined in
-any package. The user-defined properties can be used in the regular
-expression C<\p> and C<\P> constructs; if you are using a user-defined
-property from a package other than the one you are in, you must specify
-its package in the C<\p> or C<\P> construct.
+You can define your own binary character properties by defining subroutines
+whose names begin with "In" or "Is". The subroutines can be defined in any
+package. The user-defined properties can be used in the regular expression
+C<\p> and C<\P> constructs; if you are using a user-defined property from a
+package other than the one you are in, you must specify its package in the
+C<\p> or C<\P> construct.
- # assuming property IsForeign defined in Lang::
+ # assuming property Is_Foreign defined in Lang::
package main; # property package name required
if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
You can also define your own mappings to be used in the lc(),
lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
The principle is similar to that of user-defined character
-properties: to define subroutines in the C<main> package
+properties: to define subroutines
with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
the first character in ucfirst()), and C<ToUpper> (for uc(), and the
rest of the characters in ucfirst()).
-The string returned by the subroutines needs now to be three
-hexadecimal numbers separated by tabulators: start of the source
-range, end of the source range, and start of the destination range.
-For example:
+The string returned by the subroutines needs to be two hexadecimal numbers
+separated by two tabulators: the source code point and the destination code
+point. For example:
sub ToUpper {
return <<END;
- 0061\t0063\t0041
+ 0061\t\t0041
END
}
-defines an uc() mapping that causes only the characters "a", "b", and
-"c" to be mapped to "A", "B", "C", all other characters will remain
-unchanged.
-
-If there is no source range to speak of, that is, the mapping is from
-a single character to another single character, leave the end of the
-source range empty, but the two tabulator characters are still needed.
-For example:
-
- sub ToLower {
- return <<END;
- 0041\t\t0061
- END
- }
+defines an uc() mapping that causes only the character "a"
+to be mapped to "A"; all other characters will remain unchanged.
-defines a lc() mapping that causes only "A" to be mapped to "a", all
-other characters will remain unchanged.
+(For serious hackers only) The above means you have to furnish a complete
+mapping; you can't just override a couple of characters and leave the rest
+unchanged. You can find all the mappings in the directory
+C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as the
+here-document, and the C<utf8::ToSpecFoo> are special exception mappings
+derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>. The C<Digit> and
+C<Fold> mappings that one can see in the directory are not directly
+user-accessible, one can use either the C<Unicode::UCD> module, or just match
+case-insensitively (that's when the C<Fold> mapping is used).
-(For serious hackers only) If you want to introspect the default
-mappings, you can find the data in the directory
-C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
-the here-document, and the C<utf8::ToSpecFoo> are special exception
-mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
-The C<Digit> and C<Fold> mappings that one can see in the directory
-are not directly user-accessible, one can use either the
-C<Unicode::UCD> module, or just match case-insensitively (that's when
-the C<Fold> mapping is used).
+The mappings will only take effect on scalars that have been marked as having
+Unicode characters, for example by using C<utf8::upgrade()>.
+Old byte-style strings are not affected.
-A final note on the user-defined case mappings: they will be used
-only if the scalar has been marked as having Unicode characters.
-Old byte-style strings will not be affected.
+The mappings are in effect for the package they are defined in.
=head2 Character Encodings for Input and Output
=item *
-chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
+chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
rename, rmdir, stat, symlink, truncate, unlink, utime, -X
=item *
=back
+=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
+
+Perl by default comes with the latest supported Unicode version built in, but
+you can change to use any earlier one.
+
+Download the files in the version of Unicode that you want from the Unicode web
+site L<http://www.unicode.org>). These should replace the existing files in
+C<\$Config{privlib}>/F<unicore>. (C<\%Config> is available from the Config
+module.) Follow the instructions in F<README.perl> in that directory to change
+some of their names, and then run F<make>.
+
+It is even possible to download them to a different directory, and then change
+F<utf8_heavy.pl> in the directory C<\$Config{privlib}> to point to the new
+directory, or maybe make a copy of that directory before making the change, and
+using C<@INC> or the C<-I> run-time flag to switch between versions at will,
+but all this is beyond the scope of these instructions.
+
=head1 SEE ALSO
-L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
L<perlretut>, L<perlvar/"${^UNICODE}">
+L<http://www.unicode.org/reports/tr44>).
=cut