perl's encoding on output by use of the ":encoding(...)" layer.
See L<open>.
+In some filesystems (for example Microsoft NTFS and Apple HFS+) the
+filenames are in UTF-8 . By using opendir() and File::Glob you can
+make readdir() and glob() to return the filenames as Unicode, see
+L<perlfunc/opendir> and L<File::Glob> for details.
+
To mark the Perl source itself as being in a particular encoding,
see L<encoding>.
=item *
-Strings and patterns may contain characters that have an ordinal value
-larger than 255.
+Strings (including hash keys) and regular expression patterns may
+contain characters that have an ordinal value larger than 255.
If you use a Unicode editor to edit your program, Unicode characters
may occur directly within the literal strings in one of the various
This works only for characters with a code 0x100 and above.
Additionally, if you
+
use charnames ':full';
+
you can use the C<\N{...}> notation, putting the official Unicode character
name within the curlies. For example, C<\N{WHITE SMILING FACE}>.
This works for all characters that have names.
=item *
-If an appropriate L<encoding> is specified,
-identifiers within the Perl script may contain Unicode alphanumeric
-characters, including ideographs. (You are currently on your own when
-it comes to using the canonical forms of characters--Perl doesn't
-(yet) attempt to canonicalize variable names for you.)
+If an appropriate L<encoding> is specified, identifiers within the
+Perl script may contain Unicode alphanumeric characters, including
+ideographs. (You are currently on your own when it comes to using the
+canonical forms of characters--Perl doesn't (yet) attempt to
+canonicalize variable names for you.)
=item *
Armenian
Bengali
Bopomofo
+ Buhid
CanadianAboriginal
Cherokee
Cyrillic
Gurmukhi
Han
Hangul
+ Hanunoo
Hebrew
Hiragana
Inherited
Runic
Sinhala
Syriac
+ Tagalog
+ Tagbanwa
Tamil
Telugu
Thaana
There are also extended property classes that supplement the basic
properties, defined by the F<PropList> Unicode database:
- ASCII_Hex_Digit
+ ASCIIHexDigit
BidiControl
Dash
+ Deprecated
Diacritic
Extender
+ GraphemeLink
HexDigit
Hyphen
Ideographic
+ IDSBinaryOperator
+ IDSTrinaryOperator
JoinControl
+ LogicalOrderException
NoncharacterCodePoint
OtherAlphabetic
+ OtherDefaultIgnorableCodePoint
+ OtherGraphemeExtend
OtherLowercase
OtherMath
OtherUppercase
QuotationMark
+ Radical
+ SoftDotted
+ TerminalPunctuation
+ UnifiedIdeograph
WhiteSpace
and further derived properties:
These block names are supported:
- InAlphabeticPresentationForms
- InArabicBlock
- InArabicPresentationFormsA
- InArabicPresentationFormsB
- InArmenianBlock
- InArrows
- InBasicLatin
- InBengaliBlock
- InBlockElements
- InBopomofoBlock
- InBopomofoExtended
- InBoxDrawing
- InBraillePatterns
- InByzantineMusicalSymbols
- InCJKCompatibility
- InCJKCompatibilityForms
- InCJKCompatibilityIdeographs
- InCJKCompatibilityIdeographsSupplement
- InCJKRadicalsSupplement
- InCJKSymbolsAndPunctuation
- InCJKUnifiedIdeographs
- InCJKUnifiedIdeographsExtensionA
- InCJKUnifiedIdeographsExtensionB
- InCherokeeBlock
- InCombiningDiacriticalMarks
- InCombiningHalfMarks
- InCombiningMarksForSymbols
- InControlPictures
- InCurrencySymbols
- InCyrillicBlock
- InDeseretBlock
- InDevanagariBlock
- InDingbats
- InEnclosedAlphanumerics
- InEnclosedCJKLettersAndMonths
- InEthiopicBlock
- InGeneralPunctuation
- InGeometricShapes
- InGeorgianBlock
- InGothicBlock
- InGreekBlock
- InGreekExtended
- InGujaratiBlock
- InGurmukhiBlock
- InHalfwidthAndFullwidthForms
- InHangulCompatibilityJamo
- InHangulJamo
- InHangulSyllables
- InHebrewBlock
- InHighPrivateUseSurrogates
- InHighSurrogates
- InHiraganaBlock
- InIPAExtensions
- InIdeographicDescriptionCharacters
- InKanbun
- InKangxiRadicals
- InKannadaBlock
- InKatakanaBlock
- InKhmerBlock
- InLaoBlock
- InLatin1Supplement
- InLatinExtendedAdditional
- InLatinExtended-A
- InLatinExtended-B
- InLetterlikeSymbols
- InLowSurrogates
- InMalayalamBlock
- InMathematicalAlphanumericSymbols
- InMathematicalOperators
- InMiscellaneousSymbols
- InMiscellaneousTechnical
- InMongolianBlock
- InMusicalSymbols
- InMyanmarBlock
- InNumberForms
- InOghamBlock
- InOldItalicBlock
- InOpticalCharacterRecognition
- InOriyaBlock
- InPrivateUse
- InRunicBlock
- InSinhalaBlock
- InSmallFormVariants
- InSpacingModifierLetters
- InSpecials
- InSuperscriptsAndSubscripts
- InSyriacBlock
- InTags
- InTamilBlock
- InTeluguBlock
- InThaanaBlock
- InThaiBlock
- InTibetanBlock
- InUnifiedCanadianAboriginalSyllabics
- InYiRadicals
- InYiSyllables
+ InAlphabeticPresentationForms
+ InArabic
+ InArabicPresentationFormsA
+ InArabicPresentationFormsB
+ InArmenian
+ InArrows
+ InBasicLatin
+ InBengali
+ InBlockElements
+ InBopomofo
+ InBopomofoExtended
+ InBoxDrawing
+ InBraillePatterns
+ InBuhid
+ InByzantineMusicalSymbols
+ InCJKCompatibility
+ InCJKCompatibilityForms
+ InCJKCompatibilityIdeographs
+ InCJKCompatibilityIdeographsSupplement
+ InCJKRadicalsSupplement
+ InCJKSymbolsAndPunctuation
+ InCJKUnifiedIdeographs
+ InCJKUnifiedIdeographsExtensionA
+ InCJKUnifiedIdeographsExtensionB
+ InCherokee
+ InCombiningDiacriticalMarks
+ InCombiningDiacriticalMarksforSymbols
+ InCombiningHalfMarks
+ InControlPictures
+ InCurrencySymbols
+ InCyrillic
+ InCyrillicSupplementary
+ InDeseret
+ InDevanagari
+ InDingbats
+ InEnclosedAlphanumerics
+ InEnclosedCJKLettersAndMonths
+ InEthiopic
+ InGeneralPunctuation
+ InGeometricShapes
+ InGeorgian
+ InGothic
+ InGreekExtended
+ InGreekAndCoptic
+ InGujarati
+ InGurmukhi
+ InHalfwidthAndFullwidthForms
+ InHangulCompatibilityJamo
+ InHangulJamo
+ InHangulSyllables
+ InHanunoo
+ InHebrew
+ InHighPrivateUseSurrogates
+ InHighSurrogates
+ InHiragana
+ InIPAExtensions
+ InIdeographicDescriptionCharacters
+ InKanbun
+ InKangxiRadicals
+ InKannada
+ InKatakana
+ InKatakanaPhoneticExtensions
+ InKhmer
+ InLao
+ InLatin1Supplement
+ InLatinExtendedA
+ InLatinExtendedAdditional
+ InLatinExtendedB
+ InLetterlikeSymbols
+ InLowSurrogates
+ InMalayalam
+ InMathematicalAlphanumericSymbols
+ InMathematicalOperators
+ InMiscellaneousMathematicalSymbolsA
+ InMiscellaneousMathematicalSymbolsB
+ InMiscellaneousSymbols
+ InMiscellaneousTechnical
+ InMongolian
+ InMusicalSymbols
+ InMyanmar
+ InNumberForms
+ InOgham
+ InOldItalic
+ InOpticalCharacterRecognition
+ InOriya
+ InPrivateUseArea
+ InRunic
+ InSinhala
+ InSmallFormVariants
+ InSpacingModifierLetters
+ InSpecials
+ InSuperscriptsAndSubscripts
+ InSupplementalArrowsA
+ InSupplementalArrowsB
+ InSupplementalMathematicalOperators
+ InSupplementaryPrivateUseAreaA
+ InSupplementaryPrivateUseAreaB
+ InSyriac
+ InTagalog
+ InTagbanwa
+ InTags
+ InTamil
+ InTelugu
+ InThaana
+ InThai
+ InTibetan
+ InUnifiedCanadianAboriginalSyllabics
+ InVariationSelectors
+ InYiRadicals
+ InYiSyllables
=over 4
in Perl can be written as:
- (?!\p{Unassigned})\p{InGreek}
- (?=\p{Assigned})\p{InGreek}
+ (?!\p{Unassigned})\p{InGreekAndCoptic}
+ (?=\p{Assigned})\p{InGreekAndCoptic}
But in this particular example, you probably really want
Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF,
the 90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF.
+The "gaps" are caused by legal UTF-8 avoiding non-shortest encodings:
+it is technically possible to UTF-8-encode a single code point in different
+ways, but that is explicitly forbidden, and the shortest possible encoding
+should always be used (and that is what Perl does).
+
Or, another way to look at it, as bits:
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
and the decoding is
- $uni = 0x10000 + ($hi - 0xD8000) * 0x400 + ($lo - 0xDC00);
+ $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
If you try to generate surrogates (for example by using chr()), you
will get a warning if warnings are turned on (C<-w> or C<use
Perl tries really hard to work both with Unicode and the old byte
oriented world: most often this is nice, but sometimes this causes
-problems. See L</BUGS> for example how sometimes using locales
-with Unicode can help with these problems.
+problems.
=back
there is some attempt to apply 8-bit locale info to characters in the
range 0..255, but this is demonstrably incorrect for locales that use
characters above that range when mapped into Unicode. It will also
-tend to run slower. Avoidance of locales is strongly encouraged,
-with one known expection, see the next paragraph.
-
-If the keys of a hash are "mixed", that is, some keys are Unicode,
-while some keys are "byte", the keys may behave differently in regular
-expressions since the definition of character classes like C</\w/>
-is different for byte strings and character strings. This problem can
-sometimes be helped by using an appropriate locale (see L<perllocale>).
-Another way is to force all the strings to be character encoded by
-using utf8::upgrade() (see L<utf8>).
+tend to run slower. Use of locales with Unicode is discouraged.
Some functions are slower when working on UTF-8 encoded strings than
-on byte encoded strings. All functions that need to hop over
+on byte encoded strings. All functions that need to hop over
characters such as length(), substr() or index() can work B<much>
faster when the underlying data are byte-encoded. Witness the
following benchmark: