perl's encoding on output by use of the ":encoding(...)" layer.
See L<open>.
+In some filesystems (for example Microsoft NTFS and Apple HFS+) the
+filenames are in UTF-8 . By using opendir() and File::Glob you can
+make readdir() and glob() to return the filenames as Unicode, see
+L<perlfunc/opendir> and L<File::Glob> for details.
+
To mark the Perl source itself as being in a particular encoding,
see L<encoding>.
=item *
-Strings and patterns may contain characters that have an ordinal value
-larger than 255.
+Strings (including hash keys) and regular expression patterns may
+contain characters that have an ordinal value larger than 255.
If you use a Unicode editor to edit your program, Unicode characters
may occur directly within the literal strings in one of the various
This works only for characters with a code 0x100 and above.
Additionally, if you
+
use charnames ':full';
+
you can use the C<\N{...}> notation, putting the official Unicode character
name within the curlies. For example, C<\N{WHITE SMILING FACE}>.
This works for all characters that have names.
=item *
-If an appropriate L<encoding> is specified,
-identifiers within the Perl script may contain Unicode alphanumeric
-characters, including ideographs. (You are currently on your own when
-it comes to using the canonical forms of characters--Perl doesn't
-(yet) attempt to canonicalize variable names for you.)
+If an appropriate L<encoding> is specified, identifiers within the
+Perl script may contain Unicode alphanumeric characters, including
+ideographs. (You are currently on your own when it comes to using the
+canonical forms of characters--Perl doesn't (yet) attempt to
+canonicalize variable names for you.)
=item *
Armenian
Bengali
Bopomofo
+ Buhid
CanadianAboriginal
Cherokee
Cyrillic
Gurmukhi
Han
Hangul
+ Hanunoo
Hebrew
Hiragana
Inherited
Runic
Sinhala
Syriac
+ Tagalog
+ Tagbanwa
Tamil
Telugu
Thaana
There are also extended property classes that supplement the basic
properties, defined by the F<PropList> Unicode database:
- ASCII_Hex_Digit
+ ASCIIHexDigit
BidiControl
Dash
+ Deprecated
Diacritic
Extender
+ GraphemeLink
HexDigit
Hyphen
Ideographic
+ IDSBinaryOperator
+ IDSTrinaryOperator
JoinControl
+ LogicalOrderException
NoncharacterCodePoint
OtherAlphabetic
+ OtherDefaultIgnorableCodePoint
+ OtherGraphemeExtend
OtherLowercase
OtherMath
OtherUppercase
QuotationMark
+ Radical
+ SoftDotted
+ TerminalPunctuation
+ UnifiedIdeograph
WhiteSpace
and further derived properties:
These block names are supported:
- InAlphabeticPresentationForms
- InArabicBlock
- InArabicPresentationFormsA
- InArabicPresentationFormsB
- InArmenianBlock
- InArrows
- InBasicLatin
- InBengaliBlock
- InBlockElements
- InBopomofoBlock
- InBopomofoExtended
- InBoxDrawing
- InBraillePatterns
- InByzantineMusicalSymbols
- InCJKCompatibility
- InCJKCompatibilityForms
- InCJKCompatibilityIdeographs
- InCJKCompatibilityIdeographsSupplement
- InCJKRadicalsSupplement
- InCJKSymbolsAndPunctuation
- InCJKUnifiedIdeographs
- InCJKUnifiedIdeographsExtensionA
- InCJKUnifiedIdeographsExtensionB
- InCherokeeBlock
- InCombiningDiacriticalMarks
- InCombiningHalfMarks
- InCombiningMarksForSymbols
- InControlPictures
- InCurrencySymbols
- InCyrillicBlock
- InDeseretBlock
- InDevanagariBlock
- InDingbats
- InEnclosedAlphanumerics
- InEnclosedCJKLettersAndMonths
- InEthiopicBlock
- InGeneralPunctuation
- InGeometricShapes
- InGeorgianBlock
- InGothicBlock
- InGreekBlock
- InGreekExtended
- InGujaratiBlock
- InGurmukhiBlock
- InHalfwidthAndFullwidthForms
- InHangulCompatibilityJamo
- InHangulJamo
- InHangulSyllables
- InHebrewBlock
- InHighPrivateUseSurrogates
- InHighSurrogates
- InHiraganaBlock
- InIPAExtensions
- InIdeographicDescriptionCharacters
- InKanbun
- InKangxiRadicals
- InKannadaBlock
- InKatakanaBlock
- InKhmerBlock
- InLaoBlock
- InLatin1Supplement
- InLatinExtendedAdditional
- InLatinExtended-A
- InLatinExtended-B
- InLetterlikeSymbols
- InLowSurrogates
- InMalayalamBlock
- InMathematicalAlphanumericSymbols
- InMathematicalOperators
- InMiscellaneousSymbols
- InMiscellaneousTechnical
- InMongolianBlock
- InMusicalSymbols
- InMyanmarBlock
- InNumberForms
- InOghamBlock
- InOldItalicBlock
- InOpticalCharacterRecognition
- InOriyaBlock
- InPrivateUse
- InRunicBlock
- InSinhalaBlock
- InSmallFormVariants
- InSpacingModifierLetters
- InSpecials
- InSuperscriptsAndSubscripts
- InSyriacBlock
- InTags
- InTamilBlock
- InTeluguBlock
- InThaanaBlock
- InThaiBlock
- InTibetanBlock
- InUnifiedCanadianAboriginalSyllabics
- InYiRadicals
- InYiSyllables
+ InAlphabeticPresentationForms
+ InArabic
+ InArabicPresentationFormsA
+ InArabicPresentationFormsB
+ InArmenian
+ InArrows
+ InBasicLatin
+ InBengali
+ InBlockElements
+ InBopomofo
+ InBopomofoExtended
+ InBoxDrawing
+ InBraillePatterns
+ InBuhid
+ InByzantineMusicalSymbols
+ InCJKCompatibility
+ InCJKCompatibilityForms
+ InCJKCompatibilityIdeographs
+ InCJKCompatibilityIdeographsSupplement
+ InCJKRadicalsSupplement
+ InCJKSymbolsAndPunctuation
+ InCJKUnifiedIdeographs
+ InCJKUnifiedIdeographsExtensionA
+ InCJKUnifiedIdeographsExtensionB
+ InCherokee
+ InCombiningDiacriticalMarks
+ InCombiningDiacriticalMarksforSymbols
+ InCombiningHalfMarks
+ InControlPictures
+ InCurrencySymbols
+ InCyrillic
+ InCyrillicSupplementary
+ InDeseret
+ InDevanagari
+ InDingbats
+ InEnclosedAlphanumerics
+ InEnclosedCJKLettersAndMonths
+ InEthiopic
+ InGeneralPunctuation
+ InGeometricShapes
+ InGeorgian
+ InGothic
+ InGreekExtended
+ InGreekAndCoptic
+ InGujarati
+ InGurmukhi
+ InHalfwidthAndFullwidthForms
+ InHangulCompatibilityJamo
+ InHangulJamo
+ InHangulSyllables
+ InHanunoo
+ InHebrew
+ InHighPrivateUseSurrogates
+ InHighSurrogates
+ InHiragana
+ InIPAExtensions
+ InIdeographicDescriptionCharacters
+ InKanbun
+ InKangxiRadicals
+ InKannada
+ InKatakana
+ InKatakanaPhoneticExtensions
+ InKhmer
+ InLao
+ InLatin1Supplement
+ InLatinExtendedA
+ InLatinExtendedAdditional
+ InLatinExtendedB
+ InLetterlikeSymbols
+ InLowSurrogates
+ InMalayalam
+ InMathematicalAlphanumericSymbols
+ InMathematicalOperators
+ InMiscellaneousMathematicalSymbolsA
+ InMiscellaneousMathematicalSymbolsB
+ InMiscellaneousSymbols
+ InMiscellaneousTechnical
+ InMongolian
+ InMusicalSymbols
+ InMyanmar
+ InNumberForms
+ InOgham
+ InOldItalic
+ InOpticalCharacterRecognition
+ InOriya
+ InPrivateUseArea
+ InRunic
+ InSinhala
+ InSmallFormVariants
+ InSpacingModifierLetters
+ InSpecials
+ InSuperscriptsAndSubscripts
+ InSupplementalArrowsA
+ InSupplementalArrowsB
+ InSupplementalMathematicalOperators
+ InSupplementaryPrivateUseAreaA
+ InSupplementaryPrivateUseAreaB
+ InSyriac
+ InTagalog
+ InTagbanwa
+ InTags
+ InTamil
+ InTelugu
+ InThaana
+ InThai
+ InTibetan
+ InUnifiedCanadianAboriginalSyllabics
+ InVariationSelectors
+ InYiRadicals
+ InYiSyllables
=over 4
in Perl can be written as:
- (?!\p{Unassigned})\p{InGreek}
- (?=\p{Assigned})\p{InGreek}
+ (?!\p{Unassigned})\p{InGreekAndCoptic}
+ (?=\p{Assigned})\p{InGreekAndCoptic}
But in this particular example, you probably really want
Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF,
the 90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF.
+The "gaps" are caused by legal UTF-8 avoiding non-shortest encodings:
+it is technically possible to UTF-8-encode a single code point in different
+ways, but that is explicitly forbidden, and the shortest possible encoding
+should always be used (and that is what Perl does).
+
Or, another way to look at it, as bits:
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
and the decoding is
- $uni = 0x10000 + ($hi - 0xD8000) * 0x400 + ($lo - 0xDC00);
+ $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
If you try to generate surrogates (for example by using chr()), you
will get a warning if warnings are turned on (C<-w> or C<use
the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
for more discussion of the issues.
+=head2 Locales
+
+Usually locale settings and Unicode do not affect each other, but
+there are a couple of exceptions:
+
+=over 4
+
+=item *
+
+If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
+contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
+the default encoding of your STDIN, STDOUT, and STDERR, and of
+B<any subsequent file open>, is UTF-8.
+
+=item *
+
+Perl tries really hard to work both with Unicode and the old byte
+oriented world: most often this is nice, but sometimes this causes
+problems.
+
+=back
+
=head2 Using Unicode in XS
If you want to handle Perl Unicode in XS extensions, you may find
there is some attempt to apply 8-bit locale info to characters in the
range 0..255, but this is demonstrably incorrect for locales that use
characters above that range when mapped into Unicode. It will also
-tend to run slower. Avoidance of locales is strongly encouraged.
+tend to run slower. Use of locales with Unicode is discouraged.
Some functions are slower when working on UTF-8 encoded strings than
-on byte encoded strings. All functions that need to hop over
+on byte encoded strings. All functions that need to hop over
characters such as length(), substr() or index() can work B<much>
faster when the underlying data are byte-encoded. Witness the
following benchmark:
-
+
% perl -e '
use Benchmark;
use strict;
LENGTH_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 12155.45/s (n=25648)
SUBSTR_B: 3 wallclock secs ( 2.16 usr + 0.00 sys = 2.16 CPU) @ 374480.09/s (n=808877)
SUBSTR_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 6791.00/s (n=14329)
-
+
The numbers show an incredible slowness on long UTF-8 strings and you
should carefully avoid to use these functions within tight loops. For
example if you want to iterate over characters, it is infinitely
You see, the algorithm based on substr() was faster with byte encoded
data but it is pathologically slow with UTF-8 data.
-
+
=head1 SEE ALSO
L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,