X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=890bd8c00651ffb5d4dfc96f754bb3b6c957e9ed;hb=dbe420b4c394bd4b445748eaf636d08e4ef0d358;hp=9e3ca754240904c62ed43b36eff7360adfbefcb1;hpb=a63942c2c39632b8e5b0eb55463cda0d2fab4ce9;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 9e3ca75..890bd8c 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -6,19 +6,9 @@ perlunicode - Unicode support in Perl =head2 Important Caveats -WARNING: While the implementation of Unicode support in Perl is now -fairly complete it is still evolving to some extent. - -In particular the way Unicode is handled on EBCDIC platforms is still -rather experimental. On such a platform references to UTF-8 encoding -in this document and elsewhere should be read as meaning UTF-EBCDIC as -specified in Unicode Technical Report 16 unless ASCII vs EBCDIC issues -are specifically discussed. There is no C pragma or -":utfebcdic" layer, rather "utf8" and ":utf8" are re-used to mean -platform's "natural" 8-bit encoding of Unicode. See L for -more discussion of the issues. - -The following areas are still under development. +Unicode support is an extensive requirement. While perl does not +implement the Unicode standard or the accompanying technical reports +from cover to cover, Perl does support many Unicode features. =over 4 @@ -27,30 +17,33 @@ The following areas are still under development. A filehandle can be marked as containing perl's internal Unicode encoding (UTF-8 or UTF-EBCDIC) by opening it with the ":utf8" layer. Other encodings can be converted to perl's encoding on input, or from -perl's encoding on output by use of the ":encoding()" layer. There is -not yet a clean way to mark the Perl source itself as being in an -particular encoding. +perl's encoding on output by use of the ":encoding(...)" layer. +See L. + +To mark the Perl source itself as being in a particular encoding, +see L. =item Regular Expressions -The regular expression compiler does now attempt to produce -polymorphic opcodes. That is the pattern should now adapt to the data -and automatically switch to the Unicode character scheme when -presented with Unicode data, or a traditional byte scheme when -presented with byte data. The implementation is still new and -(particularly on EBCDIC platforms) may need further work. +The regular expression compiler produces polymorphic opcodes. That is, +the pattern adapts to the data and automatically switch to the Unicode +character scheme when presented with Unicode data, or a traditional +byte scheme when presented with byte data. =item C still needed to enable UTF-8/UTF-EBCDIC in scripts The C pragma implements the tables used for Unicode support. -These tables are automatically loaded on demand, so the C pragma -need not normally be used. +However, these tables are automatically loaded on demand, so the +C pragma should not normally be used. + +As a compatibility measure, this pragma must be explicitly used to +enable recognition of UTF-8 in the Perl scripts themselves on ASCII +based machines or recognize UTF-EBCDIC on EBCDIC based machines. +B +is needed>. -However, as a compatibility measure, this pragma must be explicitly -used to enable recognition of UTF-8 in the Perl scripts themselves on -ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines. -B is -needed>. +You can also use the C pragma to change the default encoding +of the data in your script; see L. =back @@ -78,11 +71,11 @@ character data. Such data may come from filehandles, from calls to external programs, from information provided by the system (such as %ENV), or from literals and constants in the source text. -If the C<-C> command line switch is used, (or the +On Windows platforms, if the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS} global flag is set to C<1>), all system calls will use the corresponding wide character APIs. Note that this is -currently only implemented on Windows since other platforms API -standard on this area. +currently only implemented on Windows since other platforms lack an +API standard on this area. Regardless of the above, the C pragma can always be used to force byte semantics in a particular lexical scope. See L. @@ -102,10 +95,11 @@ literal UTF-8 string constant in the program), character semantics apply; otherwise, byte semantics are in effect. To force byte semantics on Unicode data, the C pragma should be used. -Notice that if you have a string with byte semantics and you then -add character data into it, the bytes will be upgraded I (or if in EBCDIC, after a translation -to ISO 8859-1). +Notice that if you concatenate strings with byte semantics and strings +with Unicode character data, the bytes will by default be upgraded +I (or if in EBCDIC, after a +translation to ISO 8859-1). To change this, use the C +pragma, see L. Under character semantics, many operations that formerly operated on bytes change to operating on characters. For ASCII data this makes no @@ -162,7 +156,7 @@ ideograph, for instance. =item * -Named Unicode properties and block ranges make be used as character +Named Unicode properties and block ranges may be used as character classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't match property) constructs. For instance, C<\p{Lu}> matches any character with the Unicode uppercase property, while C<\p{M}> matches @@ -173,88 +167,90 @@ are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. The C<\p{Is...}> test for "general properties" such as "letter", "digit", while the C<\p{In...}> test for Unicode scripts and blocks. -The official Unicode script and block names have spaces and -dashes and separators, but for convenience you can have -dashes, spaces, and underbars at every word division, and -you need not care about correct casing. It is recommended, -however, that for consistency you use the following naming: -the official Unicode script or block name (see below for -the additional rules that apply to block names), with the whitespace -and dashes removed, and the words "uppercase-first-lowercase-otherwise". -That is, "Latin-1 Supplement" becomes "Latin1Supplement". +The official Unicode script and block names have spaces and dashes as +separators, but for convenience you can have dashes, spaces, and +underbars at every word division, and you need not care about correct +casing. It is recommended, however, that for consistency you use the +following naming: the official Unicode script, block, or property name +(see below for the additional rules that apply to block names), +with whitespace and dashes replaced with underbar, and the words +"uppercase-first-lowercase-rest". That is, "Latin-1 Supplement" +becomes "Latin_1_Supplement". You can also negate both C<\p{}> and C<\P{}> by introducing a caret -(^) between the first curly and the property name: C<\p{^InTamil}> is -equal to C<\P{InTamil}>. +(^) between the first curly and the property name: C<\p{^In_Tamil}> is +equal to C<\P{In_Tamil}>. The C and C can be left out: C<\p{Greek}> is equal to -C<\p{InGreek}>, C<\P{Pd}> is equal to C<\P{Pd}>. +C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>. Short Long L Letter - Lu Uppercase Letter - Ll Lowercase Letter - Lt Titlecase Letter - Lm Modifier Letter - Lo Other Letter + Lu Uppercase_Letter + Ll Lowercase_Letter + Lt Titlecase_Letter + Lm Modifier_Letter + Lo Other_Letter M Mark - Mn Non-Spacing Mark - Mc Spacing Combining Mark - Me Enclosing Mark + Mn Nonspacing_Mark + Mc Spacing_Mark + Me Enclosing_Mark N Number - Nd Decimal Digit Number - Nl Letter Number - No Other Number + Nd Decimal_Number + Nl Letter_Number + No Other_Number P Punctuation - Pc Connector Punctuation - Pd Dash Punctuation - Ps Open Punctuation - Pe Close Punctuation - Pi Initial Punctuation + Pc Connector_Punctuation + Pd Dash_Punctuation + Ps Open_Punctuation + Pe Close_Punctuation + Pi Initial_Punctuation (may behave like Ps or Pe depending on usage) - Pf Final Punctuation + Pf Final_Punctuation (may behave like Ps or Pe depending on usage) - Po Other Punctuation + Po Other_Punctuation S Symbol - Sm Math Symbol - Sc Currency Symbol - Sk Modifier Symbol - So Other Symbol + Sm Math_Symbol + Sc Currency_Symbol + Sk Modifier_Symbol + So Other_Symbol Z Separator - Zs Space Separator - Zl Line Separator - Zp Paragraph Separator + Zs Space_Separator + Zl Line_Separator + Zp Paragraph_Separator C Other - Cc (Other) Control - Cf (Other) Format - Cs (Other) Surrogate - Co (Other) Private Use - Cn (Other) Not Assigned + Cc Control + Cf Format + Cs Surrogate + Co Private_Use + Cn Unassigned There's also C which is an alias for C, C, and C. The following reserved ranges have C tests: - CJK Ideograph Extension A - CJK Ideograph - Hangul Syllable - Non Private Use High Surrogate - Private Use High Surrogate - Low Surrogate - Private Surrogate - CJK Ideograph Extension B - Plane 15 Private Use - Plane 16 Private Use + CJK_Ideograph_Extension_A + CJK_Ideograph + Hangul_Syllable + Non_Private_Use_High_Surrogate + Private_Use_High_Surrogate + Low_Surrogate + Private_Surrogate + CJK_Ideograph_Extension_B + Plane_15_Private_Use + Plane_16_Private_Use For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true. -(Handling of surrogates is not implemented yet.) +(Handling of surrogates is not implemented yet, because Perl +uses UTF-8 and not UTF-16 internally to represent Unicode. +So you really can't use the "Cs" category.) Additionally, because scripts differ in their directionality (for example Hebrew is written right to left), all characters @@ -280,71 +276,73 @@ have their directionality defined: BidiWS Whitespace BidiON Other Neutrals +=back + =head2 Scripts The scripts available for C<\p{In...}> and C<\P{In...}>, for example -\p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>: +C<\p{InLatin}> or \p{InCyrillic>, are as follows: - Latin - Greek - Cyrillic - Armenian - Hebrew Arabic - Syriac - Thaana - Devanagari + Armenian Bengali - Gurmukhi + Bopomofo + Canadian-Aboriginal + Cherokee + Cyrillic + Deseret + Devanagari + Ethiopic + Georgian + Gothic + Greek Gujarati - Oriya - Tamil - Telugu + Gurmukhi + Han + Hangul + Hebrew + Hiragana + Inherited Kannada - Malayalam - Sinhala - Thai + Katakana + Khmer Lao - Tibetan + Latin + Malayalam + Mongolian Myanmar - Georgian - Hangul - Ethiopic - Cherokee - CanadianAboriginal Ogham + Old-Italic + Oriya Runic - Khmer - Mongolian - Hiragana - Katakana - Bopomofo - Han + Sinhala + Syriac + Tamil + Telugu + Thaana + Thai + Tibetan Yi - OldItalic - Gothic - Deseret - Inherited There are also extended property classes that supplement the basic properties, defined by the F Unicode database: - White_space + ASCII_Hex_Digit Bidi_Control - Join_Control Dash - Hyphen - Quotation_Mark - Other_Math - Hex_Digit - ASCII_Hex_Digit - Other_Alphabetic - Ideographic Diacritic Extender + Hex_Digit + Hyphen + Ideographic + Join_Control + Noncharacter_Code_Point + Other_Alphabetic Other_Lowercase + Other_Math Other_Uppercase - Noncharacter_Code_Point + Quotation_Mark + White_Space and further derived properties: @@ -359,129 +357,137 @@ and further derived properties: Any Any character Assigned Any non-Cn character Common Any character (or unassigned code point) - not explicitly assigned to a script. + not explicitly assigned to a script =head2 Blocks In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the -former concept is closer to natural languages, while the latter +scripts concept is closer to natural languages, while the blocks concept is more an artificial grouping based on groups of 256 Unicode characters. For example, the C script contains letters from -many blocks, but it does not contain all the characters from those -blocks, it does not for example contain digits. +many blocks. On the other hand, the C script does not contain +all the characters from those blocks. It does not, for example, contain +digits because digits are shared across many scripts. Digits and +other similar groups, like punctuation, are in a category called +C. + +For more about scripts, see the UTR #24: -For more about scripts see the UTR #24: -http://www.unicode.org/unicode/reports/tr24/ -For more about blocks see -http://www.unicode.org/Public/UNIDATA/Blocks.txt + http://www.unicode.org/unicode/reports/tr24/ + +For more about blocks, see: + + http://www.unicode.org/Public/UNIDATA/Blocks.txt Because there are overlaps in naming (there are, for example, both a script called C and a block called C, the block version has C appended to its name, C<\p{InKatakanaBlock}>. Notice that this definition was introduced in Perl 5.8.0: in Perl -5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the +5.6 only the blocks were used; in Perl 5.8.0 scripts became the preferential Unicode character class definition; this meant that the definitions of some character classes changed (the ones in the below list that have the C appended). - BasicLatin - Latin1Supplement - LatinExtendedA - LatinExtendedB - IPAExtensions - SpacingModifierLetters - CombiningDiacriticalMarks - GreekBlock - CyrillicBlock - ArmenianBlock - HebrewBlock - ArabicBlock - SyriacBlock - ThaanaBlock - DevanagariBlock - BengaliBlock - GurmukhiBlock - GujaratiBlock - OriyaBlock - TamilBlock - TeluguBlock - KannadaBlock - MalayalamBlock - SinhalaBlock - ThaiBlock - LaoBlock - TibetanBlock - MyanmarBlock - GeorgianBlock - HangulJamo - EthiopicBlock - CherokeeBlock - UnifiedCanadianAboriginalSyllabics - OghamBlock - RunicBlock - KhmerBlock - MongolianBlock - LatinExtendedAdditional - GreekExtended - GeneralPunctuation - SuperscriptsandSubscripts - CurrencySymbols - CombiningMarksforSymbols - LetterlikeSymbols - NumberForms + Alphabetic Presentation Forms + Arabic Block + Arabic Presentation Forms-A + Arabic Presentation Forms-B + Armenian Block Arrows - MathematicalOperators - MiscellaneousTechnical - ControlPictures - OpticalCharacterRecognition - EnclosedAlphanumerics - BoxDrawing - BlockElements - GeometricShapes - MiscellaneousSymbols + Basic Latin + Bengali Block + Block Elements + Bopomofo Block + Bopomofo Extended + Box Drawing + Braille Patterns + Byzantine Musical Symbols + CJK Compatibility + CJK Compatibility Forms + CJK Compatibility Ideographs + CJK Compatibility Ideographs Supplement + CJK Radicals Supplement + CJK Symbols and Punctuation + CJK Unified Ideographs + CJK Unified Ideographs Extension A + CJK Unified Ideographs Extension B + Cherokee Block + Combining Diacritical Marks + Combining Half Marks + Combining Marks for Symbols + Control Pictures + Currency Symbols + Cyrillic Block + Deseret Block + Devanagari Block Dingbats - BraillePatterns - CJKRadicalsSupplement - KangxiRadicals - IdeographicDescriptionCharacters - CJKSymbolsandPunctuation - HiraganaBlock - KatakanaBlock - BopomofoBlock - HangulCompatibilityJamo + Enclosed Alphanumerics + Enclosed CJK Letters and Months + Ethiopic Block + General Punctuation + Geometric Shapes + Georgian Block + Gothic Block + Greek Block + Greek Extended + Gujarati Block + Gurmukhi Block + Halfwidth and Fullwidth Forms + Hangul Compatibility Jamo + Hangul Jamo + Hangul Syllables + Hebrew Block + High Private Use Surrogates + High Surrogates + Hiragana Block + IPA Extensions + Ideographic Description Characters Kanbun - BopomofoExtended - EnclosedCJKLettersandMonths - CJKCompatibility - CJKUnifiedIdeographsExtensionA - CJKUnifiedIdeographs - YiSyllables - YiRadicals - HangulSyllables - HighSurrogates - HighPrivateUseSurrogates - LowSurrogates - PrivateUse - CJKCompatibilityIdeographs - AlphabeticPresentationForms - ArabicPresentationFormsA - CombiningHalfMarks - CJKCompatibilityForms - SmallFormVariants - ArabicPresentationFormsB + Kangxi Radicals + Kannada Block + Katakana Block + Khmer Block + Lao Block + Latin 1 Supplement + Latin Extended Additional + Latin Extended-A + Latin Extended-B + Letterlike Symbols + Low Surrogates + Malayalam Block + Mathematical Alphanumeric Symbols + Mathematical Operators + Miscellaneous Symbols + Miscellaneous Technical + Mongolian Block + Musical Symbols + Myanmar Block + Number Forms + Ogham Block + Old Italic Block + Optical Character Recognition + Oriya Block + Private Use + Runic Block + Sinhala Block + Small Form Variants + Spacing Modifier Letters Specials - HalfwidthandFullwidthForms - OldItalicBlock - GothicBlock - DeseretBlock - ByzantineMusicalSymbols - MusicalSymbols - MathematicalAlphanumericSymbols - CJKUnifiedIdeographsExtensionB - CJKCompatibilityIdeographsSupplement + Superscripts and Subscripts + Syriac Block Tags + Tamil Block + Telugu Block + Thaana Block + Thai Block + Tibetan Block + Unified Canadian Aboriginal Syllabics + Yi Radicals + Yi Syllables + +=over 4 =item * @@ -501,10 +507,11 @@ pack('C0', ...). =item * Case translation operators use the Unicode case translation tables -when provided character input. Note that C translates to -uppercase, while C translates to titlecase (for languages -that make the distinction). Naturally the corresponding backslash -sequences have the same semantics. +when provided character input. Note that C (also known as C<\U> +in doublequoted strings) translates to uppercase, while C +(also known as C<\u> in doublequoted strings) translates to titlecase +(for languages that make the distinction). Naturally the +corresponding backslash sequences have the same semantics. =item * @@ -548,15 +555,37 @@ wide bit complement. =item * -lc(), uc(), lcfirst(), and ucfirst() work only for some of the -simplest cases, where the mapping goes from a single Unicode character -to another single Unicode character, and where the mapping does not -depend on surrounding characters, or on locales. More complex cases, -where for example one character maps into several, are not yet -implemented. See the Unicode Technical Report #21, Case Mappings, -for more details. The Unicode::UCD module (part of Perl since 5.8.0) -casespec() and casefold() interfaces supply information about the more -complex cases. +lc(), uc(), lcfirst(), and ucfirst() work for the following cases: + +=over 8 + +=item * + +the case mapping is from a single Unicode character to another +single Unicode character + +=item * + +the case mapping is from a single Unicode character to more +than one Unicode character + +=back + +What doesn't yet work are the following cases: + +=over 8 + +=item * + +the "final sigma" (Greek) + +=item * + +anything to with locales (Lithuanian, Turkish, Azeri) + +=back + +See the Unicode Technical Report #21, Case Mappings, for more details. =item * @@ -601,19 +630,39 @@ Level 1 - Basic Unicode Support 2.2 Categories - done [3][4] 2.3 Subtraction - MISSING [5][6] 2.4 Simple Word Boundaries - done [7] - 2.5 Simple Loose Matches - MISSING [8] + 2.5 Simple Loose Matches - done [8] 2.6 End of Line - MISSING [9][10] [ 1] \x{...} [ 2] \N{...} [ 3] . \p{Is...} \P{Is...} - [ 4] now scripts (see UTR#24 Script Names) in addition to blocks + [ 4] now scripts (see UTR#24 Script Names) in addition to blocks [ 5] have negation - [ 6] can use look-ahead to emulate subtracion + [ 6] can use look-ahead to emulate subtraction (*) [ 7] include Letters in word characters - [ 8] see UTR#21 Case Mappings + [ 8] see UTR#21 Case Mappings: Perl implements 1:1 mappings [ 9] see UTR#13 Unicode Newline Guidelines - [10] should do ^ and $ also on \x{2028} and \x{2029} + [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}) + (should also affect <>, $., and script line numbers) + +(*) You can mimic class subtraction using lookahead. +For example, what TR18 might write as + + [{Greek}-[{UNASSIGNED}]] + +in Perl can be written as: + + (?!\p{UNASSIGNED})\p{GreekBlock} + (?=\p{ASSIGNED})\p{GreekBlock} + +But in this particular example, you probably really want + + \p{Greek} + +which will match assigned characters known to be part of the Greek script. + +In other words: the matched character must not be a non-assigned +character, but it must be in the block of modern Greek characters. =item * @@ -646,8 +695,235 @@ Level 3 - Locale-Sensitive Support =back +=head2 Unicode Encodings + +Unicode characters are assigned to I which are abstract +numbers. To use these numbers various encodings are needed. + +=over 4 + +=item UTF-8 + +UTF-8 is the encoding used internally by Perl. UTF-8 is a variable +length (1 to 6 bytes, current character allocations require 4 bytes), +byteorder independent encoding. For ASCII, UTF-8 is transparent +(and we really do mean 7-bit ASCII, not any 8-bit encoding). + +The following table is from Unicode 3.1. + + Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte + + U+0000..U+007F 00..7F    + U+0080..U+07FF C2..DF 80..BF    + U+0800..U+0FFF E0 A0..BF 80..BF   + U+1000..U+FFFF E1..EF 80..BF 80..BF   + U+10000..U+3FFFF F0 90..BF 80..BF 80..BF + U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF + U+100000..U+10FFFF F4 80..8F 80..BF 80..BF + +Or, another way to look at it, as bits: + + Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte + + 0aaaaaaa 0aaaaaaa + 00000bbbbbaaaaaa 110bbbbb 10aaaaaa + ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa + 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa + +As you can see, the continuation bytes all begin with C<10>, and the +leading bits of the start byte tells how many bytes the are in the +encoded character. + +=item UTF-EBDIC + +Like UTF-8, but EBDCIC-safe, as UTF-8 is ASCII-safe. + +=item UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks) + +(The followings items are mostly for reference, Perl doesn't +use them internally.) + +UTF-16 is a 2 or 4 byte encoding. The Unicode code points +0x0000..0xFFFF are stored in two 16-bit units, and the code points +0x010000..0x10FFFF in two 16-bit units. The latter case is +using I, the first 16-bit unit being the I, and the second being the I. + +Surrogates are code points set aside to encode the 0x01000..0x10FFFF +range of Unicode code points in pairs of 16-bit units. The I are the range 0xD800..0xDBFF, and the I +are the range 0xDC00..0xDFFFF. The surrogate encoding is + + $hi = ($uni - 0x10000) / 0x400 + 0xD800; + $lo = ($uni - 0x10000) % 0x400 + 0xDC00; + +and the decoding is + + $uni = 0x10000 + ($hi - 0xD8000) * 0x400 + ($lo - 0xDC00); + +If you try to generate surrogates (for example by using chr()), you +will get an error because firstly a surrogate on its own is meaningless, +and secondly because Perl encodes its Unicode characters in UTF-8 +(not 16-bit numbers), which makes the encoded character doubly illegal. + +Because of the 16-bitness, UTF-16 is byteorder dependent. UTF-16 +itself can be used for in-memory computations, but if storage or +transfer is required, either UTF-16BE (Big Endian) or UTF-16LE +(Little Endian) must be chosen. + +This introduces another problem: what if you just know that your data +is UTF-16, but you don't know which endianness? Byte Order Marks +(BOMs) are a solution to this. A special character has been reserved +in Unicode to function as a byte order marker: the character with the +code point 0xFEFF is the BOM. + +The trick is that if you read a BOM, you will know the byte order, +since if it was written on a big endian platform, you will read the +bytes 0xFE 0xFF, but if it was written on a little endian platform, +you will read the bytes 0xFF 0xFE. (And if the originating platform +was writing in UTF-8, you will read the bytes 0xEF 0xBB 0xBF.) + +The way this trick works is that the character with the code point +0xFFFE is guaranteed not to be a valid Unicode character, so the +sequence of bytes 0xFF 0xFE is unambiguously "BOM, represented in +little-endian format" and cannot be "0xFFFE, represented in big-endian +format". + +=item UTF-32, UTF-32BE, UTF32-LE + +The UTF-32 family is pretty much like the UTF-16 family, expect that +the units are 32-bit, and therefore the surrogate scheme is not +needed. The BOM signatures will be 0x00 0x00 0xFE 0xFF for BE and +0xFF 0xFE 0x00 0x00 for LE. + +=item UCS-2, UCS-4 + +Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit +encoding, UCS-4 is a 32-bit encoding. Unlike UTF-16, UCS-2 +is not extensible beyond 0xFFFF, because it does not use surrogates. + +=item UTF-7 + +A seven-bit safe (non-eight-bit) encoding, useful if the +transport/storage is not eight-bit safe. Defined by RFC 2152. + +=back + +=head2 Security Implications of Malformed UTF-8 + +Unfortunately, the specification of UTF-8 leaves some room for +interpretation of how many bytes of encoded output one should generate +from one input Unicode character. Strictly speaking, one is supposed +to always generate the shortest possible sequence of UTF-8 bytes, +because otherwise there is potential for input buffer overflow at the +receiving end of a UTF-8 connection. Perl always generates the shortest +length UTF-8, and with warnings on (C<-w> or C) Perl will +warn about non-shortest length UTF-8 (and other malformations, too, +such as the surrogates, which are not real character code points.) + +=head2 Unicode in Perl on EBCDIC + +The way Unicode is handled on EBCDIC platforms is still rather +experimental. On such a platform, references to UTF-8 encoding in this +document and elsewhere should be read as meaning UTF-EBCDIC as +specified in Unicode Technical Report 16 unless ASCII vs EBCDIC issues +are specifically discussed. There is no C pragma or +":utfebcdic" layer, rather, "utf8" and ":utf8" are re-used to mean +the platform's "natural" 8-bit encoding of Unicode. See L +for more discussion of the issues. + +=head2 Using Unicode in XS + +If you want to handle Perl Unicode in XS extensions, you may find +the following C APIs useful: + +=over 4 + +=item * + +DO_UTF8(sv) returns true if the UTF8 flag is on and the bytes +pragma is not in effect. SvUTF8(sv) returns true is the UTF8 +flag is on, the bytes pragma is ignored. Remember that UTF8 +flag being on does not mean that there would be any characters +of code points greater than 255 or 127 in the scalar, or that +there even are any characters in the scalar. The UTF8 flag +means that any characters added to the string will be encoded +in UTF8 if the code points of the characters are greater than +255. Not "if greater than 127", since Perl's Unicode model +is not to use UTF-8 until it's really necessary. + +=item * + +uvuni_to_utf8(buf, chr) writes a Unicode character code point into a +buffer encoding the code point as UTF-8, and returns a pointer +pointing after the UTF-8 bytes. + +=item * + +utf8_to_uvuni(buf, lenp) reads UTF-8 encoded bytes from a buffer and +returns the Unicode character code point (and optionally the length of +the UTF-8 byte sequence). + +=item * + +utf8_length(s, len) returns the length of the UTF-8 encoded buffer in +characters. sv_len_utf8(sv) returns the length of the UTF-8 encoded +scalar. + +=item * + +sv_utf8_upgrade(sv) converts the string of the scalar to its UTF-8 +encoded form. sv_utf8_downgrade(sv) does the opposite (if possible). +sv_utf8_encode(sv) is like sv_utf8_upgrade but the UTF8 flag does not +get turned on. sv_utf8_decode() does the opposite of sv_utf8_encode(). + +=item * + +is_utf8_char(buf) returns true if the buffer points to valid UTF-8. + +=item * + +is_utf8_string(buf, len) returns true if the len bytes of the buffer +are valid UTF-8. + +=item * + +UTF8SKIP(buf) will return the number of bytes in the UTF-8 encoded +character in the buffer. UNISKIP(chr) will return the number of bytes +required to UTF-8-encode the Unicode character code point. + +=item * + +utf8_distance(a, b) will tell the distance in characters between the +two pointers pointing to the same UTF-8 encoded buffer. + +=item * + +utf8_hop(s, off) will return a pointer to an UTF-8 encoded buffer that +is C (positive or negative) Unicode characters displaced from the +UTF-8 buffer C. + +=item * + +pv_uni_display(dsv, spv, len, pvlim, flags) and sv_uni_display(dsv, +ssv, pvlim, flags) are useful for debug output of Unicode strings and +scalars (only for debug: they display B characters as hexadecimal +code points). + +=item * + +ibcmp_utf8(s1, u1, len1, s2, u2, len2) can be used to compare two +strings case-insensitively in Unicode. (For case-sensitive +comparisons you can just use memEQ() and memNE() as usual.) + +=back + +For more information, see L, and F and F +in the Perl source code distribution. + =head1 SEE ALSO -L, L, L, L +L, L, L, L, L, L, +L, L =cut