ext/Encode/lib/Encode/Supported.pod

   1 =head1 NAME
   2
   3 Encode::Supported -- Supported encodings by Encode
   4
   5 =head1 DESCRIPTION
   6
   7 =head2 Encoding Names
   8
   9 Encoding names are case insensitive. White space in names
  10 is ignored.  In addition an encoding may have aliases.
  11 Each encoding has one "canonical" name.  The "canonical"
  12 name is chosen from the names of the encoding by picking
  13 the first in the following sequence (with a few exceptions).
  14
  15 =over
  16
  17 =item *
  18
  19 The name used by the perl community.  That includes 'utf8' and 'ascii'.
  20 Unlike aliases, canonical names directly reaches the method so such
  21 frequently used words like 'utf8' should do without alias lookups.
  22
  23 =item *
  24
  25 The MIME name as defined in IETF RFCs  This includes all "iso-"'s.
  26
  27 =item *
  28
  29 The name in the IANA registry.
  30
  31 =item *
  32
  33 The name used by the organization that defined it.
  34
  35 =back
  36
  37 In case I<de jure> canonical names differ from that of the Encode
  38 module, they are always aliased if it ever be implemented.  So you can
  39 safely tell if a given encoding is implemented or not just by passing
  40 the canonical name.
  41
  42 Because of all the alias issues, and because in the general case
  43 encodings have state, "Encode" uses the encoding object internally
  44 once an operation is in progress.
  45
  46 =head1 Supported Encodings
  47
  48 As of Perl 5.8.0, at least the following encodings are recognized.
  49 Note that unless otherwise specified, they are all case insensitive
  50 (via alias) and all occurrance of spaces are replaced with '-'.  In
  51 other words, "ISO 8859 1" and "iso-8859-1" are identical.
  52
  53 Encodings are categorized and implemented in several different modules
  54 but you don't have to C<use Encode::XX> to make them available for
  55 most cases.  Encode.pm will automatically load those modules in need.
  56
  57 =head2 Built-in Encodings
  58
  59 The following encodings are always available.
  60
  61   Canonical     Aliases                      Comments & References
  62   ----------------------------------------------------------------
  63   ascii         US-ascii                                    [ECMA]
  64   iso-8859-1    latin1                                       [ISO]
  65   utf8          UTF-8                                    [RFC2279]
  66   UCS-2         ucs2, iso-10646-1, UTF-16LE             [IANA, UC]
  67   UTF-16LE      UCS-2LE                                       [UC]
  68   ----------------------------------------------------------------
  69
  70 =head2 Encode::Byte -- Extended ASCII
  71
  72 Encode::Byte implements most of single-byte encodings except for
  73 Symbols and EBCDIC. The following encodings are based single-byte
  74 encoding implemented as extended ASCII.  For most cases it uses
  75 \x80-\xff (upper half) to map non-ASCII characters.
  76
  77 =over 2
  78
  79 =item ISO-8859 and corresponding vendor mappings
  80
  81 Since there are so many, They are presented in table format with
  82 Languages and corresponding encoding names by vendors.  Note the table
  83 is sorted in order of ISO-8859 and the corresponding vendor mappings
  84 are slightly different from that of ISO.  See
  85 L<http://czyborra.com/charsets/iso8859.html> for details.
  86
  87   Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
  88   ----------------------------------------------------------------
  89   N. America    (ASCII)         cp437        AdobeStandardEncoding
  90                                 cp863 (DOSCanadaF)
  91   W.  Europe    (iso-8859-1)    cp850   cp1252  MacRoman  nextstep
  92                                                          hp-roman8
  93                                 cp860 (DOSPortuguese)
  94   CE. Europe    iso-8859-2      cp852   cp1250  MacCentralEurRoman
  95                                                 MacCroatian
  96                                                 MacRomanian
  97                                                 MacRumanian
  98   Latin3(*3)    iso-8859-3
  99   Latin4(*4)    iso-8859-4
 100   Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
 101     (Also see next section)     cp866           MacUkrainian
 102   Arabic        iso-8859-6      cp864   cp1256  MacArabic
 103                                 cp1006          MacFarsi
 104   Greek         iso-8859-7      cp737   cp1253  MacGreek
 105                                 cp869 (DOSGreek2)
 106   Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
 107   Turkish       iso-8859-9      cp857   cp1254  MacTurkish
 108   Nordics       iso-8859-10     cp865
 109                                 cp861           MacIcelandic
 110                                                 MacSami
 111   Thai          iso-8859-11     cp874           MacThai
 112   (iso-8859-12 is nonexistent. Reserved for Indics?)
 113   Baltics       iso-8859-13     cp775           cp1257
 114   Celtics       iso-8859-14
 115   Latin9(*15)   iso-8859-15
 116   Latin10       iso-8859-16
 117   Vietnamese    viscii                  cp1258  MacVietnamese
 118   ----------------------------------------------------------------
 119
 120   (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
 121   (*4) Baltics.  Now on 8859-10
 122   (*9) Nicknamed Latin0; Euro sign as well as  French and Finnish
 123        letters that are missing from 8859-1 are added.
 124
 125 All cp* are also available as ibm-*, ms-*, and windows-* .  See also
 126 L<http://czyborra.com/charsets/codepages.html>.
 127
 128 Macintosh encodings don't seem to be registered in such entities as
 129 IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
 130 1150.  See L<http://developer.apple.com/technotes/tn/tn1150.html>
 131 for details
 132
 133 =item KOI8 - De Facto Standard for Cyrillic world
 134
 135 Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
 136 in the Net.   L<Encode> comes with the following KOI charsets.  for
 137 gory details, See <http://czyborra.com/charsets/cyrillic.html> for
 138 details.
 139
 140   ----------------------------------------------------------------
 141   koi8-f
 142   koi8-r cp878                                           [RFC1489]
 143   koi8-u                                                 [RFC2319]
 144
 145 =item gsm0338 - Hentai Latin 1
 146
 147 GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
 148 control character ranges and other parts are mapped very differently,
 149 presumablly to store Cyrillics.  This one is also covered in
 150 Encode::Byte even thought this one does not comply extended ASCII.
 151
 152 =back
 153
 154 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
 155
 156 Note Vietnamese is listed above.  Also read "Encoding vs Charset"
 157 below.  Also note these are implemented in distinct module by
 158 languages, due the the size concerns.  Please also refer to their
 159 respective document pages.
 160
 161 =over 4
 162
 163 =item Encode::CN -- Continental China
 164
 165   Standard      DOS/Win Macintosh       Comment
 166   ----------------------------------------------------------------
 167   euc-cn                MacChineseSimp  GB2312 is aliased to this
 168   (gbk)         cp936                   GBK is aliased to to this
 169   gb12345-raw                           GB12345 as is
 170   gb2312-raw                            GB2312 as is
 171   hz
 172   iso-ir-165
 173   ----------------------------------------------------------------
 174
 175 =item Encode::JP -- Japan
 176
 177   Standard      DOS/Win Macintosh       Comment/Reference
 178   ----------------------------------------------------------------
 179   euc-jp
 180   shiftjis      cp932   macJapanese
 181   7bit-jis        jis
 182   euc-jp          ujis
 183   iso-2022-jp                           [RFC1468]
 184   iso-2022-jp-1                         [RFC2237]
 185   ----------------------------------------------------------------
 186
 187 =item Encode::KR -- Korea
 188
 189   ----------------------------------------------------------------
 190   euc-kr                MacKorean                        [RFC1557]
 191                 cp949                   ks_c_5601-1987 is an alias
 192                                         thereof.
 193   iso-2022-kr                                            [RFC1557]
 194   johab                                  [KS X 1001:1998, Annex 3]
 195   ksc5601-raw                           KSC5601 as is
 196   ----------------------------------------------------------------
 197
 198 =item Encode::TW -- Taiwan
 199
 200   ----------------------------------------------------------------
 201   big5          cp950   MacChineseTrad
 202   big5-hkscs
 203   ----------------------------------------------------------------
 204
 205 =item Encode::HanExtra -- More Chinese via CPAN
 206
 207 Due to size concerns, additional Chinese encodings below are
 208 distributed separately on CPAN, under the name Encode::HanExtra.
 209
 210   ----------------------------------------------------------------
 211   gb18030
 212   euc-tw
 213   big5plus
 214   ----------------------------------------------------------------
 215
 216 =back
 217
 218 =head2 Miscellaneous encodings
 219
 220 =over 4
 221
 222 =item Encode::EBCDIC
 223
 224 See L<perlebcdic> for details.
 225
 226   ----------------------------------------------------------------
 227   cp37
 228   cp500
 229   cp875
 230   cp1026
 231   cp1047
 232   posix-bc
 233   ----------------------------------------------------------------
 234
 235 =item Encode::Symbols
 236
 237 For symbols  and dingbats.
 238
 239   ----------------------------------------------------------------
 240   symbol
 241   dingbats
 242   MacDingbats
 243   AdobeZdingbat
 244   AdobeSymbol
 245   ----------------------------------------------------------------
 246
 247 =back
 248
 249 =head1 Unsupported encodings
 250
 251 The following are not supported as yet.  Some because they are rarely
 252 usede, some because of technical difficulty.  They may be supported by
 253 external modules via CPAN in future, however.
 254
 255 =over 4
 256
 257 =item   ISO-2022-JP-2 [RFC1554]
 258
 259 Not very popular yet.  Needs Unicode Database or equivalent to
 260 implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
 261 GB2312 sumulteniously, which code points in unicode overlap.  So you
 262 need to lookup the database to determine what character set a given
 263 Unicode character should belong).
 264
 265 =item   ISO-2022-CN [RFC1922]
 266
 267 Not very popular.  Needs CNS 11643-1 and 2 which are not available in
 268 this module.  CNS 11643 is supported (via euc-tw) in
 269 Encode::HanExtra.  Autrijus may add support for this encoding in his
 270 module in future
 271
 272 =item various UP-UX encodings
 273
 274 The following are unsoported due to the lack of mapping data.
 275
 276   '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
 277   '15' - japanese15, korean15, and  roi15
 278
 279 =item Cyrillic encoding ISO-IR-111
 280
 281 Anton doubts its usefulness.
 282
 283 =item ISO-8859-8-1 [Hebrew]
 284
 285 None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
 286 MacHebrew are supported because and just because there were mappings
 287 available at L<http://www.unicode.org/>).  Contribution welcome.
 288
 289 =item Thai encoding TCVN
 290
 291 Ditto.
 292
 293 =item Vietnamese encodings VPS
 294
 295 Though Jungshik has reported that mozilla supports this encoding,  It was too late for us to add one.  In future via a separate module.  See
 296 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and
 297 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
 298 if you are interested in helping us.
 299
 300 =item various Mac encodings
 301
 302 The following are unsoported due to the lack of mapping data.
 303
 304   MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
 305   MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
 306   MacLaotian,   MacMalayalam, MacMongolian, MacOriya
 307   MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
 308   MacVietnamese
 309
 310 The rest of which already available are based upon the vendor mappings at
 311 L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
 312
 313 =item (Mac) Indic encodings
 314
 315 The maps for the following is available at L<http://www.unicode.org/>
 316 but remains unsupport because those encordigs need algorithmical
 317 approach, unsupported by F<enc2xs>
 318
 319   MacDevanagari
 320   MacGurmukhi
 321   MacGujarati
 322
 323 For details, please see C<Unicode mapping issues and notes:> at
 324 L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
 325
 326 I believe this issue is prevalent not only for Mac Indics but also in
 327 other Indic encodings but those mentions were the only Indic encodings
 328 maps that I could find at L<http://www.unicode.org/> .
 329
 330 =back
 331
 332 =head1 Encoding vs. Charset -- terminology
 333
 334 We are used to using the term (character) I<encoding> and I<character set>
 335 interchangeably.  But just as using the term byte and character is
 336 dangerous and should be differenciated when needed, we need to
 337 differenciate I<encoding> and I<character set>.
 338
 339 To understand that, it's follow how we make computers grok our character.
 340
 341 =over 4
 342
 343 =item *
 344
 345 First we start with which characters to include.  We call this
 346 collection of characters I<character repertoire>.
 347
 348 =item *
 349
 350 Then we have to give each character a unique ID so your computer can
 351 tell the differnce from 'a' to 'A'.  This itemized character
 352 repartoire is now a I<character set>.
 353
 354 =item *
 355
 356 If your computer can grow the character set without further
 357 proccessing, you can go ahead use it.  This is called a I<coded
 358 character set> (CCS) or I<raw character encoding>.  ASCII is used this
 359 way for most cases.
 360
 361 =item *
 362
 363 But in many cases especially multi-byte CJK encodings, you have to
 364 tweak a little more.  Your network connection may not accept any data
 365 with the Most Significant Bit set, Your computer may not be able to
 366 tell if a given byte is a whole character or just half of it.  So you
 367 have to I<encode> the character set to use it.
 368
 369 A I<character encoding scheme> (CES) determines how to encode a given
 370 character set, or a set of multiple character sets.  7bit ISO-2022 is
 371 an example of CES.  You switch between character sets via I<escape
 372 sequence>.
 373
 374 =back
 375
 376 Technically, or Mathematically speaking, a character set encoded in
 377 such a CES that maps character by character may form a CCS.  EUC is such
 378 an example.  CES of EUC is as follows;
 379
 380 =over 4
 381
 382 =item *
 383
 384 Map ASCII unchanged.
 385
 386 =item *
 387
 388 Map such a character set that consists of 94 or 96 powered by N
 389 members by adding 0x80 to each byte.
 390
 391 =item *
 392
 393 You can also use 0x8e and 0x8f to tell the following sequence of
 394 characters belong to yet another character set.  each following byte
 395 is added by 0x80
 396
 397 =back
 398
 399 By carefully looking at at the encoded byte sequence, you may find the
 400 byte sequence conforms a unique number.  In that sense EUC is a CCS
 401 generated by a CES above from up to four CCS (complicated?).  UTF-8
 402 falls into this category.  See L<perlunicode/"UTF-8"> to find how
 403 UTF-8 maps Unicode to a byte sequence.
 404
 405 You may also find by now why 7bit ISO-2022 cannot conform a CCS.  If
 406 you look at a byte sequence \x21\x21, you can't tell if it is two !'s
 407 or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1 so you have no
 408 trouble between "!!". and "  "
 409
 410 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
 411
 412 This section tries to classify the supported encodings by their
 413 applicability for information exchange over the Internet and to
 414 choose the most suitable aliases to name them in the context of
 415 such communication.
 416
 417 =over 2
 418
 419 =item *
 420
 421 To (en|de) code Encodings marked as C<(*)>, You need
 422 C<Encode::HanExtra>, available from CPAN.
 423
 424 =back
 425
 426 Encoding names
 427
 428   US-ASCII    UTF-8     ISO-8859-*  KOI8-R
 429   Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
 430   EUC-KR      Big5      GB2312
 431
 432 are registered to IANA as preferred MIME names and may probably
 433 be used over the Internet.
 434
 435 C<Shift_JIS> has been officialized by JIS X 0208-1997.
 436 L<Microsoft-related naming mess> gives details.
 437
 438 C<GB2312> is the IANA name for C<EUC-CN>.
 439 See L<Microsoft-related naming mess> for details.
 440
 441 C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
 442 with Encode. See L<Encode::CN -- Continental China> for details.
 443
 444   EUC-CN
 445   KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
 446
 447 have not been registered with IANA (as of March 2002) but
 448 seem to be supported by major web browsers.
 449 IANA name for C<EUC-CN> is C<GB2312>.
 450
 451   KS_C_5601-1987
 452
 453 is heavily misused.
 454 See L<Microsoft-related naming mess> for details.
 455
 456 C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
 457 with Encode. See L<Encode::KR -- Korea> for details.
 458
 459   UTF-16
 460
 461 =for comment
 462 waiting for comments from Jungshik Shin to soften this - Anton
 463
 464 is a IANA-registered preferred MIME name
 465 but probably should be avoided as encoding for web pages due to
 466 the lack of browser support.
 467
 468   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
 469   GBK
 470   VISCII
 471   GB 12345
 472   GB 18030 (*)  (see links bellow)
 473   EUC-TW   (*)
 474
 475 are totally valid encodings but not registered at IANA.
 476 The names under which they are listed here are probably the
 477 most widely-known names for these encodings and are recommended
 478 names.
 479
 480   BIG5PLUS (*)
 481
 482 is a bit proprietary name.
 483
 484 =head2 Microsoft-related naming mess
 485
 486 Microsoft products misuse the following names:
 487
 488 =over 2
 489
 490 =item KS_C_5601-1987
 491
 492 Microsoft extension to C<EUC-KR>.
 493
 494 Proper name: C<CP949>.
 495
 496 See
 497 http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
 498 for details.
 499
 500 Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect
 501 this common misusage.
 502 I<Raw> C<KS_C_5601-1987> encoding is available as C<kcs5601-raw>.
 503
 504 See L<Encode::KR -- Korea> for details.
 505
 506 =item GB2312
 507
 508 Microsoft extension to C<EUC-CN>.
 509
 510 Proper names: C<CP936>, C<GBK>.
 511
 512 C<GB2312> has been registered in the C<EUC-CN> meaning at
 513 IANA. This has partially repaired the situation: Microsoft's
 514 C<GB2312> has become a superset of the official C<GB2312>.
 515
 516 Encode aliases C<GB2312> to C<euc-cn> in full agreement with
 517 IANA registration. C<cp936> is supported separately.
 518 I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>.
 519
 520 See L<Encode::CN -- Continental China> for details.
 521
 522 =item Big5
 523
 524 Microsoft extension to C<Big5>.
 525
 526 Proper name: C<CP950>.
 527
 528 Encode separately supports C<Big5> and C<cp950>.
 529
 530 =item Shift_JIS
 531
 532 Microsoft's understanding of C<Shift_JIS>.
 533
 534 JIS has not endorsed the full Microsoft standard however.
 535 The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
 536 subsets, while Microsoft has always been meaning C<Shift_JIS> to
 537 encode a wider character repertoire.
 538
 539 As a historical predecessor Microsoft's variant
 540 probably has more rights for the name, albeit it may be objected
 541 that Microsoft shouldn't have used JIS as part of the name
 542 in the first place.
 543
 544 Unabiguous name: C<CP932>.
 545
 546 Encode separately supports C<Shift_JIS> and C<cp932>.
 547
 548 =back
 549
 550 =head1 Glossary
 551
 552 =over 2
 553
 554 =item character repertoire
 555
 556 A collection of unique characters.  A I<character> set in the most
 557 strict sense. At this stage characters are not numberd.
 558
 559 =item coded character set (CCS)
 560
 561 A character set that is mapped in a way computers can use directly.
 562 Many character encodings including EUC falls in this category.
 563
 564 =item character encoding scheme (CES)
 565
 566 An algorithm to map a character set to a byte sequence.  You don't
 567 have to be able to tell which character set a given byte sequence
 568 belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
 569 example of being both a CCS and CES.
 570
 571 =item EUC
 572
 573 Extended Unix Character.  See ISO-2022
 574
 575 =item ISO-2022
 576
 577 A CES that was carefully designed to coexist with ASCII.  There are 7
 578 bit version and 8 bit version.  8 bit version can conform a CCS.  EUC
 579 and ISO-8859 are two examples thereof.
 580
 581 =item UCS
 582
 583 Short for I<Universal Character Set>.  When you say just UCS, it means
 584 I<Unicode>
 585
 586 =item UCS-2
 587
 588 ISO/IEC 10646 encoding form: Universal Character Set coded in two
 589 octets.
 590
 591 =item Unicode
 592
 593 A Character Set that aims to include all character character
 594 repertoire of the world.  Many character sets in various national as
 595 well as industorial standards are therefore a subset thereof.
 596
 597 =item UTF
 598
 599 Short for I<Unicode Transformation Format>.  Determinse how to map a
 600 unicode character into byte sequnece.
 601
 602 =item UTF-16
 603
 604 A UTF in 16-bit encoding.  Can either be in big endian or little
 605 endian.  Big endian version is called UTF-16BE and little endian
 606 version is UTF-16LE.
 607
 608 =back
 609
 610 =head1 See Also
 611
 612 L<Encode>,
 613 L<Encode::Byte>,
 614 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
 615 L<Encode::EBCDIC>, L<Encode::Symbol>
 616
 617 =head1 References
 618
 619 =over 2
 620
 621 =item ECMA
 622
 623 European Computer Manufacturers Association
 624 L<http://www.ecma.ch>
 625
 626 =over 2
 627
 628 =item EMCA-035 (eq C<ISO-2022>)
 629
 630 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
 631
 632 The very dspecification of ISO-2022 is available from the link above.
 633
 634 =back
 635
 636 =item IANA
 637
 638 Internet Assigned Numbers Authority
 639 L<http://www.iana.org/>
 640
 641 =over 2
 642
 643 =item Assigned Charset Names by IANA
 644
 645 L<http://www.iana.org/assignments/character-sets>
 646
 647 Most of the C<canonical names> in Encode derive from this list
 648 so you can directly apply the string you have extracted from MIME
 649 header of mails and we pages.
 650
 651 =back
 652
 653 =item ISO
 654
 655 International Organization for Standardization
 656 L<http://www.iso.ch/>
 657
 658 =item RFC
 659
 660 Request For Comment -- need I say more?
 661 L<http://www.rfc.net/>
 662
 663 =item UC
 664
 665 Unicode Consortium
 666 L<http://www.unicode.org/>
 667
 668 =over 2
 669
 670 =item Unicode Glossary
 671
 672 L<http://www.unicode.org/glossary/>
 673
 674 The glossary of this document is based opon this site.
 675
 676 =back
 677
 678 =back
 679
 680 =head2 Other Notable Sites
 681
 682 =over 2
 683
 684 =item czyborra.com
 685
 686 <http://czyborra.com/>
 687
 688 Contains a a lot of useful information, especially gory details of ISO
 689 vs. vendor mappings.
 690
 691 =item CJK.inf
 692
 693 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 694
 695 Somewhat obsolete (last update in 1996), but still useful.  Also try
 696
 697 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 698
 699 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
 700
 701 =back
 702
 703 =cut
 704
 705 I could not find this page because the hostname doesn't resolve!
 706
 707  Brief description for most of the mentioned CJK encodings
 708 L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>