=head1 SYNOPSIS
+ use encoding "greek"; # Perl like Greek to you?
use encoding "euc-jp"; # Jperl!
- # or you can even do this if your shell supports euc-jp
+ # or you can even do this if your shell supports your native encoding
- > perl -Mencoding=euc-jp -e '...'
+ perl -Mencoding=latin2 -e '...' # Feeling centrally European?
+ perl -Mencoding=euc-ko -e '...'
# or from the shebang line
- #!/your/path/to/perl -Mencoding=euc-jp
+ #!/your/path/to/perl -Mencoding="8859-6" # Arabian Nights
+ #!/your/path/to/perl -Mencoding=euc-tw
# more control
- # A simple euc-jp => utf-8 converter
- use encoding "euc-jp", STDOUT => "utf8"; while(<>){print};
+ # A simple euc-cn => utf-8 converter
+ use encoding "euc-cn", STDOUT => "utf8"; while(<>){print};
# "no encoding;" supported (but not scoped!)
no encoding;
=head1 ABSTRACT
-Perl 5.6.0 has introduced Unicode support. You could apply
-C<substr()> and regexes even to complex CJK characters -- so long as
-the script was written in UTF-8. But back then text editors that
-support UTF-8 was still rare and many users rather chose to writer
-scripts in legacy encodings, given up whole new feature of Perl 5.6.
+Let's start with a bit of history: Perl 5.6.0 introduced Unicode
+support. You could apply C<substr()> and regexes even to complex CJK
+characters -- so long as the script was written in UTF-8. But back
+then text editors that supported UTF-8 were still rare and many users
+rather chose to write scripts in legacy encodings, given up whole new
+feature of Perl 5.6.
-With B<encoding> pragma, you can write your script in any encoding you like
-(so long as the C<Encode> module supports it) and still enjoy Unicode
-support. You can write a code in EUC-JP as follows;
+Rewind to the future: starting from perl 5.8.0 with B<encoding>
+pragma, you can write your script in any encoding you like (so long
+as the C<Encode> module supports it) and still enjoy Unicode support.
+You can write a code in EUC-JP as follows:
my $Rakuda = "\xF1\xD1\xF1\xCC"; # Camel in Kanji
#<-char-><-char-> # 4 octets
s/\bCamel\b/$Rakuda/;
And with C<use encoding "euc-jp"> in effect, it is the same thing as
-the code in UTF-8 as follow.
+the code in UTF-8:
my $Rakuda = "\x{99F1}\x{99DD}"; # who Unicode Characters
s/\bCamel\b/$Rakuda/;
-The B<encoding> pragma also modifies the file handle disciplines of
+The B<encoding> pragma also modifies the filehandle disciplines of
STDIN, STDOUT, and STDERR to the specified encoding. Therefore,
use encoding "euc-jp";
$message =~ s/\bCamel\b/$Rakuda/;
print $message;
-Will print "\xF1\xD1\xF1\xCC is the symbol of perl.\n", not
-"\x{99F1}\x{99DD} is the symbol of perl.\n".
+Will print "\xF1\xD1\xF1\xCC is the symbol of perl.\n",
+not "\x{99F1}\x{99DD} is the symbol of perl.\n".
-You can override this by giving extra arguments. See below.
+You can override this by giving extra arguments, see below.
=head1 USAGE
=item use encoding [I<ENCNAME>] ;
-Sets the script encoding to I<ENCNAME> and file handle disciplines of
-STDIN, STDOUT are set to ":encoding(I<ENCNAME>)". Note STDERR will not
-be changed.
+Sets the script encoding to I<ENCNAME> and filehandle disciplines of
+STDIN, STDOUT are set to ":encoding(I<ENCNAME>)". Note STDERR will
+not be changed.
If no encoding is specified, the environment variable L<PERL_ENCODING>
-is consulted. If no encoding can be found, C<Unknown encoding 'I<ENCNAME>'>
-error will be thrown.
+is consulted. If no encoding can be found, the error C<Unknown encoding
+'I<ENCNAME>'> will be thrown.
Note that non-STD file handles remain unaffected. Use C<use open> or
C<binmode> to change disciplines of those.
You can also individually set encodings of STDIN and STDOUT via
STDI<FH> =E<gt> I<ENCNAME_FH> form. In this case, you cannot omit the
-first I<ENCNAME>. C<STDI<FH> =E<gt> undef> turns IO transcoding
+first I<ENCNAME>. C<STDI<FH> =E<gt> undef> turns the IO transcoding
completely off.
=item no encoding;
Unsets the script encoding and the disciplines of STDIN, STDOUT are
-reset to ":raw".
+reset to ":raw" (the default unprocessed raw stream of bytes).
=back
The pragma is a per script, not a per block lexical. Only the last
C<use encoding> or C<matters, and it affects B<the whole script>.
-Though <no encoding> pragma is supported and C<use encoding> can
-appear as many times as you want in a given script, the multiple use
+However, <no encoding> pragma is supported and C<use encoding> can
+appear as many times as you want in a given script. The multiple use
of this pragma is discouraged.
=head2 DO NOT MIX MULTIPLE ENCODINGS
"\xDF\x{100}" =~ /\x{3af}\x{100}/
-since the C<\xDF> on the left will B<not> be upgraded to C<\x{3af}>
-because of the C<\x{100}> on the left. You should not be mixing your
-legacy data and Unicode in the same string.
+since the C<\xDF> (ISO 8859-7 GREEK SMALL LETTER IOTA WITH TONOS) on
+the left will B<not> be upgraded to C<\x{3af}> (Unicode GREEK SMALL
+LETTER IOTA WITH TONOS) because of the C<\x{100}> on the left. You
+should not be mixing your legacy data and Unicode in the same string.
This pragma also affects encoding of the 0x80..0xFF code point range:
normally characters in that range are left as eight-bit bytes (unless
gets UTF-8 encoded.
After all, the best thing about this pragma is that you don't have to
-resort to \x... just to spell your name in native encoding. So feel
+resort to \x... just to spell your name in native a encoding. So feel
free to put your strings in your encoding in quotes and regexes.
-=head1 NON-ASCII Identifiers and Filter option
+=head1 Non-ASCII Identifiers and Filter option
-The magic of C<use encoding> is not applied to the names of identifiers.
-In order to make C<${"4eba"}++> ($man++, where man is a single ideograph)
-work, you still need to write your script in UTF-8 or use a source filter.
+The magic of C<use encoding> is not applied to the names of
+identifiers. In order to make C<${"4eba"}++> ($human++, where human
+is a single Han ideograph) work, you still need to write your script
+in UTF-8 or use a source filter.
In other words, the same restriction as Jperl applies.
-If you dare experiment, however, you can try Fitlter option.
+If you dare to experiment, however, you can try Filter option.
=over 4
=back
-What does this mean? Your source code behaves as if it is written
-in UTF-8. So even if your editor only supports Shift_JIS, for
-example. You can still try examples in Chapter 15 of
-C<Programming Perl, 3rd Ed.> For instance, you can use UTF-8
-identifiers.
+What does this mean? Your source code behaves as if it is written in
+UTF-8. So even if your editor only supports Shift_JIS, for example.
+You can still try examples in Chapter 15 of C<Programming Perl, 3rd
+Ed.> For instance, you can use UTF-8 identifiers.
This option is significantly slower and (as of this writing) non-ASCII
identifiers are not very stable WITHOUT this option and with the
source code written in UTF-8.
-To make your script in legacy encoding work with minimum effort, do
-not use Filter=E<gt>1
-
+To make your script in legacy encoding work with minimum effort,
+do not use Filter=E<gt>1.
=head1 EXAMPLE - Greekperl
=item *
-The name used by the perl community. That includes 'utf8' and 'ascii'.
-Unlike aliases, canonical names directly reaches the method so such
-frequently used words like 'utf8' should do without alias lookups.
+The name used by the Perl community. That includes 'utf8' and 'ascii'.
+Unlike aliases, canonical names directly reach the method so such
+frequently used words like 'utf8' don't need to do alias lookups.
=item *
=item *
The name in the IANA registry.
-
+
=item *
The name used by the organization that defined it.
the canonical name.
Because of all the alias issues, and because in the general case
-encodings have state, "Encode" uses the encoding object internally
+encodings have state, "Encode" uses an encoding object internally
once an operation is in progress.
=head1 Supported Encodings
As of Perl 5.8.0, at least the following encodings are recognized.
Note that unless otherwise specified, they are all case insensitive
-(via alias) and all occurrance of spaces are replaced with '-'. In
-other words, "ISO 8859 1" and "iso-8859-1" are identical.
+(via alias) and all occurrence of spaces are replaced with '-'.
+In other words, "ISO 8859 1" and "iso-8859-1" are identical.
Encodings are categorized and implemented in several different modules
but you don't have to C<use Encode::XX> to make them available for
-most cases. Encode.pm will automatically load those modules in need.
+most cases. Encode.pm will automatically load those modules on demand.
=head2 Built-in Encodings
The following encodings are always available.
- Canonical Aliases Comments & References
+ Canonical Aliases Comments & References
----------------------------------------------------------------
- ascii US-ascii [ECMA]
- iso-8859-1 latin1 [ISO]
- utf8 UTF-8 [RFC2279]
+ ascii US-ascii [ECMA]
+ iso-8859-1 latin1 [ISO]
+ utf8 UTF-8 [RFC2279]
----------------------------------------------------------------
-
=head2 Encode::Unicode -- other Unicode encodings
Unicode coding schemes other than native utf8 are supported by
=item ISO-8859 and corresponding vendor mappings
-Since there are so many, They are presented in table format with
-Languages and corresponding encoding names by vendors. Note the table
+Since there are so many, they are presented in table format with
+languages and corresponding encoding names by vendors. Note the table
is sorted in order of ISO-8859 and the corresponding vendor mappings
are slightly different from that of ISO. See
L<http://czyborra.com/charsets/iso8859.html> for details.
- Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
+ Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
----------------------------------------------------------------
- N. America (ASCII) cp437 AdobeStandardEncoding
- cp863 (DOSCanadaF)
- W. Europe (iso-8859-1) cp850 cp1252 MacRoman nextstep
- hp-roman8
- cp860 (DOSPortuguese)
- CE. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
- MacCroatian
- MacRomanian
- MacRumanian
- Latin3(*3) iso-8859-3
- Latin4(*4) iso-8859-4
- Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
- (Also see next section) cp866 MacUkrainian
- Arabic iso-8859-6 cp864 cp1256 MacArabic
- cp1006 MacFarsi
- Greek iso-8859-7 cp737 cp1253 MacGreek
- cp869 (DOSGreek2)
- Hebrew iso-8859-8 cp862 cp1255 MacHebrew
- Turkish iso-8859-9 cp857 cp1254 MacTurkish
- Nordics iso-8859-10 cp865
- cp861 MacIcelandic
- MacSami
- Thai iso-8859-11 cp874 MacThai
+ N. America (ASCII) cp437 AdobeStandardEncoding
+ cp863 (DOSCanadaF)
+ W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
+ hp-roman8
+ cp860 (DOSPortuguese)
+ Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
+ MacCroatian
+ MacRomanian
+ MacRumanian
+ Latin3 [1] iso-8859-3
+ Latin4 [2] iso-8859-4
+ Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
+ (Also see next section) cp866 MacUkrainian
+ Arabic iso-8859-6 cp864 cp1256 MacArabic
+ cp1006 MacFarsi
+ Greek iso-8859-7 cp737 cp1253 MacGreek
+ cp869 (DOSGreek2)
+ Hebrew iso-8859-8 cp862 cp1255 MacHebrew
+ Turkish iso-8859-9 cp857 cp1254 MacTurkish
+ Nordics iso-8859-10 cp865
+ cp861 MacIcelandic
+ MacSami
+ Thai iso-8859-11 [3] cp874 MacThai
(iso-8859-12 is nonexistent. Reserved for Indics?)
- Baltics iso-8859-13 cp775 cp1257
+ Baltics iso-8859-13 cp775 cp1257
Celtics iso-8859-14
- Latin9(*15) iso-8859-15
+ Latin9 [4] iso-8859-15
Latin10 iso-8859-16
- Vietnamese viscii cp1258 MacVietnamese
+ Vietnamese viscii cp1258 MacVietnamese
----------------------------------------------------------------
- (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
- (*4) Baltics. Now on 8859-10
- (*9) Nicknamed Latin0; Euro sign as well as French and Finnish
- letters that are missing from 8859-1 are added.
+ [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-5.
+ [2] Baltics. Now on 8859-10.
+ [3] Also know as TIS 620.
+ [4] Nicknamed Latin0; Euro sign as well as French and Finnish
+ letters that are missing from 8859-1 are added.
All cp* are also available as ibm-*, ms-*, and windows-* . See also
L<http://czyborra.com/charsets/codepages.html>.
=item KOI8 - De Facto Standard for Cyrillic world
Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
-in the Net. L<Encode> comes with the following KOI charsets. for
-gory details, See <http://czyborra.com/charsets/cyrillic.html> for
-details.
+in the Net. L<Encode> comes with the following KOI charsets.
+For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
----------------------------------------------------------------
- koi8-f
- koi8-r cp878 [RFC1489]
- koi8-u [RFC2319]
-
+ koi8-f
+ koi8-r cp878 [RFC1489]
+ koi8-u [RFC2319]
+
=item gsm0338 - Hentai Latin 1
-GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
-control character ranges and other parts are mapped very differently,
-presumablly to store Greek and Cyrillic alphabets. This one is also
-covered in Encode::Byte even thought this one does not comply extended
-ASCII.
+GSM0338 is for GSM handsets. Though it shares alphanumerals with
+ASCII, control character ranges and other parts are mapped very
+differently, presumably to store Greek and Cyrillic alphabets.
+This is also covered in Encode::Byte even though it does not
+comply to extended ASCII.
=back
=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
-Note Vietnamese is listed above. Also read "Encoding vs Charset"
+Note that Vietnamese is listed above. Also read "Encoding vs Charset"
below. Also note these are implemented in distinct module by
-languages, due the the size concerns. Please also refer to their
+languages, due the the size concerns. Please refer to their
respective document pages.
=over 4
=item Encode::CN -- Continental China
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
- euc-cn(*1) MacChineseSimp
- (gbk) cp936 (*2)
- gb12345-raw { GB12345 without CES }
- gb2312-raw { GB2312 without CES }
+ euc-cn [1] MacChineseSimp
+ (gbk) cp936 [2]
+ gb12345-raw { GB12345 without CES }
+ gb2312-raw { GB2312 without CES }
hz
iso-ir-165
----------------------------------------------------------------
- (*1) GB2312 is aliased to this. see L<Microsoft-related naming mess>
- (*2) gbk is aliased to this. see L<Microsoft-related naming mess>
+ [1] GB2312 is aliased to this. see L<Microsoft-related naming mess>
+ [2] gbk is aliased to this. see L<Microsoft-related naming mess>
=item Encode::JP -- Japan
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
euc-jp
- shiftjis cp932 macJapanese
+ shiftjis cp932 macJapanese
7bit-jis
euc-jp
- iso-2022-jp [RFC1468]
- iso-2022-jp-1 [RFC2237]
+ iso-2022-jp [RFC1468]
+ iso-2022-jp-1 [RFC2237]
jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
=item Encode::KR -- Korea
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
- euc-kr MacKorean [RFC1557]
- cp949 (*)
- iso-2022-kr [RFC1557]
+ euc-kr MacKorean [RFC1557]
+ cp949 [1]
+ iso-2022-kr [RFC1557]
johab [KS X 1001:1998, Annex 3]
ksc5601-raw { KSC5601 without CES }
----------------------------------------------------------------
- (*) ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to
- this. See below.
-
-
+ [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
+ See below.
+
=item Encode::TW -- Taiwan
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
- big5 cp950 MacChineseTrad
+ big5 cp950 MacChineseTrad
big5-hkscs
----------------------------------------------------------------
Due to size concerns, additional Chinese encodings below are
distributed separately on CPAN, under the name Encode::HanExtra.
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
gb18030
euc-tw
=head1 Unsupported encodings
The following are not supported as yet. Some because they are rarely
-usede, some because of technical difficulty. They may be supported by
+used, some because of technical difficulties. They may be supported by
external modules via CPAN in future, however.
=over 4
Not very popular yet. Needs Unicode Database or equivalent to
implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
-GB2312 sumulteniously, which code points in unicode overlap. So you
+GB2312 simultaneously, which code points in Unicode overlap. So you
need to lookup the database to determine what character set a given
Unicode character should belong).
-=item ISO-2022-CN [RFC1922]
+=item ISO-2022-CN [RFC1922]
Not very popular. Needs CNS 11643-1 and 2 which are not available in
-this module. CNS 11643 is supported (via euc-tw) in
-Encode::HanExtra. Autrijus may add support for this encoding in his
-module in future
+this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
+Autrijus may add support for this encoding in his module in future.
=item various UP-UX encodings
-The following are unsoported due to the lack of mapping data.
-
+The following are unsupported due to the lack of mapping data.
+
'8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
- '15' - japanese15, korean15, and roi15
+ '15' - japanese15, korean15, and roi15
=item Cyrillic encoding ISO-IR-111
None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
MacHebrew are supported because and just because there were mappings
-available at L<http://www.unicode.org/>). Contribution welcome.
+available at L<http://www.unicode.org/>). Contributions welcome.
+
+=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
+
+Ditto.
=item Thai encoding TCVN
=item Vietnamese encodings VPS
-Though Jungshik has reported that mozilla supports this encoding, It was too late for us to add one. In future via a separate module. See
-L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and
+Though Jungshik has reported that Mozilla supports this encoding it
+was too late before 5.8.0 for us to add one. In future via a separate
+module. See
+L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
+and
L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
if you are interested in helping us.
-=item various Mac encodings
+=item Various Mac encodings
-The following are unsoported due to the lack of mapping data.
+The following are unsupported due to the lack of mapping data.
MacArmenian, MacBengali, MacBurmese, MacEthiopic
MacExtArabic, MacGeorgian, MacKannada, MacKhmer
MacSinhalese, MacTamil, MacTelugu, MacTibetan
MacVietnamese
-The rest of which already available are based upon the vendor mappings at
-L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
+The rest of which already available are based upon the vendor mappings
+at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
=item (Mac) Indic encodings
The maps for the following is available at L<http://www.unicode.org/>
-but remains unsupport because those encordigs need algorithmical
-approach, unsupported by F<enc2xs>
+but remains unsupport because those encodings need algorithmical
+approach, currently unsupported by F<enc2xs>
MacDevanagari
MacGurmukhi
L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
I believe this issue is prevalent not only for Mac Indics but also in
-other Indic encodings but those mentions were the only Indic encodings
+other Indic encodings, but the above were the only Indic encodings
maps that I could find at L<http://www.unicode.org/> .
=back
We are used to using the term (character) I<encoding> and I<character set>
interchangeably. But just as using the term byte and character is
-dangerous and should be differenciated when needed, we need to
-differenciate I<encoding> and I<character set>.
+dangerous and should be differentiated when needed, we need to
+differentiate I<encoding> and I<character set>.
To understand that, it's follow how we make computers grok our characters.
=item *
Then we have to give each character a unique ID so your computer can
-tell the differnce from 'a' to 'A'. This itemized character
-repartoire is now a I<character set>.
+tell the difference from 'a' to 'A'. This itemized character
+repertoire is now a I<character set>.
=item *
If your computer can grow the character set without further
-proccessing, you can go ahead use it. This is called a I<coded
+processing, you can go ahead use it. This is called a I<coded
character set> (CCS) or I<raw character encoding>. ASCII is used this
way for most cases.
By carefully looking at at the encoded byte sequence, you may find the
byte sequence conforms a unique number. In that sense EUC is a CCS
generated by a CES above from up to four CCS (complicated?). UTF-8
-falls into this category. See L<perlunicode/"UTF-8"> to find how
+falls into this category. See L<perlUnicode/"UTF-8"> to find how
UTF-8 maps Unicode to a byte sequence.
You may also find by now why 7bit ISO-2022 cannot conform a CCS. If
encode non-C<ASCII> form data. To get a general impression visit
L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
While encoding of form data has stabilized for C<UTF-8> coded pages
-(at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to
+(at least IE 5/6, NS 6, Opera 6 behave consistently), be sure to
expect fun (and cross-browser discrepancies) with C<UTF-16> coded
pages!
=item character repertoire
A collection of unique characters. A I<character> set in the most
-strict sense. At this stage characters are not numberd.
+strict sense. At this stage characters are not numbered.
=item coded character set (CCS)
iso-2022-jp, the de facto standard CES for e-mails.
8 bit version can conform a CCS. EUC and ISO-8859 are two examples
-thereof. pre-5.6 perl could use them as string literals.
+thereof. Pre-5.6 perl could use them as string literals.
=item UCS
=item Unicode
A Character Set that aims to include all character repertoire of the
-world. Many character sets in various national as well as industorial
+world. Many character sets in various national as well as industrial
standards have become, in a way, just subsets of Unicode.
=item UTF
Short for I<Unicode Transformation Format>. Determines how to map a
-unicode character into byte sequnece.
+Unicode character into byte sequence.
=item UTF-16
=item RFC
-Request For Comment -- need I say more?
+Request For Comments -- need I say more?
L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/>
=item UC
L<http://www.unicode.org/glossary/>
-The glossary of this document is based opon this site.
+The glossary of this document is based upon this site.
=back
L<http://jshin.net/faq>
-And especially it's subject 8
+And especially it's subject 8.
L<http://jshin.net/faq/qa8.html>
-a comprehensive overview of the Korean (C<KS *>) standards.
+A comprehensive overview of the Korean (C<KS *>) standards.
=back
I could not find this page because the hostname doesn't resolve!
- Brief description for most of the mentioned CJK encodings
+Brief description for most of the mentioned CJK encodings
L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>