Portions that are still incomplete are marked with XXX.
+Perl used to work on EBCDIC machines, but there are now areas of the code where
+it doesn't. If you want to use Perl on an EBCDIC machine, please let us know
+by sending mail to perlbug@perl.org
+
=head1 COMMON CHARACTER CODE SETS
=head2 ASCII
=head2 EBCDIC
The Extended Binary Coded Decimal Interchange Code refers to a
-large collection of slightly different single and multi byte
-coded character sets that are different from ASCII or ISO 8859-1
-and typically run on host computers. The EBCDIC encodings derive
-from 8 bit byte extensions of Hollerith punched card encodings.
-The layout on the cards was such that high bits were set for the
-upper and lower case alphabet characters [a-z] and [A-Z], but there
-were gaps within each latin alphabet range.
+large collection of single and multi byte coded character sets that are
+different from ASCII or ISO 8859-1 and are all slightly different from each
+other; they typically run on host computers. The EBCDIC encodings derive from
+8 bit byte extensions of Hollerith punched card encodings. The layout on the
+cards was such that high bits were set for the upper and lower case alphabet
+characters [a-z] and [A-Z], but there were gaps within each Latin alphabet
+range.
Some IBM EBCDIC character sets may be known by character code set
identification numbers (CCSID numbers) or code page numbers. Leading
with larger numbers requiring more bytes.
UTF-EBCDIC is like UTF-8, but based on EBCDIC.
-In UTF-8, the code points corresponding to the lowest 128
-ordinal numbers (0 - 127) are the same (or C<invariant>)
-in UTF-8 or not. They occupy one byte each. All other Unicode code points
-require more than one byte to be represented in UTF-8.
-With UTF-EBCDIC, the term C<invariant> has a somewhat different meaning.
-(First, note that this is very different from the L</13 variant characters>
+You may see the term C<invariant> character or code point.
+This simply means that the character has the same numeric
+value when encoded as when not.
+(Note that this is a very different concept from L</The 13 variant characters>
mentioned above.)
-In UTF-EBCDIC, an C<invariant> character or code point
-is one which takes up exactly one byte encoded, regardless
-of whether or not the encoding changes its value
-(which it most likely will).
+For example, the ordinal value of 'A' is 193 in most EBCDIC code pages,
+and also is 193 when encoded in UTF-EBCDIC.
+All variant code points occupy at least two bytes when encoded.
+In UTF-8, the code points corresponding to the lowest 128
+ordinal numbers (0 - 127: the ASCII characters) are invariant.
+In UTF-EBCDIC, there are 160 invariant characters.
(If you care, the EBCDIC invariants are those characters
-which correspond to the the ASCII characters, plus those that correspond to
+which have ASCII equivalents, plus those that correspond to
the C1 controls (80..9f on ASCII platforms).)
+
A string encoded in UTF-EBCDIC may be longer (but never shorter) than
one encoded in UTF-8.
my $ebcdic_string = $ascii_string;
eval '$ebcdic_string =~ tr/' . $cp_037 . '/\000-\377/';
-To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
+To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
arguments like so:
my $ascii_string = $ebcdic_string;
sub Is_latin_1 {
my $char = substr(shift,0,1);
- $char =~ /[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/;
+ $char =~ /[ ¡¢£¤¥¦§¨©ª«¬\ad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/;
}
Although that form may run into trouble in network transit (due to the
apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE
and include Latin-1 characters then apply:
- tr/[a-z]/[A-Z]/;
- tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]/;
- s/ß/SS/g;
+ tr/[a-z]/[A-Z]/;
+ tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ/;
+ s/ß/SS/g;
then sort(). Do note however that such Latin-1 manipulation does not
address the E<yuml> C<y WITH DIAERESIS> character that will remain at
L<http://www.unicode.org/unicode/reports/tr16/>
-L<http://www.wps.com/texts/codes/>
+L<http://www.wps.com/projects/codes/>
B<ASCII: American Standard Code for Information Infiltration> Tom Jennings,
September 1999.