Portions that are still incomplete are marked with XXX.
+Perl used to work on EBCDIC machines, but there are now areas of the code where
+it doesn't. If you want to use Perl on an EBCDIC machine, please let us know
+by sending mail to perlbug@perl.org
+
=head1 COMMON CHARACTER CODE SETS
=head2 ASCII
=head2 EBCDIC
The Extended Binary Coded Decimal Interchange Code refers to a
-large collection of slightly different single and multi byte
-coded character sets that are different from ASCII or ISO 8859-1
-and typically run on host computers. The EBCDIC encodings derive
-from 8 bit byte extensions of Hollerith punched card encodings.
-The layout on the cards was such that high bits were set for the
-upper and lower case alphabet characters [a-z] and [A-Z], but there
-were gaps within each latin alphabet range.
+large collection of single and multi byte coded character sets that are
+different from ASCII or ISO 8859-1 and are all slightly different from each
+other; they typically run on host computers. The EBCDIC encodings derive from
+8 bit byte extensions of Hollerith punched card encodings. The layout on the
+cards was such that high bits were set for the upper and lower case alphabet
+characters [a-z] and [A-Z], but there were gaps within each Latin alphabet
+range.
Some IBM EBCDIC character sets may be known by character code set
identification numbers (CCSID numbers) or code page numbers. Leading
mentioned above.)
For example, the ordinal value of 'A' is 193 in most EBCDIC code pages,
and also is 193 when encoded in UTF-EBCDIC.
-All other code points occupy at least two bytes when encoded.
+All variant code points occupy at least two bytes when encoded.
In UTF-8, the code points corresponding to the lowest 128
ordinal numbers (0 - 127: the ASCII characters) are invariant.
In UTF-EBCDIC, there are 160 invariant characters.
my $ebcdic_string = $ascii_string;
eval '$ebcdic_string =~ tr/' . $cp_037 . '/\000-\377/';
-To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
+To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
arguments like so:
my $ascii_string = $ebcdic_string;
sub Is_latin_1 {
my $char = substr(shift,0,1);
- $char =~ /[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/;
+ $char =~ /[ ¡¢£¤¥¦§¨©ª«¬\ad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/;
}
Although that form may run into trouble in network transit (due to the
apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE
and include Latin-1 characters then apply:
- tr/[a-z]/[A-Z]/;
- tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]/;
- s/ß/SS/g;
+ tr/[a-z]/[A-Z]/;
+ tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ/;
+ s/ß/SS/g;
then sort(). Do note however that such Latin-1 manipulation does not
address the E<yuml> C<y WITH DIAERESIS> character that will remain at
L<http://www.unicode.org/unicode/reports/tr16/>
-L<http://www.wps.com/texts/codes/>
+L<http://www.wps.com/projects/codes/>
B<ASCII: American Standard Code for Information Infiltration> Tom Jennings,
September 1999.