Portions that are still incomplete are marked with XXX.
+Perl used to work on EBCDIC machines, but there are now areas of the code where
+it doesn't. If you want to use Perl on an EBCDIC machine, please let us know
+by sending mail to perlbug@perl.org
+
=head1 COMMON CHARACTER CODE SETS
=head2 ASCII
=head2 EBCDIC
The Extended Binary Coded Decimal Interchange Code refers to a
-large collection of slightly different single and multi byte
-coded character sets that are different from ASCII or ISO 8859-1
-and typically run on host computers. The EBCDIC encodings derive
-from 8 bit byte extensions of Hollerith punched card encodings.
-The layout on the cards was such that high bits were set for the
-upper and lower case alphabet characters [a-z] and [A-Z], but there
-were gaps within each latin alphabet range.
+large collection of single and multi byte coded character sets that are
+different from ASCII or ISO 8859-1 and are all slightly different from each
+other; they typically run on host computers. The EBCDIC encodings derive from
+8 bit byte extensions of Hollerith punched card encodings. The layout on the
+cards was such that high bits were set for the upper and lower case alphabet
+characters [a-z] and [A-Z], but there were gaps within each Latin alphabet
+range.
Some IBM EBCDIC character sets may be known by character code set
identification numbers (CCSID numbers) or code page numbers. Leading
Perl can be compiled on platforms that run any of three commonly used EBCDIC
character sets, listed below.
-=head2 13 variant characters
+=head2 The 13 variant characters
Among IBM EBCDIC character code sets there are 13 characters that
are often mapped to different integer values. Those characters
with larger numbers requiring more bytes.
UTF-EBCDIC is like UTF-8, but based on EBCDIC.
-In UTF-8, the code points corresponding to the lowest 128
-ordinal numbers (0 - 127) are the same (or C<invariant>)
-in UTF-8 or not. They occupy one byte each. All other Unicode code points
-require more than one byte to be represented in UTF-8.
-With UTF-EBCDIC, the term C<invariant> has a somewhat different meaning.
-(First, note that this is very different from the L</13 variant characters>
+You may see the term C<invariant> character or code point.
+This simply means that the character has the same numeric
+value when encoded as when not.
+(Note that this is a very different concept from L</The 13 variant characters>
mentioned above.)
-In UTF-EBCDIC, an C<invariant> character or code point
-is one which takes up exactly one byte encoded, regardless
-of whether or not the encoding changes its value
-(which it most likely will).
+For example, the ordinal value of 'A' is 193 in most EBCDIC code pages,
+and also is 193 when encoded in UTF-EBCDIC.
+All variant code points occupy at least two bytes when encoded.
+In UTF-8, the code points corresponding to the lowest 128
+ordinal numbers (0 - 127: the ASCII characters) are invariant.
+In UTF-EBCDIC, there are 160 invariant characters.
(If you care, the EBCDIC invariants are those characters
-which correspond to the the ASCII characters, plus those that correspond to
+which have ASCII equivalents, plus those that correspond to
the C1 controls (80..9f on ASCII platforms).)
+
A string encoded in UTF-EBCDIC may be longer (but never shorter) than
one encoded in UTF-8.
my $ebcdic_string = $ascii_string;
eval '$ebcdic_string =~ tr/' . $cp_037 . '/\000-\377/';
-To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
+To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
arguments like so:
my $ascii_string = $ebcdic_string;
sub Is_latin_1 {
my $char = substr(shift,0,1);
- $char =~ /[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/;
+ $char =~ /[ ¡¢£¤¥¦§¨©ª«¬\ad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/;
}
Although that form may run into trouble in network transit (due to the
apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE
and include Latin-1 characters then apply:
- tr/[a-z]/[A-Z]/;
- tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]/;
- s/ß/SS/g;
+ tr/[a-z]/[A-Z]/;
+ tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ/;
+ s/ß/SS/g;
then sort(). Do note however that such Latin-1 manipulation does not
address the E<yuml> C<y WITH DIAERESIS> character that will remain at
L<http://www.unicode.org/unicode/reports/tr16/>
-L<http://www.wps.com/texts/codes/>
+L<http://www.wps.com/projects/codes/>
B<ASCII: American Standard Code for Information Infiltration> Tom Jennings,
September 1999.