The EBCDIC code page in use on Siemens' BS2000 system is distinct from
1047 and 0037. It is identified below as the POSIX-BC set.
+=head2 Unicode code points versus EBCDIC code points
+
+In Unicode terminology a I<code point> is the number assigned to a
+character: for example, in EBCDIC the character "A" is usually assigned
+the number 193. In Unicode the character "A" is assigned the number 65.
+This causes a problem with the semantics of the pack/unpack "U", which
+are supposed to pack Unicode code points to characters and back to numbers.
+The problem is: which code points to use for code points less than 256?
+(for 256 and over there's no problem: Unicode code points are used)
+In EBCDIC, for the low 256 the EBCDIC code points are used. This
+means that the equivalences
+
+ pack("U", ord($character)) eq $character
+ unpack("U", $character) == ord $character
+
+will hold. (If Unicode code points were applied consistently over
+all the possible code points, pack("U",ord("A")) would in EBCDIC
+equal I<A with acute> or chr(101), and unpack("U", "A") would equal
+65, or I<non-breaking space>, not 193, or ord "A".)
+
+=head2 Remaining Perl Unicode problems in EBCDIC
+
+=over 4
+
+=item *
+
+Many of the remaining seem to be related to case-insensitive matching:
+for example, C<< /[\x{131}]/ >> (LATIN SMALL LETTER DOTLESS I) does
+not match "I" case-insensitively, as it should under Unicode.
+(The match succeeds in ASCII-derived platforms.)
+
+=item *
+
+The extensions Unicode::Collate and Unicode::Normalized are not
+supported under EBCDIC, likewise for the encoding pragma.
+
+=back
+
=head2 Unicode and UTF
UTF is a Unicode Transformation Format. UTF-8 is a Unicode conforming
UTF-EBCDIC is an attempt to represent Unicode characters in an EBCDIC
transparent manner.
+=head2 Using Encode
+
+Starting from Perl 5.8 you can use the standard new module Encode
+to translate from EBCDIC to Latin-1 code points
+
+ use Encode 'from_to';
+
+ my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
+
+ # $a is in EBCDIC code points
+ from_to($a, $ebcdic{ord '^'}, 'latin1');
+ # $a is ISO 8859-1 code points
+
+and from Latin-1 code points to EBCDIC code points
+
+ use Encode 'from_to';
+
+ my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
+
+ # $a is ISO 8859-1 code points
+ from_to($a, 'latin1', $ebcdic{ord '^'});
+ # $a is in EBCDIC code points
+
+For doing I/O it is suggested that you use the autotranslating features
+of PerlIO, see L<perluniintro>.
+
+Since version 5.8 Perl uses the new PerlIO I/O library. This enables
+you to use different encodings per IO channel. For example you may use
+
+ use Encode;
+ open($f, ">:encoding(ascii)", "test.ascii");
+ print $f "Hello World!\n";
+ open($f, ">:encoding(cp37)", "test.ebcdic");
+ print $f "Hello World!\n";
+ open($f, ">:encoding(latin1)", "test.latin1");
+ print $f "Hello World!\n";
+ open($f, ">:encoding(utf8)", "test.utf8");
+ print $f "Hello World!\n";
+
+to get two files containing "Hello World!\n" in ASCII, CP 37 EBCDIC,
+ISO 8859-1 (Latin-1) (in this example identical to ASCII) respective
+UTF-EBCDIC (in this example identical to normal EBCDIC). See the
+documentation of Encode::PerlIO for details.
+
+As the PerlIO layer uses raw IO (bytes) internally, all this totally
+ignores things like the type of your filesystem (ASCII or EBCDIC).
+
=head1 SINGLE OCTET TABLES
The following tables list the ASCII and Latin 1 ordered sets including
'\060\061\062\063\064\065\066\067\070\071\263\333\334\331\332\237' ;
my $ebcdic_string = $ascii_string;
- eval '$ebcdic_string =~ tr/\000-\377/' . $cp_037 . '/';
+ eval '$ebcdic_string =~ tr/' . $cp_037 . '/\000-\377/';
To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
arguments like so:
my $ascii_string = $ebcdic_string;
- eval '$ascii_string = tr/' . $cp_037 . '/\000-\377/';
+ eval '$ascii_string =~ tr/\000-\377/' . $cp_037 . '/';
Similarly one could take the output of the third column from recipe 0 to
obtain a C<$cp_1047> table. The fourth column of the output from recipe
[A-Z] and [a-z] have been especially coded to not pick up gap
characters. For example, characters such as E<ocirc> C<o WITH CIRCUMFLEX>
that lie between I and J would not be matched by the
-regular expression range C</[H-K]/>.
+regular expression range C</[H-K]/>. This works in
+the other direction, too, if either of the range end points is
+explicitly numeric: C<[\x89-\x91]> will match C<\x8e>, even
+though C<\x89> is C<i> and C<\x91 > is C<j>, and C<\x8e>
+is a gap character from the alphabetic viewpoint.
If you do want to match the alphabet gap characters in a single octet
regular expression try matching the hex or octal code such
There may be a few system dependent issues
of concern to EBCDIC Perl programmers.
-=head2 OS/400
-
-The PASE environment.
+=head2 OS/400
=over 8
+=item PASE
+
+The PASE environment is runtime environment for OS/400 that can run
+executables built for PowerPC AIX in OS/400, see L<perlos400>. PASE
+is ASCII-based, not EBCDIC-based as the ILE.
+
=item IFS access
XXX.