From: Jarkko Hietaniemi Date: Tue, 19 Mar 2002 04:58:22 +0000 (+0000) Subject: Update the Unicode vs EBCDIC situation. X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=64c66fb6d001b6ad9c6dcec93084b647d4c6eb13;p=p5sagit%2Fp5-mst-13.2.git Update the Unicode vs EBCDIC situation. p4raw-id: //depot/perl@15313 --- diff --git a/pod/perlebcdic.pod b/pod/perlebcdic.pod index 6339bb4..0053d91 100644 --- a/pod/perlebcdic.pod +++ b/pod/perlebcdic.pod @@ -97,6 +97,26 @@ for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places. The EBCDIC code page in use on Siemens' BS2000 system is distinct from 1047 and 0037. It is identified below as the POSIX-BC set. +=head2 Unicode code points versus EBCDIC code points + +In Unicode terminology a I is the number assigned to a +character: for example, in EBCDIC the character "A" is usually assigned +the number 193. In Unicode the character "A" is assigned the number 65. +This causes a problem with the semantics of the pack/unpack "U", which +are supposed to pack Unicode code points to characters and back to numbers. +The problem is: which code points to use for code points less than 256? +(for 256 and over there's no problem: Unicode code points are used) +In EBCDIC, for the low 256 the EBCDIC code points are used. This +means that the equivalences + + pack("U", ord($character)) eq $character + unpack("U", $character) == ord $character + +will hold. (If Unicode code points were applied consistently over +all the possible code points, pack("U",ord("A")) would in EBCDIC +equal I or chr(101), and unpack("U", "A") would equal +65, or I, not 193, or ord "A".) + =head2 Unicode and UTF UTF is a Unicode Transformation Format. UTF-8 is a Unicode conforming diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 8a7a055..e36bb07 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -169,15 +169,23 @@ To output UTF-8 always, use the ":utf8" output discipline. Prepending to this sample program ensures the output is completely UTF-8, and of course, removes the warning. -Perl 5.8.0 also supports Unicode on EBCDIC platforms. There, the -support is somewhat harder to implement since additional conversions -are needed at every step. Because of these difficulties, the Unicode -support isn't quite as full as in other, mainly ASCII-based, platforms -(the Unicode support is better than in the 5.6 series, which didn't -work much at all for EBCDIC platform). On EBCDIC platforms, the -internal Unicode encoding form is UTF-EBCDIC instead of UTF-8 (the -difference is that as UTF-8 is "ASCII-safe" in that ASCII characters -encode to UTF-8 as-is, UTF-EBCDIC is "EBCDIC-safe"). +=head2 Unicode and EBCDIC + +Perl 5.8.0 also supports Unicode on EBCDIC platforms. There, +the Unicode support is somewhat more complex to implement since +additional conversions are needed at every step. Some problems +remain, but they all seem to be related to the combination of +the extra mapping just described and case-insensitive matching: +for example, "\x{131}" (LATIN SMALL LETTER DOTLESS I) does not +match "I" case-insensitively, as it should under Unicode. +(The match succeeds in ASCII-derived platforms.) + +In any case, the Unicode support on EBCDIC platforms is better than +in the 5.6 series, which didn't work much at all for EBCDIC platform. +On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC +instead of UTF-8 (the difference is that as UTF-8 is "ASCII-safe" in +that ASCII characters encode to UTF-8 as-is, UTF-EBCDIC is +"EBCDIC-safe"). =head2 Creating Unicode