X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlebcdic.pod;h=39ed61279cb830a9c14b829cf892cf8c9d361553;hb=226de479579f4a84dd17654b44e5aef323b0a403;hp=ca4ef84408c1295bdb10e047a74954ca02909aea;hpb=f4084e3915fd9d0f0ed59d0dddeb6888f64af93e;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlebcdic.pod b/pod/perlebcdic.pod index ca4ef84..39ed612 100644 --- a/pod/perlebcdic.pod +++ b/pod/perlebcdic.pod @@ -63,7 +63,7 @@ and typically run on host computers. The EBCDIC encodings derive from 8 bit byte extensions of Hollerith punched card encodings. The layout on the cards was such that high bits were set for the upper and lower case alphabet characters [a-z] and [A-Z], but there -were gaps within each latin alphabet range. +were gaps within each Latin alphabet range. Some IBM EBCDIC character sets may be known by character code set identification numbers (CCSID numbers) or code page numbers. Leading @@ -153,20 +153,21 @@ depends on the ordinal number of that code point, with larger numbers requiring more bytes. UTF-EBCDIC is like UTF-8, but based on EBCDIC. -In UTF-8, the code points corresponding to the lowest 128 -ordinal numbers (0 - 127) are the same (or C) -in UTF-8 or not. They occupy one byte each. All other Unicode code points -require more than one byte to be represented in UTF-8. -With UTF-EBCDIC, the term C has a somewhat different meaning. -(First, note that this is very different from the L +You may see the term C character or code point. +This simply means that the character has the same numeric +value when encoded as when not. +(Note that this is a very different concept from L mentioned above.) -In UTF-EBCDIC, an C character or code point -is one which takes up exactly one byte encoded, regardless -of whether or not the encoding changes its value -(which it most likely will). +For example, the ordinal value of 'A' is 193 in most EBCDIC code pages, +and also is 193 when encoded in UTF-EBCDIC. +All other code points occupy at least two bytes when encoded. +In UTF-8, the code points corresponding to the lowest 128 +ordinal numbers (0 - 127: the ASCII characters) are invariant. +In UTF-EBCDIC, there are 160 invariant characters. (If you care, the EBCDIC invariants are those characters -which correspond to the the ASCII characters, plus those that correspond to +which have ASCII equivalents, plus those that correspond to the C1 controls (80..9f on ASCII platforms).) + A string encoded in UTF-EBCDIC may be longer (but never shorter) than one encoded in UTF-8.