Commit | Line | Data |
d396a558 |
1 | =head1 NAME |
2 | |
3 | perlebcdic - Considerations for running Perl on EBCDIC platforms |
4 | |
5 | =head1 DESCRIPTION |
6 | |
7 | An exploration of some of the issues facing Perl programmers |
8 | on EBCDIC based computers. We do not cover localization, |
395f5a0c |
9 | internationalization, or multi byte character set issues other |
10 | than some discussion of UTF-8 and UTF-EBCDIC. |
d396a558 |
11 | |
12 | Portions that are still incomplete are marked with XXX. |
13 | |
14 | =head1 COMMON CHARACTER CODE SETS |
15 | |
16 | =head2 ASCII |
17 | |
18 | The American Standard Code for Information Interchange is a set of |
19 | integers running from 0 to 127 (decimal) that imply character |
20 | interpretation by the display and other system(s) of computers. |
51b5cecb |
21 | The range 0..127 can be covered by setting the bits in a 7-bit binary |
d396a558 |
22 | digit, hence the set is sometimes referred to as a "7-bit ASCII". |
51b5cecb |
23 | ASCII was described by the American National Standards Institute |
d396a558 |
24 | document ANSI X3.4-1986. It was also described by ISO 646:1991 |
25 | (with localization for currency symbols). The full ASCII set is |
26 | given in the table below as the first 128 elements. Languages that |
27 | can be written adequately with the characters in ASCII include |
28 | English, Hawaiian, Indonesian, Swahili and some Native American |
29 | languages. |
30 | |
51b5cecb |
31 | There are many character sets that extend the range of integers |
32 | from 0..2**7-1 up to 2**8-1, or 8 bit bytes (octets if you prefer). |
33 | One common one is the ISO 8859-1 character set. |
34 | |
d396a558 |
35 | =head2 ISO 8859 |
36 | |
37 | The ISO 8859-$n are a collection of character code sets from the |
38 | International Organization for Standardization (ISO) each of which |
39 | adds characters to the ASCII set that are typically found in European |
40 | languages many of which are based on the Roman, or Latin, alphabet. |
41 | |
42 | =head2 Latin 1 (ISO 8859-1) |
43 | |
44 | A particular 8-bit extension to ASCII that includes grave and acute |
45 | accented Latin characters. Languages that can employ ISO 8859-1 |
46 | include all the languages covered by ASCII as well as Afrikaans, |
47 | Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian, |
3958b146 |
48 | Portuguese, Spanish, and Swedish. Dutch is covered albeit without |
d396a558 |
49 | the ij ligature. French is covered too but without the oe ligature. |
50 | German can use ISO 8859-1 but must do so without German-style |
51 | quotation marks. This set is based on Western European extensions |
52 | to ASCII and is commonly encountered in world wide web work. |
53 | In IBM character code set identification terminology ISO 8859-1 is |
51b5cecb |
54 | also known as CCSID 819 (or sometimes 0819 or even 00819). |
d396a558 |
55 | |
56 | =head2 EBCDIC |
57 | |
395f5a0c |
58 | The Extended Binary Coded Decimal Interchange Code refers to a |
51b5cecb |
59 | large collection of slightly different single and multi byte |
60 | coded character sets that are different from ASCII or ISO 8859-1 |
61 | and typically run on host computers. The EBCDIC encodings derive |
62 | from 8 bit byte extensions of Hollerith punched card encodings. |
d396a558 |
63 | The layout on the cards was such that high bits were set for the |
64 | upper and lower case alphabet characters [a-z] and [A-Z], but there |
65 | were gaps within each latin alphabet range. |
66 | |
51b5cecb |
67 | Some IBM EBCDIC character sets may be known by character code set |
68 | identification numbers (CCSID numbers) or code page numbers. Leading |
69 | zero digits in CCSID numbers within this document are insignificant. |
70 | E.g. CCSID 0037 may be referred to as 37 in places. |
71 | |
1e054b24 |
72 | =head2 13 variant characters |
73 | |
51b5cecb |
74 | Among IBM EBCDIC character code sets there are 13 characters that |
75 | are often mapped to different integer values. Those characters |
76 | are known as the 13 "variant" characters and are: |
d396a558 |
77 | |
51b5cecb |
78 | \ [ ] { } ^ ~ ! # | $ @ ` |
d396a558 |
79 | |
80 | =head2 0037 |
81 | |
82 | Character code set ID 0037 is a mapping of the ASCII plus Latin-1 |
83 | characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used |
51b5cecb |
84 | in North American English locales on the OS/400 operating system |
85 | that runs on AS/400 computers. CCSID 37 differs from ISO 8859-1 |
86 | in 237 places, in other words they agree on only 19 code point values. |
d396a558 |
87 | |
88 | =head2 1047 |
89 | |
90 | Character code set ID 1047 is also a mapping of the ASCII plus |
91 | Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is |
395f5a0c |
92 | used under Unix System Services for OS/390 or z/OS, and OpenEdition |
93 | for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places. |
d396a558 |
94 | |
95 | =head2 POSIX-BC |
96 | |
97 | The EBCDIC code page in use on Siemens' BS2000 system is distinct from |
98 | 1047 and 0037. It is identified below as the POSIX-BC set. |
99 | |
395f5a0c |
100 | =head2 Unicode and UTF |
101 | |
102 | UTF is a Unicode Transformation Format. UTF-8 is a Unicode conforming |
103 | representation of the Unicode standard that looks very much like ASCII. |
104 | UTF-EBCDIC is an attempt to represent Unicode characters in an EBCDIC |
105 | transparent manner. |
106 | |
d396a558 |
107 | =head1 SINGLE OCTET TABLES |
108 | |
109 | The following tables list the ASCII and Latin 1 ordered sets including |
110 | the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f), |
111 | C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff). In the |
112 | table non-printing control character names as well as the Latin 1 |
113 | extensions to ASCII have been labelled with character names roughly |
395f5a0c |
114 | corresponding to I<The Unicode Standard, Version 3.0> albeit with |
d396a558 |
115 | substitutions such as s/LATIN// and s/VULGAR// in all cases, |
116 | s/CAPITAL LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/ |
1e054b24 |
117 | in some other cases (the C<charnames> pragma names unfortunately do |
118 | not list explicit names for the C0 or C1 control characters). The |
119 | "names" of the C1 control set (128..159 in ISO 8859-1) listed here are |
120 | somewhat arbitrary. The differences between the 0037 and 1047 sets are |
121 | flagged with ***. The differences between the 1047 and POSIX-BC sets |
122 | are flagged with ###. All ord() numbers listed are decimal. If you |
123 | would rather see this table listing octal values then run the table |
124 | (that is, the pod version of this document since this recipe may not |
125 | work with a pod2_other_format translation) through: |
d396a558 |
126 | |
127 | =over 4 |
128 | |
129 | =item recipe 0 |
130 | |
131 | =back |
132 | |
769c2898 |
133 | perldoc -m perlebcdic | \ |
134 | perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ |
135 | -e '{printf("%s%-9o%-9o%-9o%o\n",$1,$2,$3,$4,$5)}' |
136 | |
137 | Or, as a script, called like C<perldoc -m perlebcdic | extract.pl>: |
138 | |
139 | my $regex = qr/ |
140 | (.{33}) # any 33 characters |
141 | |
142 | (\d+)\s+ # capture some digits, discard spaces |
143 | (\d+)\s+ # ".." |
144 | (\d+)\s+ # ".." |
145 | (\d+) # capture some digits |
146 | /x; |
147 | |
148 | while ( <> ) { |
149 | if ( $_ =~ $regex ) { |
150 | printf( |
151 | "%s%-9o%-9o%-9o%o\n", |
152 | $1, $2, $3, $4, $5, |
153 | ); |
154 | } |
155 | } |
395f5a0c |
156 | |
157 | If you want to retain the UTF-x code points then in script form you |
158 | might want to write: |
159 | |
160 | =over 4 |
161 | |
162 | =item recipe 1 |
163 | |
164 | =back |
165 | |
769c2898 |
166 | my $regex = qr/ |
167 | (.{33}) # $1: any 33 characters |
168 | |
169 | (\d+)\s+ # $2, $3, $4, $5: |
170 | (\d+)\s+ # capture some digits, discard spaces |
171 | (\d+)\s+ # 4 times |
172 | (\d+)\s+ |
173 | |
174 | (\d+) # $6: capture some digits, |
175 | \.? # there may be a period, |
176 | (\d*) # $7: capture some digits if they're there, |
177 | \s+ # discard spaces |
178 | |
179 | (\d+) # $8: capture some digits |
180 | \.? # there may be a period, |
181 | (\d*) # $9: capture some digits if they're there, |
182 | /x; |
183 | |
184 | open( FH, 'perldoc -m perlebcdic |' ) || |
185 | die "Could not open perlebcdic.pod: $!"; |
186 | while ( <FH> ) { |
187 | if ( $_ =~ $regex ) { |
188 | if ( $7 ne '' && $9 ne '' ) { |
189 | printf( |
190 | "%s%-9o%-9o%-9o%-9o%-3o.%-5o%-3o.%o\n", |
191 | $1, $2, $3, $4, $5, $6, $7, $8, $9 |
192 | ); |
193 | } elsif ( $7 ne '' ) { |
194 | printf( |
195 | "%s%-9o%-9o%-9o%-9o%-3o.%-5o%o\n", |
196 | $1, $2, $3, $4, $5, $6, $7, $8 |
197 | ); |
198 | } else { |
199 | printf( |
200 | "%s%-9o%-9o%-9o%-9o%-9o%o\n", |
201 | $1, $2, $3, $4, $5, $6, $8 |
202 | ); |
395f5a0c |
203 | } |
204 | } |
205 | } |
769c2898 |
206 | close FH; |
d396a558 |
207 | |
208 | If you would rather see this table listing hexadecimal values then |
209 | run the table through: |
210 | |
211 | =over 4 |
212 | |
395f5a0c |
213 | =item recipe 2 |
d396a558 |
214 | |
215 | =back |
216 | |
769c2898 |
217 | perldoc -m perlebcdic | \ |
218 | perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ |
219 | -e '{printf("%s%-9X%-9X%-9X%X\n",$1,$2,$3,$4,$5)}' |
395f5a0c |
220 | |
221 | Or, in order to retain the UTF-x code points in hexadecimal: |
222 | |
223 | =over 4 |
224 | |
225 | =item recipe 3 |
226 | |
227 | =back |
228 | |
769c2898 |
229 | my $regex = qr/ |
230 | (.{33}) # $1: any 33 characters |
231 | |
232 | (\d+)\s+ # $2, $3, $4, $5: |
233 | (\d+)\s+ # capture some digits, discard spaces |
234 | (\d+)\s+ # 4 times |
235 | (\d+)\s+ |
236 | |
237 | (\d+) # $6: capture some digits, |
238 | \.? # there may be a period, |
239 | (\d*) # $7: capture some digits if they're there, |
240 | \s+ # discard spaces |
241 | |
242 | (\d+) # $8: capture some digits |
243 | \.? # there may be a period, |
244 | (\d*) # $9: capture some digits if they're there, |
245 | /x; |
246 | |
247 | open( FH, 'perldoc -m perlebcdic |' ) || |
248 | die "Could not open perlebcdic.pod: $!"; |
395f5a0c |
249 | while (<FH>) { |
769c2898 |
250 | if ( $_ =~ $regex ) { |
251 | if ( $7 ne '' && $9 ne '' ) { |
252 | printf( |
253 | "%s%-9X%-9X%-9X%-9X%-2X.%-6X%-2X.%X\n", |
254 | $1, $2, $3, $4, $5, $6, $7, $8, $9 |
255 | ); |
395f5a0c |
256 | } |
769c2898 |
257 | elsif ( $7 ne '' ) { |
258 | printf( |
259 | "%s%-9X%-9X%-9X%-9X%-2X.%-6X%X\n", |
260 | $1, $2, $3, $4, $5, $6, $7, $8 |
261 | ); |
395f5a0c |
262 | } |
263 | else { |
769c2898 |
264 | printf( |
265 | "%s%-9X%-9X%-9X%-9X%-9X%X\n", |
266 | $1, $2, $3, $4, $5, $6, $8 |
267 | ); |
395f5a0c |
268 | } |
269 | } |
270 | } |
271 | |
769c2898 |
272 | =head2 THE SINGLE OCTET TABLE |
395f5a0c |
273 | |
274 | incomp- incomp- |
275 | 8859-1 lete lete |
276 | chr 0819 0037 1047 POSIX-BC UTF-8 UTF-EBCDIC |
277 | ------------------------------------------------------------------------------------ |
278 | <NULL> 0 0 0 0 0 0 |
279 | <START OF HEADING> 1 1 1 1 1 1 |
280 | <START OF TEXT> 2 2 2 2 2 2 |
281 | <END OF TEXT> 3 3 3 3 3 3 |
282 | <END OF TRANSMISSION> 4 55 55 55 4 55 |
283 | <ENQUIRY> 5 45 45 45 5 45 |
284 | <ACKNOWLEDGE> 6 46 46 46 6 46 |
285 | <BELL> 7 47 47 47 7 47 |
286 | <BACKSPACE> 8 22 22 22 8 22 |
287 | <HORIZONTAL TABULATION> 9 5 5 5 9 5 |
288 | <LINE FEED> 10 37 21 21 10 21 *** |
289 | <VERTICAL TABULATION> 11 11 11 11 11 11 |
290 | <FORM FEED> 12 12 12 12 12 12 |
291 | <CARRIAGE RETURN> 13 13 13 13 13 13 |
292 | <SHIFT OUT> 14 14 14 14 14 14 |
293 | <SHIFT IN> 15 15 15 15 15 15 |
294 | <DATA LINK ESCAPE> 16 16 16 16 16 16 |
295 | <DEVICE CONTROL ONE> 17 17 17 17 17 17 |
296 | <DEVICE CONTROL TWO> 18 18 18 18 18 18 |
297 | <DEVICE CONTROL THREE> 19 19 19 19 19 19 |
298 | <DEVICE CONTROL FOUR> 20 60 60 60 20 60 |
299 | <NEGATIVE ACKNOWLEDGE> 21 61 61 61 21 61 |
300 | <SYNCHRONOUS IDLE> 22 50 50 50 22 50 |
301 | <END OF TRANSMISSION BLOCK> 23 38 38 38 23 38 |
302 | <CANCEL> 24 24 24 24 24 24 |
303 | <END OF MEDIUM> 25 25 25 25 25 25 |
304 | <SUBSTITUTE> 26 63 63 63 26 63 |
305 | <ESCAPE> 27 39 39 39 27 39 |
306 | <FILE SEPARATOR> 28 28 28 28 28 28 |
307 | <GROUP SEPARATOR> 29 29 29 29 29 29 |
308 | <RECORD SEPARATOR> 30 30 30 30 30 30 |
309 | <UNIT SEPARATOR> 31 31 31 31 31 31 |
310 | <SPACE> 32 64 64 64 32 64 |
311 | ! 33 90 90 90 33 90 |
312 | " 34 127 127 127 34 127 |
313 | # 35 123 123 123 35 123 |
314 | $ 36 91 91 91 36 91 |
315 | % 37 108 108 108 37 108 |
316 | & 38 80 80 80 38 80 |
317 | ' 39 125 125 125 39 125 |
318 | ( 40 77 77 77 40 77 |
319 | ) 41 93 93 93 41 93 |
320 | * 42 92 92 92 42 92 |
321 | + 43 78 78 78 43 78 |
322 | , 44 107 107 107 44 107 |
323 | - 45 96 96 96 45 96 |
324 | . 46 75 75 75 46 75 |
325 | / 47 97 97 97 47 97 |
326 | 0 48 240 240 240 48 240 |
327 | 1 49 241 241 241 49 241 |
328 | 2 50 242 242 242 50 242 |
329 | 3 51 243 243 243 51 243 |
330 | 4 52 244 244 244 52 244 |
331 | 5 53 245 245 245 53 245 |
332 | 6 54 246 246 246 54 246 |
333 | 7 55 247 247 247 55 247 |
334 | 8 56 248 248 248 56 248 |
335 | 9 57 249 249 249 57 249 |
336 | : 58 122 122 122 58 122 |
337 | ; 59 94 94 94 59 94 |
338 | < 60 76 76 76 60 76 |
339 | = 61 126 126 126 61 126 |
340 | > 62 110 110 110 62 110 |
341 | ? 63 111 111 111 63 111 |
342 | @ 64 124 124 124 64 124 |
343 | A 65 193 193 193 65 193 |
344 | B 66 194 194 194 66 194 |
345 | C 67 195 195 195 67 195 |
346 | D 68 196 196 196 68 196 |
347 | E 69 197 197 197 69 197 |
348 | F 70 198 198 198 70 198 |
349 | G 71 199 199 199 71 199 |
350 | H 72 200 200 200 72 200 |
351 | I 73 201 201 201 73 201 |
352 | J 74 209 209 209 74 209 |
353 | K 75 210 210 210 75 210 |
354 | L 76 211 211 211 76 211 |
355 | M 77 212 212 212 77 212 |
356 | N 78 213 213 213 78 213 |
357 | O 79 214 214 214 79 214 |
358 | P 80 215 215 215 80 215 |
359 | Q 81 216 216 216 81 216 |
360 | R 82 217 217 217 82 217 |
361 | S 83 226 226 226 83 226 |
362 | T 84 227 227 227 84 227 |
363 | U 85 228 228 228 85 228 |
364 | V 86 229 229 229 86 229 |
365 | W 87 230 230 230 87 230 |
366 | X 88 231 231 231 88 231 |
367 | Y 89 232 232 232 89 232 |
368 | Z 90 233 233 233 90 233 |
369 | [ 91 186 173 187 91 173 *** ### |
370 | \ 92 224 224 188 92 224 ### |
371 | ] 93 187 189 189 93 189 *** |
372 | ^ 94 176 95 106 94 95 *** ### |
373 | _ 95 109 109 109 95 109 |
374 | ` 96 121 121 74 96 121 ### |
375 | a 97 129 129 129 97 129 |
376 | b 98 130 130 130 98 130 |
377 | c 99 131 131 131 99 131 |
378 | d 100 132 132 132 100 132 |
379 | e 101 133 133 133 101 133 |
380 | f 102 134 134 134 102 134 |
381 | g 103 135 135 135 103 135 |
382 | h 104 136 136 136 104 136 |
383 | i 105 137 137 137 105 137 |
384 | j 106 145 145 145 106 145 |
385 | k 107 146 146 146 107 146 |
386 | l 108 147 147 147 108 147 |
387 | m 109 148 148 148 109 148 |
388 | n 110 149 149 149 110 149 |
389 | o 111 150 150 150 111 150 |
390 | p 112 151 151 151 112 151 |
391 | q 113 152 152 152 113 152 |
392 | r 114 153 153 153 114 153 |
393 | s 115 162 162 162 115 162 |
394 | t 116 163 163 163 116 163 |
395 | u 117 164 164 164 117 164 |
396 | v 118 165 165 165 118 165 |
397 | w 119 166 166 166 119 166 |
398 | x 120 167 167 167 120 167 |
399 | y 121 168 168 168 121 168 |
400 | z 122 169 169 169 122 169 |
401 | { 123 192 192 251 123 192 ### |
402 | | 124 79 79 79 124 79 |
403 | } 125 208 208 253 125 208 ### |
404 | ~ 126 161 161 255 126 161 ### |
405 | <DELETE> 127 7 7 7 127 7 |
406 | <C1 0> 128 32 32 32 194.128 32 |
407 | <C1 1> 129 33 33 33 194.129 33 |
408 | <C1 2> 130 34 34 34 194.130 34 |
409 | <C1 3> 131 35 35 35 194.131 35 |
410 | <C1 4> 132 36 36 36 194.132 36 |
411 | <C1 5> 133 21 37 37 194.133 37 *** |
412 | <C1 6> 134 6 6 6 194.134 6 |
413 | <C1 7> 135 23 23 23 194.135 23 |
414 | <C1 8> 136 40 40 40 194.136 40 |
415 | <C1 9> 137 41 41 41 194.137 41 |
416 | <C1 10> 138 42 42 42 194.138 42 |
417 | <C1 11> 139 43 43 43 194.139 43 |
418 | <C1 12> 140 44 44 44 194.140 44 |
419 | <C1 13> 141 9 9 9 194.141 9 |
420 | <C1 14> 142 10 10 10 194.142 10 |
421 | <C1 15> 143 27 27 27 194.143 27 |
422 | <C1 16> 144 48 48 48 194.144 48 |
423 | <C1 17> 145 49 49 49 194.145 49 |
424 | <C1 18> 146 26 26 26 194.146 26 |
425 | <C1 19> 147 51 51 51 194.147 51 |
426 | <C1 20> 148 52 52 52 194.148 52 |
427 | <C1 21> 149 53 53 53 194.149 53 |
428 | <C1 22> 150 54 54 54 194.150 54 |
429 | <C1 23> 151 8 8 8 194.151 8 |
430 | <C1 24> 152 56 56 56 194.152 56 |
431 | <C1 25> 153 57 57 57 194.153 57 |
432 | <C1 26> 154 58 58 58 194.154 58 |
433 | <C1 27> 155 59 59 59 194.155 59 |
434 | <C1 28> 156 4 4 4 194.156 4 |
435 | <C1 29> 157 20 20 20 194.157 20 |
436 | <C1 30> 158 62 62 62 194.158 62 |
437 | <C1 31> 159 255 255 95 194.159 255 ### |
438 | <NON-BREAKING SPACE> 160 65 65 65 194.160 128.65 |
439 | <INVERTED EXCLAMATION MARK> 161 170 170 170 194.161 128.66 |
440 | <CENT SIGN> 162 74 74 176 194.162 128.67 ### |
441 | <POUND SIGN> 163 177 177 177 194.163 128.68 |
442 | <CURRENCY SIGN> 164 159 159 159 194.164 128.69 |
443 | <YEN SIGN> 165 178 178 178 194.165 128.70 |
444 | <BROKEN BAR> 166 106 106 208 194.166 128.71 ### |
445 | <SECTION SIGN> 167 181 181 181 194.167 128.72 |
446 | <DIAERESIS> 168 189 187 121 194.168 128.73 *** ### |
447 | <COPYRIGHT SIGN> 169 180 180 180 194.169 128.74 |
448 | <FEMININE ORDINAL INDICATOR> 170 154 154 154 194.170 128.81 |
449 | <LEFT POINTING GUILLEMET> 171 138 138 138 194.171 128.82 |
450 | <NOT SIGN> 172 95 176 186 194.172 128.83 *** ### |
451 | <SOFT HYPHEN> 173 202 202 202 194.173 128.84 |
452 | <REGISTERED TRADE MARK SIGN> 174 175 175 175 194.174 128.85 |
453 | <MACRON> 175 188 188 161 194.175 128.86 ### |
454 | <DEGREE SIGN> 176 144 144 144 194.176 128.87 |
455 | <PLUS-OR-MINUS SIGN> 177 143 143 143 194.177 128.88 |
456 | <SUPERSCRIPT TWO> 178 234 234 234 194.178 128.89 |
457 | <SUPERSCRIPT THREE> 179 250 250 250 194.179 128.98 |
458 | <ACUTE ACCENT> 180 190 190 190 194.180 128.99 |
459 | <MICRO SIGN> 181 160 160 160 194.181 128.100 |
460 | <PARAGRAPH SIGN> 182 182 182 182 194.182 128.101 |
461 | <MIDDLE DOT> 183 179 179 179 194.183 128.102 |
462 | <CEDILLA> 184 157 157 157 194.184 128.103 |
463 | <SUPERSCRIPT ONE> 185 218 218 218 194.185 128.104 |
464 | <MASC. ORDINAL INDICATOR> 186 155 155 155 194.186 128.105 |
465 | <RIGHT POINTING GUILLEMET> 187 139 139 139 194.187 128.106 |
466 | <FRACTION ONE QUARTER> 188 183 183 183 194.188 128.112 |
467 | <FRACTION ONE HALF> 189 184 184 184 194.189 128.113 |
468 | <FRACTION THREE QUARTERS> 190 185 185 185 194.190 128.114 |
469 | <INVERTED QUESTION MARK> 191 171 171 171 194.191 128.115 |
470 | <A WITH GRAVE> 192 100 100 100 195.128 138.65 |
471 | <A WITH ACUTE> 193 101 101 101 195.129 138.66 |
472 | <A WITH CIRCUMFLEX> 194 98 98 98 195.130 138.67 |
473 | <A WITH TILDE> 195 102 102 102 195.131 138.68 |
474 | <A WITH DIAERESIS> 196 99 99 99 195.132 138.69 |
475 | <A WITH RING ABOVE> 197 103 103 103 195.133 138.70 |
476 | <CAPITAL LIGATURE AE> 198 158 158 158 195.134 138.71 |
477 | <C WITH CEDILLA> 199 104 104 104 195.135 138.72 |
478 | <E WITH GRAVE> 200 116 116 116 195.136 138.73 |
479 | <E WITH ACUTE> 201 113 113 113 195.137 138.74 |
480 | <E WITH CIRCUMFLEX> 202 114 114 114 195.138 138.81 |
481 | <E WITH DIAERESIS> 203 115 115 115 195.139 138.82 |
482 | <I WITH GRAVE> 204 120 120 120 195.140 138.83 |
483 | <I WITH ACUTE> 205 117 117 117 195.141 138.84 |
484 | <I WITH CIRCUMFLEX> 206 118 118 118 195.142 138.85 |
485 | <I WITH DIAERESIS> 207 119 119 119 195.143 138.86 |
486 | <CAPITAL LETTER ETH> 208 172 172 172 195.144 138.87 |
487 | <N WITH TILDE> 209 105 105 105 195.145 138.88 |
488 | <O WITH GRAVE> 210 237 237 237 195.146 138.89 |
489 | <O WITH ACUTE> 211 238 238 238 195.147 138.98 |
490 | <O WITH CIRCUMFLEX> 212 235 235 235 195.148 138.99 |
491 | <O WITH TILDE> 213 239 239 239 195.149 138.100 |
492 | <O WITH DIAERESIS> 214 236 236 236 195.150 138.101 |
493 | <MULTIPLICATION SIGN> 215 191 191 191 195.151 138.102 |
494 | <O WITH STROKE> 216 128 128 128 195.152 138.103 |
495 | <U WITH GRAVE> 217 253 253 224 195.153 138.104 ### |
496 | <U WITH ACUTE> 218 254 254 254 195.154 138.105 |
497 | <U WITH CIRCUMFLEX> 219 251 251 221 195.155 138.106 ### |
498 | <U WITH DIAERESIS> 220 252 252 252 195.156 138.112 |
499 | <Y WITH ACUTE> 221 173 186 173 195.157 138.113 *** ### |
500 | <CAPITAL LETTER THORN> 222 174 174 174 195.158 138.114 |
501 | <SMALL LETTER SHARP S> 223 89 89 89 195.159 138.115 |
502 | <a WITH GRAVE> 224 68 68 68 195.160 139.65 |
503 | <a WITH ACUTE> 225 69 69 69 195.161 139.66 |
504 | <a WITH CIRCUMFLEX> 226 66 66 66 195.162 139.67 |
505 | <a WITH TILDE> 227 70 70 70 195.163 139.68 |
506 | <a WITH DIAERESIS> 228 67 67 67 195.164 139.69 |
507 | <a WITH RING ABOVE> 229 71 71 71 195.165 139.70 |
508 | <SMALL LIGATURE ae> 230 156 156 156 195.166 139.71 |
509 | <c WITH CEDILLA> 231 72 72 72 195.167 139.72 |
510 | <e WITH GRAVE> 232 84 84 84 195.168 139.73 |
511 | <e WITH ACUTE> 233 81 81 81 195.169 139.74 |
512 | <e WITH CIRCUMFLEX> 234 82 82 82 195.170 139.81 |
513 | <e WITH DIAERESIS> 235 83 83 83 195.171 139.82 |
514 | <i WITH GRAVE> 236 88 88 88 195.172 139.83 |
515 | <i WITH ACUTE> 237 85 85 85 195.173 139.84 |
516 | <i WITH CIRCUMFLEX> 238 86 86 86 195.174 139.85 |
517 | <i WITH DIAERESIS> 239 87 87 87 195.175 139.86 |
518 | <SMALL LETTER eth> 240 140 140 140 195.176 139.87 |
519 | <n WITH TILDE> 241 73 73 73 195.177 139.88 |
520 | <o WITH GRAVE> 242 205 205 205 195.178 139.89 |
521 | <o WITH ACUTE> 243 206 206 206 195.179 139.98 |
522 | <o WITH CIRCUMFLEX> 244 203 203 203 195.180 139.99 |
523 | <o WITH TILDE> 245 207 207 207 195.181 139.100 |
524 | <o WITH DIAERESIS> 246 204 204 204 195.182 139.101 |
525 | <DIVISION SIGN> 247 225 225 225 195.183 139.102 |
526 | <o WITH STROKE> 248 112 112 112 195.184 139.103 |
527 | <u WITH GRAVE> 249 221 221 192 195.185 139.104 ### |
528 | <u WITH ACUTE> 250 222 222 222 195.186 139.105 |
529 | <u WITH CIRCUMFLEX> 251 219 219 219 195.187 139.106 |
530 | <u WITH DIAERESIS> 252 220 220 220 195.188 139.112 |
531 | <y WITH ACUTE> 253 141 141 141 195.189 139.113 |
532 | <SMALL LETTER thorn> 254 142 142 142 195.190 139.114 |
533 | <y WITH DIAERESIS> 255 223 223 223 195.191 139.115 |
d396a558 |
534 | |
769c2898 |
535 | |
d396a558 |
536 | If you would rather see the above table in CCSID 0037 order rather than |
537 | ASCII + Latin-1 order then run the table through: |
538 | |
539 | =over 4 |
540 | |
395f5a0c |
541 | =item recipe 4 |
d396a558 |
542 | |
543 | =back |
544 | |
769c2898 |
545 | perldoc -m perlebcdic | \ |
546 | perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)' \ |
547 | -e '{push(@l,$_)}' \ |
548 | -e 'END{print map{$_->[0]}' \ |
549 | -e 'sort{$a->[1] <=> $b->[1]}' \ |
550 | -e 'map{[$_,substr($_,42,3)]}@l;}' |
d396a558 |
551 | |
552 | If you would rather see it in CCSID 1047 order then change the digit |
553 | 42 in the last line to 51, like this: |
554 | |
555 | =over 4 |
556 | |
395f5a0c |
557 | =item recipe 5 |
d396a558 |
558 | |
559 | =back |
560 | |
769c2898 |
561 | perldoc -m perlebcdic | \ |
562 | perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)' \ |
563 | -e '{push(@l,$_)}' \ |
564 | -e 'END{print map{$_->[0]}' \ |
565 | -e 'sort{$a->[1] <=> $b->[1]}' \ |
566 | -e 'map{[$_,substr($_,51,3)]}@l;}' |
d396a558 |
567 | |
568 | If you would rather see it in POSIX-BC order then change the digit |
569 | 51 in the last line to 60, like this: |
570 | |
571 | =over 4 |
572 | |
395f5a0c |
573 | =item recipe 6 |
d396a558 |
574 | |
575 | =back |
576 | |
769c2898 |
577 | perldoc -m perlebcdic | \ |
578 | perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)' \ |
579 | -e '{push(@l,$_)}' \ |
580 | -e 'END{print map{$_->[0]}' \ |
581 | -e 'sort{$a->[1] <=> $b->[1]}' \ |
582 | -e 'map{[$_,substr($_,60,3)]}@l;}' |
d396a558 |
583 | |
584 | |
585 | =head1 IDENTIFYING CHARACTER CODE SETS |
586 | |
587 | To determine the character set you are running under from perl one |
588 | could use the return value of ord() or chr() to test one or more |
589 | character values. For example: |
590 | |
769c2898 |
591 | my $is_ascii = "A" eq chr(65); |
592 | my $is_ebcdic = "A" eq chr(193); |
d396a558 |
593 | |
51b5cecb |
594 | Also, "\t" is a C<HORIZONTAL TABULATION> character so that: |
d396a558 |
595 | |
769c2898 |
596 | my $is_ascii = ord("\t") == 9; |
597 | my $is_ebcdic = ord("\t") == 5; |
d396a558 |
598 | |
599 | To distinguish EBCDIC code pages try looking at one or more of |
600 | the characters that differ between them. For example: |
601 | |
769c2898 |
602 | my $is_ebcdic_37 = "\n" eq chr(37); |
603 | my $is_ebcdic_1047 = "\n" eq chr(21); |
d396a558 |
604 | |
605 | Or better still choose a character that is uniquely encoded in any |
606 | of the code sets, e.g.: |
607 | |
769c2898 |
608 | my $is_ascii = ord('[') == 91; |
609 | my $is_ebcdic_37 = ord('[') == 186; |
610 | my $is_ebcdic_1047 = ord('[') == 173; |
611 | my $is_ebcdic_POSIX_BC = ord('[') == 187; |
d396a558 |
612 | |
613 | However, it would be unwise to write tests such as: |
614 | |
769c2898 |
615 | my $is_ascii = "\r" ne chr(13); # WRONG |
616 | my $is_ascii = "\n" ne chr(10); # ILL ADVISED |
d396a558 |
617 | |
618 | Obviously the first of these will fail to distinguish most ASCII machines |
769c2898 |
619 | from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC machine since "\r" eq |
620 | chr(13) under all of those coded character sets. But note too that |
621 | because "\n" is chr(13) and "\r" is chr(10) on the MacIntosh (which is an |
d396a558 |
622 | ASCII machine) the second C<$is_ascii> test will lead to trouble there. |
623 | |
769c2898 |
624 | To determine whether or not perl was built under an EBCDIC |
d396a558 |
625 | code page you can use the Config module like so: |
626 | |
627 | use Config; |
769c2898 |
628 | my $is_ebcdic = $Config{'ebcdic'} eq 'define'; |
d396a558 |
629 | |
630 | =head1 CONVERSIONS |
631 | |
1e054b24 |
632 | =head2 tr/// |
633 | |
d396a558 |
634 | In order to convert a string of characters from one character set to |
635 | another a simple list of numbers, such as in the right columns in the |
636 | above table, along with perl's tr/// operator is all that is needed. |
637 | The data in the table are in ASCII order hence the EBCDIC columns |
638 | provide easy to use ASCII to EBCDIC operations that are also easily |
639 | reversed. |
640 | |
769c2898 |
641 | For example, to convert ASCII to code page 037 take the output of the second |
642 | column from the output of recipe 0 (modified to add \\ characters) and use |
d5d9880c |
643 | it in tr/// like so: |
d396a558 |
644 | |
769c2898 |
645 | my $cp_037 = join '', |
646 | qq[\000\001\002\003\234\011\206\177\227\215\216\013\014\015\016\017], |
647 | qq[\020\021\022\023\235\205\010\207\030\031\222\217\034\035\036\037], |
648 | qq[\200\201\202\203\204\012\027\033\210\211\212\213\214\005\006\007], |
649 | qq[\220\221\026\223\224\225\226\004\230\231\232\233\024\025\236\032], |
650 | qq[\040\240\342\344\340\341\343\345\347\361\242\056\074\050\053\174], |
651 | qq[\046\351\352\353\350\355\356\357\354\337\041\044\052\051\073\254], |
652 | qq[\055\057\302\304\300\301\303\305\307\321\246\054\045\137\076\077], |
653 | qq[\370\311\312\313\310\315\316\317\314\140\072\043\100\047\075\042], |
654 | qq[\330\141\142\143\144\145\146\147\150\151\253\273\360\375\376\261], |
655 | qq[\260\152\153\154\155\156\157\160\161\162\252\272\346\270\306\244], |
656 | qq[\265\176\163\164\165\166\167\170\171\172\241\277\320\335\336\256], |
657 | qq[\136\243\245\267\251\247\266\274\275\276\133\135\257\250\264\327], |
658 | qq[\173\101\102\103\104\105\106\107\110\111\255\364\366\362\363\365], |
659 | qq[\175\112\113\114\115\116\117\120\121\122\271\373\374\371\372\377], |
660 | qq[\134\367\123\124\125\126\127\130\131\132\262\324\326\322\323\325], |
661 | qq[\060\061\062\063\064\065\066\067\070\071\263\333\334\331\332\237]; |
d396a558 |
662 | |
663 | my $ebcdic_string = $ascii_string; |
769c2898 |
664 | |
1e054b24 |
665 | eval '$ebcdic_string =~ tr/\000-\377/' . $cp_037 . '/'; |
d396a558 |
666 | |
d5d9880c |
667 | To convert from EBCDIC 037 to ASCII just reverse the order of the tr/// |
d396a558 |
668 | arguments like so: |
669 | |
670 | my $ascii_string = $ebcdic_string; |
d5d9880c |
671 | eval '$ascii_string = tr/' . $cp_037 . '/\000-\377/'; |
672 | |
673 | Similarly one could take the output of the third column from recipe 0 to |
674 | obtain a C<$cp_1047> table. The fourth column of the output from recipe |
675 | 0 could provide a C<$cp_posix_bc> table suitable for transcoding as well. |
1e054b24 |
676 | |
677 | =head2 iconv |
d396a558 |
678 | |
d5d9880c |
679 | XPG operability often implies the presence of an I<iconv> utility |
d396a558 |
680 | available from the shell or from the C library. Consult your system's |
681 | documentation for information on iconv. |
682 | |
3958b146 |
683 | On OS/390 or z/OS see the iconv(1) manpage. One way to invoke the iconv |
d396a558 |
684 | shell utility from within perl would be to: |
685 | |
395f5a0c |
686 | # OS/390 or z/OS example |
769c2898 |
687 | my $ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1` |
d396a558 |
688 | |
689 | or the inverse map: |
690 | |
395f5a0c |
691 | # OS/390 or z/OS example |
769c2898 |
692 | my $ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047` |
d396a558 |
693 | |
d396a558 |
694 | For other perl based conversion options see the Convert::* modules on CPAN. |
695 | |
1e054b24 |
696 | =head2 C RTL |
697 | |
395f5a0c |
698 | The OS/390 and z/OS C run time libraries provide _atoe() and _etoa() functions. |
1e054b24 |
699 | |
d396a558 |
700 | =head1 OPERATOR DIFFERENCES |
701 | |
702 | The C<..> range operator treats certain character ranges with |
703 | care on EBCDIC machines. For example the following array |
704 | will have twenty six elements on either an EBCDIC machine |
705 | or an ASCII machine: |
706 | |
769c2898 |
707 | my @alphabet = ( 'A'..'Z' ); # $#alphabet == 25 |
d396a558 |
708 | |
709 | The bitwise operators such as & ^ | may return different results |
710 | when operating on string or character data in a perl program running |
711 | on an EBCDIC machine than when run on an ASCII machine. Here is |
712 | an example adapted from the one in L<perlop>: |
713 | |
714 | # EBCDIC-based examples |
769c2898 |
715 | print "j p \n" ^ " a h"; # prints "JAPH\n" |
716 | print "JA" | " ph\n"; # prints "japh\n" |
717 | print "JAPH\nJunk" & "\277\277\277\277\277"; # prints "japh\n" |
718 | print 'p N$' ^ " E<H\n"; # prints "Perl\n" |
d396a558 |
719 | |
720 | An interesting property of the 32 C0 control characters |
721 | in the ASCII table is that they can "literally" be constructed |
51b5cecb |
722 | as control characters in perl, e.g. C<(chr(0) eq "\c@")> |
723 | C<(chr(1) eq "\cA")>, and so on. Perl on EBCDIC machines has been |
724 | ported to take "\c@" to chr(0) and "\cA" to chr(1) as well, but the |
d396a558 |
725 | thirty three characters that result depend on which code page you are |
726 | using. The table below uses the character names from the previous table |
51b5cecb |
727 | but with substitutions such as s/START OF/S.O./; s/END OF /E.O./; |
d396a558 |
728 | s/TRANSMISSION/TRANS./; s/TABULATION/TAB./; s/VERTICAL/VERT./; |
729 | s/HORIZONTAL/HORIZ./; s/DEVICE CONTROL/D.C./; s/SEPARATOR/SEP./; |
730 | s/NEGATIVE ACKNOWLEDGE/NEG. ACK./;. The POSIX-BC and 1047 sets are |
731 | identical throughout this range and differ from the 0037 set at only |
51b5cecb |
732 | one spot (21 decimal). Note that the C<LINE FEED> character |
733 | may be generated by "\cJ" on ASCII machines but by "\cU" on 1047 or POSIX-BC |
734 | machines and cannot be generated as a C<"\c.letter."> control character on |
735 | 0037 machines. Note also that "\c\\" maps to two characters |
d396a558 |
736 | not one. |
737 | |
738 | chr ord 8859-1 0037 1047 && POSIX-BC |
739 | ------------------------------------------------------------------------ |
740 | "\c?" 127 <DELETE> " " ***>< |
741 | "\c@" 0 <NULL> <NULL> <NULL> ***>< |
742 | "\cA" 1 <S.O. HEADING> <S.O. HEADING> <S.O. HEADING> |
743 | "\cB" 2 <S.O. TEXT> <S.O. TEXT> <S.O. TEXT> |
744 | "\cC" 3 <E.O. TEXT> <E.O. TEXT> <E.O. TEXT> |
745 | "\cD" 4 <E.O. TRANS.> <C1 28> <C1 28> |
746 | "\cE" 5 <ENQUIRY> <HORIZ. TAB.> <HORIZ. TAB.> |
747 | "\cF" 6 <ACKNOWLEDGE> <C1 6> <C1 6> |
748 | "\cG" 7 <BELL> <DELETE> <DELETE> |
749 | "\cH" 8 <BACKSPACE> <C1 23> <C1 23> |
750 | "\cI" 9 <HORIZ. TAB.> <C1 13> <C1 13> |
751 | "\cJ" 10 <LINE FEED> <C1 14> <C1 14> |
752 | "\cK" 11 <VERT. TAB.> <VERT. TAB.> <VERT. TAB.> |
753 | "\cL" 12 <FORM FEED> <FORM FEED> <FORM FEED> |
754 | "\cM" 13 <CARRIAGE RETURN> <CARRIAGE RETURN> <CARRIAGE RETURN> |
755 | "\cN" 14 <SHIFT OUT> <SHIFT OUT> <SHIFT OUT> |
756 | "\cO" 15 <SHIFT IN> <SHIFT IN> <SHIFT IN> |
757 | "\cP" 16 <DATA LINK ESCAPE> <DATA LINK ESCAPE> <DATA LINK ESCAPE> |
758 | "\cQ" 17 <D.C. ONE> <D.C. ONE> <D.C. ONE> |
759 | "\cR" 18 <D.C. TWO> <D.C. TWO> <D.C. TWO> |
760 | "\cS" 19 <D.C. THREE> <D.C. THREE> <D.C. THREE> |
761 | "\cT" 20 <D.C. FOUR> <C1 29> <C1 29> |
762 | "\cU" 21 <NEG. ACK.> <C1 5> <LINE FEED> *** |
763 | "\cV" 22 <SYNCHRONOUS IDLE> <BACKSPACE> <BACKSPACE> |
764 | "\cW" 23 <E.O. TRANS. BLOCK> <C1 7> <C1 7> |
765 | "\cX" 24 <CANCEL> <CANCEL> <CANCEL> |
766 | "\cY" 25 <E.O. MEDIUM> <E.O. MEDIUM> <E.O. MEDIUM> |
767 | "\cZ" 26 <SUBSTITUTE> <C1 18> <C1 18> |
768 | "\c[" 27 <ESCAPE> <C1 15> <C1 15> |
769 | "\c\\" 28 <FILE SEP.>\ <FILE SEP.>\ <FILE SEP.>\ |
770 | "\c]" 29 <GROUP SEP.> <GROUP SEP.> <GROUP SEP.> |
771 | "\c^" 30 <RECORD SEP.> <RECORD SEP.> <RECORD SEP.> ***>< |
772 | "\c_" 31 <UNIT SEP.> <UNIT SEP.> <UNIT SEP.> ***>< |
773 | |
774 | |
775 | =head1 FUNCTION DIFFERENCES |
776 | |
777 | =over 8 |
778 | |
779 | =item chr() |
780 | |
781 | chr() must be given an EBCDIC code number argument to yield a desired |
782 | character return value on an EBCDIC machine. For example: |
783 | |
769c2898 |
784 | my $CAPITAL_LETTER_A = chr(193); |
d396a558 |
785 | |
786 | =item ord() |
787 | |
788 | ord() will return EBCDIC code number values on an EBCDIC machine. |
789 | For example: |
790 | |
769c2898 |
791 | my $the_number_193 = ord("A"); |
d396a558 |
792 | |
793 | =item pack() |
794 | |
795 | The c and C templates for pack() are dependent upon character set |
796 | encoding. Examples of usage on EBCDIC include: |
797 | |
769c2898 |
798 | my $foo; |
d396a558 |
799 | $foo = pack("CCCC",193,194,195,196); |
800 | # $foo eq "ABCD" |
769c2898 |
801 | $foo = pack("C4", 193,194,195,196); |
d396a558 |
802 | # same thing |
803 | |
804 | $foo = pack("ccxxcc",193,194,195,196); |
805 | # $foo eq "AB\0\0CD" |
806 | |
807 | =item print() |
808 | |
809 | One must be careful with scalars and strings that are passed to |
810 | print that contain ASCII encodings. One common place |
811 | for this to occur is in the output of the MIME type header for |
812 | CGI script writing. For example, many perl programming guides |
813 | recommend something similar to: |
814 | |
815 | print "Content-type:\ttext/html\015\012\015\012"; |
816 | # this may be wrong on EBCDIC |
817 | |
395f5a0c |
818 | Under the IBM OS/390 USS Web Server or WebSphere on z/OS for example |
819 | you should instead write that as: |
d396a558 |
820 | |
821 | print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et alia |
822 | |
823 | That is because the translation from EBCDIC to ASCII is done |
824 | by the web server in this case (such code will not be appropriate for |
825 | the Macintosh however). Consult your web server's documentation for |
826 | further details. |
827 | |
828 | =item printf() |
829 | |
830 | The formats that can convert characters to numbers and vice versa |
831 | will be different from their ASCII counterparts when executed |
832 | on an EBCDIC machine. Examples include: |
833 | |
834 | printf("%c%c%c",193,194,195); # prints ABC |
835 | |
836 | =item sort() |
837 | |
838 | EBCDIC sort results may differ from ASCII sort results especially for |
839 | mixed case strings. This is discussed in more detail below. |
840 | |
841 | =item sprintf() |
842 | |
843 | See the discussion of printf() above. An example of the use |
844 | of sprintf would be: |
845 | |
769c2898 |
846 | my $CAPITAL_LETTER_A = sprintf("%c",193); |
d396a558 |
847 | |
848 | =item unpack() |
849 | |
850 | See the discussion of pack() above. |
851 | |
852 | =back |
853 | |
854 | =head1 REGULAR EXPRESSION DIFFERENCES |
855 | |
856 | As of perl 5.005_03 the letter range regular expression such as |
857 | [A-Z] and [a-z] have been especially coded to not pick up gap |
b3b6085d |
858 | characters. For example, characters such as E<ocirc> C<o WITH CIRCUMFLEX> |
859 | that lie between I and J would not be matched by the |
51b5cecb |
860 | regular expression range C</[H-K]/>. |
861 | |
862 | If you do want to match the alphabet gap characters in a single octet |
d396a558 |
863 | regular expression try matching the hex or octal code such |
864 | as C</\313/> on EBCDIC or C</\364/> on ASCII machines to |
51b5cecb |
865 | have your regular expression match C<o WITH CIRCUMFLEX>. |
d396a558 |
866 | |
51b5cecb |
867 | Another construct to be wary of is the inappropriate use of hex or |
d396a558 |
868 | octal constants in regular expressions. Consider the following |
869 | set of subs: |
870 | |
871 | sub is_c0 { |
872 | my $char = substr(shift,0,1); |
873 | $char =~ /[\000-\037]/; |
874 | } |
875 | |
876 | sub is_print_ascii { |
877 | my $char = substr(shift,0,1); |
878 | $char =~ /[\040-\176]/; |
879 | } |
880 | |
881 | sub is_delete { |
882 | my $char = substr(shift,0,1); |
883 | $char eq "\177"; |
884 | } |
885 | |
886 | sub is_c1 { |
887 | my $char = substr(shift,0,1); |
888 | $char =~ /[\200-\237]/; |
889 | } |
890 | |
891 | sub is_latin_1 { |
892 | my $char = substr(shift,0,1); |
893 | $char =~ /[\240-\377]/; |
894 | } |
895 | |
51b5cecb |
896 | The above would be adequate if the concern was only with numeric code points. |
897 | However, the concern may be with characters rather than code points |
898 | and on an EBCDIC machine it may be desirable for constructs such as |
d396a558 |
899 | C<if (is_print_ascii("A")) {print "A is a printable character\n";}> to print |
900 | out the expected message. One way to represent the above collection |
901 | of character classification subs that is capable of working across the |
902 | four coded character sets discussed in this document is as follows: |
903 | |
904 | sub Is_c0 { |
905 | my $char = substr(shift,0,1); |
769c2898 |
906 | if ( ord('^') == 94 ) { # ascii |
d396a558 |
907 | return $char =~ /[\000-\037]/; |
769c2898 |
908 | } |
909 | if ( ord('^') == 176 ) { # 37 |
d396a558 |
910 | return $char =~ /[\000-\003\067\055-\057\026\005\045\013-\023\074\075\062\046\030\031\077\047\034-\037]/; |
911 | } |
769c2898 |
912 | if ( ord('^') == 95 || ord('^') == 106 ) { # 1047 || posix-bc |
d396a558 |
913 | return $char =~ /[\000-\003\067\055-\057\026\005\025\013-\023\074\075\062\046\030\031\077\047\034-\037]/; |
914 | } |
915 | } |
916 | |
917 | sub Is_print_ascii { |
918 | my $char = substr(shift,0,1); |
919 | $char =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/; |
920 | } |
921 | |
922 | sub Is_delete { |
923 | my $char = substr(shift,0,1); |
769c2898 |
924 | if ( ord('^') == 94 ) { # ascii |
d396a558 |
925 | return $char eq "\177"; |
769c2898 |
926 | } else { # ebcdic |
d396a558 |
927 | return $char eq "\007"; |
928 | } |
929 | } |
930 | |
931 | sub Is_c1 { |
932 | my $char = substr(shift,0,1); |
769c2898 |
933 | if ( ord('^') == 94 ) { # ascii |
d396a558 |
934 | return $char =~ /[\200-\237]/; |
935 | } |
769c2898 |
936 | if ( ord('^') == 176 ) { # 37 |
d396a558 |
937 | return $char =~ /[\040-\044\025\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/; |
938 | } |
769c2898 |
939 | if ( ord('^') == 95 ) { # 1047 |
d396a558 |
940 | return $char =~ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/; |
941 | } |
769c2898 |
942 | if ( ord('^') == 106 ) { # posix-bc |
943 | return $char =~ |
d396a558 |
944 | /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\137]/; |
945 | } |
946 | } |
947 | |
948 | sub Is_latin_1 { |
949 | my $char = substr(shift,0,1); |
769c2898 |
950 | if ( ord('^') == 94 ) { # ascii |
d396a558 |
951 | return $char =~ /[\240-\377]/; |
952 | } |
769c2898 |
953 | if ( ord('^') == 176 ) { # 37 |
954 | return $char =~ |
d396a558 |
955 | /[\101\252\112\261\237\262\152\265\275\264\232\212\137\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/; |
956 | } |
769c2898 |
957 | if ( ord('^') == 95 ) { # 1047 |
d396a558 |
958 | return $char =~ |
959 | /[\101\252\112\261\237\262\152\265\273\264\232\212\260\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\272\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/; |
960 | } |
769c2898 |
961 | if ( ord('^') == 106 ) { # posix-bc |
962 | return $char =~ |
d396a558 |
963 | /[\101\252\260\261\237\262\320\265\171\264\232\212\272\312\257\241\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\340\376\335\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\300\336\333\334\215\216\337]/; |
964 | } |
965 | } |
966 | |
967 | Note however that only the C<Is_ascii_print()> sub is really independent |
968 | of coded character set. Another way to write C<Is_latin_1()> would be |
969 | to use the characters in the range explicitly: |
970 | |
971 | sub Is_latin_1 { |
972 | my $char = substr(shift,0,1); |
973 | $char =~ /[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/; |
974 | } |
975 | |
976 | Although that form may run into trouble in network transit (due to the |
977 | presence of 8 bit characters) or on non ISO-Latin character sets. |
d396a558 |
978 | |
979 | =head1 SOCKETS |
980 | |
981 | Most socket programming assumes ASCII character encodings in network |
982 | byte order. Exceptions can include CGI script writing under a |
983 | host web server where the server may take care of translation for you. |
984 | Most host web servers convert EBCDIC data to ISO-8859-1 or Unicode on |
985 | output. |
986 | |
987 | =head1 SORTING |
988 | |
989 | One big difference between ASCII based character sets and EBCDIC ones |
990 | are the relative positions of upper and lower case letters and the |
991 | letters compared to the digits. If sorted on an ASCII based machine the |
992 | two letter abbreviation for a physician comes before the two letter |
993 | for drive, that is: |
994 | |
769c2898 |
995 | my @sorted = sort(qw(Dr. dr.)); # @sorted holds ('Dr.','dr.') on ASCII, |
996 | # but ('dr.','Dr.') on EBCDIC |
d396a558 |
997 | |
998 | The property of lower case before uppercase letters in EBCDIC is |
999 | even carried to the Latin 1 EBCDIC pages such as 0037 and 1047. |
b3b6085d |
1000 | An example would be that E<Euml> C<E WITH DIAERESIS> (203) comes |
1001 | before E<euml> C<e WITH DIAERESIS> (235) on an ASCII machine, but |
51b5cecb |
1002 | the latter (83) comes before the former (115) on an EBCDIC machine. |
b3b6085d |
1003 | (Astute readers will note that the upper case version of E<szlig> |
51b5cecb |
1004 | C<SMALL LETTER SHARP S> is simply "SS" and that the upper case version of |
b3b6085d |
1005 | E<yuml> C<y WITH DIAERESIS> is not in the 0..255 range but it is |
51b5cecb |
1006 | at U+x0178 in Unicode, or C<"\x{178}"> in a Unicode enabled Perl). |
d396a558 |
1007 | |
1008 | The sort order will cause differences between results obtained on |
1009 | ASCII machines versus EBCDIC machines. What follows are some suggestions |
1010 | on how to deal with these differences. |
1011 | |
51b5cecb |
1012 | =head2 Ignore ASCII vs. EBCDIC sort differences. |
d396a558 |
1013 | |
1014 | This is the least computationally expensive strategy. It may require |
1015 | some user education. |
1016 | |
51b5cecb |
1017 | =head2 MONO CASE then sort data. |
d396a558 |
1018 | |
51b5cecb |
1019 | In order to minimize the expense of mono casing mixed test try to |
d396a558 |
1020 | C<tr///> towards the character set case most employed within the data. |
1021 | If the data are primarily UPPERCASE non Latin 1 then apply tr/[a-z]/[A-Z]/ |
1022 | then sort(). If the data are primarily lowercase non Latin 1 then |
1023 | apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE |
51b5cecb |
1024 | and include Latin-1 characters then apply: |
1025 | |
769c2898 |
1026 | tr/[a-z]/[A-Z]/; |
51b5cecb |
1027 | tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]/; |
769c2898 |
1028 | s/ß/SS/g; |
d396a558 |
1029 | |
51b5cecb |
1030 | then sort(). Do note however that such Latin-1 manipulation does not |
b3b6085d |
1031 | address the E<yuml> C<y WITH DIAERESIS> character that will remain at |
1032 | code point 255 on ASCII machines, but 223 on most EBCDIC machines |
51b5cecb |
1033 | where it will sort to a place less than the EBCDIC numerals. With a |
1034 | Unicode enabled Perl you might try: |
d396a558 |
1035 | |
51b5cecb |
1036 | tr/^?/\x{178}/; |
1037 | |
1038 | The strategy of mono casing data before sorting does not preserve the case |
1039 | of the data and may not be acceptable for that reason. |
1040 | |
1041 | =head2 Convert, sort data, then re convert. |
d396a558 |
1042 | |
1043 | This is the most expensive proposition that does not employ a network |
1044 | connection. |
1045 | |
1046 | =head2 Perform sorting on one type of machine only. |
1047 | |
1048 | This strategy can employ a network connection. As such |
1049 | it would be computationally expensive. |
1050 | |
395f5a0c |
1051 | =head1 TRANSFORMATION FORMATS |
1e054b24 |
1052 | |
1053 | There are a variety of ways of transforming data with an intra character set |
1054 | mapping that serve a variety of purposes. Sorting was discussed in the |
1055 | previous section and a few of the other more popular mapping techniques are |
1056 | discussed next. |
1057 | |
1058 | =head2 URL decoding and encoding |
d396a558 |
1059 | |
51b5cecb |
1060 | Note that some URLs have hexadecimal ASCII code points in them in an |
1e054b24 |
1061 | attempt to overcome character or protocol limitation issues. For example |
1062 | the tilde character is not on every keyboard hence a URL of the form: |
d396a558 |
1063 | |
1064 | http://www.pvhp.com/~pvhp/ |
1065 | |
1066 | may also be expressed as either of: |
1067 | |
1068 | http://www.pvhp.com/%7Epvhp/ |
1069 | |
1070 | http://www.pvhp.com/%7epvhp/ |
1071 | |
51b5cecb |
1072 | where 7E is the hexadecimal ASCII code point for '~'. Here is an example |
d396a558 |
1073 | of decoding such a URL under CCSID 1047: |
1074 | |
769c2898 |
1075 | my $url = 'http://www.pvhp.com/%7Epvhp/'; |
d396a558 |
1076 | # this array assumes code page 1047 |
1077 | my @a2e_1047 = ( |
1078 | 0, 1, 2, 3, 55, 45, 46, 47, 22, 5, 21, 11, 12, 13, 14, 15, |
1079 | 16, 17, 18, 19, 60, 61, 50, 38, 24, 25, 63, 39, 28, 29, 30, 31, |
1080 | 64, 90,127,123, 91,108, 80,125, 77, 93, 92, 78,107, 96, 75, 97, |
1081 | 240,241,242,243,244,245,246,247,248,249,122, 94, 76,126,110,111, |
1082 | 124,193,194,195,196,197,198,199,200,201,209,210,211,212,213,214, |
1083 | 215,216,217,226,227,228,229,230,231,232,233,173,224,189, 95,109, |
1084 | 121,129,130,131,132,133,134,135,136,137,145,146,147,148,149,150, |
1085 | 151,152,153,162,163,164,165,166,167,168,169,192, 79,208,161, 7, |
1086 | 32, 33, 34, 35, 36, 37, 6, 23, 40, 41, 42, 43, 44, 9, 10, 27, |
1087 | 48, 49, 26, 51, 52, 53, 54, 8, 56, 57, 58, 59, 4, 20, 62,255, |
1088 | 65,170, 74,177,159,178,106,181,187,180,154,138,176,202,175,188, |
1089 | 144,143,234,250,190,160,182,179,157,218,155,139,183,184,185,171, |
1090 | 100,101, 98,102, 99,103,158,104,116,113,114,115,120,117,118,119, |
1091 | 172,105,237,238,235,239,236,191,128,253,254,251,252,186,174, 89, |
1092 | 68, 69, 66, 70, 67, 71,156, 72, 84, 81, 82, 83, 88, 85, 86, 87, |
1093 | 140, 73,205,206,203,207,204,225,112,221,222,219,220,141,142,223 |
1094 | ); |
1095 | $url =~ s/%([0-9a-fA-F]{2})/pack("c",$a2e_1047[hex($1)])/ge; |
1096 | |
1e054b24 |
1097 | Conversely, here is a partial solution for the task of encoding such |
1098 | a URL under the 1047 code page: |
1099 | |
769c2898 |
1100 | my $url = 'http://www.pvhp.com/~pvhp/'; |
1e054b24 |
1101 | # this array assumes code page 1047 |
1102 | my @e2a_1047 = ( |
1103 | 0, 1, 2, 3,156, 9,134,127,151,141,142, 11, 12, 13, 14, 15, |
1104 | 16, 17, 18, 19,157, 10, 8,135, 24, 25,146,143, 28, 29, 30, 31, |
1105 | 128,129,130,131,132,133, 23, 27,136,137,138,139,140, 5, 6, 7, |
1106 | 144,145, 22,147,148,149,150, 4,152,153,154,155, 20, 21,158, 26, |
1107 | 32,160,226,228,224,225,227,229,231,241,162, 46, 60, 40, 43,124, |
1108 | 38,233,234,235,232,237,238,239,236,223, 33, 36, 42, 41, 59, 94, |
1109 | 45, 47,194,196,192,193,195,197,199,209,166, 44, 37, 95, 62, 63, |
1110 | 248,201,202,203,200,205,206,207,204, 96, 58, 35, 64, 39, 61, 34, |
1111 | 216, 97, 98, 99,100,101,102,103,104,105,171,187,240,253,254,177, |
1112 | 176,106,107,108,109,110,111,112,113,114,170,186,230,184,198,164, |
1113 | 181,126,115,116,117,118,119,120,121,122,161,191,208, 91,222,174, |
1114 | 172,163,165,183,169,167,182,188,189,190,221,168,175, 93,180,215, |
1115 | 123, 65, 66, 67, 68, 69, 70, 71, 72, 73,173,244,246,242,243,245, |
1116 | 125, 74, 75, 76, 77, 78, 79, 80, 81, 82,185,251,252,249,250,255, |
1117 | 92,247, 83, 84, 85, 86, 87, 88, 89, 90,178,212,214,210,211,213, |
1118 | 48, 49, 50, 51, 52, 53, 54, 55, 56, 57,179,219,220,217,218,159 |
1119 | ); |
769c2898 |
1120 | # The following regular expression does not address the |
1e054b24 |
1121 | # mappings for: ('.' => '%2E', '/' => '%2F', ':' => '%3A') |
1122 | $url =~ s/([\t "#%&\(\),;<=>\?\@\[\\\]^`{|}~])/sprintf("%%%02X",$e2a_1047[ord($1)])/ge; |
1123 | |
1124 | where a more complete solution would split the URL into components |
1125 | and apply a full s/// substitution only to the appropriate parts. |
1126 | |
1127 | In the remaining examples a @e2a or @a2e array may be employed |
1128 | but the assignment will not be shown explicitly. For code page 1047 |
1129 | you could use the @a2e_1047 or @e2a_1047 arrays just shown. |
1130 | |
1131 | =head2 uu encoding and decoding |
1132 | |
1133 | The C<u> template to pack() or unpack() will render EBCDIC data in EBCDIC |
1134 | characters equivalent to their ASCII counterparts. For example, the |
1135 | following will print "Yes indeed\n" on either an ASCII or EBCDIC computer: |
1136 | |
769c2898 |
1137 | my $all_byte_chrs = ''; |
1138 | |
1139 | $all_byte_chrs .= chr($_) foreach 0 .. 255; |
1140 | |
1141 | my $uuencode_byte_chrs = pack('u', $all_byte_chrs); |
1142 | |
1143 | (my $uu = <<' ENDOFHEREDOC') =~ s/^\s*//gm; |
1e054b24 |
1144 | M``$"`P0%!@<("0H+#`T.#Q`1$A,4%187&!D:&QP='A\@(2(C)"4F)R@I*BLL |
1145 | M+2XO,#$R,S0U-C<X.3H[/#T^/T!!0D-$149'2$E*2TQ-3D]045)35%565UA9 |
1146 | M6EM<75Y?8&%B8V1E9F=H:6IK;&UN;W!Q<G-T=79W>'EZ>WQ]?G^`@8*#A(6& |
1147 | MAXB)BHN,C8Z/D)&2DY25EI>8F9J;G)V>GZ"AHJ.DI::GJ*FJJZRMKJ^PL;*S |
1148 | MM+6VM[BYNKN\O;Z_P,'"P\3%QL?(R<K+S,W.S]#1TM/4U=;7V-G:V]S=WM_@ |
1149 | ?X>+CY.7FY^CIZNOL[>[O\/'R\_3U]O?X^?K[_/W^_P`` |
1150 | ENDOFHEREDOC |
769c2898 |
1151 | if ( $uuencode_byte_chrs eq $uu ) { |
1e054b24 |
1152 | print "Yes "; |
1153 | } |
1154 | $uudecode_byte_chrs = unpack('u', $uuencode_byte_chrs); |
769c2898 |
1155 | if ( $uudecode_byte_chrs eq $all_byte_chrs ) { |
1e054b24 |
1156 | print "indeed\n"; |
1157 | } |
1158 | |
1159 | Here is a very spartan uudecoder that will work on EBCDIC provided |
1160 | that the @e2a array is filled in appropriately: |
1161 | |
769c2898 |
1162 | #!/usr/bin/perl |
1163 | my @e2a = ( |
1164 | # this must be filled in |
1165 | ); |
1166 | $_ = <> until my($mode,$file) = /^begin\s*(\d*)\s*(\S*)/; |
1e054b24 |
1167 | open(OUT, "> $file") if $file ne ""; |
1168 | while(<>) { |
1169 | last if /^end/; |
1170 | next if /[a-z]/; |
1171 | next unless int(((($e2a[ord()] - 32 ) & 077) + 2) / 3) == |
1172 | int(length() / 4); |
1173 | print OUT unpack("u", $_); |
1174 | } |
1175 | close(OUT); |
1176 | chmod oct($mode), $file; |
1177 | |
1178 | |
1179 | =head2 Quoted-Printable encoding and decoding |
1180 | |
1181 | On ASCII encoded machines it is possible to strip characters outside of |
1182 | the printable set using: |
1183 | |
1184 | # This QP encoder works on ASCII only |
769c2898 |
1185 | my $qp_string =~ s/([=\x00-\x1F\x80-\xFF])/sprintf("=%02X",ord($1))/ge; |
1e054b24 |
1186 | |
1187 | Whereas a QP encoder that works on both ASCII and EBCDIC machines |
1188 | would look somewhat like the following (where the EBCDIC branch @e2a |
1189 | array is omitted for brevity): |
1190 | |
1191 | if (ord('A') == 65) { # ASCII |
1192 | $delete = "\x7F"; # ASCII |
1193 | @e2a = (0 .. 255) # ASCII to ASCII identity map |
769c2898 |
1194 | |
1195 | } else { # EBCDIC |
1e054b24 |
1196 | $delete = "\x07"; # EBCDIC |
769c2898 |
1197 | @e2a = ( |
1198 | # EBCDIC to ASCII map (as shown above) |
1199 | ); |
1e054b24 |
1200 | } |
769c2898 |
1201 | my $qp_string =~ |
1e054b24 |
1202 | s/([^ !"\#\$%&'()*+,\-.\/0-9:;<>?\@A-Z[\\\]^_`a-z{|}~$delete])/sprintf("=%02X",$e2a[ord($1)])/ge; |
1203 | |
1204 | (although in production code the substitutions might be done |
1205 | in the EBCDIC branch with the @e2a array and separately in the |
1206 | ASCII branch without the expense of the identity map). |
1207 | |
1208 | Such QP strings can be decoded with: |
1209 | |
1210 | # This QP decoder is limited to ASCII only |
1211 | $string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr hex $1/ge; |
1212 | $string =~ s/=[\n\r]+$//; |
1213 | |
1214 | Whereas a QP decoder that works on both ASCII and EBCDIC machines |
1215 | would look somewhat like the following (where the @a2e array is |
1216 | omitted for brevity): |
1217 | |
1218 | $string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr $a2e[hex $1]/ge; |
1219 | $string =~ s/=[\n\r]+$//; |
1220 | |
395f5a0c |
1221 | =head2 Caesarian ciphers |
1e054b24 |
1222 | |
1223 | The practice of shifting an alphabet one or more characters for encipherment |
1224 | dates back thousands of years and was explicitly detailed by Gaius Julius |
1225 | Caesar in his B<Gallic Wars> text. A single alphabet shift is sometimes |
1226 | referred to as a rotation and the shift amount is given as a number $n after |
1227 | the string 'rot' or "rot$n". Rot0 and rot26 would designate identity maps |
1228 | on the 26 letter English version of the Latin alphabet. Rot13 has the |
1229 | interesting property that alternate subsequent invocations are identity maps |
1230 | (thus rot13 is its own non-trivial inverse in the group of 26 alphabet |
1231 | rotations). Hence the following is a rot13 encoder and decoder that will |
1232 | work on ASCII and EBCDIC machines: |
1233 | |
1234 | #!/usr/local/bin/perl |
1235 | |
769c2898 |
1236 | while ( <> ) { |
1e054b24 |
1237 | tr/n-za-mN-ZA-M/a-zA-Z/; |
1238 | print; |
1239 | } |
1240 | |
1241 | In one-liner form: |
1242 | |
769c2898 |
1243 | perl -pe 'tr/n-za-mN-ZA-M/a-zA-Z/' |
1e054b24 |
1244 | |
1245 | |
1246 | =head1 Hashing order and checksums |
1247 | |
395f5a0c |
1248 | To the extent that it is possible to write code that depends on |
1249 | hashing order there may be differences between hashes as stored |
1250 | on an ASCII based machine and hashes stored on an EBCDIC based machine. |
1e054b24 |
1251 | XXX |
1252 | |
d396a558 |
1253 | =head1 I18N AND L10N |
1254 | |
1255 | Internationalization(I18N) and localization(L10N) are supported at least |
1256 | in principle even on EBCDIC machines. The details are system dependent |
1257 | and discussed under the L<perlebcdic/OS ISSUES> section below. |
1258 | |
1259 | =head1 MULTI OCTET CHARACTER SETS |
1260 | |
395f5a0c |
1261 | Perl may work with an internal UTF-EBCDIC encoding form for wide characters |
1262 | on EBCDIC platforms in a manner analogous to the way that it works with |
1263 | the UTF-8 internal encoding form on ASCII based platforms. |
1264 | |
1265 | Legacy multi byte EBCDIC code pages XXX. |
d396a558 |
1266 | |
1267 | =head1 OS ISSUES |
1268 | |
1269 | There may be a few system dependent issues |
1270 | of concern to EBCDIC Perl programmers. |
1271 | |
1272 | =head2 OS/400 |
1273 | |
51b5cecb |
1274 | The PASE environment. |
1275 | |
d396a558 |
1276 | =over 8 |
1277 | |
1278 | =item IFS access |
1279 | |
1280 | XXX. |
1281 | |
1282 | =back |
1283 | |
395f5a0c |
1284 | =head2 OS/390, z/OS |
d396a558 |
1285 | |
51b5cecb |
1286 | Perl runs under Unix Systems Services or USS. |
1287 | |
d396a558 |
1288 | =over 8 |
1289 | |
51b5cecb |
1290 | =item chcp |
1291 | |
1e054b24 |
1292 | B<chcp> is supported as a shell utility for displaying and changing |
1293 | one's code page. See also L<chcp>. |
51b5cecb |
1294 | |
d396a558 |
1295 | =item dataset access |
1296 | |
1297 | For sequential data set access try: |
1298 | |
1299 | my @ds_records = `cat //DSNAME`; |
1300 | |
1301 | or: |
1302 | |
1303 | my @ds_records = `cat //'HLQ.DSNAME'`; |
1304 | |
1305 | See also the OS390::Stdio module on CPAN. |
1306 | |
395f5a0c |
1307 | =item OS/390, z/OS iconv |
51b5cecb |
1308 | |
1e054b24 |
1309 | B<iconv> is supported as both a shell utility and a C RTL routine. |
1310 | See also the iconv(1) and iconv(3) manual pages. |
51b5cecb |
1311 | |
d396a558 |
1312 | =item locales |
1313 | |
395f5a0c |
1314 | On OS/390 or z/OS see L<locale> for information on locales. The L10N files |
1315 | are in F</usr/nls/locale>. $Config{d_setlocale} is 'define' on OS/390 |
1316 | or z/OS. |
d396a558 |
1317 | |
1318 | =back |
1319 | |
1320 | =head2 VM/ESA? |
1321 | |
1322 | XXX. |
1323 | |
1324 | =head2 POSIX-BC? |
1325 | |
1326 | XXX. |
1327 | |
51b5cecb |
1328 | =head1 BUGS |
1329 | |
1330 | This pod document contains literal Latin 1 characters and may encounter |
b1866b2d |
1331 | translation difficulties. In particular one popular nroff implementation |
51b5cecb |
1332 | was known to strip accented characters to their unaccented counterparts |
1333 | while attempting to view this document through the B<pod2man> program |
1334 | (for example, you may see a plain C<y> rather than one with a diaeresis |
3958b146 |
1335 | as in E<yuml>). Another nroff truncated the resultant manpage at |
395f5a0c |
1336 | the first occurrence of 8 bit characters. |
51b5cecb |
1337 | |
1338 | Not all shells will allow multiple C<-e> string arguments to perl to |
395f5a0c |
1339 | be concatenated together properly as recipes 0, 2, 4, 5, and 6 might |
1340 | seem to imply. |
51b5cecb |
1341 | |
b3b6085d |
1342 | =head1 SEE ALSO |
1343 | |
395f5a0c |
1344 | L<perllocale>, L<perlfunc>, L<perlunicode>, L<utf8>. |
b3b6085d |
1345 | |
d396a558 |
1346 | =head1 REFERENCES |
1347 | |
1348 | http://anubis.dkuug.dk/i18n/charmaps |
1349 | |
d396a558 |
1350 | http://www.unicode.org/ |
1351 | |
1352 | http://www.unicode.org/unicode/reports/tr16/ |
1353 | |
51b5cecb |
1354 | http://www.wps.com/texts/codes/ |
1355 | B<ASCII: American Standard Code for Information Infiltration> Tom Jennings, |
1356 | September 1999. |
1357 | |
395f5a0c |
1358 | B<The Unicode Standard, Version 3.0> The Unicode Consortium, Lisa Moore ed., |
51b5cecb |
1359 | ISBN 0-201-61633-5, Addison Wesley Developers Press, February 2000. |
1360 | |
d396a558 |
1361 | B<CDRA: IBM - Character Data Representation Architecture - |
1362 | Reference and Registry>, IBM SC09-2190-00, December 1996. |
1363 | |
1364 | "Demystifying Character Sets", Andrea Vine, Multilingual Computing |
1365 | & Technology, B<#26 Vol. 10 Issue 4>, August/September 1999; |
1366 | ISSN 1523-0309; Multilingual Computing Inc. Sandpoint ID, USA. |
1367 | |
1e054b24 |
1368 | B<Codes, Ciphers, and Other Cryptic and Clandestine Communication> |
1369 | Fred B. Wrixon, ISBN 1-57912-040-7, Black Dog & Leventhal Publishers, |
1370 | 1998. |
1371 | |
395f5a0c |
1372 | http://www.bobbemer.com/P-BIT.HTM |
1373 | B<IBM - EBCDIC and the P-bit; The biggest Computer Goof Ever> Robert Bemer. |
1374 | |
1375 | =head1 HISTORY |
1376 | |
1377 | 15 April 2001: added UTF-8 and UTF-EBCDIC to main table, pvhp. |
1378 | |
d396a558 |
1379 | =head1 AUTHOR |
1380 | |
b3b6085d |
1381 | Peter Prymmer pvhp@best.com wrote this in 1999 and 2000 |
d396a558 |
1382 | with CCSID 0819 and 0037 help from Chris Leach and |
b3b6085d |
1383 | AndrE<eacute> Pirard A.Pirard@ulg.ac.be as well as POSIX-BC |
1384 | help from Thomas Dorner Thomas.Dorner@start.de. |
1e054b24 |
1385 | Thanks also to Vickie Cooper, Philip Newton, William Raffloer, and |
1386 | Joe Smith. Trademarks, registered trademarks, service marks and |
1387 | registered service marks used in this document are the property of |
1388 | their respective owners. |