prefix. Some macros are provided for compatibility with the older,
unadorned names, but this support may be disabled in a future release.
-The listing is alphabetical, case insensitive.
+Perl was originally written to handle US-ASCII only (that is characters
+whose ordinal numbers are in the range 0 - 127).
+And documentation and comments may still use the term ASCII, when
+sometimes in fact the entire range from 0 - 255 is meant.
+
+Note that Perl can be compiled and run under EBCDIC (See L<perlebcdic>)
+or ASCII. Most of the documentation (and even comments in the code)
+ignore the EBCDIC possibility.
+For almost all purposes the differences are transparent.
+As an example, under EBCDIC,
+instead of UTF-8, UTF-EBCDIC is used to encode Unicode strings, and so
+whenever this documentation refers to C<utf8>
+(and variants of that name, including in function names),
+it also (essentially transparently) means C<UTF-EBCDIC>.
+But the ordinals of characters differ between ASCII, EBCDIC, and
+the UTF- encodings, and a string encoded in UTF-EBCDIC may occupy more bytes
+than in UTF-8.
+
+Also, on some EBCDIC machines, functions that are documented as operating on
+US-ASCII (or Basic Latin in Unicode terminology) may in fact operate on all
+256 characters in the EBCDIC range, not just the subset corresponding to
+US-ASCII.
+
+The listing below is alphabetical, case insensitive.
_EOB_
AmdbR |char* |sv_pvutf8 |NN SV *sv
AmdbR |char* |sv_pvbyte |NN SV *sv
Amdb |STRLEN |sv_utf8_upgrade|NN SV *sv
+Amdb |STRLEN |sv_utf8_upgrade_nomg|NN SV *sv
ApdM |bool |sv_utf8_downgrade|NN SV *const sv|const bool fail_ok
Apd |void |sv_utf8_encode |NN SV *const sv
ApdM |bool |sv_utf8_decode |NN SV *const sv
Perl_sv_pvutf8
Perl_sv_pvbyte
Perl_sv_utf8_upgrade
+Perl_sv_utf8_upgrade_nomg
Perl_sv_utf8_downgrade
Perl_sv_utf8_encode
Perl_sv_utf8_decode
=head1 Character classes
=for apidoc Am|bool|isALNUM|char ch
-Returns a boolean indicating whether the C C<char> is an ASCII alphanumeric
-character (including underscore) or digit.
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
+alphanumeric character (including underscore) or digit.
=for apidoc Am|bool|isALPHA|char ch
-Returns a boolean indicating whether the C C<char> is an ASCII alphabetic
-character.
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
+alphabetic character.
=for apidoc Am|bool|isSPACE|char ch
-Returns a boolean indicating whether the C C<char> is whitespace.
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
+whitespace.
=for apidoc Am|bool|isDIGIT|char ch
-Returns a boolean indicating whether the C C<char> is an ASCII
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
digit.
=for apidoc Am|bool|isUPPER|char ch
-Returns a boolean indicating whether the C C<char> is an uppercase
-character.
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
+uppercase character.
=for apidoc Am|bool|isLOWER|char ch
-Returns a boolean indicating whether the C C<char> is a lowercase
-character.
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
+lowercase character.
=for apidoc Am|char|toUPPER|char ch
-Converts the specified character to uppercase.
+Converts the specified character to uppercase. Characters outside the
+US-ASCII (Basic Latin) range are viewed as not having any case.
=for apidoc Am|char|toLOWER|char ch
-Converts the specified character to lowercase.
+Converts the specified character to lowercase. Characters outside the
+US-ASCII (Basic Latin) range are viewed as not having any case.
=cut
*/
prefix. Some macros are provided for compatibility with the older,
unadorned names, but this support may be disabled in a future release.
-The listing is alphabetical, case insensitive.
+Perl was originally written to handle US-ASCII only (that is characters
+whose ordinal numbers are in the range 0 - 127).
+And documentation and comments may still use the term ASCII, when
+sometimes in fact the entire range from 0 - 256 is meant.
+
+Note that Perl can be compiled and run under EBCDIC (See L<perlebcdic>)
+or ASCII. Most of the documentation (and even comments in the code)
+ignore the EBCDIC possibility.
+For almost all purposes the differences are transparent.
+As an example, under EBCDIC,
+instead of UTF-8, UTF-EBCDIC is used to encode Unicode strings, and so
+whenever this documentation refers to C<utf8>
+(and variants of that name, including in function names),
+it also (essentially transparently) means C<UTF-EBCDIC>.
+But the ordinals of characters differ between ASCII, EBCDIC, and
+the UTF- encodings, and a string encoded in UTF-EBCDIC may occupy more bytes
+than in UTF-8.
+
+Also, on some EBCDIC machines, functions that are documented as operating on
+US-ASCII (or Basic Latin in Unicode terminology) may in fact operate on all
+256 characters in the EBCDIC range, not just the subset corresponding to
+US-ASCII.
+
+The listing below is alphabetical, case insensitive.
=head1 "Gimme" Values
=item isALNUM
X<isALNUM>
-Returns a boolean indicating whether the C C<char> is an ASCII alphanumeric
-character (including underscore) or digit.
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
+alphanumeric character (including underscore) or digit.
bool isALNUM(char ch)
=item isALPHA
X<isALPHA>
-Returns a boolean indicating whether the C C<char> is an ASCII alphabetic
-character.
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
+alphabetic character.
bool isALPHA(char ch)
=item isDIGIT
X<isDIGIT>
-Returns a boolean indicating whether the C C<char> is an ASCII
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
digit.
bool isDIGIT(char ch)
=item isLOWER
X<isLOWER>
-Returns a boolean indicating whether the C C<char> is a lowercase
-character.
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
+lowercase character.
bool isLOWER(char ch)
=item isSPACE
X<isSPACE>
-Returns a boolean indicating whether the C C<char> is whitespace.
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
+whitespace.
bool isSPACE(char ch)
=item isUPPER
X<isUPPER>
-Returns a boolean indicating whether the C C<char> is an uppercase
-character.
+Returns a boolean indicating whether the C C<char> is a US-ASCII (Basic Latin)
+uppercase character.
bool isUPPER(char ch)
=item toLOWER
X<toLOWER>
-Converts the specified character to lowercase.
+Converts the specified character to lowercase. Characters outside the
+US-ASCII (Basic Latin) range are viewed as not having any case.
char toLOWER(char ch)
=item toUPPER
X<toUPPER>
-Converts the specified character to uppercase.
+Converts the specified character to uppercase. Characters outside the
+US-ASCII (Basic Latin) range are viewed as not having any case.
char toUPPER(char ch)
X<sv_utf8_downgrade>
Attempts to convert the PV of an SV from characters to bytes.
-If the PV contains a character beyond byte, this conversion will fail;
+If the PV contains a character that cannot fit
+in a byte, this conversion will fail;
in this case, either returns false or, if C<fail_ok> is not
true, croaks.
Converts the PV of an SV to its UTF-8-encoded form.
Forces the SV to string form if it is not already.
+Will C<mg_get> on C<sv> if appropriate.
Always sets the SvUTF8 flag to avoid future validity checks even
-if all the bytes have hibit clear.
+if the whole string is the same in UTF-8 as not.
+Returns the number of bytes in the converted string
This is not as a general purpose byte encoding to Unicode interface:
use the Encode extension for that.
Converts the PV of an SV to its UTF-8-encoded form.
Forces the SV to string form if it is not already.
Always sets the SvUTF8 flag to avoid future validity checks even
-if all the bytes have hibit clear. If C<flags> has C<SV_GMAGIC> bit set,
-will C<mg_get> on C<sv> if appropriate, else not. C<sv_utf8_upgrade> and
+if all the bytes are invariant in UTF-8. If C<flags> has C<SV_GMAGIC> bit set,
+will C<mg_get> on C<sv> if appropriate, else not.
+Returns the number of bytes in the converted string
+C<sv_utf8_upgrade> and
C<sv_utf8_upgrade_nomg> are implemented in terms of this function.
This is not as a general purpose byte encoding to Unicode interface:
=for hackers
Found in file sv.c
+=item sv_utf8_upgrade_nomg
+X<sv_utf8_upgrade_nomg>
+
+Like sv_utf8_upgrade, but doesn't do magic on C<sv>
+
+ STRLEN sv_utf8_upgrade_nomg(SV *sv)
+
+=for hackers
+Found in file sv.c
+
=item sv_vcatpvf
X<sv_vcatpvf>
=item bytes_from_utf8
X<bytes_from_utf8>
-Converts a string C<s> of length C<len> from UTF-8 into byte encoding.
+Converts a string C<s> of length C<len> from UTF-8 into native byte encoding.
Unlike C<utf8_to_bytes> but like C<bytes_to_utf8>, returns a pointer to
the newly-created string, and updates C<len> to contain the new
length. Returns the original string if no conversion occurs, C<len>
is unchanged. Do nothing if C<is_utf8> points to 0. Sets C<is_utf8> to
-0 if C<s> is converted or contains all 7bit characters.
+0 if C<s> is converted or consisted entirely of characters that are invariant
+in utf8 (i.e., US-ASCII on non-EBCDIC machines).
NOTE: this function is experimental and may change or be
removed without notice.
=item bytes_to_utf8
X<bytes_to_utf8>
-Converts a string C<s> of length C<len> from ASCII into UTF-8 encoding.
+Converts a string C<s> of length C<len> from the native encoding into UTF-8.
Returns a pointer to the newly-created string, and sets C<len> to
reflect the new length.
-If you want to convert to UTF-8 from other encodings than ASCII,
+A NUL character will be written after the end of the string.
+
+If you want to convert to UTF-8 from encodings other than
+the native (Latin1 or EBCDIC),
see sv_recode_to_utf8().
NOTE: this function is experimental and may change or be
X<is_utf8_char>
Tests if some arbitrary number of bytes begins in a valid UTF-8
-character. Note that an INVARIANT (i.e. ASCII) character is a valid
-UTF-8 character. The actual number of bytes in the UTF-8 character
-will be returned if it is valid, otherwise 0.
+character. Note that an INVARIANT (i.e. ASCII on non-EBCDIC machines)
+character is a valid UTF-8 character. The actual number of bytes in the UTF-8
+character will be returned if it is valid, otherwise 0.
STRLEN is_utf8_char(const U8 *s)
=item utf8_to_bytes
X<utf8_to_bytes>
-Converts a string C<s> of length C<len> from UTF-8 into byte encoding.
+Converts a string C<s> of length C<len> from UTF-8 into native byte encoding.
Unlike C<bytes_to_utf8>, this over-writes the original string, and
updates len to contain the new length.
Returns zero on failure, setting C<len> to -1.
which is assumed to be in UTF-8 encoding; C<retlen> will be set to the
length, in bytes, of that character.
-This function should only be used when returned UV is considered
+This function should only be used when the returned UV is considered
an index into the Unicode semantic tables (e.g. swashes).
If C<s> does not point to a well-formed UTF-8 character, zero is
=head2 ASCII
-The American Standard Code for Information Interchange is a set of
+The American Standard Code for Information Interchange (ASCII or US-ASCII) is a
+set of
integers running from 0 to 127 (decimal) that imply character
-interpretation by the display and other system(s) of computers.
+interpretation by the display and other systems of computers.
The range 0..127 can be covered by setting the bits in a 7-bit binary
digit, hence the set is sometimes referred to as a "7-bit ASCII".
ASCII was described by the American National Standards Institute
zero digits in CCSID numbers within this document are insignificant.
E.g. CCSID 0037 may be referred to as 37 in places.
+Perl can be compiled on platforms that run any of three commonly used EBCDIC
+character sets, listed below.
+
=head2 13 variant characters
Among IBM EBCDIC character code sets there are 13 characters that
\ [ ] { } ^ ~ ! # | $ @ `
+When Perl is compiled for a platform, it looks at some of these characters to
+guess which EBCDIC character set the platform uses, and adapts itself
+accordingly to that platform. If the platform uses a character set that is not
+one of the three Perl knows about, Perl will either fail to compile, or
+mistakenly and silently choose one of the three.
+They are:
+
=head2 0037
Character code set ID 0037 is a mapping of the ASCII plus Latin-1
=item *
-Many of the remaining seem to be related to case-insensitive matching:
-for example, C<< /[\x{131}]/ >> (LATIN SMALL LETTER DOTLESS I) does
-not match "I" case-insensitively, as it should under Unicode.
-(The match succeeds in ASCII-derived platforms.)
+Many of the remaining problems seem to be related to case-insensitive matching
=item *
=head2 Unicode and UTF
-UTF is a Unicode Transformation Format. UTF-8 is a Unicode conforming
-representation of the Unicode standard that looks very much like ASCII.
-UTF-EBCDIC is an attempt to represent Unicode characters in an EBCDIC
-transparent manner.
+UTF stands for C<Unicode Transformation Format>.
+UTF-8 is an encoding of Unicode into a sequence of 8-bit byte chunks, based on
+ASCII and Latin-1.
+The length of a sequence required to represent a Unicode code point
+depends on the ordinal number of that code point,
+with larger numbers requiring more bytes.
+UTF-EBCDIC is like UTF-8, but based on EBCDIC.
+
+In UTF-8, the code points corresponding to the lowest 128
+ordinal numbers (0 - 127) are the same (or C<invariant>)
+in UTF-8 or not. They occupy one byte each. All other Unicode code points
+require more than one byte to be represented in UTF-8.
+With UTF-EBCDIC, the term C<invariant> has a somewhat different meaning.
+(First, note that this is very different from the L</13 variant characters>
+mentioned above.)
+In UTF-EBCDIC, an C<invariant> character or code point
+is one which takes up exactly one byte encoded, regardless
+of whether or not the encoding changes its value
+(which it most likely will).
+(If you care, the EBCDIC invariants are those characters
+which correspond to the the ASCII characters, plus those that correspond to
+the C1 controls (80..9f on ASCII platforms).)
+A string encoded in UTF-EBCDIC may be longer (but never shorter) than
+one encoded in UTF-8.
=head2 Using Encode
Starting from Perl 5.8 you can use the standard new module Encode
-to translate from EBCDIC to Latin-1 code points
+to translate from EBCDIC to Latin-1 code points.
+Encode knows about more EBCDIC character sets than Perl can currently
+be compiled to run on.
use Encode 'from_to';
open($f, ">:encoding(utf8)", "test.utf8");
print $f "Hello World!\n";
-to get two files containing "Hello World!\n" in ASCII, CP 37 EBCDIC,
-ISO 8859-1 (Latin-1) (in this example identical to ASCII) respective
-UTF-EBCDIC (in this example identical to normal EBCDIC). See the
+to get four files containing "Hello World!\n" in ASCII, CP 37 EBCDIC,
+ISO 8859-1 (Latin-1) (in this example identical to ASCII since only ASCII
+characters were printed), and
+UTF-EBCDIC (in this example identical to normal EBCDIC since only characters
+that don't differ between EBCDIC and UTF-EBCDIC were printed). See the
documentation of Encode::PerlIO for details.
As the PerlIO layer uses raw IO (bytes) internally, all this totally
$is_ascii = "\r" ne chr(13); # WRONG
$is_ascii = "\n" ne chr(10); # ILL ADVISED
-Obviously the first of these will fail to distinguish most ASCII machines
-from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC machine since "\r" eq
+Obviously the first of these will fail to distinguish most ASCII platforms
+from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC platform since "\r" eq
chr(13) under all of those coded character sets. But note too that
because "\n" is chr(13) and "\r" is chr(10) on the MacIntosh (which is an
-ASCII machine) the second C<$is_ascii> test will lead to trouble there.
+ASCII platform) the second C<$is_ascii> test will lead to trouble there.
To determine whether or not perl was built under an EBCDIC
code page you can use the Config module like so:
=head1 OPERATOR DIFFERENCES
The C<..> range operator treats certain character ranges with
-care on EBCDIC machines. For example the following array
-will have twenty six elements on either an EBCDIC machine
-or an ASCII machine:
+care on EBCDIC platforms. For example the following array
+will have twenty six elements on either an EBCDIC platform
+or an ASCII platform:
@alphabet = ('A'..'Z'); # $#alphabet == 25
The bitwise operators such as & ^ | may return different results
when operating on string or character data in a perl program running
-on an EBCDIC machine than when run on an ASCII machine. Here is
+on an EBCDIC platform than when run on an ASCII platform. Here is
an example adapted from the one in L<perlop>:
# EBCDIC-based examples
An interesting property of the 32 C0 control characters
in the ASCII table is that they can "literally" be constructed
as control characters in perl, e.g. C<(chr(0) eq "\c@")>
-C<(chr(1) eq "\cA")>, and so on. Perl on EBCDIC machines has been
+C<(chr(1) eq "\cA")>, and so on. Perl on EBCDIC platforms has been
ported to take "\c@" to chr(0) and "\cA" to chr(1) as well, but the
thirty three characters that result depend on which code page you are
using. The table below uses the character names from the previous table
s/NEGATIVE ACKNOWLEDGE/NEG. ACK./;. The POSIX-BC and 1047 sets are
identical throughout this range and differ from the 0037 set at only
one spot (21 decimal). Note that the C<LINE FEED> character
-may be generated by "\cJ" on ASCII machines but by "\cU" on 1047 or POSIX-BC
-machines and cannot be generated as a C<"\c.letter."> control character on
-0037 machines. Note also that "\c\\" maps to two characters
+may be generated by "\cJ" on ASCII platforms but by "\cU" on 1047 or POSIX-BC
+platforms and cannot be generated as a C<"\c.letter."> control character on
+0037 platforms. Note also that "\c\\" maps to two characters
not one.
chr ord 8859-1 0037 1047 && POSIX-BC
=item chr()
chr() must be given an EBCDIC code number argument to yield a desired
-character return value on an EBCDIC machine. For example:
+character return value on an EBCDIC platform. For example:
$CAPITAL_LETTER_A = chr(193);
=item ord()
-ord() will return EBCDIC code number values on an EBCDIC machine.
+ord() will return EBCDIC code number values on an EBCDIC platform.
For example:
$the_number_193 = ord("A");
The formats that can convert characters to numbers and vice versa
will be different from their ASCII counterparts when executed
-on an EBCDIC machine. Examples include:
+on an EBCDIC platform. Examples include:
printf("%c%c%c",193,194,195); # prints ABC
If you do want to match the alphabet gap characters in a single octet
regular expression try matching the hex or octal code such
-as C</\313/> on EBCDIC or C</\364/> on ASCII machines to
+as C</\313/> on EBCDIC or C</\364/> on ASCII platforms to
have your regular expression match C<o WITH CIRCUMFLEX>.
Another construct to be wary of is the inappropriate use of hex or
The above would be adequate if the concern was only with numeric code points.
However, the concern may be with characters rather than code points
-and on an EBCDIC machine it may be desirable for constructs such as
+and on an EBCDIC platform it may be desirable for constructs such as
C<if (is_print_ascii("A")) {print "A is a printable character\n";}> to print
out the expected message. One way to represent the above collection
of character classification subs that is capable of working across the
One big difference between ASCII based character sets and EBCDIC ones
are the relative positions of upper and lower case letters and the
-letters compared to the digits. If sorted on an ASCII based machine the
+letters compared to the digits. If sorted on an ASCII based platform the
two letter abbreviation for a physician comes before the two letter
for drive, that is:
The property of lower case before uppercase letters in EBCDIC is
even carried to the Latin 1 EBCDIC pages such as 0037 and 1047.
An example would be that E<Euml> C<E WITH DIAERESIS> (203) comes
-before E<euml> C<e WITH DIAERESIS> (235) on an ASCII machine, but
-the latter (83) comes before the former (115) on an EBCDIC machine.
+before E<euml> C<e WITH DIAERESIS> (235) on an ASCII platform, but
+the latter (83) comes before the former (115) on an EBCDIC platform.
(Astute readers will note that the upper case version of E<szlig>
C<SMALL LETTER SHARP S> is simply "SS" and that the upper case version of
E<yuml> C<y WITH DIAERESIS> is not in the 0..255 range but it is
at U+x0178 in Unicode, or C<"\x{178}"> in a Unicode enabled Perl).
The sort order will cause differences between results obtained on
-ASCII machines versus EBCDIC machines. What follows are some suggestions
+ASCII platforms versus EBCDIC platforms. What follows are some suggestions
on how to deal with these differences.
=head2 Ignore ASCII vs. EBCDIC sort differences.
then sort(). Do note however that such Latin-1 manipulation does not
address the E<yuml> C<y WITH DIAERESIS> character that will remain at
-code point 255 on ASCII machines, but 223 on most EBCDIC machines
+code point 255 on ASCII platforms, but 223 on most EBCDIC platforms
where it will sort to a place less than the EBCDIC numerals. With a
Unicode enabled Perl you might try:
This is the most expensive proposition that does not employ a network
connection.
-=head2 Perform sorting on one type of machine only.
+=head2 Perform sorting on one type of platform only.
This strategy can employ a network connection. As such
it would be computationally expensive.
=head2 Quoted-Printable encoding and decoding
-On ASCII encoded machines it is possible to strip characters outside of
+On ASCII encoded platforms it is possible to strip characters outside of
the printable set using:
# This QP encoder works on ASCII only
$qp_string =~ s/([=\x00-\x1F\x80-\xFF])/sprintf("=%02X",ord($1))/ge;
-Whereas a QP encoder that works on both ASCII and EBCDIC machines
+Whereas a QP encoder that works on both ASCII and EBCDIC platforms
would look somewhat like the following (where the EBCDIC branch @e2a
array is omitted for brevity):
$string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr hex $1/ge;
$string =~ s/=[\n\r]+$//;
-Whereas a QP decoder that works on both ASCII and EBCDIC machines
+Whereas a QP decoder that works on both ASCII and EBCDIC platforms
would look somewhat like the following (where the @a2e array is
omitted for brevity):
interesting property that alternate subsequent invocations are identity maps
(thus rot13 is its own non-trivial inverse in the group of 26 alphabet
rotations). Hence the following is a rot13 encoder and decoder that will
-work on ASCII and EBCDIC machines:
+work on ASCII and EBCDIC platforms:
#!/usr/local/bin/perl
To the extent that it is possible to write code that depends on
hashing order there may be differences between hashes as stored
-on an ASCII based machine and hashes stored on an EBCDIC based machine.
+on an ASCII based platform and hashes stored on an EBCDIC based platform.
XXX
=head1 I18N AND L10N
Internationalization(I18N) and localization(L10N) are supported at least
-in principle even on EBCDIC machines. The details are system dependent
+in principle even on EBCDIC platforms. The details are system dependent
and discussed under the L<perlebcdic/OS ISSUES> section below.
=head1 MULTI OCTET CHARACTER SETS
=head1 REFERENCES
-http://anubis.dkuug.dk/i18n/charmaps
+L<http://anubis.dkuug.dk/i18n/charmaps>
-http://www.unicode.org/
+L<http://www.unicode.org/>
-http://www.unicode.org/unicode/reports/tr16/
+L<http://www.unicode.org/unicode/reports/tr16/>
-http://www.wps.com/texts/codes/
+L<http://www.wps.com/texts/codes/>
B<ASCII: American Standard Code for Information Infiltration> Tom Jennings,
September 1999.
Fred B. Wrixon, ISBN 1-57912-040-7, Black Dog & Leventhal Publishers,
1998.
-http://www.bobbemer.com/P-BIT.HTM
+L<http://www.bobbemer.com/P-BIT.HTM>
B<IBM - EBCDIC and the P-bit; The biggest Computer Goof Ever> Robert Bemer.
=head1 HISTORY
=item *
Mixing UTF-8 and non-UTF-8 strings is tricky. Use C<bytes_to_utf8> to get
-a new string which is UTF-8 encoded. There are tricks you can use to
-delay deciding whether you need to use a UTF-8 string until you get to a
-high character - C<HALF_UPGRADE> is one of those.
+a new string which is UTF-8 encoded, and then combine them.
=back
(Then creating the symlinks...)
The specifics may vary based on your operating system, of course.
-After you see this, you can abort the F<Configure> script, and you
+After it's all done, you
will see that the directory you are in has a tree of symlinks to the
F<perl-rsync> directories and files.
=item *
+Assuming the character set is ASCIIish
+
+Perl can compile and run under EBCDIC platforms. See L<perlebcdic>.
+This is transparent for the most part, but because the character sets
+differ, you shouldn't use numeric (decimal, octal, nor hex) constants
+to refer to characters. You can safely say 'A', but not 0x41.
+You can safely say '\n', but not \012.
+If a character doesn't have a trivial input form, you can
+create a #define for it in both C<utfebcdic.h> and C<utf8.h>, so that
+it resolves to different values depending on the character set being used.
+(There are three different EBCDIC character sets defined in C<utfebcdic.h>,
+so it might be best to insert the #define three times in that file.)
+
+Also, the range 'A' - 'Z' in ASCII is an unbroken sequence of 26 upper case
+alphabetic characters. That is not true in EBCDIC. Nor for 'a' to 'z'.
+But '0' - '9' is an unbroken range in both systems. Don't assume anything
+about other ranges.
+
+Many of the comments in the existing code ignore the possibility of EBCDIC,
+and may be wrong therefore, even if the code works.
+This is actually a tribute to the successful transparent insertion of being
+able to handle EBCDIC. without having to change pre-existing code.
+
+UTF-8 and UTF-EBCDIC are two different encodings used to represent Unicode
+code points as sequences of bytes. Macros
+with the same names (but different definitions)
+in C<utf8.h> and C<utfebcdic.h>
+are used to allow the calling code think that there is only one such encoding.
+This is almost always referred to as C<utf8>, but it means the EBCDIC
+version as well. Comments in the code may well be wrong even if the code
+itself is right.
+For example, the concept of C<invariant characters> differs between ASCII and
+EBCDIC.
+On ASCII platforms, only characters that do not have the high-order
+bit set (i.e. whose ordinals are strict ASCII, 0 - 127)
+are invariant, and the documentation and comments in the code
+may assume that,
+often referring to something like, say, C<hibit>.
+The situation differs and is not so simple on EBCDIC machines, but as long as
+the code itself uses the C<NATIVE_IS_INVARIANT()> macro appropriately, it
+works, even if the comments are wrong.
+
+=item *
+
+Assuming the character set is just ASCII
+
+ASCII is a 7 bit encoding, but bytes have 8 bits in them. The 128 extra
+characters have different meanings depending on the locale. Absent a locale,
+currently these extra characters are generally considered to be unassigned,
+and this has presented some problems.
+This is scheduled to be changed in 5.12 so that these characters will
+be considered to be Latin-1 (ISO-8859-1).
+
+=item *
+
Mixing #define and #ifdef
#define BURGLE(x) ... \
=item *
-Adding stuff after #endif or #else
+Adding non-comment stuff after #endif or #else
#ifdef SNOSH
...
=item *
-Binding together several statements
+Binding together several statements in a macro
Use the macros STMT_START and STMT_END.
out from under you the next time the cache is
invalidated).
- AV* mro_get_linear_isa_c3(HV* stash, I32 level)
+ AV* mro_get_linear_isa_c3(HV* stash, U32 level)
=for hackers
Found in file mro.c
out from under you the next time the cache is
invalidated).
- AV* mro_get_linear_isa_dfs(HV* stash, I32 level)
+ AV* mro_get_linear_isa_dfs(HV* stash, U32 level)
=for hackers
Found in file mro.c
be made without additional information from the user, Perl decides in
favor of compatibility and chooses to use byte semantics.
+Under byte semantics, when C<use locale> is in effect, Perl uses the
+semantics associated with the current locale. Absent a C<use locale>, Perl
+currently uses US-ASCII (or Basic Latin in Unicode terminology) byte semantics,
+meaning that characters whose ordinal numbers are in the range 128 - 255 are
+undefined except for their ordinal numbers. This means that none have case
+(upper and lower), nor are any a member of character classes, like C<[:alpha:]>
+or C<\w>.
+(But all do belong to the C<\W> class or the Perl regular expression extension
+C<[:^alpha:]>.)
+
This behavior preserves compatibility with earlier versions of Perl,
which allowed byte semantics in Perl operations only if
none of the program's inputs were marked as being as source of Unicode
be used to force byte semantics on Unicode data.
If strings operating under byte semantics and strings with Unicode
-character data are concatenated, the new string will be created by
-decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
-old Unicode string used EBCDIC. This translation is done without
-regard to the system's native 8-bit encoding.
+character data are concatenated, the new string will have
+character semantics.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
Unicode characters can also be added to a string by using the C<\x{...}>
notation. The Unicode code for the desired character, in hexadecimal,
should be placed in the braces. For instance, a smiley face is
-C<\x{263A}>. This encoding scheme only works for all characters, but
+C<\x{263A}>. This encoding scheme works for all characters, but
for characters under 0x100, note that Perl may use an 8 bit encoding
internally, for optimization and/or backward compatibility.
user-defined character properties [b] to emulate set operations
[6] \b \B
[7] note that Perl does Full case-folding in matching, not Simple:
- for example U+1F88 is equivalent with U+1F00 U+03B9,
- not with 1F80. This difference matters for certain Greek
+ for example U+1F88 is equivalent to U+1F00 U+03B9,
+ not with 1F80. This difference matters mainly for certain Greek
capital letters with certain modifiers: the Full case-folding
decomposes the letter, while the Simple case-folding would map
it to a single character.
=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
Sometimes (see L</"When Unicode Does Not Happen">) there are
-situations where you simply need to force Perl to believe that a byte
-string is UTF-8, or vice versa. The low-level calls
-utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
+situations where you simply need to force a byte
+string into UTF-8, or vice versa. The low-level calls
+utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
the answers.
-Do not use them without careful thought, though: Perl may easily get
-very confused, angry, or even crash, if you suddenly change the 'nature'
-of scalar like that. Especially careful you have to be if you use the
-utf8::upgrade(): any random byte string is not valid UTF-8.
+Note that utf8::downgrade() can fail if the string contains characters
+that don't fit into a byte.
=head2 Using Unicode in XS
=item *
C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
-pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8>
+pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
does B<not> mean that there are any characters of code points greater
than 255 (or 127) in the scalar or that there are even any characters
=item *
-C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
+C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
a buffer encoding the code point as UTF-8, and returns a pointer
-pointing after the UTF-8 bytes.
+pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
=item *
-C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
+C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
returns the Unicode character code point and, optionally, the length of
-the UTF-8 byte sequence.
+the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
=item *
=item *
-C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
+C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
that is C<off> (positive or negative) Unicode characters displaced
from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
C<utf8_hop()> will merrily run off the end or the beginning of the
=item *
-C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
+C<ibcmp_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
compare two strings case-insensitively in Unicode. For case-sensitive
comparisons you can just use C<memEQ()> and C<memNE()> as usual.
Unicode support will also tend to run slower. Use of locales with
Unicode is discouraged.
+=head2 Problems with characters whose ordinal numbers are in the range 128 - 255 with no Locale specified
+
+Without a locale specified, unlike all other characters or code points,
+these characters have very different semantics in byte semantics versus
+character semantics.
+In character semantics they are interpreted as Unicode code points, which means
+they are viewed as Latin-1 (ISO-8859-1).
+In byte semantics, they are considered to be unassigned characters,
+meaning that the only semantics they have is their
+ordinal numbers, and that they are not members of various character classes.
+None are considered to match C<\w> for example, but all match C<\W>.
+Besides these class matches,
+the known operations that this affects are those that change the case,
+regular expression matching while ignoring case,
+and B<quotemeta()>.
+This can lead to unexpected results in which a string's semantics suddenly
+change if a code point above 255 is appended to or removed from it,
+which changes the string's semantics from byte to character or vice versa.
+This behavior is scheduled to change in version 5.12, but in the meantime,
+a workaround is to always call utf8::upgrade($string).
+
=head2 Interaction with Extensions
When Perl exchanges data with an extension, the extension should be
Affected are C<uc>, C<lc>, C<ucfirst>, C<lcfirst>, C<\U>, C<\L>, C<\u>, C<\l>,
C<\d>, C<\s>, C<\w>, C<\D>, C<\S>, C<\W>, C</.../i>, C<(?i:...)>,
-C</[[:posix:]]/>.
+C</[[:posix:]]/>, and C<quotemeta> (though this last should not cause any real
+problems).
To force Unicode semantics, you can upgrade the internal representation to
-by doing C<utf8::upgrade($string)>. This does not change strings that were
-already upgraded.
+by doing C<utf8::upgrade($string)>. This can be used
+safely on any string, as it checks and does not change strings that have
+already been upgraded.
For a more detailed discussion, see L<Unicode::Semantics> on CPAN.
A Unicode I<character> is an abstract entity. It is not bound to any
particular integer width, especially not to the C language C<char>.
Unicode is language-neutral and display-neutral: it does not encode the
-language of the text and it does not define fonts or other graphical
+language of the text and it does not generally define fonts or other graphical
layout details. Unicode operates on characters and on text built from
those characters.
problems of the initial Unicode implementation, but for example
regular expressions still do not work with Unicode in 5.6.1.
-B<Starting from Perl 5.8.0, the use of C<use utf8> is no longer
-necessary.> In earlier releases the C<utf8> pragma was used to declare
+B<Starting from Perl 5.8.0, the use of C<use utf8> is needed only in much more restricted circumstances.> In earlier releases the C<utf8> pragma was used to declare
that operations in the current block or file would be Unicode-aware.
This model was found to be wrong, or at least clumsy: the "Unicodeness"
is now carried with the data, instead of being attached to the
The long answer is that you need to consider character normalization
and casing issues: see L<Unicode::Normalize>, Unicode Technical
Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
-Mappings>, http://www.unicode.org/unicode/reports/tr15/ and
-http://www.unicode.org/unicode/reports/tr21/
+Mappings>, L<http://www.unicode.org/unicode/reports/tr15/> and
+L<http://www.unicode.org/unicode/reports/tr21/>
As of Perl 5.8.0, the "Full" case-folding of I<Case
Mappings/SpecialCasing> is implemented.
The long answer is that "it depends", and a good answer cannot be
given without knowing (at the very least) the language context.
See L<Unicode::Collate>, and I<Unicode Collation Algorithm>
-http://www.unicode.org/unicode/reports/tr10/
+L<http://www.unicode.org/unicode/reports/tr10/>
=back
Character ranges in regular expression character classes (C</[a-z]/>)
and in the C<tr///> (also known as C<y///>) operator are not magically
-Unicode-aware. What this means that C<[A-Za-z]> will not magically start
+Unicode-aware. What this means is that C<[A-Za-z]> will not magically start
to mean "all alphabetic letters"; not that it does mean that even for
8-bit characters, you should be using C</[[:alpha:]]/> in that case.
How Do I Know Whether My String Is In Unicode?
-You shouldn't care. No, you really shouldn't. No, really. If you
-have to care--beyond the cases described above--it means that we
-didn't get the transparency of Unicode quite right.
+You shouldn't have to care. But you may, because currently the semantics of the
+characters whose ordinals are in the range 128 to 255 is different depending on
+whether the string they are contained within is in Unicode or not.
+(See L<perlunicode>.)
-Okay, if you insist:
+To determine if a string is in Unicode, use:
print utf8::is_utf8($string) ? 1 : 0, "\n";
Sometimes you might really need to know the byte length of a string
instead of the character length. For that use either the
-C<Encode::encode_utf8()> function or the C<bytes> pragma and its only
-defined function C<length()>:
+C<Encode::encode_utf8()> function or the C<bytes> pragma and
+the C<length()> function:
my $unicode = chr(0x100);
print length($unicode), "\n"; # will print 1
For example,
use Encode 'decode_utf8';
-
+
if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
# $string is valid utf8
} else {
$Unicode = pack("U0a*", $bytes);
-You can convert well-formed UTF-8 to a sequence of bytes, but if
-you just want to convert random binary data into UTF-8, you can't.
-B<Any random collection of bytes isn't well-formed UTF-8>. You can
-use C<unpack("C*", $string)> for the former, and you can create
-well-formed Unicode data by C<pack("U*", 0xff, ...)>.
+You can find the bytes that make up a UTF-8 sequence with
+
+ @bytes = unpack("C*", $Unicode_string)
+
+and you can create well-formed Unicode with
+
+ $Unicode_string = pack("U*", 0xff, ...)
=item *
How Do I Display Unicode? How Do I Input Unicode?
-See http://www.alanwood.net/unicode/ and
-http://www.cl.cam.ac.uk/~mgk25/unicode.html
+See L<http://www.alanwood.net/unicode/> and
+L<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
=item *
Unicode Consortium
-http://www.unicode.org/
+L<http://www.unicode.org/>
=item *
Unicode FAQ
-http://www.unicode.org/unicode/faq/
+L<http://www.unicode.org/unicode/faq/>
=item *
Unicode Glossary
-http://www.unicode.org/glossary/
+L<http://www.unicode.org/glossary/>
=item *
Unicode Useful Resources
-http://www.unicode.org/unicode/onlinedat/resources.html
+L<http://www.unicode.org/unicode/onlinedat/resources.html>
=item *
Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
-http://www.alanwood.net/unicode/
+L<http://www.alanwood.net/unicode/>
=item *
UTF-8 and Unicode FAQ for Unix/Linux
-http://www.cl.cam.ac.uk/~mgk25/unicode.html
+L<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
=item *
Legacy Character Sets
-http://www.czyborra.com/
-http://www.eki.ee/letter/
+L<http://www.czyborra.com/>
+L<http://www.eki.ee/letter/>
=item *
#define PERL_ARGS_ASSERT_SV_UTF8_UPGRADE \
assert(sv)
+/* PERL_CALLCONV STRLEN Perl_sv_utf8_upgrade_nomg(pTHX_ SV *sv)
+ __attribute__nonnull__(pTHX_1); */
+#define PERL_ARGS_ASSERT_SV_UTF8_UPGRADE_NOMG \
+ assert(sv)
+
PERL_CALLCONV bool Perl_sv_utf8_downgrade(pTHX_ SV *const sv, const bool fail_ok)
__attribute__nonnull__(pTHX_1);
#define PERL_ARGS_ASSERT_SV_UTF8_DOWNGRADE \
Converts the PV of an SV to its UTF-8-encoded form.
Forces the SV to string form if it is not already.
+Will C<mg_get> on C<sv> if appropriate.
Always sets the SvUTF8 flag to avoid future validity checks even
-if all the bytes have hibit clear.
+if the whole string is the same in UTF-8 as not.
+Returns the number of bytes in the converted string
This is not as a general purpose byte encoding to Unicode interface:
use the Encode extension for that.
+=for apidoc sv_utf8_upgrade_nomg
+
+Like sv_utf8_upgrade, but doesn't do magic on C<sv>
+
=for apidoc sv_utf8_upgrade_flags
Converts the PV of an SV to its UTF-8-encoded form.
Forces the SV to string form if it is not already.
Always sets the SvUTF8 flag to avoid future validity checks even
-if all the bytes have hibit clear. If C<flags> has C<SV_GMAGIC> bit set,
-will C<mg_get> on C<sv> if appropriate, else not. C<sv_utf8_upgrade> and
+if all the bytes are invariant in UTF-8. If C<flags> has C<SV_GMAGIC> bit set,
+will C<mg_get> on C<sv> if appropriate, else not.
+Returns the number of bytes in the converted string
+C<sv_utf8_upgrade> and
C<sv_utf8_upgrade_nomg> are implemented in terms of this function.
This is not as a general purpose byte encoding to Unicode interface:
sv_recode_to_utf8(sv, PL_encoding);
else { /* Assume Latin-1/EBCDIC */
/* This function could be much more efficient if we
- * had a FLAG in SVs to signal if there are any hibit
+ * had a FLAG in SVs to signal if there are any variant
* chars in the PV. Given that there isn't such a flag
* make the loop as fast as possible. */
const U8 * const s = (U8 *) SvPVX_const(sv);
while (t < e) {
const U8 ch = *t++;
- /* Check for hi bit */
+ /* Check for variant */
if (!NATIVE_IS_INVARIANT(ch)) {
STRLEN len = SvCUR(sv);
/* *Currently* bytes_to_utf8() adds a '\0' after every string
break;
}
}
- /* Mark as UTF-8 even if no hibit - saves scanning loop */
+ /* Mark as UTF-8 even if no variant - saves scanning loop */
SvUTF8_on(sv);
}
return SvCUR(sv);
=for apidoc sv_utf8_downgrade
Attempts to convert the PV of an SV from characters to bytes.
-If the PV contains a character beyond byte, this conversion will fail;
+If the PV contains a character that cannot fit
+in a byte, this conversion will fail;
in this case, either returns false or, if C<fail_ok> is not
true, croaks.
my @CF;
while (<CF>) {
- # Skip S since we are going for 'F'ull case folding
+ # Skip S since we are going for 'F'ull case folding. I is obsolete starting
+ # with Unicode 3.2, but leaving it in does no harm, and allows backward
+ # compatibility
if (/^([0-9A-F]+); ([CFI]); ((?:[0-9A-F]+)(?: [0-9A-F]+)*); \# (.+)/) {
next if EBCDIC && hex $1 < 0x100;
push @CF, [$1, $2, $3, $4];
=for apidoc is_utf8_char
Tests if some arbitrary number of bytes begins in a valid UTF-8
-character. Note that an INVARIANT (i.e. ASCII) character is a valid
-UTF-8 character. The actual number of bytes in the UTF-8 character
-will be returned if it is valid, otherwise 0.
+character. Note that an INVARIANT (i.e. ASCII on non-EBCDIC machines)
+character is a valid UTF-8 character. The actual number of bytes in the UTF-8
+character will be returned if it is valid, otherwise 0.
=cut */
STRLEN
which is assumed to be in UTF-8 encoding; C<retlen> will be set to the
length, in bytes, of that character.
-This function should only be used when returned UV is considered
+This function should only be used when the returned UV is considered
an index into the Unicode semantic tables (e.g. swashes).
If C<s> does not point to a well-formed UTF-8 character, zero is
/*
=for apidoc utf8_to_bytes
-Converts a string C<s> of length C<len> from UTF-8 into byte encoding.
+Converts a string C<s> of length C<len> from UTF-8 into native byte encoding.
Unlike C<bytes_to_utf8>, this over-writes the original string, and
updates len to contain the new length.
Returns zero on failure, setting C<len> to -1.
/*
=for apidoc bytes_from_utf8
-Converts a string C<s> of length C<len> from UTF-8 into byte encoding.
+Converts a string C<s> of length C<len> from UTF-8 into native byte encoding.
Unlike C<utf8_to_bytes> but like C<bytes_to_utf8>, returns a pointer to
the newly-created string, and updates C<len> to contain the new
length. Returns the original string if no conversion occurs, C<len>
is unchanged. Do nothing if C<is_utf8> points to 0. Sets C<is_utf8> to
-0 if C<s> is converted or contains all 7bit characters.
+0 if C<s> is converted or consisted entirely of characters that are invariant
+in utf8 (i.e., US-ASCII on non-EBCDIC machines).
=cut
*/
/*
=for apidoc bytes_to_utf8
-Converts a string C<s> of length C<len> from ASCII into UTF-8 encoding.
+Converts a string C<s> of length C<len> from the native encoding into UTF-8.
Returns a pointer to the newly-created string, and sets C<len> to
reflect the new length.
-If you want to convert to UTF-8 from other encodings than ASCII,
+A NUL character will be written after the end of the string.
+
+If you want to convert to UTF-8 from encodings other than
+the native (Latin1 or EBCDIC),
see sv_recode_to_utf8().
=cut