[p5sagit/p5-mst-13.2.git] / lib / utf8.pm

package utf8;


$utf8::hint_bits = 0x00800000;

our $VERSION = '1.00';

sub import {
    $^H |= $utf8::hint_bits;
    $enc{caller()} = $_[1] if $_[1];
}

sub unimport {
    $^H &= ~$utf8::hint_bits;
}

sub AUTOLOAD {
    require "utf8_heavy.pl";
    goto &$AUTOLOAD if defined &$AUTOLOAD;
    Carp::croak("Undefined subroutine $AUTOLOAD called");
}

1;
__END__

=head1 NAME

utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code

=head1 SYNOPSIS

    use utf8;
    no utf8;

=head1 DESCRIPTION

WARNING: The implementation of Unicode support in Perl is incomplete.
See L<perlunicode> for the exact details.

The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
platforms).  The C<no utf8> pragma tells Perl to switch back to treating 
the source text as literal bytes in the current lexical scope.

This pragma is primarily a compatibility device.  Perl versions
earlier than 5.6 allowed arbitrary bytes in source code, whereas
in future we would like to standardize on the UTF-8 encoding for
source text.  Until UTF-8 becomes the default format for source
text, this pragma should be used to recognize UTF-8 in the source.
When UTF-8 becomes the standard source format, this pragma will
effectively become a no-op.  For convenience in what follows the
term UTF-X is used to refer to UTF-8 on ASCII and ISO Latin based
platforms and UTF-EBCDIC on EBCDIC based platforms.

Enabling the C<utf8> pragma has the following effects:

=over 4

=item *

Bytes in the source text that have their high-bit set will be treated
as being part of a literal UTF-8 character.  This includes most literals
such as identifiers, string constants, constant regular expression patterns
and package names.  On EBCDIC platforms, characters in the C1 control group 
and the Latin 1 character set are treated as being part of a literal
UTF-EBCDIC character.

=item *

In the absence of inputs marked as UTF-X, regular expressions within the 
scope of this pragma will default to using character semantics instead
of byte semantics.

    @bytes_or_chars = split //, $data;	# may split to bytes if data
					# $data isn't UTF-X
    {
	use utf8;			# force char semantics
	@chars = split //, $data;	# splits characters
    }

=back

=head2 Utility functions

The following functions are defined in the C<utf8::> package by the perl core.

=over 4

=item * $num_octets = utf8::upgrade($string);

Converts internal representation of string to the perls internal UTF-X form.
Returns the number of octets necessary to represent the string as UTF-X.

=item * utf8::downgrade($string[, CHECK])

Converts internal representation of string to be un-encoded bytes.

=item * utf8::encode($string)

Converts (in-place) I<$string> from logical characters to octet sequence
representing it in perl's UTF-X encoding.

=item * $flag = utf8::decode($string)

Attempts to convert I<$string> in-place from perl's UTF-X encoding into logical characters.

=back

=head1 SEE ALSO

L<perlunicode>, L<bytes>

=cut
Commit	Line	Data
a0ed51b3	1	package utf8;
a0ed51b3	2
663b9db3	3
d5448623	4	$utf8::hint_bits = 0x00800000;
d5448623	5
b75c8c73	6	our $VERSION = '1.00';
b75c8c73	7
a0ed51b3	8	sub import {
d5448623	9	$^H \|= $utf8::hint_bits;
a0ed51b3	10	$enc{caller()} = $_[1] if $_[1];
	11	}
	12
	13	sub unimport {
d5448623	14	$^H &= ~$utf8::hint_bits;
a0ed51b3	15	}
	16
	17	sub AUTOLOAD {
	18	require "utf8_heavy.pl";
daf4d4ea	19	goto &$AUTOLOAD if defined &$AUTOLOAD;
daf4d4ea	20	Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3	21	}
	22
	23	1;
	24	__END__
	25
	26	=head1 NAME
	27
b3419ed8	28	utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3	29
	30	=head1 SYNOPSIS
	31
	32	use utf8;
	33	no utf8;
	34
	35	=head1 DESCRIPTION
	36
393fec97	37	WARNING: The implementation of Unicode support in Perl is incomplete.
21bad921	38	See L<perlunicode> for the exact details.
a0ed51b3	39
393fec97	40	The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
b3419ed8	41	program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
	42	platforms). The C<no utf8> pragma tells Perl to switch back to treating
	43	the source text as literal bytes in the current lexical scope.
a0ed51b3	44
393fec97	45	This pragma is primarily a compatibility device. Perl versions
	46	earlier than 5.6 allowed arbitrary bytes in source code, whereas
	47	in future we would like to standardize on the UTF-8 encoding for
	48	source text. Until UTF-8 becomes the default format for source
	49	text, this pragma should be used to recognize UTF-8 in the source.
	50	When UTF-8 becomes the standard source format, this pragma will
b3419ed8	51	effectively become a no-op. For convenience in what follows the
	52	term UTF-X is used to refer to UTF-8 on ASCII and ISO Latin based
	53	platforms and UTF-EBCDIC on EBCDIC based platforms.
a0ed51b3	54
393fec97	55	Enabling the C<utf8> pragma has the following effects:
a0ed51b3	56
4ac9195f	57	=over 4
a0ed51b3	58
	59	=item *
	60
393fec97	61	Bytes in the source text that have their high-bit set will be treated
	62	as being part of a literal UTF-8 character. This includes most literals
	63	such as identifiers, string constants, constant regular expression patterns
b3419ed8	64	and package names. On EBCDIC platforms, characters in the C1 control group
	65	and the Latin 1 character set are treated as being part of a literal
	66	UTF-EBCDIC character.
a0ed51b3	67
	68	=item *
	69
b3419ed8	70	In the absence of inputs marked as UTF-X, regular expressions within the
393fec97	71	scope of this pragma will default to using character semantics instead
393fec97	72	of byte semantics.
a0ed51b3	73
393fec97	74	@bytes_or_chars = split //, $data; # may split to bytes if data
b3419ed8	75	# $data isn't UTF-X
393fec97	76	{
	77	use utf8; # force char semantics
	78	@chars = split //, $data; # splits characters
a0ed51b3	79	}
a0ed51b3	80
4ac9195f	81	=back
4ac9195f	82
1b026014	83	=head2 Utility functions
	84
	85	The following functions are defined in the C<utf8::> package by the perl core.
	86
	87	=over 4
	88
	89	=item * $num_octets = utf8::upgrade($string);
	90
	91	Converts internal representation of string to the perls internal UTF-X form.
	92	Returns the number of octets necessary to represent the string as UTF-X.
	93
	94	=item * utf8::downgrade($string[, CHECK])
	95
	96	Converts internal representation of string to be un-encoded bytes.
	97
	98	=item * utf8::encode($string)
	99
	100	Converts (in-place) I<$string> from logical characters to octet sequence
	101	representing it in perl's UTF-X encoding.
	102
	103	=item * $flag = utf8::decode($string)
	104
b3419ed8	105	Attempts to convert I<$string> in-place from perl's UTF-X encoding into logical characters.
1b026014	106
	107	=back
	108
393fec97	109	=head1 SEE ALSO
a0ed51b3	110
8058d7ab	111	L<perlunicode>, L<bytes>
a0ed51b3	112
a0ed51b3	113	=cut