[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Encoding.pm

package Encode::Encoding;
# Base class for classes which implement encodings
use strict;
our $VERSION = do { my @r = (q$Revision: 1.27 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };

sub Define
{
    my $obj = shift;
    my $canonical = shift;
    $obj = bless { Name => $canonical },$obj unless ref $obj;
    # warn "$canonical => $obj\n";
    Encode::define_encoding($obj, $canonical, @_);
}

sub name { shift->{'Name'} }

# Temporary legacy methods
sub toUnicode    { shift->decode(@_) }
sub fromUnicode  { shift->encode(@_) }

sub new_sequence { return $_[0] }

sub perlio_ok { 0 }

sub needs_lines  { 0 }

sub DESTROY {}

1;
__END__

=head1 NAME

Encode::Encoding - Encode Implementation Base Class

=head1 SYNOPSIS

  package Encode::MyEncoding;
  use base qw(Encode::Encoding);

  __PACKAGE__->Define(qw(myCanonical myAlias));

=head1 DESCRIPTION

As mentioned in L<Encode>, encodings are (in the current
implementation at least) defined by objects. The mapping of encoding
name to object is via the C<%encodings> hash.

The values of the hash can currently be either strings or objects.
The string form may go away in the future. The string form occurs
when C<encodings()> has scanned C<@INC> for loadable encodings but has
not actually loaded the encoding in question. This is because the
current "loading" process is all Perl and a bit slow.

Once an encoding is loaded, the value of the hash is the object which
implements the encoding. The object should provide the following
interface:

=over 4

=item -E<gt>name

MUST return the string representing the canonical name of the encoding.

=item -E<gt>new_sequence

This is a placeholder for encodings with state. It should return an
object which implements this interface.  All current implementations
return the original object.

=item -E<gt>encode($string,$check)

MUST return the octet sequence representing I<$string>. 

=over 2

=item *

If I<$check> is true, it SHOULD modify I<$string> in place to remove
the converted part (i.e.  the whole string unless there is an error).
If perlio_ok() is true, SHOULD becomes MUST.

=item *

If an error occurs, it SHOULD return the octet sequence for the
fragment of string that has been converted and modify $string in-place
to remove the converted part leaving it starting with the problem
fragment.  If perlio_ok() is true, SHOULD becomes MUST.

=item *

If I<$check> is is false then C<encode> MUST  make a "best effort" to
convert the string - for example, by using a replacement character.

=back

=item -E<gt>decode($octets,$check)

MUST return the string that I<$octets> represents. 

=over 2

=item *

If I<$check> is true, it SHOULD modify I<$octets> in place to remove
the converted part (i.e.  the whole sequence unless there is an
error).  If perlio_ok() is true, SHOULD becomes MUST.

=item *

If an error occurs, it SHOULD return the fragment of string that has
been converted and modify $octets in-place to remove the converted
part leaving it starting with the problem fragment.  If perlio_ok() is
true, SHOULD becomes MUST.

=item *

If I<$check> is false then C<decode> should make a "best effort" to
convert the string - for example by using Unicode's "\x{FFFD}" as a
replacement character.

=back

=item -E<gt>perlio_ok()

If you want your encoding to work with PerlIO, you MUST define this
method so that it returns 1 when PerlIO is enabled.  Here is an
example;

 sub perlio_ok { exists $INC{"PerlIO/encoding.pm"} }

By default, this method is defined as follows;

 sub perlio_ok { 0 }

=item -E<gt>needs_lines()

If your encoding can work with PerlIO but needs line buffering, you
MUST define this method so it returns true.  7bit ISO-2022 encodings
are one example that needs this.  When this method is missing, false
is assumed.

=back

It should be noted that the I<$check> behaviour is different from the
outer public API. The logic is that the "unchecked" case is useful
when the encoding is part of a stream which may be reporting errors
(e.g. STDERR).  In such cases, it is desirable to get everything
through somehow without causing additional errors which obscure the
original one. Also, the encoding is best placed to know what the
correct replacement character is, so if that is the desired behaviour
then letting low level code do it is the most efficient.

By contrast, if I<$check> is true, the scheme above allows the
encoding to do as much as it can and tell the layer above how much
that was. What is lacking at present is a mechanism to report what
went wrong. The most likely interface will be an additional method
call to the object, or perhaps (to avoid forcing per-stream objects
on otherwise stateless encodings) an additional parameter.

It is also highly desirable that encoding classes inherit from
C<Encode::Encoding> as a base class. This allows that class to define
additional behaviour for all encoding objects. For example, built-in
Unicode, UCS-2, and UTF-8 classes use

  package Encode::MyEncoding;
  use base qw(Encode::Encoding);

  __PACKAGE__->Define(qw(myCanonical myAlias));

to create an object with C<< bless {Name => ...}, $class >>, and call
define_encoding.  They inherit their C<name> method from
C<Encode::Encoding>.

=head2 Compiled Encodings

For the sake of speed and efficiency, most of the encodings are now
supported via a I<compiled form>: XS modules generated from UCM
files.   Encode provides the enc2xs tool to achieve that.  Please see
L<enc2xs> for more details.

=head1 SEE ALSO

L<perlmod>, L<enc2xs>

=begin future

=over 4

=item Scheme 1

The fixup routine gets passed the remaining fragment of string being
processed.  It modifies it in place to remove bytes/characters it can
understand and returns a string used to represent them.  For example:

 sub fixup {
   my $ch = substr($_[0],0,1,'');
   return sprintf("\x{%02X}",ord($ch);
 }

This scheme is close to how the underlying C code for Encode works,
but gives the fixup routine very little context.

=item Scheme 2

The fixup routine gets passed the original string, an index into
it of the problem area, and the output string so far.  It appends
what it wants to the output string and returns a new index into the
original string.  For example:

 sub fixup {
   # my ($s,$i,$d) = @_;
   my $ch = substr($_[0],$_[1],1);
   $_[2] .= sprintf("\x{%02X}",ord($ch);
   return $_[1]+1;
 }

This scheme gives maximal control to the fixup routine but is more
complicated to code, and may require that the internals of Encode be tweaked to
keep the original string intact.

=item Other Schemes

Hybrids of the above.

Multiple return values rather than in-place modifications.

Index into the string could be C<pos($str)> allowing C<s/\G...//>.

=back

=end future

=cut
Commit	Line	Data
18586f54	1	package Encode::Encoding;
	2	# Base class for classes which implement encodings
	3	use strict;
0ab8f81e	4	our $VERSION = do { my @r = (q$Revision: 1.27 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
18586f54	5
	6	sub Define
	7	{
	8	my $obj = shift;
	9	my $canonical = shift;
	10	$obj = bless { Name => $canonical },$obj unless ref $obj;
	11	# warn "$canonical => $obj\n";
f2a2953c	12	Encode::define_encoding($obj, $canonical, @_);
18586f54	13	}
	14
	15	sub name { shift->{'Name'} }
	16
	17	# Temporary legacy methods
	18	sub toUnicode { shift->decode(@_) }
	19	sub fromUnicode { shift->encode(@_) }
	20
	21	sub new_sequence { return $_[0] }
	22
0ab8f81e	23	sub perlio_ok { 0 }
0ab8f81e	24
6d1c0808	25	sub needs_lines { 0 }
6d1c0808	26
284ee456	27	sub DESTROY {}
284ee456	28
18586f54	29	1;
18586f54	30	__END__
1b2c56c8	31
	32	=head1 NAME
	33
	34	Encode::Encoding - Encode Implementation Base Class
	35
	36	=head1 SYNOPSIS
	37
	38	package Encode::MyEncoding;
	39	use base qw(Encode::Encoding);
	40
	41	__PACKAGE__->Define(qw(myCanonical myAlias));
	42
5129552c	43	=head1 DESCRIPTION
1b2c56c8	44
	45	As mentioned in L<Encode>, encodings are (in the current
	46	implementation at least) defined by objects. The mapping of encoding
	47	name to object is via the C<%encodings> hash.
	48
	49	The values of the hash can currently be either strings or objects.
	50	The string form may go away in the future. The string form occurs
	51	when C<encodings()> has scanned C<@INC> for loadable encodings but has
	52	not actually loaded the encoding in question. This is because the
	53	current "loading" process is all Perl and a bit slow.
	54
0ab8f81e	55	Once an encoding is loaded, the value of the hash is the object which
1b2c56c8	56	implements the encoding. The object should provide the following
	57	interface:
	58
	59	=over 4
	60
	61	=item -E<gt>name
	62
0ab8f81e	63	MUST return the string representing the canonical name of the encoding.
1b2c56c8	64
	65	=item -E<gt>new_sequence
	66
	67	This is a placeholder for encodings with state. It should return an
0ab8f81e	68	object which implements this interface. All current implementations
1b2c56c8	69	return the original object.
	70
	71	=item -E<gt>encode($string,$check)
	72
0ab8f81e	73	MUST return the octet sequence representing I<$string>.
	74
	75	=over 2
	76
	77	=item *
	78
	79	If I<$check> is true, it SHOULD modify I<$string> in place to remove
	80	the converted part (i.e. the whole string unless there is an error).
	81	If perlio_ok() is true, SHOULD becomes MUST.
	82
	83	=item *
	84
	85	If an error occurs, it SHOULD return the octet sequence for the
	86	fragment of string that has been converted and modify $string in-place
	87	to remove the converted part leaving it starting with the problem
	88	fragment. If perlio_ok() is true, SHOULD becomes MUST.
	89
	90	=item *
1b2c56c8	91
0ab8f81e	92	If I<$check> is is false then C<encode> MUST make a "best effort" to
	93	convert the string - for example, by using a replacement character.
	94
	95	=back
1b2c56c8	96
	97	=item -E<gt>decode($octets,$check)
	98
0ab8f81e	99	MUST return the string that I<$octets> represents.
	100
	101	=over 2
	102
	103	=item *
	104
	105	If I<$check> is true, it SHOULD modify I<$octets> in place to remove
	106	the converted part (i.e. the whole sequence unless there is an
	107	error). If perlio_ok() is true, SHOULD becomes MUST.
	108
	109	=item *
1b2c56c8	110
0ab8f81e	111	If an error occurs, it SHOULD return the fragment of string that has
	112	been converted and modify $octets in-place to remove the converted
	113	part leaving it starting with the problem fragment. If perlio_ok() is
	114	true, SHOULD becomes MUST.
	115
	116	=item *
	117
	118	If I<$check> is false then C<decode> should make a "best effort" to
1b2c56c8	119	convert the string - for example by using Unicode's "\x{FFFD}" as a
	120	replacement character.
	121
	122	=back
	123
0ab8f81e	124	=item -E<gt>perlio_ok()
	125
	126	If you want your encoding to work with PerlIO, you MUST define this
	127	method so that it returns 1 when PerlIO is enabled. Here is an
	128	example;
	129
	130	sub perlio_ok { exists $INC{"PerlIO/encoding.pm"} }
	131
	132	By default, this method is defined as follows;
	133
	134	sub perlio_ok { 0 }
	135
	136	=item -E<gt>needs_lines()
	137
	138	If your encoding can work with PerlIO but needs line buffering, you
	139	MUST define this method so it returns true. 7bit ISO-2022 encodings
	140	are one example that needs this. When this method is missing, false
	141	is assumed.
	142
	143	=back
	144
	145	It should be noted that the I<$check> behaviour is different from the
1b2c56c8	146	outer public API. The logic is that the "unchecked" case is useful
0ab8f81e	147	when the encoding is part of a stream which may be reporting errors
0ab8f81e	148	(e.g. STDERR). In such cases, it is desirable to get everything
1b2c56c8	149	through somehow without causing additional errors which obscure the
0ab8f81e	150	original one. Also, the encoding is best placed to know what the
1b2c56c8	151	correct replacement character is, so if that is the desired behaviour
	152	then letting low level code do it is the most efficient.
	153
0ab8f81e	154	By contrast, if I<$check> is true, the scheme above allows the
	155	encoding to do as much as it can and tell the layer above how much
	156	that was. What is lacking at present is a mechanism to report what
	157	went wrong. The most likely interface will be an additional method
	158	call to the object, or perhaps (to avoid forcing per-stream objects
	159	on otherwise stateless encodings) an additional parameter.
1b2c56c8	160
	161	It is also highly desirable that encoding classes inherit from
	162	C<Encode::Encoding> as a base class. This allows that class to define
0ab8f81e	163	additional behaviour for all encoding objects. For example, built-in
0ab8f81e	164	Unicode, UCS-2, and UTF-8 classes use
1b2c56c8	165
	166	package Encode::MyEncoding;
	167	use base qw(Encode::Encoding);
	168
	169	__PACKAGE__->Define(qw(myCanonical myAlias));
	170
0ab8f81e	171	to create an object with C<< bless {Name => ...}, $class >>, and call
1b2c56c8	172	define_encoding. They inherit their C<name> method from
	173	C<Encode::Encoding>.
	174
	175	=head2 Compiled Encodings
	176
0ab8f81e	177	For the sake of speed and efficiency, most of the encodings are now
	178	supported via a I<compiled form>: XS modules generated from UCM
	179	files. Encode provides the enc2xs tool to achieve that. Please see
67d7b5ef	180	L<enc2xs> for more details.
1b2c56c8	181
67d7b5ef	182	=head1 SEE ALSO
1b2c56c8	183
67d7b5ef	184	L<perlmod>, L<enc2xs>
1b2c56c8	185
0ab8f81e	186	=begin future
f2a2953c	187
	188	=over 4
	189
	190	=item Scheme 1
	191
0ab8f81e	192	The fixup routine gets passed the remaining fragment of string being
	193	processed. It modifies it in place to remove bytes/characters it can
	194	understand and returns a string used to represent them. For example:
f2a2953c	195
	196	sub fixup {
	197	my $ch = substr($_[0],0,1,'');
	198	return sprintf("\x{%02X}",ord($ch);
	199	}
	200
0ab8f81e	201	This scheme is close to how the underlying C code for Encode works,
0ab8f81e	202	but gives the fixup routine very little context.
f2a2953c	203
	204	=item Scheme 2
	205
0ab8f81e	206	The fixup routine gets passed the original string, an index into
	207	it of the problem area, and the output string so far. It appends
	208	what it wants to the output string and returns a new index into the
	209	original string. For example:
f2a2953c	210
	211	sub fixup {
	212	# my ($s,$i,$d) = @_;
	213	my $ch = substr($_[0],$_[1],1);
	214	$_[2] .= sprintf("\x{%02X}",ord($ch);
	215	return $_[1]+1;
	216	}
	217
	218	This scheme gives maximal control to the fixup routine but is more
0ab8f81e	219	complicated to code, and may require that the internals of Encode be tweaked to
0ab8f81e	220	keep the original string intact.
f2a2953c	221
	222	=item Other Schemes
	223
0ab8f81e	224	Hybrids of the above.
f2a2953c	225
	226	Multiple return values rather than in-place modifications.
	227
	228	Index into the string could be C<pos($str)> allowing C<s/\G...//>.
	229
	230	=back
	231
0ab8f81e	232	=end future
0ab8f81e	233
1b2c56c8	234	=cut