[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Encoding.pm

package Encode::Encoding;
# Base class for classes which implement encodings
use strict;
our $VERSION = do { my @r = (q$Revision: 2.0 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };

require Encode;

sub Define
{
    my $obj = shift;
    my $canonical = shift;
    $obj = bless { Name => $canonical },$obj unless ref $obj;
    # warn "$canonical => $obj\n";
    Encode::define_encoding($obj, $canonical, @_);
}

sub name  { return shift->{'Name'} }

sub renew { return $_[0] }
*new_sequence = \&renew;

sub needs_lines { 0 };

sub perlio_ok { 
    eval{ require PerlIO::encoding };
    return $@ ? 0 : 1;
}

# (Temporary|legacy) methods

sub toUnicode    { shift->decode(@_) }
sub fromUnicode  { shift->encode(@_) }

#
# Needs to be overloaded or just croak
#

sub encode {
    require Carp;
    my $obj = shift;
    my $class = ref($obj) ? ref($obj) : $obj;
    Carp::croak $class, "->encode() not defined!";
}

sub decode{
    require Carp;
    my $obj = shift;
    my $class = ref($obj) ? ref($obj) : $obj;
    Carp::croak $class, "->encode() not defined!";
}

sub DESTROY {}

1;
__END__

=head1 NAME

Encode::Encoding - Encode Implementation Base Class

=head1 SYNOPSIS

  package Encode::MyEncoding;
  use base qw(Encode::Encoding);

  __PACKAGE__->Define(qw(myCanonical myAlias));

=head1 DESCRIPTION

As mentioned in L<Encode>, encodings are (in the current
implementation at least) defined as objects. The mapping of encoding
name to object is via the C<%Encode::Encoding> hash.  Though you can
directly manipulate this hash, it is strongly encouraged to use this
base class module and add encode() and decode() methods.

=head2 Methods you should implement

You are strongly encouraged to implement methods below, at least
either encode() or decode().

=over 4

=item -E<gt>encode($string [,$check])

MUST return the octet sequence representing I<$string>. 

=over 2

=item *

If I<$check> is true, it SHOULD modify I<$string> in place to remove
the converted part (i.e.  the whole string unless there is an error).
If perlio_ok() is true, SHOULD becomes MUST.

=item *

If an error occurs, it SHOULD return the octet sequence for the
fragment of string that has been converted and modify $string in-place
to remove the converted part leaving it starting with the problem
fragment.  If perlio_ok() is true, SHOULD becomes MUST.

=item *

If I<$check> is is false then C<encode> MUST  make a "best effort" to
convert the string - for example, by using a replacement character.

=back

=item -E<gt>decode($octets [,$check])

MUST return the string that I<$octets> represents. 

=over 2

=item *

If I<$check> is true, it SHOULD modify I<$octets> in place to remove
the converted part (i.e.  the whole sequence unless there is an
error).  If perlio_ok() is true, SHOULD becomes MUST.

=item *

If an error occurs, it SHOULD return the fragment of string that has
been converted and modify $octets in-place to remove the converted
part leaving it starting with the problem fragment.  If perlio_ok() is
true, SHOULD becomes MUST.

=item *

If I<$check> is false then C<decode> should make a "best effort" to
convert the string - for example by using Unicode's "\x{FFFD}" as a
replacement character.

=back

=back

If you want your encoding to work with L<encoding> pragma, you should
also implement the method below.

=over 4

=item -E<gt>cat_decode($destination, $octets, $offset, $terminator [,$check])

MUST decode I<$octets> with I<$offset> and concatenate it to I<$destination>.
Decoding will terminate when $terminator (a string) appears in output.
I<$offset> will be modified to the last $octets position at end of decode.
Returns true if $terminator appears output, else returns false.

=back

=head2 Other methods defined in Encode::Encodings

You do not have to override methods shown below unless you have to.

=over 4

=item -E<gt>name

Predefined As:

  sub name  { return shift->{'Name'} }

MUST return the string representing the canonical name of the encoding.

=item -E<gt>renew

Predefined As:

  sub renew { return $_[0] }

This method reconstructs the encoding object if necessary.  If you need
to store the state during encoding, this is where you clone your object.
Here is an example:

  sub renew { 
      my $self = shift;
      my $clone = bless { %$self } => ref($self);
      $clone->{clone} = 1; # so the caller can see it
      return $clone;
  }

Since most encodings are stateless the default behavior is just return
itself as shown above.

PerlIO ALWAYS calls this method to make sure it has its own private
encoding object.

=item -E<gt>perlio_ok()

Predefined As:

  sub perlio_ok { 
      eval{ require PerlIO::encoding };
      return $@ ? 0 : 1;
  }

If your encoding does not support PerlIO for some reasons, just;

 sub perlio_ok { 0 }

=item -E<gt>needs_lines()

Predefined As:

  sub needs_lines { 0 };

If your encoding can work with PerlIO but needs line buffering, you
MUST define this method so it returns true.  7bit ISO-2022 encodings
are one example that needs this.  When this method is missing, false
is assumed.

=back

=head2 Example: Encode::ROT13

  package Encode::ROT13;
  use strict;
  use base qw(Encode::Encoding);

  __PACKAGE__->Define('rot13');

  sub encode($$;$){
      my ($obj, $str, $chk) = @_;
      $str =~ tr/A-Za-z/N-ZA-Mn-za-m/;
      $_[1] = '' if $chk; # this is what in-place edit means
      return $str;
  }

  # Jr pna or ynml yvxr guvf;
  *decode = \&encode;

  1;

=head1 Why the heck Encode API is different?

It should be noted that the I<$check> behaviour is different from the
outer public API. The logic is that the "unchecked" case is useful
when the encoding is part of a stream which may be reporting errors
(e.g. STDERR).  In such cases, it is desirable to get everything
through somehow without causing additional errors which obscure the
original one. Also, the encoding is best placed to know what the
correct replacement character is, so if that is the desired behaviour
then letting low level code do it is the most efficient.

By contrast, if I<$check> is true, the scheme above allows the
encoding to do as much as it can and tell the layer above how much
that was. What is lacking at present is a mechanism to report what
went wrong. The most likely interface will be an additional method
call to the object, or perhaps (to avoid forcing per-stream objects
on otherwise stateless encodings) an additional parameter.

It is also highly desirable that encoding classes inherit from
C<Encode::Encoding> as a base class. This allows that class to define
additional behaviour for all encoding objects.

  package Encode::MyEncoding;
  use base qw(Encode::Encoding);

  __PACKAGE__->Define(qw(myCanonical myAlias));

to create an object with C<< bless {Name => ...}, $class >>, and call
define_encoding.  They inherit their C<name> method from
C<Encode::Encoding>.

=head2 Compiled Encodings

For the sake of speed and efficiency, most of the encodings are now
supported via a I<compiled form>: XS modules generated from UCM
files.   Encode provides the enc2xs tool to achieve that.  Please see
L<enc2xs> for more details.

=head1 SEE ALSO

L<perlmod>, L<enc2xs>

=begin future

=over 4

=item Scheme 1

The fixup routine gets passed the remaining fragment of string being
processed.  It modifies it in place to remove bytes/characters it can
understand and returns a string used to represent them.  For example:

 sub fixup {
   my $ch = substr($_[0],0,1,'');
   return sprintf("\x{%02X}",ord($ch);
 }

This scheme is close to how the underlying C code for Encode works,
but gives the fixup routine very little context.

=item Scheme 2

The fixup routine gets passed the original string, an index into
it of the problem area, and the output string so far.  It appends
what it wants to the output string and returns a new index into the
original string.  For example:

 sub fixup {
   # my ($s,$i,$d) = @_;
   my $ch = substr($_[0],$_[1],1);
   $_[2] .= sprintf("\x{%02X}",ord($ch);
   return $_[1]+1;
 }

This scheme gives maximal control to the fixup routine but is more
complicated to code, and may require that the internals of Encode be tweaked to
keep the original string intact.

=item Other Schemes

Hybrids of the above.

Multiple return values rather than in-place modifications.

Index into the string could be C<pos($str)> allowing C<s/\G...//>.

=back

=end future

=cut
Commit	Line	Data
18586f54	1	package Encode::Encoding;
	2	# Base class for classes which implement encodings
	3	use strict;
7237418a	4	our $VERSION = do { my @r = (q$Revision: 2.0 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
621b0f8d	5
621b0f8d	6	require Encode;
18586f54	7
	8	sub Define
	9	{
	10	my $obj = shift;
	11	my $canonical = shift;
	12	$obj = bless { Name => $canonical },$obj unless ref $obj;
	13	# warn "$canonical => $obj\n";
f2a2953c	14	Encode::define_encoding($obj, $canonical, @_);
18586f54	15	}
18586f54	16
a0d8a30e	17	sub name { return shift->{'Name'} }
	18
	19	sub renew { return $_[0] }
	20	*new_sequence = \&renew;
10c5ecbb	21
	22	sub needs_lines { 0 };
	23
	24	sub perlio_ok {
	25	eval{ require PerlIO::encoding };
	26	return $@ ? 0 : 1;
	27	}
18586f54	28
a0d8a30e	29	# (Temporary\|legacy) methods
a0d8a30e	30
18586f54	31	sub toUnicode { shift->decode(@_) }
	32	sub fromUnicode { shift->encode(@_) }
	33
10c5ecbb	34	#
	35	# Needs to be overloaded or just croak
	36	#
18586f54	37
10c5ecbb	38	sub encode {
	39	require Carp;
	40	my $obj = shift;
	41	my $class = ref($obj) ? ref($obj) : $obj;
4b291ae6	42	Carp::croak $class, "->encode() not defined!";
10c5ecbb	43	}
0ab8f81e	44
10c5ecbb	45	sub decode{
	46	require Carp;
	47	my $obj = shift;
	48	my $class = ref($obj) ? ref($obj) : $obj;
4b291ae6	49	Carp::croak $class, "->encode() not defined!";
10c5ecbb	50	}
6d1c0808	51
284ee456	52	sub DESTROY {}
284ee456	53
18586f54	54	1;
18586f54	55	__END__
1b2c56c8	56
	57	=head1 NAME
	58
	59	Encode::Encoding - Encode Implementation Base Class
	60
	61	=head1 SYNOPSIS
	62
	63	package Encode::MyEncoding;
	64	use base qw(Encode::Encoding);
	65
	66	__PACKAGE__->Define(qw(myCanonical myAlias));
	67
5129552c	68	=head1 DESCRIPTION
1b2c56c8	69
1b2c56c8	70	As mentioned in L<Encode>, encodings are (in the current
10c5ecbb	71	implementation at least) defined as objects. The mapping of encoding
	72	name to object is via the C<%Encode::Encoding> hash. Though you can
	73	directly manipulate this hash, it is strongly encouraged to use this
	74	base class module and add encode() and decode() methods.
1b2c56c8	75
10c5ecbb	76	=head2 Methods you should implement
1b2c56c8	77
10c5ecbb	78	You are strongly encouraged to implement methods below, at least
10c5ecbb	79	either encode() or decode().
1b2c56c8	80
	81	=over 4
	82
10c5ecbb	83	=item -E<gt>encode($string [,$check])
1b2c56c8	84
0ab8f81e	85	MUST return the octet sequence representing I<$string>.
	86
	87	=over 2
	88
	89	=item *
	90
	91	If I<$check> is true, it SHOULD modify I<$string> in place to remove
	92	the converted part (i.e. the whole string unless there is an error).
	93	If perlio_ok() is true, SHOULD becomes MUST.
	94
	95	=item *
	96
	97	If an error occurs, it SHOULD return the octet sequence for the
	98	fragment of string that has been converted and modify $string in-place
	99	to remove the converted part leaving it starting with the problem
	100	fragment. If perlio_ok() is true, SHOULD becomes MUST.
	101
	102	=item *
1b2c56c8	103
0ab8f81e	104	If I<$check> is is false then C<encode> MUST make a "best effort" to
	105	convert the string - for example, by using a replacement character.
	106
	107	=back
1b2c56c8	108
10c5ecbb	109	=item -E<gt>decode($octets [,$check])
1b2c56c8	110
0ab8f81e	111	MUST return the string that I<$octets> represents.
	112
	113	=over 2
	114
	115	=item *
	116
	117	If I<$check> is true, it SHOULD modify I<$octets> in place to remove
	118	the converted part (i.e. the whole sequence unless there is an
	119	error). If perlio_ok() is true, SHOULD becomes MUST.
	120
	121	=item *
1b2c56c8	122
0ab8f81e	123	If an error occurs, it SHOULD return the fragment of string that has
	124	been converted and modify $octets in-place to remove the converted
	125	part leaving it starting with the problem fragment. If perlio_ok() is
	126	true, SHOULD becomes MUST.
	127
	128	=item *
	129
	130	If I<$check> is false then C<decode> should make a "best effort" to
1b2c56c8	131	convert the string - for example by using Unicode's "\x{FFFD}" as a
	132	replacement character.
	133
	134	=back
	135
8676e7d3	136	=back
	137
	138	If you want your encoding to work with L<encoding> pragma, you should
	139	also implement the method below.
	140
	141	=over 4
	142
220e2d4e	143	=item -E<gt>cat_decode($destination, $octets, $offset, $terminator [,$check])
	144
	145	MUST decode I<$octets> with I<$offset> and concatenate it to I<$destination>.
	146	Decoding will terminate when $terminator (a string) appears in output.
	147	I<$offset> will be modified to the last $octets position at end of decode.
	148	Returns true if $terminator appears output, else returns false.
	149
151b5d36	150	=back
151b5d36	151
10c5ecbb	152	=head2 Other methods defined in Encode::Encodings
	153
	154	You do not have to override methods shown below unless you have to.
	155
	156	=over 4
	157
	158	=item -E<gt>name
	159
	160	Predefined As:
	161
	162	sub name { return shift->{'Name'} }
	163
	164	MUST return the string representing the canonical name of the encoding.
	165
a0d8a30e	166	=item -E<gt>renew
10c5ecbb	167
	168	Predefined As:
	169
a0d8a30e	170	sub renew { return $_[0] }
	171
	172	This method reconstructs the encoding object if necessary. If you need
	173	to store the state during encoding, this is where you clone your object.
	174	Here is an example:
	175
	176	sub renew {
	177	my $self = shift;
	178	my $clone = bless { %$self } => ref($self);
	179	$clone->{clone} = 1; # so the caller can see it
	180	return $clone;
	181	}
	182
	183	Since most encodings are stateless the default behavior is just return
	184	itself as shown above.
10c5ecbb	185
a0d8a30e	186	PerlIO ALWAYS calls this method to make sure it has its own private
a0d8a30e	187	encoding object.
10c5ecbb	188
0ab8f81e	189	=item -E<gt>perlio_ok()
0ab8f81e	190
10c5ecbb	191	Predefined As:
011b2d2f	192
10c5ecbb	193	sub perlio_ok {
	194	eval{ require PerlIO::encoding };
	195	return $@ ? 0 : 1;
	196	}
0ab8f81e	197
10c5ecbb	198	If your encoding does not support PerlIO for some reasons, just;
0ab8f81e	199
	200	sub perlio_ok { 0 }
	201
	202	=item -E<gt>needs_lines()
	203
10c5ecbb	204	Predefined As:
	205
	206	sub needs_lines { 0 };
	207
0ab8f81e	208	If your encoding can work with PerlIO but needs line buffering, you
	209	MUST define this method so it returns true. 7bit ISO-2022 encodings
	210	are one example that needs this. When this method is missing, false
	211	is assumed.
	212
	213	=back
	214
10c5ecbb	215	=head2 Example: Encode::ROT13
	216
	217	package Encode::ROT13;
	218	use strict;
	219	use base qw(Encode::Encoding);
	220
	221	__PACKAGE__->Define('rot13');
	222
	223	sub encode($$;$){
	224	my ($obj, $str, $chk) = @_;
	225	$str =~ tr/A-Za-z/N-ZA-Mn-za-m/;
	226	$_[1] = '' if $chk; # this is what in-place edit means
	227	return $str;
	228	}
	229
	230	# Jr pna or ynml yvxr guvf;
	231	*decode = \&encode;
	232
	233	1;
	234
	235	=head1 Why the heck Encode API is different?
	236
0ab8f81e	237	It should be noted that the I<$check> behaviour is different from the
1b2c56c8	238	outer public API. The logic is that the "unchecked" case is useful
0ab8f81e	239	when the encoding is part of a stream which may be reporting errors
0ab8f81e	240	(e.g. STDERR). In such cases, it is desirable to get everything
1b2c56c8	241	through somehow without causing additional errors which obscure the
0ab8f81e	242	original one. Also, the encoding is best placed to know what the
1b2c56c8	243	correct replacement character is, so if that is the desired behaviour
	244	then letting low level code do it is the most efficient.
	245
0ab8f81e	246	By contrast, if I<$check> is true, the scheme above allows the
	247	encoding to do as much as it can and tell the layer above how much
	248	that was. What is lacking at present is a mechanism to report what
	249	went wrong. The most likely interface will be an additional method
	250	call to the object, or perhaps (to avoid forcing per-stream objects
	251	on otherwise stateless encodings) an additional parameter.
1b2c56c8	252
	253	It is also highly desirable that encoding classes inherit from
	254	C<Encode::Encoding> as a base class. This allows that class to define
10c5ecbb	255	additional behaviour for all encoding objects.
1b2c56c8	256
	257	package Encode::MyEncoding;
	258	use base qw(Encode::Encoding);
	259
	260	__PACKAGE__->Define(qw(myCanonical myAlias));
	261
0ab8f81e	262	to create an object with C<< bless {Name => ...}, $class >>, and call
1b2c56c8	263	define_encoding. They inherit their C<name> method from
	264	C<Encode::Encoding>.
	265
	266	=head2 Compiled Encodings
	267
0ab8f81e	268	For the sake of speed and efficiency, most of the encodings are now
	269	supported via a I<compiled form>: XS modules generated from UCM
	270	files. Encode provides the enc2xs tool to achieve that. Please see
67d7b5ef	271	L<enc2xs> for more details.
1b2c56c8	272
67d7b5ef	273	=head1 SEE ALSO
1b2c56c8	274
67d7b5ef	275	L<perlmod>, L<enc2xs>
1b2c56c8	276
0ab8f81e	277	=begin future
f2a2953c	278
	279	=over 4
	280
	281	=item Scheme 1
	282
0ab8f81e	283	The fixup routine gets passed the remaining fragment of string being
	284	processed. It modifies it in place to remove bytes/characters it can
	285	understand and returns a string used to represent them. For example:
f2a2953c	286
	287	sub fixup {
	288	my $ch = substr($_[0],0,1,'');
	289	return sprintf("\x{%02X}",ord($ch);
	290	}
	291
0ab8f81e	292	This scheme is close to how the underlying C code for Encode works,
0ab8f81e	293	but gives the fixup routine very little context.
f2a2953c	294
	295	=item Scheme 2
	296
0ab8f81e	297	The fixup routine gets passed the original string, an index into
	298	it of the problem area, and the output string so far. It appends
	299	what it wants to the output string and returns a new index into the
	300	original string. For example:
f2a2953c	301
	302	sub fixup {
	303	# my ($s,$i,$d) = @_;
	304	my $ch = substr($_[0],$_[1],1);
	305	$_[2] .= sprintf("\x{%02X}",ord($ch);
	306	return $_[1]+1;
	307	}
	308
	309	This scheme gives maximal control to the fixup routine but is more
0ab8f81e	310	complicated to code, and may require that the internals of Encode be tweaked to
0ab8f81e	311	keep the original string intact.
f2a2953c	312
	313	=item Other Schemes
	314
0ab8f81e	315	Hybrids of the above.
f2a2953c	316
	317	Multiple return values rather than in-place modifications.
	318
	319	Index into the string could be C<pos($str)> allowing C<s/\G...//>.
	320
	321	=back
	322
0ab8f81e	323	=end future
0ab8f81e	324
1b2c56c8	325	=cut