package Encode::Encoding;
# Base class for classes which implement encodings
use strict;
-our $VERSION = do { my @r = (q$Revision: 0.96 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
+our $VERSION = do { my @r = (q$Revision: 1.25 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
sub Define
{
my $canonical = shift;
$obj = bless { Name => $canonical },$obj unless ref $obj;
# warn "$canonical => $obj\n";
- Encode::define_encoding($obj, $canonical, @_);
+ Encode::define_encoding($obj, $canonical, @_);
}
sub name { shift->{'Name'} }
=head2 Compiled Encodings
-F<Encode.xs> provides a class C<Encode::XS> which provides the
-interface described above. It calls a generic octet-sequence to
-octet-sequence "engine" that is driven by tables (defined in
-F<encengine.c>). The same engine is used for both encode and
-decode. C<Encode:XS>'s C<encode> forces Perl's characters to their
-UTF-8 form and then treats them as just another multibyte
-encoding. C<Encode:XS>'s C<decode> transforms the sequence and then
-turns the UTF-8-ness flag as that is the form that the tables are
-defined to produce. For details of the engine see the comments in
-F<encengine.c>.
-
-The tables are produced by the Perl script F<compile> (the name needs
-to change so we can eventually install it somewhere). F<compile> can
-currently read two formats:
+For the sake of speed and efficiency, Most of the encodings are now
+supported via I<Compiled Form> that are XS modules generated from UCM
+files. Encode provides enc2xs tool to achieve that. Please see
+L<enc2xs> for more details.
-=over 4
-
-=item *.enc
-
-This is a coined format used by Tcl. It is documented in
-Encode/EncodeFormat.pod.
+=head1 SEE ALSO
-=item *.ucm
+L<perlmod>, L<enc2xs>
-This is the semi-standard format used by IBM's ICU package.
+=for future
-=back
-
-F<compile> can write the following forms:
=over 4
-=item *.ucm
+=item Scheme 1
-See above - the F<Encode/*.ucm> files provided with the distribution have
-been created from the original Tcl .enc files using this approach.
+Passed remaining fragment of string being processed.
+Modifies it in place to remove bytes/characters it can understand
+and returns a string used to represent them.
+e.g.
-=item *.c
+ sub fixup {
+ my $ch = substr($_[0],0,1,'');
+ return sprintf("\x{%02X}",ord($ch);
+ }
-Produces tables as C data structures - this is used to build in encodings
-into F<Encode.so>/F<Encode.dll>.
+This scheme is close to how underlying C code for Encode works, but gives
+the fixup routine very little context.
-=item *.xs
+=item Scheme 2
-In theory this allows encodings to be stand-alone loadable Perl
-extensions. The process has not yet been tested. The plan is to use
-this approach for large East Asian encodings.
+Passed original string, and an index into it of the problem area, and
+output string so far. Appends what it will to output string and
+returns new index into original string. For example:
-=back
+ sub fixup {
+ # my ($s,$i,$d) = @_;
+ my $ch = substr($_[0],$_[1],1);
+ $_[2] .= sprintf("\x{%02X}",ord($ch);
+ return $_[1]+1;
+ }
-The set of encodings built-in to F<Encode.so>/F<Encode.dll> is
-determined by F<Makefile.PL>. The current set is as follows:
+This scheme gives maximal control to the fixup routine but is more
+complicated to code, and may need internals of Encode to be tweaked to
+keep original string intact.
-=over 4
-
-=item ascii and iso-8859-*
-
-That is all the common 8-bit "western" encodings.
+=item Other Schemes
-=item IBM-1047 and two other variants of EBCDIC.
+Hybrids of above.
-These are the same variants that are supported by EBCDIC Perl as
-"native" encodings. They are included to prove "reversibility" of
-some constructs in EBCDIC Perl.
+Multiple return values rather than in-place modifications.
-=item symbol and dingbats as used by Tk on X11.
-
-(The reason Encode got started was to support Perl/Tk.)
+Index into the string could be C<pos($str)> allowing C<s/\G...//>.
=back
-That set is rather ad hoc and has been driven by the needs of the
-tests rather than the needs of typical applications. It is likely
-to be rationalized.
-
=cut