Integrate mainline

[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Encoding.pm
diff --git a/ext/Encode/lib/Encode/Encoding.pm b/ext/Encode/lib/Encode/Encoding.pm

index d2cb803..88594d1 100644 (file)
--- a/ext/Encode/lib/Encode/Encoding.pm
+++ b/ext/Encode/lib/Encode/Encoding.pm
@@ -1,7 +1,7 @@
 package Encode::Encoding;
 # Base class for classes which implement encodings
 use strict;
-our $VERSION = do { my @r = (q$Revision: 0.96 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
+our $VERSION = do { my @r = (q$Revision: 1.25 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
 
 sub Define
 {
@@ -9,7 +9,7 @@ sub Define
     my $canonical = shift;
     $obj = bless { Name => $canonical },$obj unless ref $obj;
     # warn "$canonical => $obj\n";
-  Encode::define_encoding($obj, $canonical, @_);
+    Encode::define_encoding($obj, $canonical, @_);
 }
 
 sub name { shift->{'Name'} }
@@ -123,79 +123,60 @@ C<Encode::Encoding>.
 
 =head2 Compiled Encodings
 
-F<Encode.xs> provides a class C<Encode::XS> which provides the
-interface described above. It calls a generic octet-sequence to
-octet-sequence "engine" that is driven by tables (defined in
-F<encengine.c>). The same engine is used for both encode and
-decode. C<Encode:XS>'s C<encode> forces Perl's characters to their
-UTF-8 form and then treats them as just another multibyte
-encoding. C<Encode:XS>'s C<decode> transforms the sequence and then
-turns the UTF-8-ness flag as that is the form that the tables are
-defined to produce. For details of the engine see the comments in
-F<encengine.c>.
-
-The tables are produced by the Perl script F<compile> (the name needs
-to change so we can eventually install it somewhere). F<compile> can
-currently read two formats:
+For the sake of speed and efficiency, Most of the encodings are now
+supported via I<Compiled Form> that are XS modules generated from UCM
+files.   Encode provides enc2xs tool to achieve that.  Please see
+L<enc2xs> for more details.
 
-=over 4
-
-=item *.enc
-
-This is a coined format used by Tcl. It is documented in
-Encode/EncodeFormat.pod.
+=head1 SEE ALSO
 
-=item *.ucm
+L<perlmod>, L<enc2xs>
 
-This is the semi-standard format used by IBM's ICU package.
+=for future
 
-=back
-
-F<compile> can write the following forms:
 
 =over 4
 
-=item *.ucm
+=item Scheme 1
 
-See above - the F<Encode/*.ucm> files provided with the distribution have
-been created from the original Tcl .enc files using this approach.
+Passed remaining fragment of string being processed.
+Modifies it in place to remove bytes/characters it can understand
+and returns a string used to represent them.
+e.g.
 
-=item *.c
+ sub fixup {
+   my $ch = substr($_[0],0,1,'');
+   return sprintf("\x{%02X}",ord($ch);
+ }
 
-Produces tables as C data structures - this is used to build in encodings
-into F<Encode.so>/F<Encode.dll>.
+This scheme is close to how underlying C code for Encode works, but gives
+the fixup routine very little context.
 
-=item *.xs
+=item Scheme 2
 
-In theory this allows encodings to be stand-alone loadable Perl
-extensions.  The process has not yet been tested. The plan is to use
-this approach for large East Asian encodings.
+Passed original string, and an index into it of the problem area, and
+output string so far.  Appends what it will to output string and
+returns new index into original string.  For example:
 
-=back
+ sub fixup {
+   # my ($s,$i,$d) = @_;
+   my $ch = substr($_[0],$_[1],1);
+   $_[2] .= sprintf("\x{%02X}",ord($ch);
+   return $_[1]+1;
+ }
 
-The set of encodings built-in to F<Encode.so>/F<Encode.dll> is
-determined by F<Makefile.PL>.  The current set is as follows:
+This scheme gives maximal control to the fixup routine but is more
+complicated to code, and may need internals of Encode to be tweaked to
+keep original string intact.
 
-=over 4
-
-=item ascii and iso-8859-*
-
-That is all the common 8-bit "western" encodings.
+=item Other Schemes
 
-=item IBM-1047 and two other variants of EBCDIC.
+Hybrids of above.
 
-These are the same variants that are supported by EBCDIC Perl as
-"native" encodings.  They are included to prove "reversibility" of
-some constructs in EBCDIC Perl.
+Multiple return values rather than in-place modifications.
 
-=item symbol and dingbats as used by Tk on X11.
-
-(The reason Encode got started was to support Perl/Tk.)
+Index into the string could be C<pos($str)> allowing C<s/\G...//>.
 
 =back
 
-That set is rather ad hoc and has been driven by the needs of the
-tests rather than the needs of typical applications. It is likely
-to be rationalized.
-
 =cut