From: Nick Ing-Simmons Date: Sun, 25 Feb 2001 19:36:28 +0000 (+0000) Subject: Encode implementations docs. X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=4edaa979293b3e482a715be9ab66acf7eb848c46;p=p5sagit%2Fp5-mst-13.2.git Encode implementations docs. p4raw-id: //depot/perlio@8938 --- diff --git a/ext/Encode/Encode.pm b/ext/Encode/Encode.pm index f17cc1a..2d49865 100644 --- a/ext/Encode/Encode.pm +++ b/ext/Encode/Encode.pm @@ -158,9 +158,12 @@ sub decode_utf8 package Encode::Encoding; # Base class for classes which implement encodings + # Temporary legacy methods -sub toUnicode { shift->decode(@_) } -sub fromUnicode { shift->encode(@_) } +sub toUnicode { shift->decode(@_) } +sub fromUnicode { shift->encode(@_) } + +sub new_sequence { return $_[0] } package Encode::XS; use base 'Encode::Encoding'; @@ -809,6 +812,149 @@ not a string. =back +=head1 IMPLEMENTATION CLASSES + +As mentioned above encodings are (in the current implementation at least) +defined by objects. The mapping of encoding name to object is via the +C<%Encode::encodings> hash. (It is a package hash to allow XS code to get +at it.) + +The values of the hash can currently be either strings or objects. +The string form may go away in the future. The string form occurs +when C has scanned C<@INC> for loadable encodings but has +not actually loaded the encoding in question. This is because the +current "loading" process is all perl and a bit slow. + +Once an encoding is loaded then value of the hash is object which implements +the encoding. The object should provide the following interface: + +=over 4 + +=item -Ename + +Should return the string representing the canonical name of the encoding. + +=item -Enew_sequence + +This is a placeholder for encodings with state. It should return an object +which implements this interface, all current implementations return the +original object. + +=item -Eencode($string,$check) + +Should return the octet sequence representing I<$string>. If I<$check> is true +it should modify I<$string> in place to remove the converted part (i.e. +the whole string unless there is an error). +If an error occurs it should return the octet sequence for the +fragment of string that has been converted, and modify $string in-place +to remove the converted part leaving it starting with the problem fragment. + +If check is is false then C should make a "best effort" to convert +the string - for example by using a replacement character. + +=item -Edecode($octets,$check) + +Should return the string that I<$octets> represents. If I<$check> is true +it should modify I<$octets> in place to remove the converted part (i.e. +the whole sequence unless there is an error). +If an error occurs it should return the fragment of string +that has been converted, and modify $octets in-place to remove the converted part +leaving it starting with the problem fragment. + +If check is is false then C should make a "best effort" to convert +the string - for example by using Unicode's "\x{FFFD}" as a replacement character. + +=back + +It should be noted that the check behaviour is different from the outer +public API. The logic is that the "unchecked" case is useful when +encoding is part of a stream which may be reporting errors (e.g. STDERR). +In such cases it is desirable to get everything through somehow without +causing additional errors which obscure the original one. Also the encoding +is best placed to know what the correct replacement character is, so if that +is the desired behaviour then letting low level code do it is the most efficient. + +In contrast if check is true, the scheme above allows the encoding to do as +much as it can and tell layer above how much that was. What is lacking +at present is a mechanism to report what went wrong. The most likely interface +will be an additional method call to the object, or perhaps +(to avoid forcing per-stream objects on otherwise stateless encodings) +and additional parameter. + +It is also highly desirable that encoding classes inherit from C +as a base class. This allows that class to define additional behaviour for +all encoding objects. + +=head2 Compiled Encodings + +F provides a class C which provides the interface described +above. It calls a generic octet-sequence to octet-sequence "engine" that is +driven by tables (defined in F). The same engine is used for both +encode and decode. C's C forces perl's characters to their UTF-8 form +and then treats them as just another multibyte encoding. C's C transforms +the sequence and then turns the UTF-8-ness flag as that is the form that the tables +are defined to produce. For details of the engine see the comments in F. + +The tables are produced by the perl script F (the name needs to change so +we can eventually install it somewhere). F can currently read two formats: + +=over 4 + +=item *.enc + +This is a coined format used by Tcl. It is documented in Encode/EncodeFormat.pod. + +=item *.ucm + +This is the semi-standard format used by IBM's ICU package. + +=back + +F can write the following forms: + +=over 4 + +=item *.ucm + +See above - the F files provided with the distribution have +been created from the original Tcl .enc files using this approach. + +=item *.c + +Produces tables as C data structures - this is used to build in encodings +into F/F. + +=item *.xs + +In theory this allows encodings to be stand-alone loadable perl extensions. +The process has not yet been tested. The plan is to use this approach +for large East Asian encodings. + +=back + +The set of encodings built-in to F/F is determined by +F. The current set is as follows: + +=over 4 + +=item ascii and iso-8859-* + +That is all the common 8-bit "western" encodings. + +=item IBM-1047 and two other variants of EBCDIC. + +These are the same variants that are supported by EBCDIC perl as "native" encodings. +They are included to prove "reversibility" of some constructs in EBCDIC perl. + +=item symbol and dingbats as used by Tk on X11. + +(The reason Encode got started was to support perl/Tk.) + +=back + +That set is rather ad. hoc. and has been driven by the needs of the tests rather +than the needs of typical applications. It is likely to be rationalized. + =head1 SEE ALSO L, L, L