X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlguts.pod;h=5255519a8851790ecc5695fea5a782450ea3d3dd;hb=f3b76584ef7773843ba39a11b8bd91238af59f12;hp=5f1dd21a14091dfac3056b6e366068a237c49476;hpb=92d29cee5ff815b05b81b877528e4c77e73881c9;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlguts.pod b/pod/perlguts.pod index 5f1dd21..5255519 100644 --- a/pod/perlguts.pod +++ b/pod/perlguts.pod @@ -832,6 +832,8 @@ The current kinds of Magic Virtual Tables are: a vtbl_amagicelem %OVERLOAD hash element c (none) Holds overload table (AMT) on stash B vtbl_bm Boyer-Moore (fast string search) + D vtbl_regdata Regex match position data (@+ and @- vars) + d vtbl_regdatum Regex match position data element E vtbl_env %ENV hash e vtbl_envelem %ENV hash element f vtbl_fm Formline ('compiled' format) @@ -1554,17 +1556,11 @@ First problem: deciding which functions will be public API functions and which will be private. All functions whose names begin C are private (think "S" for "secret" or "static"). All other functions begin with "Perl_", but just because a function begins with "Perl_" does not mean it is -part of the API. The easiest way to be B a function is part of the API -is to find its entry in L. If it exists in L, it's part -of the API. If it doesn't, and you think it should be (i.e., you need it for -your extension), send mail via L explaining why you think it -should be. - -(L itself is generated by embed.pl, a Perl script that generates -significant portions of the Perl source code. It has a list of almost -all the functions defined by the Perl interpreter along with their calling -characteristics and some flags. Functions that are part of the public API -are marked with an 'A' in its flags.) +part of the API. (See L.) The easiest way to be B a +function is part of the API is to find its entry in L. +If it exists in L, it's part of the API. If it doesn't, and you +think it should be (i.e., you need it for your extension), send mail via +L explaining why you think it should be. Second problem: there must be a syntax so that the same subroutine declarations and calls can pass a structure as their first argument, @@ -1779,6 +1775,320 @@ The Perl engine/interpreter and the host are orthogonal entities. There could be one or more interpreters in a process, and one or more "hosts", with free association between them. +=head1 Internal Functions + +All of Perl's internal functions which will be exposed to the outside +world are be prefixed by C so that they will not conflict with XS +functions or functions used in a program in which Perl is embedded. +Similarly, all global variables begin with C. (By convention, +static functions start with C) + +Inside the Perl core, you can get at the functions either with or +without the C prefix, thanks to a bunch of defines that live in +F. This header file is generated automatically from +F. F also creates the prototyping header files for +the internal functions, generates the documentation and a lot of other +bits and pieces. It's important that when you add a new function to the +core or change an existing one, you change the data in the table at the +end of F as well. Here's a sample entry from that table: + + Apd |SV** |av_fetch |AV* ar|I32 key|I32 lval + +The second column is the return type, the third column the name. Columns +after that are the arguments. The first column is a set of flags: + +=over 3 + +=item A + +This function is a part of the public API. + +=item p + +This function has a C prefix; ie, it is defined as C + +=item d + +This function has documentation using the C feature which we'll +look at in a second. + +=back + +Other available flags are: + +=over 3 + +=item s + +This is a static function and is defined as C. + +=item n + +This does not use C and C to pass interpreter context. (See +L.) + +=item r + +This function never returns; C, C and friends. + +=item f + +This function takes a variable number of arguments, C style. +The argument list should end with C<...>, like this: + + Afprd |void |croak |const char* pat|... + +=item m + +This function is part of the experimental development API, and may change +or disappear without notice. + +=item o + +This function should not have a compatibility macro to define, say, +C to C. It must be called as C. + +=item j + +This function is not a member of C. If you don't know +what this means, don't use it. + +=item x + +This function isn't exported out of the Perl core. + +=back + +If you edit F, you will need to run C to +force a rebuild of F and other auto-generated files. + +=head2 Source Documentation + +There's an effort going on to document the internal functions and +automatically produce reference manuals from them - L is one +such manual which details all the functions which are available to XS +writers. L is the autogenerated manual for the functions +which are not part of the API and are supposedly for internal use only. + +Source documentation is created by putting POD comments into the C +source, like this: + + /* + =for apidoc sv_setiv + + Copies an integer into the given SV. Does not handle 'set' magic. See + C. + + =cut + */ + +Please try and supply some documentation if you add functions to the +Perl core. + +=head1 Unicode Support + +Perl 5.6.0 introduced Unicode support. It's important for porters and XS +writers to understand this support and make sure that the code they +write does not corrupt Unicode data. + +=head2 What B Unicode, anyway? + +In the olden, less enlightened times, we all used to use ASCII. Most of +us did, anyway. The big problem with ASCII is that it's American. Well, +no, that's not actually the problem; the problem is that it's not +particularly useful for people who don't use the Roman alphabet. What +used to happen was that particular languages would stick their own +alphabet in the upper range of the sequence, between 128 and 255. Of +course, we then ended up with plenty of variants that weren't quite +ASCII, and the whole point of it being a standard was lost. + +Worse still, if you've got a language like Chinese or +Japanese that has hundreds or thousands of characters, then you really +can't fit them into a mere 256, so they had to forget about ASCII +altogether, and build their own systems using pairs of numbers to refer +to one character. + +To fix this, some people formed Unicode, Inc. and +produced a new character set containing all the characters you can +possibly think of and more. There are several ways of representing these +characters, and the one Perl uses is called UTF8. UTF8 uses +a variable number of bytes to represent a character, instead of just +one. You can learn more about Unicode at +L + +=head2 How can I recognise a UTF8 string? + +You can't. This is because UTF8 data is stored in bytes just like +non-UTF8 data. The Unicode character 200, (C<0xC8> for you hex types) +capital E with a grave accent, is represented by the two bytes +C. Unfortunately, the non-Unicode string C +has that byte sequence as well. So you can't tell just by looking - this +is what makes Unicode input an interesting problem. + +The API function C can help; it'll tell you if a string +contains only valid UTF8 characters. However, it can't do the work for +you. On a character-by-character basis, C will tell you +whether the current character in a string is valid UTF8. + +=head2 How does UTF8 represent Unicode characters? + +As mentioned above, UTF8 uses a variable number of bytes to store a +character. Characters with values 1...128 are stored in one byte, just +like good ol' ASCII. Character 129 is stored as C; this +contines up to character 191, which is C. Now we've run out of +bits (191 is binary C<10111111>) so we move on; 192 is C. And +so it goes on, moving to three bytes at character 2048. + +Assuming you know you're dealing with a UTF8 string, you can find out +how long the first character in it is with the C macro: + + char *utf = "\305\233\340\240\201"; + I32 len; + + len = UTF8SKIP(utf); /* len is 2 here */ + utf += len; + len = UTF8SKIP(utf); /* len is 3 here */ + +Another way to skip over characters in a UTF8 string is to use +C, which takes a string and a number of characters to skip +over. You're on your own about bounds checking, though, so don't use it +lightly. + +All bytes in a multi-byte UTF8 character will have the high bit set, so +you can test if you need to do something special with this character +like this: + + UV uv; + + if (utf & 0x80) + /* Must treat this as UTF8 */ + uv = utf8_to_uv(utf); + else + /* OK to treat this character as a byte */ + uv = *utf; + +You can also see in that example that we use C to get the +value of the character; the inverse function C is available +for putting a UV into UTF8: + + if (uv > 0x80) + /* Must treat this as UTF8 */ + utf8 = uv_to_utf8(utf8, uv); + else + /* OK to treat this character as a byte */ + *utf8++ = uv; + +You B convert characters to UVs using the above functions if +you're ever in a situation where you have to match UTF8 and non-UTF8 +characters. You may not skip over UTF8 characters in this case. If you +do this, you'll lose the ability to match hi-bit non-UTF8 characters; +for instance, if your UTF8 string contains C, and you skip +that character, you can never match a C in a non-UTF8 string. +So don't do that! + +=head2 How does Perl store UTF8 strings? + +Currently, Perl deals with Unicode strings and non-Unicode strings +slightly differently. If a string has been identified as being UTF-8 +encoded, Perl will set a flag in the SV, C. You can check and +manipulate this flag with the following macros: + + SvUTF8(sv) + SvUTF8_on(sv) + SvUTF8_off(sv) + +This flag has an important effect on Perl's treatment of the string: if +Unicode data is not properly distinguished, regular expressions, +C, C and other string handling operations will have +undesirable results. + +The problem comes when you have, for instance, a string that isn't +flagged is UTF8, and contains a byte sequence that could be UTF8 - +especially when combining non-UTF8 and UTF8 strings. + +Never forget that the C flag is separate to the PV value; you +need be sure you don't accidentally knock it off while you're +manipulating SVs. More specifically, you cannot expect to do this: + + SV *sv; + SV *nsv; + STRLEN len; + char *p; + + p = SvPV(sv, len); + frobnicate(p); + nsv = newSVpvn(p, len); + +The C string does not tell you the whole story, and you can't +copy or reconstruct an SV just by copying the string value. Check if the +old SV has the UTF8 flag set, and act accordingly: + + p = SvPV(sv, len); + frobnicate(p); + nsv = newSVpvn(p, len); + if (SvUTF8(sv)) + SvUTF8_on(nsv); + +In fact, your C function should be made aware of whether or +not it's dealing with UTF8 data, so that it can handle the string +appropriately. + +=head2 How do I convert a string to UTF8? + +If you're mixing UTF8 and non-UTF8 strings, you might find it necessary +to upgrade one of the strings to UTF8. If you've got an SV, the easiest +way to do this is: + + sv_utf8_upgrade(sv); + +However, you must not do this, for example: + + if (!SvUTF8(left)) + sv_utf8_upgrade(left); + +If you do this in a binary operator, you will actually change one of the +strings that came into the operator, and, while it shouldn't be noticable +by the end user, it can cause problems. + +Instead, C will give you a UTF8-encoded B of its +string argument. This is useful for having the data available for +comparisons and so on, without harming the orginal SV. There's also +C to go the other way, but naturally, this will fail if +the string contains any characters above 255 that can't be represented +in a single byte. + +=head2 Is there anything else I need to know? + +Not really. Just remember these things: + +=over 3 + +=item * + +There's no way to tell if a string is UTF8 or not. You can tell if an SV +is UTF8 by looking at is C flag. Don't forget to set the flag if +something should be UTF8. Treat the flag as part of the PV, even though +it's not - if you pass on the PV to somewhere, pass on the flag too. + +=item * + +If a string is UTF8, B use C to get at the value, +unless C in which case you can use C<*s>. + +=item * + +When writing to a UTF8 string, B use C, unless +C in which case you can use C<*s = uv>. + +=item * + +Mixing UTF8 and non-UTF8 strings is tricky. Use C to get +a new string which is UTF8 encoded. There are tricks you can use to +delay deciding whether you need to use a UTF8 string until you get to a +high character - C is one of those. + +=back + =head1 AUTHORS Until May 1997, this document was maintained by Jeff Okamoto