Converts a string C<s> of length C<len> from UTF8 into byte encoding.
Unlike <utf8_to_bytes> but like C<bytes_to_utf8>, returns a pointer to
-the newly-created string, and updates C<len> to contain the new length.
-Returns the original string if no conversion occurs, C<len> and
-C<is_utf8> are unchanged. Do nothing if C<is_utf8> points to 0. Sets
-C<is_utf8> to 0 if C<s> is converted or malformed .
+the newly-created string, and updates C<len> to contain the new
+length. Returns the original string if no conversion occurs, C<len>
+is unchanged. Do nothing if C<is_utf8> points to 0. Sets C<is_utf8> to
+0 if C<s> is converted or contains all 7bit characters.
NOTE: this function is experimental and may change or be
removed without notice.
=item is_utf8_char
-Tests if some arbitrary number of bytes begins in a valid UTF-8 character.
-The actual number of bytes in the UTF-8 character will be returned if it
-is valid, otherwise 0.
-
+Tests if some arbitrary number of bytes begins in a valid UTF-8
+character. Note that an INVARIANT (i.e. ASCII) character is a valid UTF-8 character.
+The actual number of bytes in the UTF-8 character will be returned if
+it is valid, otherwise 0.
+
STRLEN is_utf8_char(U8 *p)
=for hackers
=item is_utf8_string
-Returns true if first C<len> bytes of the given string form valid a UTF8
-string, false otherwise.
+Returns true if first C<len> bytes of the given string form a valid UTF8
+string, false otherwise. Note that 'a valid UTF8 string' does not mean
+'a string that contains UTF8' because a valid ASCII string is a valid
+UTF8 string.
bool is_utf8_string(U8 *s, STRLEN len)
=for hackers
Found in file scope.h
+=item load_module
+
+Loads the module whose name is pointed to by the string part of name.
+Note that the actual module name, not its filename, should be given.
+Eg, "Foo::Bar" instead of "Foo/Bar.pm". flags can be any of
+PERL_LOADMOD_DENY, PERL_LOADMOD_NOIMPORT, or PERL_LOADMOD_IMPORT_OPS
+(or 0 for no flags). ver, if specified, provides version semantics
+similar to C<use Foo::Bar VERSION>. The optional trailing SV*
+arguments can be used to specify arguments to the module's import()
+method, similar to C<use Foo::Bar VERSION LIST>.
+
+ void load_module(U32 flags, SV* name, SV* ver, ...)
+
+=for hackers
+Found in file op.c
+
=item looks_like_number
Test if an the content of an SV looks like a number (or is a
=item POPp
-Pops a string off the stack.
+Pops a string off the stack. Deprecated. New code should provide
+a STRLEN n_a and use POPpx.
char* POPp
=for hackers
Found in file pp.h
+=item POPpbytex
+
+Pops a string off the stack which must consist of bytes i.e. characters < 256.
+Requires a variable STRLEN n_a in scope.
+
+ char* POPpbytex
+
+=for hackers
+Found in file pp.h
+
+=item POPpx
+
+Pops a string off the stack.
+Requires a variable STRLEN n_a in scope.
+
+ char* POPpx
+
+=for hackers
+Found in file pp.h
+
=item POPs
Pops an SV off the stack.
=item require_pv
-Tells Perl to C<require> a module.
+Tells Perl to C<require> the file named by the string argument. It is
+analogous to the Perl code C<eval "require '$file'">. It's even
+implemented that way; consider using Perl_load_module instead.
NOTE: the perl_ form of this function is deprecated.
NUL character). Calls C<sv_grow> to perform the expansion if necessary.
Returns a pointer to the character buffer.
- void SvGROW(SV* sv, STRLEN len)
+ char * SvGROW(SV* sv, STRLEN len)
=for hackers
Found in file sv.h
=item SvPOK_only
Tells an SV that it is a string and disables all other OK bits.
+Will also turn off the UTF8 status.
void SvPOK_only(SV* sv)
=item SvPOK_only_UTF8
-Tells an SV that it is a UTF8 string (do not use frivolously)
-and disables all other OK bits.
+Tells an SV that it is a string and disables all other OK bits,
+and leaves the UTF8 status as it was.
void SvPOK_only_UTF8(SV* sv)
=item SvUTF8_on
-Tells an SV that it is a string and encoded in UTF8. Do not use frivolously.
+Turn on the UTF8 status of an SV (the data is not changed, just the flag).
+Do not use frivolously.
void SvUTF8_on(SV *sv)
=item sv_catpv
Concatenates the string onto the end of the string which is in the SV.
-Handles 'get' magic, but not 'set' magic. See C<sv_catpv_mg>.
+If the SV has the UTF8 status set, then the bytes appended should be
+valid UTF8. Handles 'get' magic, but not 'set' magic. See C<sv_catpv_mg>.
void sv_catpv(SV* sv, const char* ptr)
=item sv_catpvf
-Processes its arguments like C<sprintf> and appends the formatted output
-to an SV. Handles 'get' magic, but not 'set' magic. C<SvSETMAGIC()> must
-typically be called after calling this function to handle 'set' magic.
+Processes its arguments like C<sprintf> and appends the formatted
+output to an SV. If the appended data contains "wide" characters
+(including, but not limited to, SVs with a UTF-8 PV formatted with %s,
+and characters >255 formatted with %c), the original SV might get
+upgraded to UTF-8. Handles 'get' magic, but not 'set' magic.
+C<SvSETMAGIC()> must typically be called after calling this function
+to handle 'set' magic.
void sv_catpvf(SV* sv, const char* pat, ...)
=item sv_catpvn
Concatenates the string onto the end of the string which is in the SV. The
-C<len> indicates number of bytes to copy. Handles 'get' magic, but not
-'set' magic. See C<sv_catpvn_mg>.
+C<len> indicates number of bytes to copy. If the SV has the UTF8
+status set, then the bytes appended should be valid UTF8.
+Handles 'get' magic, but not 'set' magic. See C<sv_catpvn_mg>.
void sv_catpvn(SV* sv, const char* ptr, STRLEN len)
=for hackers
Found in file sv.c
+=item sv_setref_uv
+
+Copies an unsigned integer into a new SV, optionally blessing the SV. The C<rv>
+argument will be upgraded to an RV. That RV will be modified to point to
+the new SV. The C<classname> argument indicates the package for the
+blessing. Set C<classname> to C<Nullch> to avoid the blessing. The new SV
+will be returned and will have a reference count of 1.
+
+ SV* sv_setref_uv(SV* rv, const char* classname, UV uv)
+
+=for hackers
+Found in file sv.c
+
=item sv_setsv
Copies the contents of the source SV C<ssv> into the destination SV C<dsv>.
=for hackers
Found in file sv.c
+=item sv_utf8_decode
+
+Convert the octets in the PV from UTF-8 to chars. Scan for validity and then
+turn of SvUTF8 if needed so that we see characters. Used as a building block
+for decode_utf8 in Encode.xs
+
+NOTE: this function is experimental and may change or be
+removed without notice.
+
+ bool sv_utf8_decode(SV *sv)
+
+=for hackers
+Found in file sv.c
+
=item sv_utf8_downgrade
Attempt to convert the PV of an SV from UTF8-encoded to byte encoding.
=item sv_utf8_encode
Convert the PV of an SV to UTF8-encoded, but then turn off the C<SvUTF8>
-flag so that it looks like bytes again. Nothing calls this.
-
-NOTE: this function is experimental and may change or be
-removed without notice.
+flag so that it looks like octets again. Used as a building block
+for encode_utf8 in Encode.xs
void sv_utf8_encode(SV *sv)
=item sv_utf8_upgrade
Convert the PV of an SV to its UTF8-encoded form.
+Forces the SV to string form it it is not already.
+Always sets the SvUTF8 flag to avoid future validity checks even
+if all the bytes have hibit clear.
- void sv_utf8_upgrade(SV *sv)
+ STRLEN sv_utf8_upgrade(SV *sv)
=for hackers
Found in file sv.c
=for hackers
Found in file handy.h
+=item utf8n_to_uvchr
+
+Returns the native character value of the first character in the string C<s>
+which is assumed to be in UTF8 encoding; C<retlen> will be set to the
+length, in bytes, of that character.
+
+Allows length and flags to be passed to low level routine.
+
+ UV utf8n_to_uvchr(U8 *s, STRLEN curlen, STRLEN* retlen, U32 flags)
+
+=for hackers
+Found in file utf8.c
+
+=item utf8n_to_uvuni
+
+Bottom level UTF-8 decode routine.
+Returns the unicode code point value of the first character in the string C<s>
+which is assumed to be in UTF8 encoding and no longer than C<curlen>;
+C<retlen> will be set to the length, in bytes, of that character.
+
+If C<s> does not point to a well-formed UTF8 character, the behaviour
+is dependent on the value of C<flags>: if it contains UTF8_CHECK_ONLY,
+it is assumed that the caller will raise a warning, and this function
+will silently just set C<retlen> to C<-1> and return zero. If the
+C<flags> does not contain UTF8_CHECK_ONLY, warnings about
+malformations will be given, C<retlen> will be set to the expected
+length of the UTF-8 character in bytes, and zero will be returned.
+
+The C<flags> can also contain various flags to allow deviations from
+the strict UTF-8 encoding (see F<utf8.h>).
+
+Most code should use utf8_to_uvchr() rather than call this directly.
+
+ UV utf8n_to_uvuni(U8 *s, STRLEN curlen, STRLEN* retlen, U32 flags)
+
+=for hackers
+Found in file utf8.c
+
=item utf8_distance
Returns the number of UTF8 characters between the UTF-8 pointers C<a>
=for hackers
Found in file utf8.c
-=item utf8_to_uv
-
-Returns the character value of the first character in the string C<s>
-which is assumed to be in UTF8 encoding and no longer than C<curlen>;
-C<retlen> will be set to the length, in bytes, of that character.
+=item utf8_to_uvchr
-If C<s> does not point to a well-formed UTF8 character, the behaviour
-is dependent on the value of C<flags>: if it contains UTF8_CHECK_ONLY,
-it is assumed that the caller will raise a warning, and this function
-will silently just set C<retlen> to C<-1> and return zero. If the
-C<flags> does not contain UTF8_CHECK_ONLY, warnings about
-malformations will be given, C<retlen> will be set to the expected
-length of the UTF-8 character in bytes, and zero will be returned.
+Returns the native character value of the first character in the string C<s>
+which is assumed to be in UTF8 encoding; C<retlen> will be set to the
+length, in bytes, of that character.
-The C<flags> can also contain various flags to allow deviations from
-the strict UTF-8 encoding (see F<utf8.h>).
+If C<s> does not point to a well-formed UTF8 character, zero is
+returned and retlen is set, if possible, to -1.
- UV utf8_to_uv(U8 *s, STRLEN curlen, STRLEN* retlen, U32 flags)
+ UV utf8_to_uvchr(U8 *s, STRLEN* retlen)
=for hackers
Found in file utf8.c
-=item utf8_to_uv_simple
+=item utf8_to_uvuni
-Returns the character value of the first character in the string C<s>
+Returns the Unicode code point of the first character in the string C<s>
which is assumed to be in UTF8 encoding; C<retlen> will be set to the
length, in bytes, of that character.
+This function should only be used when returned UV is considered
+an index into the Unicode semantic tables (e.g. swashes).
+
If C<s> does not point to a well-formed UTF8 character, zero is
returned and retlen is set, if possible, to -1.
- UV utf8_to_uv_simple(U8 *s, STRLEN* retlen)
+ UV utf8_to_uvuni(U8 *s, STRLEN* retlen)
+
+=for hackers
+Found in file utf8.c
+
+=item uvchr_to_utf8
+
+Adds the UTF8 representation of the Native codepoint C<uv> to the end
+of the string C<d>; C<d> should be have at least C<UTF8_MAXLEN+1> free
+bytes available. The return value is the pointer to the byte after the
+end of the new character. In other words,
+
+ d = uvchr_to_utf8(d, uv);
+
+is the recommended wide native character-aware way of saying
+
+ *(d++) = uv;
+
+ U8* uvchr_to_utf8(U8 *d, UV uv)
=for hackers
Found in file utf8.c
-=item uv_to_utf8
+=item uvuni_to_utf8
Adds the UTF8 representation of the Unicode codepoint C<uv> to the end
of the string C<d>; C<d> should be have at least C<UTF8_MAXLEN+1> free
bytes available. The return value is the pointer to the byte after the
-end of the new character. In other words,
+end of the new character. In other words,
- d = uv_to_utf8(d, uv);
+ d = uvuni_to_utf8(d, uv);
is the recommended Unicode-aware way of saying
*(d++) = uv;
- U8* uv_to_utf8(U8 *d, UV uv)
+ U8* uvuni_to_utf8(U8 *d, UV uv)
=for hackers
Found in file utf8.c