1 package Unicode::Normalize;
4 unless ("A" eq pack('U', 0x41) || "A" eq pack('U', ord("A"))) {
5 die "Unicode::Normalize cannot stringify a Unicode code point\n";
14 our $VERSION = '0.21';
15 our $PACKAGE = __PACKAGE__;
21 our @ISA = qw(Exporter DynaLoader);
22 our @EXPORT = qw( NFC NFD NFKC NFKD );
24 normalize decompose reorder compose
25 checkNFD checkNFKD checkNFC checkNFKC check
26 getCanon getCompat getComposite getCombinClass
27 isExclusion isSingleton isNonStDecomp isComp2nd isComp_Ex
28 isNFD_NO isNFC_NO isNFC_MAYBE isNFKD_NO isNFKC_NO isNFKC_MAYBE
31 all => [ @EXPORT, @EXPORT_OK ],
32 normalize => [ @EXPORT, qw/normalize decompose reorder compose/ ],
33 check => [ qw/checkNFD checkNFKD checkNFC checkNFKC check/ ],
36 bootstrap Unicode::Normalize $VERSION;
39 return pack('U*', @_);
43 return unpack('U*', pack('U*').shift);
46 use constant COMPAT => 1;
48 sub NFD ($) { reorder(decompose($_[0])) }
49 sub NFKD ($) { reorder(decompose($_[0], COMPAT)) }
50 sub NFC ($) { compose(reorder(decompose($_[0]))) }
51 sub NFKC ($) { compose(reorder(decompose($_[0], COMPAT))) }
59 $form eq 'D' ? NFD ($str) :
60 $form eq 'C' ? NFC ($str) :
61 $form eq 'KD' ? NFKD($str) :
62 $form eq 'KC' ? NFKC($str) :
63 croak $PACKAGE."::normalize: invalid form name: $form";
72 $form eq 'D' ? checkNFD ($str) :
73 $form eq 'C' ? checkNFC ($str) :
74 $form eq 'KD' ? checkNFKD($str) :
75 $form eq 'KC' ? checkNFKC($str) :
76 croak $PACKAGE."::check: invalid form name: $form";
84 Unicode::Normalize - Unicode Normalization Forms
88 use Unicode::Normalize;
90 $NFD_string = NFD($string); # Normalization Form D
91 $NFC_string = NFC($string); # Normalization Form C
92 $NFKD_string = NFKD($string); # Normalization Form KD
93 $NFKC_string = NFKC($string); # Normalization Form KC
97 use Unicode::Normalize 'normalize';
99 $NFD_string = normalize('D', $string); # Normalization Form D
100 $NFC_string = normalize('C', $string); # Normalization Form C
101 $NFKD_string = normalize('KD', $string); # Normalization Form KD
102 $NFKC_string = normalize('KC', $string); # Normalization Form KC
106 =head2 Normalization Forms
110 =item C<$NFD_string = NFD($string)>
112 returns the Normalization Form D (formed by canonical decomposition).
114 =item C<$NFC_string = NFC($string)>
116 returns the Normalization Form C (formed by canonical decomposition
117 followed by canonical composition).
119 =item C<$NFKD_string = NFKD($string)>
121 returns the Normalization Form KD (formed by compatibility decomposition).
123 =item C<$NFKC_string = NFKC($string)>
125 returns the Normalization Form KC (formed by compatibility decomposition
126 followed by B<canonical> composition).
128 =item C<$normalized_string = normalize($form_name, $string)>
130 As C<$form_name>, one of the following names must be given.
132 'C' or 'NFC' for Normalization Form C
133 'D' or 'NFD' for Normalization Form D
134 'KC' or 'NFKC' for Normalization Form KC
135 'KD' or 'NFKD' for Normalization Form KD
139 =head2 Decomposition and Composition
143 =item C<$decomposed_string = decompose($string)>
145 =item C<$decomposed_string = decompose($string, $useCompatMapping)>
147 Decomposes the specified string and returns the result.
149 If the second parameter (a boolean) is omitted or false, decomposes it
150 using the Canonical Decomposition Mapping.
151 If true, decomposes it using the Compatibility Decomposition Mapping.
153 The string returned is not always in NFD/NFKD.
154 Reordering may be required.
156 $NFD_string = reorder(decompose($string)); # eq. to NFD()
157 $NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
159 =item C<$reordered_string = reorder($string)>
161 Reorders the combining characters and the like in the canonical ordering
162 and returns the result.
164 E.g., when you have a list of NFD/NFKD strings,
165 you can get the concatenated NFD/NFKD string from them, saying
167 $concat_NFD = reorder(join '', @NFD_strings);
168 $concat_NFKD = reorder(join '', @NFKD_strings);
170 =item C<$composed_string = compose($string)>
172 Returns the string where composable pairs are composed.
174 E.g., when you have a NFD/NFKD string,
175 you can get its NFC/NFKC string, saying
177 $NFC_string = compose($NFD_string);
178 $NFKC_string = compose($NFKD_string);
184 (see Annex 8, UAX #15, and F<DerivedNormalizationProps.txt>)
186 The following functions check whether the string is in that normalization form.
188 The result returned will be:
190 YES The string is in that normalization form.
191 NO The string is not in that normalization form.
192 MAYBE Dubious. Maybe yes, maybe no.
196 =item C<$result = checkNFD($string)>
198 returns C<YES> (C<1>) or C<NO> (C<empty string>).
200 =item C<$result = checkNFC($string)>
202 returns C<YES> (C<1>), C<NO> (C<empty string>), or C<MAYBE> (C<undef>).
204 =item C<$result = checkNFKD($string)>
206 returns C<YES> (C<1>) or C<NO> (C<empty string>).
208 =item C<$result = checkNFKC($string)>
210 returns C<YES> (C<1>), C<NO> (C<empty string>), or C<MAYBE> (C<undef>).
212 =item C<$result = check($form_name, $string)>
214 returns C<YES> (C<1>), C<NO> (C<empty string>), or C<MAYBE> (C<undef>).
216 C<$form_name> is alike to that for C<normalize()>.
222 In the cases of NFD and NFKD, the answer must be either C<YES> or C<NO>.
223 The answer C<MAYBE> may be returned in the cases of NFC and NFKC.
225 A MAYBE-NFC/NFKC string should contain at least
226 one combining character or the like.
227 For example, C<COMBINING ACUTE ACCENT> has
228 the MAYBE_NFC/MAYBE_NFKC property.
229 Both C<checkNFC("A\N{COMBINING ACUTE ACCENT}")>
230 and C<checkNFC("B\N{COMBINING ACUTE ACCENT}")> will return C<MAYBE>.
231 C<"A\N{COMBINING ACUTE ACCENT}"> is not in NFC
232 (its NFC is C<"\N{LATIN CAPITAL LETTER A WITH ACUTE}">),
233 while C<"B\N{COMBINING ACUTE ACCENT}"> is in NFC.
235 If you want to check exactly, compare the string with its NFC/NFKC; i.e.,
237 $string eq NFC($string) # more thorough than checkNFC($string)
238 $string eq NFKC($string) # more thorough than checkNFKC($string)
240 =head2 Character Data
242 These functions are interface of character data used internally.
243 If you want only to get Unicode normalization forms, you don't need
248 =item C<$canonical_decomposed = getCanon($codepoint)>
250 If the character of the specified codepoint is canonically
251 decomposable (including Hangul Syllables),
252 returns the B<completely decomposed> string canonically equivalent to it.
254 If it is not decomposable, returns C<undef>.
256 =item C<$compatibility_decomposed = getCompat($codepoint)>
258 If the character of the specified codepoint is compatibility
259 decomposable (including Hangul Syllables),
260 returns the B<completely decomposed> string compatibility equivalent to it.
262 If it is not decomposable, returns C<undef>.
264 =item C<$codepoint_composite = getComposite($codepoint_here, $codepoint_next)>
266 If two characters here and next (as codepoints) are composable
267 (including Hangul Jamo/Syllables and Composition Exclusions),
268 returns the codepoint of the composite.
270 If they are not composable, returns C<undef>.
272 =item C<$combining_class = getCombinClass($codepoint)>
274 Returns the combining class of the character as an integer.
276 =item C<$is_exclusion = isExclusion($codepoint)>
278 Returns a boolean whether the character of the specified codepoint
279 is a composition exclusion.
281 =item C<$is_singleton = isSingleton($codepoint)>
283 Returns a boolean whether the character of the specified codepoint is
286 =item C<$is_non_starter_decomposition = isNonStDecomp($codepoint)>
288 Returns a boolean whether the canonical decomposition
289 of the character of the specified codepoint
290 is a Non-Starter Decomposition.
292 =item C<$may_be_composed_with_prev_char = isComp2nd($codepoint)>
294 Returns a boolean whether the character of the specified codepoint
295 may be composed with the previous one in a certain composition
296 (including Hangul Compositions, but excluding
297 Composition Exclusions and Non-Starter Decompositions).
303 C<NFC>, C<NFD>, C<NFKC>, C<NFKD>: by default.
305 C<normalize> and other some functions: on request.
309 SADAHIRO Tomoyuki, E<lt>SADAHIRO@cpan.orgE<gt>
311 http://homepage1.nifty.com/nomenclator/perl/
313 Copyright(C) 2001-2003, SADAHIRO Tomoyuki. Japan. All rights reserved.
315 This module is free software; you can redistribute it
316 and/or modify it under the same terms as Perl itself.
322 =item http://www.unicode.org/unicode/reports/tr15/
324 Unicode Normalization Forms - UAX #15
326 =item http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
328 Derived Normalization Properties