Commit | Line | Data |
15ca46df |
1 | <html> |
2 | |
3 | <head> |
4 | <meta name="GENERATOR" content="Microsoft FrontPage 4.0"> |
5 | <meta name="ProgId" content="FrontPage.Editor.Document"> |
6 | <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
7 | <link rel="stylesheet" href="http://www.unicode.org/unicode.css" type="text/css"> |
8 | <title>UnicodeData File Format</title> |
9 | </head> |
10 | |
11 | <body> |
12 | |
06bfd75b |
13 | <table width="100%" cellpadding="0" cellspacing="0" border="0"> |
14 | <tr> |
15 | <td> |
16 | <table width="100%" border="0" cellpadding="0" cellspacing="0"> |
17 | <tr> |
18 | <td class="icon"><a href="http://www.unicode.org"><img border="0" |
19 | src="http://www.unicode.org/webscripts/logo60s2.gif" align="middle" |
20 | alt="[Unicode]" width="34" height="33"></a> <a |
21 | class="bar" href="UnicodeCharacterDatabase.html">Unicode Character |
22 | Database</a></td> |
23 | </tr> |
24 | </table> |
25 | </td> |
26 | </tr> |
27 | <tr> |
28 | <td class="gray"> </td> |
29 | </tr> |
30 | </table> |
31 | <h1>Unicode Data File Format</h1> |
15ca46df |
32 | <table border="1" cellspacing="2" cellpadding="0" height="87" width="100%"> |
33 | <tr> |
34 | <td valign="TOP" width="144">Revision</td> |
06bfd75b |
35 | <td valign="TOP">3.1.0</td> |
15ca46df |
36 | </tr> |
37 | <tr> |
38 | <td valign="TOP" width="144">Authors</td> |
39 | <td valign="TOP">Mark Davis and Ken Whistler</td> |
40 | </tr> |
41 | <tr> |
42 | <td valign="TOP" width="144">Date</td> |
06bfd75b |
43 | <td valign="TOP">2001-02-28</td> |
15ca46df |
44 | </tr> |
45 | <tr> |
46 | <td valign="TOP" width="144">This Version</td> |
47 | <td valign="TOP"><a |
06bfd75b |
48 | href="http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.html">http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.html</a></td> |
15ca46df |
49 | </tr> |
50 | <tr> |
51 | <td valign="TOP" width="144">Previous Version</td> |
52 | <td valign="TOP"><a |
06bfd75b |
53 | href="http://www.unicode.org/Public/3.0-Update1/UnicodeData-3.0.1.html">http://www.unicode.org/Public/3.0-Update1/UnicodeData-3.0.1.html</a></td> |
15ca46df |
54 | </tr> |
55 | <tr> |
56 | <td valign="TOP" width="144">Latest Version</td> |
57 | <td valign="TOP"><a |
58 | href="http://www.unicode.org/Public/UNIDATA/UnicodeData.html">http://www.unicode.org/Public/UNIDATA/UnicodeData.html</a></td> |
59 | </tr> |
60 | </table> |
06bfd75b |
61 | <h3><br> |
62 | S<i>ummary</i></h3> |
63 | <blockquote> |
64 | <p><i>This document describes the format and content of the UnicodeData.txt |
65 | file in the Unicode Character Database (UCD).</i></p> |
66 | </blockquote> |
67 | <h3><i>Status</i></h3> |
68 | <blockquote> |
69 | <p><i>The file and the files described herein are part of the Unicode |
70 | Character Database and governed by the <a href="#UCD_Terms">UCD Terms of |
71 | Use</a> given below.</i></p> |
72 | <p><i>For general information on file formats and table formats, and the |
73 | implications of normative vs informative properties, see |
74 | UnicodeCharacterDatabase.html. </i></p> |
75 | <p><i><b>Warning: </b>the information in this file does not completely |
76 | describe the use and interpretation of Unicode character properties and |
77 | behavior. It must be used in conjunction with the data in the other files in |
78 | the UCD, and relies on the notation and definitions supplied in <a |
79 | href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html">The |
80 | Unicode Standard</a>. All chapter references are to Version 3.1.0 of the |
81 | standard.</i></p> |
82 | </blockquote> |
83 | <h2>Introduction</h2> |
15ca46df |
84 | <p>This document describes the format of the UnicodeData.txt file, which is one |
85 | of the files in the Unicode Character Database. The document is divided into the |
86 | following sections: |
87 | <ul> |
88 | <li><a href="#Field Formats">Field Formats</a> |
89 | <ul> |
90 | <li><a href="#General Category">General Category</a></li> |
91 | <li><a href="#Bidirectional Category">Bidirectional Category</a></li> |
92 | <li><a href="#Character Decomposition">Character Decomposition Mapping</a></li> |
93 | <li><a href="#Canonical Combining Classes">Canonical Combining Classes</a></li> |
94 | <li><a href="#Decompositions and Normalization">Decompositions and |
95 | Normalization</a></li> |
96 | <li><a href="#Case Mappings">Case Mappings</a></li> |
97 | </ul> |
98 | </li> |
99 | <li><a href="#Property Invariants">Property Invariants</a></li> |
100 | <li><a href="#Modification History">Modification History</a></li> |
101 | </ul> |
15ca46df |
102 | <h2><a name="Field Formats"></a>Field Formats</h2> |
06bfd75b |
103 | <p>Each line represents the data for one encoded character in the Unicode |
104 | Standard. (For information on the file format, see UCD File Format in |
105 | UnicodeCharacterDatabase.html). |
106 | <p>Every encoded character has a data entry, with the exception of certain |
107 | special ranges, as detailed below. |
15ca46df |
108 | <ul> |
06bfd75b |
109 | <li>These ranges represented only by their start and end characters, since the |
110 | properties in the file are uniform, except for code values (which are all |
111 | sequential and assigned).</li> |
15ca46df |
112 | <li>The names of CJK ideograph characters and the names and decompositions of |
113 | Hangul syllable characters are algorithmically derivable. (See the Unicode |
114 | Standard and <a href="http://www.unicode.org/unicode/reports/tr15/">Unicode |
115 | Standard Annex #15</a> for more information).</li> |
116 | <li>Surrogate code values and private use characters have no names.</li> |
06bfd75b |
117 | <li>The supplementary Private Use characters (U+F0000 .. U+FFFFD, U+100000 .. |
118 | U+10FFFD) are listed as distinct ranges. These correspond to surrogate pairs |
15ca46df |
119 | where the first surrogate is in the High Surrogate Private Use section.</li> |
120 | </ul> |
121 | <p>The exact ranges represented by start and end characters are: |
122 | <ul> |
06bfd75b |
123 | <li>CJK Ideographs Extension A (U+3400 .. U+4DB5)</li> |
124 | <li>CJK Ideographs (U+4E00 .. U+9FA5)</li> |
125 | <li>Hangul Syllables (U+AC00 .. U+D7A3)</li> |
126 | <li>Non-Private Use High Surrogates (U+D800 .. U+DB7F)</li> |
127 | <li>Private Use High Surrogates (U+DB80 .. U+DBFF)</li> |
128 | <li>Low Surrogates (U+DC00 .. U+DFFF)</li> |
129 | <li>The Private Use Area (U+E000 .. U+F8FF)</li> |
130 | <li>CJK Ideographs Extension B (U+20000 .. U+2A6D6)</li> |
131 | <li>Plane 15 Private Use Area (U+F0000 .. U+FFFFD)</li> |
132 | <li>Plane 16 Private Use Area (U+100000 .. U+10FFFD)</li> |
15ca46df |
133 | </ul> |
134 | <p>The following table describes the format and meaning of each field in a data |
06bfd75b |
135 | entry in the UnicodeData file.</p> |
15ca46df |
136 | <table border="1" cellspacing="2" cellpadding="2"> |
137 | <tr> |
138 | <th valign="top" align="LEFT"> |
139 | <p align="LEFT">Field</th> |
140 | <th valign="top" align="LEFT"> |
141 | <p align="LEFT">Name</th> |
06bfd75b |
142 | <th valign="top" align="center"> |
143 | <p align="LEFT">N/I</th> |
15ca46df |
144 | <th valign="top" align="LEFT"> |
145 | <p align="LEFT">Explanation</th> |
146 | </tr> |
147 | <tr> |
148 | <th valign="top">0</th> |
149 | <td valign="top">Code value</td> |
06bfd75b |
150 | <td valign="top" align="center">N</td> |
151 | <td valign="top">Code value.</td> |
15ca46df |
152 | </tr> |
153 | <tr> |
154 | <th valign="top">1</th> |
155 | <td valign="top">Character name</td> |
06bfd75b |
156 | <td valign="top" align="center">N</td> |
15ca46df |
157 | <td valign="top">These names match exactly the names published in Chapter 14 |
158 | of the Unicode Standard, Version 3.0.</td> |
159 | </tr> |
160 | <tr> |
161 | <th valign="top">2</th> |
162 | <td valign="top"><a href="#General Category">General Category</a></td> |
06bfd75b |
163 | <td valign="top" align="center">N</td> |
15ca46df |
164 | <td valign="top">This is a useful breakdown into various "character |
165 | types" which can be used as a default categorization in |
166 | implementations. See below for a brief explanation.</td> |
167 | </tr> |
168 | <tr> |
169 | <th valign="top">3</th> |
170 | <td valign="top"><a href="#Canonical Combining Classes">Canonical Combining |
171 | Classes</a></td> |
06bfd75b |
172 | <td valign="top" align="center">N</td> |
15ca46df |
173 | <td valign="top">The classes used for the Canonical Ordering Algorithm in |
174 | the Unicode Standard. These classes are also printed in Chapter 4 of the |
175 | Unicode Standard.</td> |
176 | </tr> |
177 | <tr> |
178 | <th valign="top">4</th> |
179 | <td valign="top"><a href="#Bidirectional Category">Bidirectional Category</a></td> |
06bfd75b |
180 | <td valign="top" align="center">N</td> |
15ca46df |
181 | <td valign="top">See the list below for an explanation of the abbreviations |
182 | used in this field. These are the categories required by the Bidirectional |
183 | Behavior Algorithm in the Unicode Standard. These categories are |
184 | summarized in Chapter 3 of the Unicode Standard.</td> |
185 | </tr> |
186 | <tr> |
187 | <th valign="top">5</th> |
188 | <td valign="top"><a href="#Character Decomposition">Character Decomposition |
189 | Mapping</a></td> |
06bfd75b |
190 | <td valign="top" align="center">N</td> |
15ca46df |
191 | <td valign="top">In the Unicode Standard, not all of the mappings are full |
192 | (maximal) decompositions. Recursive application of look-up for |
193 | decompositions will, in all cases, lead to a maximal decomposition. The |
194 | decomposition mappings match exactly the decomposition mappings published |
195 | with the character names in the Unicode Standard.</td> |
196 | </tr> |
197 | <tr> |
198 | <th valign="top">6</th> |
199 | <td valign="top">Decimal digit value</td> |
06bfd75b |
200 | <td valign="top" align="center">N</td> |
15ca46df |
201 | <td valign="top">This is a numeric field. If the character has the decimal |
202 | digit property, as specified in Chapter 4 of the Unicode Standard, the |
203 | value of that digit is represented with an integer value in this field</td> |
204 | </tr> |
205 | <tr> |
206 | <th valign="top">7</th> |
207 | <td valign="top">Digit value</td> |
06bfd75b |
208 | <td valign="top" align="center">N</td> |
15ca46df |
209 | <td valign="top">This is a numeric field. If the character represents a |
210 | digit, not necessarily a decimal digit, the value is here. This covers |
211 | digits which do not form decimal radix forms, such as the compatibility |
212 | superscript digits</td> |
213 | </tr> |
214 | <tr> |
215 | <th valign="top">8</th> |
216 | <td valign="top">Numeric value</td> |
06bfd75b |
217 | <td valign="top" align="center">N</td> |
15ca46df |
218 | <td valign="top">This is a numeric field. If the character has the numeric |
219 | property, as specified in Chapter 4 of the Unicode Standard, the value of |
220 | that character is represented with an integer or rational number in this |
221 | field. This includes fractions as, e.g., "1/5" for U+2155 VULGAR |
222 | FRACTION ONE FIFTH Also included are numerical values for compatibility |
223 | characters such as circled numbers.</td> |
224 | </tr> |
225 | <tr> |
226 | <th valign="top">9</th> |
227 | <td valign="top">Mirrored</td> |
06bfd75b |
228 | <td valign="top" align="center">N</td> |
15ca46df |
229 | <td valign="top">If the character has been identified as a |
230 | "mirrored" character in bidirectional text, this field has the |
231 | value "Y"; otherwise "N". The list of mirrored |
232 | characters is also printed in Chapter 4 of the Unicode Standard.</td> |
233 | </tr> |
234 | <tr> |
235 | <th valign="top">10</th> |
236 | <td valign="top">Unicode 1.0 Name</td> |
06bfd75b |
237 | <td valign="top" align="center">I</td> |
15ca46df |
238 | <td valign="top">This is the old name as published in Unicode 1.0. This name |
06bfd75b |
239 | is only provided when it is significantly different from the current name |
240 | for the character.</td> |
15ca46df |
241 | </tr> |
242 | <tr> |
243 | <th valign="top">11</th> |
244 | <td valign="top">10646 comment field</td> |
06bfd75b |
245 | <td valign="top" align="center">I</td> |
246 | <td valign="top">This is the ISO 10646 comment field. It appears in |
247 | parentheses in the 10646 names list, or contains an asterisk to mark an |
248 | Annex P note.</td> |
15ca46df |
249 | </tr> |
250 | <tr> |
251 | <th valign="top">12</th> |
252 | <td valign="top"><a href="#Case Mappings">Uppercase Mapping</a></td> |
06bfd75b |
253 | <td valign="top" align="center">N</td> |
15ca46df |
254 | <td valign="top">Upper case equivalent mapping. If a character is part of an |
06bfd75b |
255 | alphabet with case distinctions, and has a simple upper case equivalent, |
256 | then the upper case equivalent is in this field. See the explanation below |
257 | on case distinctions. These mappings are always one-to-one, not |
258 | one-to-many or many-to-one. |
259 | <p><i>For full case mappings, see <a |
260 | href="http://www.unicode.org/unicode/reports/tr21/">UTR #21</a> and |
261 | SpecialCasing.txt.</i></p> |
262 | </td> |
15ca46df |
263 | </tr> |
264 | <tr> |
265 | <th valign="top">13</th> |
266 | <td valign="top"><a href="#Case Mappings">Lowercase Mapping</a></td> |
06bfd75b |
267 | <td valign="top" align="center">N</td> |
15ca46df |
268 | <td valign="top">Similar to Uppercase mapping</td> |
269 | </tr> |
270 | <tr> |
271 | <th valign="top">14</th> |
272 | <td valign="top"><a href="#Case Mappings">Titlecase Mapping</a></td> |
06bfd75b |
273 | <td valign="top" align="center">N</td> |
15ca46df |
274 | <td valign="top">Similar to Uppercase mapping</td> |
275 | </tr> |
276 | </table> |
277 | <h3><a name="General Category"></a>General Category</h3> |
06bfd75b |
278 | <p>The values in this field are abbreviations for the following values. For more |
279 | information, see the Unicode Standard.</p> |
280 | <blockquote> |
281 | <p><b>Note:</b> the standard does not assign information to control characters |
282 | (except for certain cases in the Bidirectional Algorithm). Implementations |
283 | will generally also assign categories to certain control characters, notably |
284 | CR and LF, according to platform conventions. See <a |
285 | href="http://www.unicode.org/unicode/reports/tr13/">UAX #13: Unicode Newline |
286 | Guidelines</a> for more information.</p> |
287 | </blockquote> |
288 | <table border="0" cellspacing="0" cellpadding="4"> |
15ca46df |
289 | <tr> |
290 | <th> |
291 | <p align="LEFT">Abbr.</th> |
292 | <th> |
293 | <p align="LEFT">Description</th> |
294 | </tr> |
295 | <tr> |
296 | <td align="CENTER">Lu</td> |
297 | <td>Letter, Uppercase</td> |
298 | </tr> |
299 | <tr> |
300 | <td align="CENTER">Ll</td> |
301 | <td>Letter, Lowercase</td> |
302 | </tr> |
303 | <tr> |
304 | <td align="CENTER">Lt</td> |
305 | <td>Letter, Titlecase</td> |
306 | </tr> |
307 | <tr> |
06bfd75b |
308 | <td align="CENTER">Lm</td> |
309 | <td>Letter, Modifier</td> |
310 | </tr> |
311 | <tr> |
312 | <td align="CENTER">Lo</td> |
313 | <td>Letter, Other</td> |
314 | </tr> |
315 | <tr> |
15ca46df |
316 | <td align="CENTER">Mn</td> |
317 | <td>Mark, Non-Spacing</td> |
318 | </tr> |
319 | <tr> |
320 | <td align="CENTER">Mc</td> |
321 | <td>Mark, Spacing Combining</td> |
322 | </tr> |
323 | <tr> |
324 | <td align="CENTER">Me</td> |
325 | <td>Mark, Enclosing</td> |
326 | </tr> |
327 | <tr> |
328 | <td align="CENTER">Nd</td> |
329 | <td>Number, Decimal Digit</td> |
330 | </tr> |
331 | <tr> |
332 | <td align="CENTER">Nl</td> |
333 | <td>Number, Letter</td> |
334 | </tr> |
335 | <tr> |
336 | <td align="CENTER">No</td> |
337 | <td>Number, Other</td> |
338 | </tr> |
339 | <tr> |
15ca46df |
340 | <td align="CENTER">Pc</td> |
341 | <td>Punctuation, Connector</td> |
342 | </tr> |
343 | <tr> |
344 | <td align="CENTER">Pd</td> |
345 | <td>Punctuation, Dash</td> |
346 | </tr> |
347 | <tr> |
348 | <td align="CENTER">Ps</td> |
349 | <td>Punctuation, Open</td> |
350 | </tr> |
351 | <tr> |
352 | <td align="CENTER">Pe</td> |
353 | <td>Punctuation, Close</td> |
354 | </tr> |
355 | <tr> |
356 | <td align="CENTER">Pi</td> |
357 | <td>Punctuation, Initial quote (may behave like Ps or Pe depending on usage)</td> |
358 | </tr> |
359 | <tr> |
360 | <td align="CENTER">Pf</td> |
361 | <td>Punctuation, Final quote (may behave like Ps or Pe depending on usage)</td> |
362 | </tr> |
363 | <tr> |
364 | <td align="CENTER">Po</td> |
365 | <td>Punctuation, Other</td> |
366 | </tr> |
367 | <tr> |
368 | <td align="CENTER">Sm</td> |
369 | <td>Symbol, Math</td> |
370 | </tr> |
371 | <tr> |
372 | <td align="CENTER">Sc</td> |
373 | <td>Symbol, Currency</td> |
374 | </tr> |
375 | <tr> |
376 | <td align="CENTER">Sk</td> |
377 | <td>Symbol, Modifier</td> |
378 | </tr> |
379 | <tr> |
380 | <td align="CENTER">So</td> |
381 | <td>Symbol, Other</td> |
382 | </tr> |
06bfd75b |
383 | <tr> |
384 | <td align="CENTER">Zs</td> |
385 | <td>Separator, Space</td> |
386 | </tr> |
387 | <tr> |
388 | <td align="CENTER">Zl</td> |
389 | <td>Separator, Line</td> |
390 | </tr> |
391 | <tr> |
392 | <td align="CENTER">Zp</td> |
393 | <td>Separator, Paragraph</td> |
394 | </tr> |
395 | <tr> |
396 | <td align="CENTER">Cc</td> |
397 | <td>Other, Control</td> |
398 | </tr> |
399 | <tr> |
400 | <td align="CENTER">Cf</td> |
401 | <td>Other, Format</td> |
402 | </tr> |
403 | <tr> |
404 | <td align="CENTER">Cs</td> |
405 | <td>Other, Surrogate</td> |
406 | </tr> |
407 | <tr> |
408 | <td align="CENTER">Co</td> |
409 | <td>Other, Private Use</td> |
410 | </tr> |
411 | <tr> |
412 | <td align="CENTER">Cn</td> |
413 | <td>Other, Not Assigned (no characters in the file have this property)</td> |
414 | </tr> |
15ca46df |
415 | </table> |
06bfd75b |
416 | <blockquote> |
417 | <p><b>Note:</b> The term "L&" is sometimes used to stand for |
418 | Uppercase, Lowercase or Titlecase letters (Lu, Ll, or Lt).</p> |
419 | </blockquote> |
15ca46df |
420 | <h3><a name="Bidirectional Category"></a>Bidirectional Category</h3> |
421 | <p>Please refer to Chapter 3 for an explanation of the algorithm for |
422 | Bidirectional Behavior and an explanation of the significance of these |
423 | categories. An up-to-date version can be found on <a |
424 | href="http://www.unicode.org/unicode/reports/tr9/">Unicode Standard Annex #9: |
06bfd75b |
425 | The Bidirectional Algorithm</a>.</p> |
426 | <table border="0" cellpadding="4" cellspacing="0"> |
15ca46df |
427 | <tr> |
428 | <th valign="TOP" align="LEFT"> |
429 | <p align="LEFT">Type</th> |
430 | <th valign="TOP" align="LEFT"> |
431 | <p align="LEFT">Description</th> |
432 | </tr> |
433 | <tr> |
434 | <td valign="TOP"><b>L</b></td> |
435 | <td valign="TOP">Left-to-Right</td> |
436 | </tr> |
437 | <tr> |
438 | <td valign="TOP"><b>LRE</b></td> |
439 | <td valign="TOP">Left-to-Right Embedding</td> |
440 | </tr> |
441 | <tr> |
442 | <td valign="TOP"><b>LRO</b></td> |
443 | <td valign="TOP">Left-to-Right Override</td> |
444 | </tr> |
445 | <tr> |
446 | <td valign="TOP"><b>R</b></td> |
447 | <td valign="TOP">Right-to-Left</td> |
448 | </tr> |
449 | <tr> |
450 | <td valign="TOP"><b>AL</b></td> |
451 | <td valign="TOP">Right-to-Left Arabic</td> |
452 | </tr> |
453 | <tr> |
454 | <td valign="TOP"><b>RLE</b></td> |
455 | <td valign="TOP">Right-to-Left Embedding</td> |
456 | </tr> |
457 | <tr> |
458 | <td valign="TOP"><b>RLO</b></td> |
459 | <td valign="TOP">Right-to-Left Override</td> |
460 | </tr> |
461 | <tr> |
462 | <td valign="TOP"><b>PDF</b></td> |
463 | <td valign="TOP">Pop Directional Format</td> |
464 | </tr> |
465 | <tr> |
466 | <td valign="TOP"><b>EN</b></td> |
467 | <td valign="TOP">European Number</td> |
468 | </tr> |
469 | <tr> |
470 | <td valign="TOP"><b>ES</b></td> |
471 | <td valign="TOP">European Number Separator</td> |
472 | </tr> |
473 | <tr> |
474 | <td valign="TOP"><b>ET</b></td> |
475 | <td valign="TOP">European Number Terminator</td> |
476 | </tr> |
477 | <tr> |
478 | <td valign="TOP"><b>AN</b></td> |
479 | <td valign="TOP">Arabic Number</td> |
480 | </tr> |
481 | <tr> |
482 | <td valign="TOP"><b>CS</b></td> |
483 | <td valign="TOP">Common Number Separator</td> |
484 | </tr> |
485 | <tr> |
486 | <td valign="TOP"><b>NSM</b></td> |
487 | <td valign="TOP">Non-Spacing Mark</td> |
488 | </tr> |
489 | <tr> |
490 | <td valign="TOP"><b>BN</b></td> |
491 | <td valign="TOP">Boundary Neutral</td> |
492 | </tr> |
493 | <tr> |
494 | <td valign="TOP"><b>B</b></td> |
495 | <td valign="TOP">Paragraph Separator</td> |
496 | </tr> |
497 | <tr> |
498 | <td valign="TOP"><b>S</b></td> |
499 | <td valign="TOP">Segment Separator</td> |
500 | </tr> |
501 | <tr> |
502 | <td valign="TOP"><b>WS</b></td> |
503 | <td valign="TOP">Whitespace</td> |
504 | </tr> |
505 | <tr> |
506 | <td valign="TOP"><b>ON</b></td> |
507 | <td valign="TOP">Other Neutrals</td> |
508 | </tr> |
509 | </table> |
510 | <h3><a name="Character Decomposition"></a>Character Decomposition Mapping</h3> |
06bfd75b |
511 | <p>The tags supplied with certain decomposition mappings generally indicate |
512 | formatting information. Where no such tag is given, the mapping is designated as |
513 | canonical. Conversely, the presence of a formatting tag also indicates that the |
514 | mapping is a compatibility mapping and not a canonical mapping. In the absence |
515 | of other formatting information in a compatibility mapping, the tag is used to |
15ca46df |
516 | distinguish it from canonical mappings.</p> |
517 | <p>In some instances a canonical mapping or a compatibility mapping may consist |
518 | of a single character. For a canonical mapping, this indicates that the |
519 | character is a canonical equivalent of another single character. For a |
520 | compatibility mapping, this indicates that the character is a compatibility |
521 | equivalent of another single character. The compatibility formatting tags used |
522 | are:</p> |
06bfd75b |
523 | <table border="0" cellspacing="0" cellpadding="4"> |
15ca46df |
524 | <tr> |
525 | <th>Tag</th> |
526 | <th> |
527 | <p align="LEFT">Description</th> |
528 | </tr> |
529 | <tr> |
530 | <td align="CENTER"><font> </td> |
531 | <td>A font variant (e.g. a blackletter form).</td> |
532 | </tr> |
533 | <tr> |
534 | <td align="CENTER"><noBreak> </td> |
535 | <td>A no-break version of a space or hyphen.</td> |
536 | </tr> |
537 | <tr> |
538 | <td align="CENTER"><initial> </td> |
539 | <td>An initial presentation form (Arabic).</td> |
540 | </tr> |
541 | <tr> |
542 | <td align="CENTER"><medial> </td> |
543 | <td>A medial presentation form (Arabic).</td> |
544 | </tr> |
545 | <tr> |
546 | <td align="CENTER"><final> </td> |
547 | <td>A final presentation form (Arabic).</td> |
548 | </tr> |
549 | <tr> |
550 | <td align="CENTER"><isolated> </td> |
551 | <td>An isolated presentation form (Arabic).</td> |
552 | </tr> |
553 | <tr> |
554 | <td align="CENTER"><circle> </td> |
555 | <td>An encircled form.</td> |
556 | </tr> |
557 | <tr> |
558 | <td align="CENTER"><super> </td> |
559 | <td>A superscript form.</td> |
560 | </tr> |
561 | <tr> |
562 | <td align="CENTER"><sub> </td> |
563 | <td>A subscript form.</td> |
564 | </tr> |
565 | <tr> |
566 | <td align="CENTER"><vertical> </td> |
567 | <td>A vertical layout presentation form.</td> |
568 | </tr> |
569 | <tr> |
570 | <td align="CENTER"><wide> </td> |
571 | <td>A wide (or zenkaku) compatibility character.</td> |
572 | </tr> |
573 | <tr> |
574 | <td align="CENTER"><narrow> </td> |
575 | <td>A narrow (or hankaku) compatibility character.</td> |
576 | </tr> |
577 | <tr> |
578 | <td align="CENTER"><small> </td> |
579 | <td>A small variant form (CNS compatibility).</td> |
580 | </tr> |
581 | <tr> |
582 | <td align="CENTER"><square> </td> |
583 | <td>A CJK squared font variant.</td> |
584 | </tr> |
585 | <tr> |
586 | <td align="CENTER"><fraction> </td> |
587 | <td>A vulgar fraction form.</td> |
588 | </tr> |
589 | <tr> |
590 | <td align="CENTER"><compat> </td> |
591 | <td>Otherwise unspecified compatibility character.</td> |
592 | </tr> |
593 | </table> |
594 | <p><b>Reminder: </b>There is a difference between decomposition and |
595 | decomposition mapping. The decomposition mappings are defined in the UnicodeData, |
596 | while the decomposition (also termed "full decomposition") is defined |
597 | in Chapter 3 to use those mappings <i>recursively.</i> |
598 | <ul> |
599 | <li>The canonical decomposition is formed by recursively applying the |
600 | canonical mappings, then applying the canonical reordering algorithm.</li> |
601 | <li>The compatibility decomposition is formed by recursively applying the |
602 | canonical <em>and</em> compatibility mappings, then applying the canonical |
603 | reordering algorithm.</li> |
604 | </ul> |
605 | <h3><a name="Canonical Combining Classes"></a>Canonical Combining Classes</h3> |
06bfd75b |
606 | <table border="0" cellspacing="0" cellpadding="4"> |
15ca46df |
607 | <tr> |
608 | <th> |
609 | <p align="LEFT">Value</th> |
610 | <th> |
611 | <p align="LEFT">Description</th> |
612 | </tr> |
613 | <tr> |
614 | <td align="RIGHT">0:</td> |
615 | <td>Spacing, split, enclosing, reordrant, and Tibetan subjoined</td> |
616 | </tr> |
617 | <tr> |
618 | <td align="RIGHT">1:</td> |
619 | <td>Overlays and interior</td> |
620 | </tr> |
621 | <tr> |
622 | <td align="RIGHT">7:</td> |
623 | <td>Nuktas</td> |
624 | </tr> |
625 | <tr> |
626 | <td align="RIGHT">8:</td> |
627 | <td>Hiragana/Katakana voicing marks</td> |
628 | </tr> |
629 | <tr> |
630 | <td align="RIGHT">9:</td> |
631 | <td>Viramas</td> |
632 | </tr> |
633 | <tr> |
634 | <td align="RIGHT">10:</td> |
635 | <td>Start of fixed position classes</td> |
636 | </tr> |
637 | <tr> |
638 | <td align="RIGHT">199:</td> |
639 | <td>End of fixed position classes</td> |
640 | </tr> |
641 | <tr> |
642 | <td align="RIGHT">200:</td> |
643 | <td>Below left attached</td> |
644 | </tr> |
645 | <tr> |
646 | <td align="RIGHT">202:</td> |
647 | <td>Below attached</td> |
648 | </tr> |
649 | <tr> |
650 | <td align="RIGHT">204:</td> |
651 | <td>Below right attached</td> |
652 | </tr> |
653 | <tr> |
654 | <td align="RIGHT">208:</td> |
655 | <td>Left attached (reordrant around single base character)</td> |
656 | </tr> |
657 | <tr> |
658 | <td align="RIGHT">210:</td> |
659 | <td>Right attached</td> |
660 | </tr> |
661 | <tr> |
662 | <td align="RIGHT">212:</td> |
663 | <td>Above left attached</td> |
664 | </tr> |
665 | <tr> |
666 | <td align="RIGHT">214:</td> |
667 | <td>Above attached</td> |
668 | </tr> |
669 | <tr> |
670 | <td align="RIGHT">216:</td> |
671 | <td>Above right attached</td> |
672 | </tr> |
673 | <tr> |
674 | <td align="RIGHT">218:</td> |
675 | <td>Below left</td> |
676 | </tr> |
677 | <tr> |
678 | <td align="RIGHT">220:</td> |
679 | <td>Below</td> |
680 | </tr> |
681 | <tr> |
682 | <td align="RIGHT">222:</td> |
683 | <td>Below right</td> |
684 | </tr> |
685 | <tr> |
686 | <td align="RIGHT">224:</td> |
687 | <td>Left (reordrant around single base character)</td> |
688 | </tr> |
689 | <tr> |
690 | <td align="RIGHT">226:</td> |
691 | <td>Right</td> |
692 | </tr> |
693 | <tr> |
694 | <td align="RIGHT">228:</td> |
695 | <td>Above left</td> |
696 | </tr> |
697 | <tr> |
698 | <td align="RIGHT">230:</td> |
699 | <td>Above</td> |
700 | </tr> |
701 | <tr> |
702 | <td align="RIGHT">232:</td> |
703 | <td>Above right</td> |
704 | </tr> |
705 | <tr> |
706 | <td align="RIGHT">233:</td> |
707 | <td>Double below</td> |
708 | </tr> |
709 | <tr> |
710 | <td align="RIGHT">234:</td> |
711 | <td>Double above</td> |
712 | </tr> |
713 | <tr> |
714 | <td align="RIGHT">240:</td> |
715 | <td>Below (iota subscript)</td> |
716 | </tr> |
717 | </table> |
718 | <p><strong>Note: </strong>some of the combining classes in this list do not |
719 | currently have members but are specified here for completeness.</p> |
720 | <h3><a name="Decompositions and Normalization"></a>Decompositions and |
721 | Normalization</h3> |
722 | <p>Decomposition is specified in Chapter 3. <a |
723 | href="http://www.unicode.org/unicode/reports/tr15/"><i>Unicode Standard Annex |
06bfd75b |
724 | #15: Unicode Normalization Forms</i></a> specifies the interaction between |
725 | decomposition and normalization. That report specifies how the decompositions |
726 | defined in UnicodeData.txt are used to derive normalized forms of Unicode text.</p> |
15ca46df |
727 | <p>Note that as of the 2.1.9 update of the Unicode Character Database, the |
728 | decompositions in the UnicodeData.txt file can be used to recursively derive the |
729 | full decomposition in canonical order, without the need to separately apply |
730 | canonical reordering. However, canonical reordering of combining character |
06bfd75b |
731 | sequences <b><i>must</i></b> still be applied in decomposition when normalizing |
732 | source text which contains any combining marks.</p> |
15ca46df |
733 | <h3><a name="Case Mappings"></a>Case Mappings</h3> |
06bfd75b |
734 | <p>There are a number of complications to case mappings that occur once the |
735 | repertoire of characters is expanded beyond ASCII. For more information, see <a |
736 | href="http://www.unicode.org/unicode/reports/tr21/">UTR #21: Case Mappings</a>.</p> |
737 | <p>For compatibility with existing parsers, UnicodeData.txt only contains case |
15ca46df |
738 | mappings for characters where they are one-to-one mappings; it also omits |
739 | information about context-sensitive case mappings. Information about these |
06bfd75b |
740 | special cases can be found in a separate data file, SpecialCasing.txt.</p> |
15ca46df |
741 | <h2><a name="Property Invariants"></a>Property Invariants</h2> |
742 | <p>Values in UnicodeData.txt are subject to correction as errors are found; |
743 | however, some characteristics of the categories themselves can be considered |
744 | invariants. Applications may wish to take these invariants into account when |
06bfd75b |
745 | choosing how to implement character properties. For more information, see <a |
746 | href="http://www.unicode.org/unicode/standard/policies.html">Unicode Policies</a>.</p> |
747 | <p>The following is a partial list of known invariants for the Unicode Character |
748 | Database.</p> |
15ca46df |
749 | <h4>Database Fields</h4> |
750 | <ul> |
751 | <li>The number of fields in UnicodeData.txt is fixed.</li> |
752 | <li>The order of the fields is also fixed. |
753 | <ul> |
754 | <li>Any additional information about character properties to be added in |
755 | the future will appear in separate data tables, rather than being added |
756 | on to the existing table or by subdivision or reinterpretation of |
757 | existing fields.</li> |
758 | </ul> |
759 | </li> |
760 | </ul> |
761 | <h4>General Category</h4> |
762 | <ul> |
763 | <li>There will never be more than 32 General Category values. |
764 | <ul> |
765 | <li>It is very unlikely that the Unicode Technical Committee will |
766 | subdivide the General Category partition any further, since that can |
767 | cause implementations to misbehave. Because the General Category is |
768 | limited to 32 values, 5 bits can be used to represent the information, |
769 | and a 32-bit integer can be used as a bitmask to represent arbitrary |
770 | sets of categories.</li> |
771 | </ul> |
772 | </li> |
773 | </ul> |
774 | <h4>Combining Classes</h4> |
775 | <ul> |
776 | <li>Combining classes are limited to the values 0 to 255. |
777 | <ul> |
778 | <li>In practice, there are far fewer than 256 values used. Implementations |
779 | may take advantage of this fact for compression, since only the ordering |
780 | of the non-zero values matters for the Canonical Reordering Algorithm. |
781 | It is possible for up to 256 values to be used in the future; however, |
782 | UTC decisions in the future may restrict the number of values to 128, |
783 | since this has implementation advantages. [Signed bytes can be used |
784 | without widening to ints in Java, for example.]</li> |
785 | </ul> |
786 | </li> |
787 | <li>All characters other than those of General Category M* have the combining |
788 | class 0. |
789 | <ul> |
790 | <li>Currently, all characters other than those of General Category Mn have |
791 | the value 0. However, some characters of General Category Me or Mc may |
792 | be given non-zero values in the future.</li> |
793 | <li>The precise values above the value 0 are not invariant--only the |
794 | relative ordering is considered normative. For example, it is not |
795 | guaranteed in future versions that the class of U+05B4 will be precisely |
796 | 14.</li> |
797 | </ul> |
798 | </li> |
799 | </ul> |
15ca46df |
800 | <h4>Canonical Decomposition</h4> |
801 | <ul> |
802 | <li>Canonical mappings are always in canonical order.</li> |
803 | <li>Canonical mappings have only the first of a pair possibly further |
804 | decomposing.</li> |
805 | <li>Canonical decompositions are "transparent" to other character |
806 | data: |
807 | <ul> |
808 | <li><tt>BIDI(a) = BIDI(principal(canonicalDecomposition(a))</tt></li> |
809 | <li><tt>Category(a) = Category(principal(canonicalDecomposition(a))</tt></li> |
810 | <li><tt>CombiningClass(a) = |
811 | CombiningClass(principal(canonicalDecomposition(a))</tt><br> |
812 | where principal(a) is the first character not of type Mn, or the first |
813 | character if all characters are of type Mn.</li> |
814 | </ul> |
815 | </li> |
816 | <li>However, because there are sometimes missing case pairs, and because of |
817 | some legacy characters, it is only generally true that: |
818 | <ul> |
819 | <li><tt>upper(canonicalDecomposition(a)) = canonicalDecomposition(upper(a))</tt></li> |
820 | <li><tt>lower(canonicalDecomposition(a)) = canonicalDecomposition(lower(a))</tt></li> |
821 | <li><tt>title(canonicalDecomposition(a)) = canonicalDecomposition(title(a))</tt></li> |
822 | </ul> |
823 | </li> |
824 | </ul> |
825 | <h2><a name="Modification History"></a>Modification History</h2> |
826 | <p>This section provides a summary of the changes between update versions of the |
827 | Unicode Standard.</p> |
828 | <h3><a |
06bfd75b |
829 | href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.1">Unicode |
830 | 3.1</a></h3> |
831 | <p>Modifications made for Version 3.0.1 of UnicodeData.txt include: |
832 | <ul> |
833 | <li>Addition of 2237 new entries, to cover new characters and new ranges of |
834 | unified Han characters encoded in Unicode 3.1.</li> |
835 | <li>Changed General Category value of 16EE..16F0 (Runic golden numbers) from |
836 | No to Nl.</li> |
837 | </ul> |
838 | <h3><a |
15ca46df |
839 | href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.0.1">Unicode |
840 | 3.0.1</a></h3> |
841 | <p>Modifications made for Version 3.0.1 of UnicodeData.txt include: |
842 | <ul> |
843 | <li>Added 5- and 6-digit representation of code points past U+FFFF.</li> |
844 | <li>Added Private Use range definitions for Planes 15 and 16.</li> |
845 | <li>Minor additions for the 10646 comment field.</li> |
846 | </ul> |
847 | <h3><a |
848 | href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.0.0">Unicode |
849 | 3.0.0</a></h3> |
850 | <p>Modifications made for Version 3.0.0 of UnicodeData.txt include many new |
851 | characters and a number of property changes. These are summarized in Appendex D |
852 | of <em>The Unicode Standard, Version 3.0.</em></p> |
853 | <h3><a |
854 | href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.9">Unicode |
855 | 2.1.9</a></h3> |
856 | <p>Modifications made for Version 2.1.9 of UnicodeData.txt include: |
857 | <ul> |
858 | <li>Corrected combining class for U+05AE HEBREW ACCENT ZINOR.</li> |
859 | <li>Corrected combining class for U+20E1 COMBINING LEFT RIGHT ARROW ABOVE</li> |
860 | <li>Corrected combining class for U+0F35 and U+0F37 to 220.</li> |
861 | <li>Corrected combining class for U+0F71 to 129.</li> |
862 | <li>Added a decomposition for U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR.</li> |
863 | <li>Added decompositions for several Greek symbol letters: |
864 | U+03D0..U+03D2, U+03D5, U+03D6, U+03F0..U+03F2.</li> |
865 | <li>Removed decompositions from the conjoining jamo block: |
866 | U+1100..U+11F8.</li> |
867 | <li>Changes to decomposition mappings for some Tibetan vowels for consistency |
868 | in normalization. (U+0F71, U+0F73, U+0F77, U+0F79, U+0F81)</li> |
869 | <li>Updated the decomposition mappings for several Vietnamese characters with |
870 | two diacritics (U+1EAC, U+1EAD, U+1EB6, U+1EB7, U+1EC6, U+1EC7, U+1ED8, |
871 | U+1ED9), so that the recursive decomposition can be generated directly in |
872 | canonically reordered form (not a normative change).</li> |
873 | <li>Updated the decomposition mappings for several Arabic compatibility |
874 | characters involving shadda (U+FC5E..U+FC62, U+FCF2..U+FCF4), and two Latin |
875 | characters (U+1E1C, U+1E1D), so that the decompositions are generated |
876 | directly in canonically reordered form (not a normative change).</li> |
877 | <li>Changed BIDI category for: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, |
878 | U+2028 LINE SEPARATOR.</li> |
879 | <li>Changed BIDI category for extenders of General Category Lm: U+3005, |
880 | U+3021..U+3035, U+FF9E, U+FF9F.</li> |
881 | <li>Changed General Category and BIDI category for the Greek numeral signs: |
882 | U+0374, U+0375.</li> |
883 | <li>Corrected General Category for U+FFE8 HALFWIDTH FORMS LIGHT VERTICAL.</li> |
884 | <li>Added Unicode 1.0 names for many Tibetan characters (informative).</li> |
885 | </ul> |
886 | <h3><a |
887 | href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.8">Unicode |
888 | 2.1.8</a></h3> |
889 | <p>Modifications made for Version 2.1.8 of UnicodeData.txt include: |
890 | <ul> |
891 | <li>Added combining class 240 for U+0345 COMBINING GREEK YPOGEGRAMMENI so that |
892 | decompositions involving iota subscript are derivable directly in |
893 | canonically reordered form; this also has a bearing on simplification of |
894 | casing of polytonic Greek.</li> |
895 | <li>Changes in decompositions related to Greek tonos. These result from the |
896 | clarification that monotonic Greek "tonos" should be equated with |
897 | U+0301 COMBINING ACUTE, rather than with U+030D COMBINING VERTICAL LINE |
898 | ABOVE. (All Greek characters in the Greek block involving "tonos"; |
899 | some Greek characters in the polytonic Greek in the 1FXX block.)</li> |
900 | <li>Changed decompositions involving dialytika tonos. (U+0390, U+03B0)</li> |
901 | <li>Changed ternary decompositions to binary. (U+0CCB, U+FB2C, U+FB2D) These |
902 | changes simplify normalization.</li> |
903 | <li>Removed canonical decomposition for Latin Candrabindu. (U+0310)</li> |
904 | <li>Corrected error in canonical decomposition for U+1FF4.</li> |
905 | <li>Added compatibility decompositions to clarify collation tables. (U+2100, |
906 | U+2101, U+2105, U+2106, U+1E9A)</li> |
907 | <li>A series of general category changes to assist the convergence of of |
908 | Unicode definition of identifier with ISO TR 10176: |
909 | <ul> |
910 | <li>So > Lo: U+0950, U+0AD0, U+0F00, U+0F88..U+0F8B</li> |
911 | <li>Po > Lo: U+0E2F, U+0EAF, U+3006</li> |
912 | <li>Lm > Sk: U+309B, U+309C</li> |
913 | <li>Po > Pc: U+30FB, U+FF65</li> |
914 | <li>Ps/Pe > Mn: U+0F3E, U+0F3F</li> |
915 | </ul> |
916 | </li> |
917 | <li>A series of bidi property changes for consistency. |
918 | <ul> |
919 | <li>L > ET: U+09F2, U+09F3</li> |
920 | <li>ON > L: U+3007</li> |
921 | <li>L > ON: U+0F3A..U+0F3D, U+037E, U+0387</li> |
922 | </ul> |
923 | </li> |
924 | <li>Add case mapping: U+01A6 <-> U+0280</li> |
925 | <li>Updated symmetric swapping value for guillemets: U+00AB, U+00BB, U+2039, |
926 | U+203A.</li> |
927 | <li>Changes to combining class values. Most Indic fixed position class |
928 | non-spacing marks were changed to combining class 0. This fixes some |
929 | inconsistencies in how canonical reordering would apply to Indic scripts, |
930 | including Tibetan. Indic interacting top/bottom fixed position classes were |
931 | merged into single (non-zero) classes as part of this change. Tibetan |
932 | subjoined consonants are changed from combining class 6 to combining class |
933 | 0. Thai pinthu (U+0E3A) moved to combining class 9. Moved two Devanagari |
934 | stress marks into generic above and below combining classes (U+0951, |
935 | U+0952).</li> |
936 | <li>Corrected placement of semicolon near symmetric swapping field. (U+FA0E, |
937 | etc., scattered positions to U+FA29)</li> |
938 | </ul> |
939 | <h3>Version 2.1.7</h3> |
940 | <p><i>This version was for internal change tracking only, and never publicly |
941 | released.</i></p> |
942 | <h3>Version 2.1.6</h3> |
943 | <p><i>This version was for internal change tracking only, and never publicly |
944 | released.</i></p> |
945 | <h3><a |
946 | href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.5">Unicode |
947 | 2.1.5</a></h3> |
948 | <p>Modifications made for Version 2.1.5 of UnicodeData.txt include: |
949 | <ul> |
950 | <li>Changed decomposition for U+FF9E and U+FF9F so that correct collation |
951 | weighting will automatically result from the canonical equivalences.</li> |
952 | <li>Removed canonical decompositions for U+04D4, U+04D5, U+04D8, U+04D9, |
953 | U+04E0, U+04E1, U+04E8, U+04E9 (the implication being that no canonical |
954 | equivalence is claimed between these 8 characters and similar Latin |
955 | letters), and updated 4 canonical decompositions for U+04DB, U+04DC, U+04EA, |
956 | U+04EB to reflect the implied difference in the base character.</li> |
957 | <li>Added Pi, and Pf categories and assigned the relevant quotation marks to |
958 | those categories, based on the Unicode Technical Corrigendum on Quotation |
959 | Characters.</li> |
960 | <li>Updating of many bidi properties, following the advice of the ad hoc |
961 | committee on bidi, and to make the bidi properties of compatibility |
962 | characters more consistent.</li> |
963 | <li>Changed category of several Tibetan characters: U+0F3E, U+0F3F, |
964 | U+0F88..U+0F8B to make them non-combining, reflecting the combined opinion |
965 | of Tibetan experts.</li> |
966 | <li>Added case mapping for U+03F2.</li> |
967 | <li>Corrected case mapping for U+0275.</li> |
968 | <li>Added titlecase mappings for U+03D0, U+03D1, U+03D5, U+03D6, U+03F0.. |
969 | U+03F2.</li> |
970 | <li>Corrected compatibility label for U+2121.</li> |
971 | <li>Add specific entries for all the CJK compatibility ideographs, |
972 | U+F900..U+FA2D, so the canonical decomposition for each (the URO character |
973 | it is equivalent to) can be carried in the database.</li> |
974 | </ul> |
975 | <h3>Version 2.1.4</h3> |
976 | <p><i>This version was for internal change tracking only, and never publicly |
977 | released.</i></p> |
978 | <h3>Version 2.1.3</h3> |
979 | <p><i>This version was for internal change tracking only, and never publicly |
980 | released.</i></p> |
981 | <h3><a |
982 | href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.2">Unicode |
983 | 2.1.2</a></h3> |
984 | <p>Modifications made in updating UnicodeData.txt to Version 2.1.2 for the |
985 | Unicode Standard, Version 2.1 (from Version 2.0) include: |
986 | <ul> |
987 | <li>Added two characters (U+20AC and U+FFFC).</li> |
988 | <li>Amended bidi properties for U+0026, U+002E, U+0040, U+2007.</li> |
989 | <li>Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275, |
990 | U+03C2, U+1E9B.</li> |
991 | <li>Changed combining order class for U+0F71.</li> |
992 | <li>Corrected canonical decompositions for U+0F73, U+1FBE.</li> |
993 | <li>Changed decomposition for U+FB1F from compatibility to canonical.</li> |
994 | <li>Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB.</li> |
995 | <li>Corrected compatibility decompositions for U+2469, U+246A, U+3358.</li> |
996 | </ul> |
997 | <h3>Version 2.1.1</h3> |
998 | <p><i>This version was for internal change tracking only, and never publicly |
999 | released.</i></p> |
1000 | <h3><a |
1001 | href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.0.0">Unicode |
1002 | 2.0.0</a></h3> |
1003 | <p>The modifications made in updating UnicodeData.txt for the Unicode Standard, |
1004 | Version 2.0 include: |
1005 | <ul> |
1006 | <li>Fixed decompositions with TONOS to use correct NSM: 030D.</li> |
1007 | <li>Removed old Hangul Syllables; mapping to new characters are in a separate |
1008 | table.</li> |
1009 | <li>Marked compatibility decompositions with additional tags.</li> |
1010 | <li>Changed old tag names for clarity.</li> |
1011 | <li>Revision of decompositions to use first-level decomposition, instead of |
1012 | maximal decomposition.</li> |
1013 | <li>Correction of all known errors in decompositions from earlier versions.</li> |
1014 | <li>Added control code names (as old Unicode names).</li> |
1015 | <li>Added Hangul Jamo decompositions.</li> |
1016 | <li>Added Number category to match properties list in book.</li> |
1017 | <li>Fixed categories of Koranic Arabic marks.</li> |
1018 | <li>Fixed categories of precomposed characters to match decomposition where |
1019 | possible.</li> |
1020 | <li>Added Hebrew cantillation marks and the Tibetan script.</li> |
1021 | <li>Added place holders for ranges such as CJK Ideographic Area and the |
1022 | Private Use Area.</li> |
1023 | <li>Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of |
1024 | mistakes in the database.</li> |
1025 | </ul> |
06bfd75b |
1026 | <h2><i><a name="UCD_Terms">UCD Terms of Use</a></i></h2> |
1027 | <h3><i>Disclaimer</i></h3> |
1028 | <blockquote> |
1029 | <p><i>The Unicode Character Database is provided as is by Unicode, Inc. No |
1030 | claims are made as to fitness for any particular purpose. No warranties of any |
1031 | kind are expressed or implied. The recipient agrees to determine applicability |
1032 | of information provided. If this file has been purchased on magnetic or |
1033 | optical media from Unicode, Inc., the sole remedy for any claim will be |
1034 | exchange of defective media within 90 days of receipt.</i></p> |
1035 | <p><i>This disclaimer is applicable for all other data files accompanying the |
1036 | Unicode Character Database, some of which have been compiled by the Unicode |
1037 | Consortium, and some of which have been supplied by other sources.</i></p> |
1038 | </blockquote> |
1039 | <h3><i>Limitations on Rights to Redistribute This Data</i></h3> |
1040 | <blockquote> |
1041 | <p><i>Recipient is granted the right to make copies in any form for internal |
1042 | distribution and to freely use the information supplied in the creation of |
1043 | products supporting the Unicode<sup>TM</sup> Standard. The files in the |
1044 | Unicode Character Database can be redistributed to third parties or other |
1045 | organizations (whether for profit or not) as long as this notice and the |
1046 | disclaimer notice are retained. Information can be extracted from these files |
1047 | and used in documentation or programs, as long as there is an accompanying |
1048 | notice indicating the source.</i></p> |
1049 | </blockquote> |
1050 | <hr width="50%"> |
1051 | <div align="center"> |
1052 | <center> |
1053 | <table cellspacing="0" cellpadding="0" border="0"> |
1054 | <tr> |
1055 | <td><a href="http://www.unicode.org/unicode/copyright.html"><img |
1056 | src="http://www.unicode.org/img/hb_home.gif" border="0" alt="Home" |
1057 | width="40" height="49"><img src="http://www.unicode.org/img/hb_mid.gif" |
1058 | border="0" alt="Terms of Use" width="152" height="49"><img |
1059 | src="http://www.unicode.org/img/hb_mail.gif" border="0" alt="E-mail" |
1060 | width="46" height="49"></a></td> |
1061 | </tr> |
1062 | </table> |
1063 | </center> |
1064 | </div> |
15ca46df |
1065 | |
1066 | </body> |
1067 | |
1068 | </html> |