Commit | Line | Data |
a0ed51b3 |
1 | |
2 | UNICODE 2.0 CHARACTER DATABASE |
3 | |
4 | Copyright (c) 1991-1996 Unicode, Inc. |
5 | All Rights reserved. |
6 | |
7 | DISCLAIMER |
8 | |
9 | The Unicode Character Database "UNIDATA2.TXT" is provided as-is by |
10 | Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any |
11 | particular purpose. No warranties of any kind are expressed or implied. The |
12 | recipient agrees to determine applicability of information provided. If this |
13 | file has been purchased on magnetic or optical media from Unicode, Inc., |
14 | the sole remedy for any claim will be exchange of defective media within |
15 | 90 days of receipt. |
16 | |
17 | This disclaimer is applicable for all other data files accompanying the |
18 | Unicode Character Database, some of which have been compiled by the |
19 | Unicode Consortium, and some of which have been supplied by other vendors. |
20 | |
21 | LIMITATIONS ON RIGHTS TO REDISTRIBUTE THIS DATA |
22 | |
23 | Recipient is granted the right to make copies in any form for internal |
24 | distribution and to freely use the information supplied in the creation of |
25 | products supporting the Unicode (TM) Standard. This file can be redistributed |
26 | to third parties or other organizations (whether for profit or not) as long |
27 | as this notice and the disclaimer notice are retained. |
28 | |
29 | EXPLANATORY INFORMATION |
30 | |
31 | The Unicode Character Database defines the default Unicode character |
32 | properties, and internal mappings. Particular implementations may choose to |
33 | override the properties and mappings that are not normative. If that is done, |
34 | it is up to the implementer to establish a protocol to convey that |
35 | information. For more information about character properties and mappings, |
36 | see "The Unicode Standard, Worldwide Character Encoding, Version 2.0", |
37 | published by Addison-Wesley. For information about other data files |
38 | accompanying the Unicode Character Database, see the section of the |
39 | Unicode Standard they were extracted from, or the explanatory readme |
40 | files and/or header sections with those files. |
41 | |
42 | The Unicode Character Database is a plain ASCII text file consisting of lines |
43 | containing fields terminated by semicolons. Each line represents the data for |
44 | one encoded character in the Unicode Standard, Version 2.0. Every encoded |
45 | character has a data entry, with the exception of certain special ranges, as |
46 | detailed below. |
47 | |
48 | There are five special ranges of characters that are represented only by |
49 | their start and end characters, since the properties in the file are uniform, |
50 | except for code values (which are all sequential and assigned). The names of CJK |
51 | ideograph characters and Hangul syllable characters are algorithmically |
52 | derivable. (See the Unicode Standard for more information). Surrogate |
53 | characters and private use characters have no names. |
54 | |
55 | The exact ranges represented by start and end characters are: |
56 | |
57 | The CJK Ideographs Area (U+4E00 - U+9FFF) |
58 | The Hangul Syllables Area (U+AC00 - U+D7A3) |
59 | The Surrogates Area (U+D800 - U+DFFF) |
60 | The Private Use Area (U+E000 - U+F8FF) |
61 | CJK Compatibility Ideographs (U+F900 - U+FAFF) |
62 | |
63 | The following table describes the format and meaning of each field in a |
64 | data entry in the Unicode Character Database. Fields which contain |
65 | normative information are so indicated. |
66 | |
67 | Field Explanation |
68 | ----- ----------- |
69 | |
70 | 0 Code value in 4-digit hexadecimal format. |
71 | This field is normative. |
72 | |
73 | 1 Unicode 2.0 Character Name. These names match exactly the |
74 | names published in Chapter 7 of the Unicode Standard. |
75 | This field is normative. |
76 | |
77 | 2 General Category. This is a useful breakdown into various "character |
78 | types" which can be used as a default categorization in implementations. |
79 | Some of the values are normative, and some are informative. |
80 | See below for a brief explanation. |
81 | |
82 | 3 Canonical Combining Classes. The classes used for the |
83 | Canonical Ordering Algorithm in the Unicode Standard. These |
84 | classes are also printed in Chapter 4 of the Unicode Standard. |
85 | This field is normative. See below for a brief explanation. |
86 | |
87 | 4 Bidirectional Category. See the list below for an explanation of the |
88 | abbreviations used in this field. These are the categories required |
89 | by the Bidirectional Behavior Algorithm in the Unicode Standard. |
90 | These categories are summarized in Chapter 4 of the Unicode Standard. |
91 | This field is normative. |
92 | |
93 | 5 Character Decomposition. In the Unicode Standard, Version 2.0, not all of |
94 | the decompositions are full decompositions. Recursive |
95 | application of look-up for decompositions will, in all cases, lead to |
96 | a maximal decomposition. The decompositions match exactly the |
97 | decompositions published with the character names in Chapter 7 |
98 | of the Unicode Standard. This field is normative. |
99 | |
100 | 6 Decimal digit value. This is a numeric field. If the character |
101 | has the decimal digit property, as specified in Chapter 4 of |
102 | the Unicode Standard, the value of that digit is represented |
103 | with an integer value in this field. This field is normative. |
104 | |
105 | 7 Digit value. This is a numeric field. If the character represents a |
106 | digit, not necessarily a decimal digit, the value is here. This |
107 | covers digits which do not form decimal radix forms, such as the |
108 | compatibility superscript digits. This field is informative. |
109 | |
110 | 8 Numeric value. This is a numeric field. If the character has the |
111 | numeric property, as specified in Chapter 4 of the Unicode |
112 | Standard, the value of that character is represented with an |
113 | integer or rational number in this field. This includes fractions as, |
114 | e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH. |
115 | Also included are numerical values for compatibility characters |
116 | such as circled numbers. This field is normative. |
117 | |
118 | 9 If the characters has been identified as a "mirrored" character in |
119 | bidirectional text, this field has the value "Y"; otherwise "N". |
120 | The list of mirrored characters is also printed in Chapter 4 of |
121 | the Unicode Standard. This field is normative. |
122 | |
123 | 10 Unicode 1.0 Name. This is the old name as published in Unicode 1.0. |
124 | This name is only provided when it is significantly different from |
125 | the Unicode 2.0 name for the character. This field is informative. |
126 | |
127 | 11 10646 Comment field. This field is informative. |
128 | |
129 | 12 Upper case equivalent mapping. If a character is part of an |
130 | alphabet with case distinctions, and has an upper case equivalent, |
131 | then the upper case equivalent is in this field. See the explanation |
132 | below on case distinctions. These mappings are always one-to-one, |
133 | not one-to-many or many-to-one. This field is informative. |
134 | |
135 | 13 Lower case equivalent mapping. Similar to 12. This field is informative. |
136 | |
137 | 14 Title case equivalent mapping. Similar to 12. This field is informative. |
138 | |
139 | GENERAL CATEGORY |
140 | |
141 | The values in this field are abbreviations for the following. Some of the |
142 | values are normative, and some are informative. For more information, see |
143 | the Unicode Standard. |
144 | |
145 | Normative |
146 | Mn = Mark, Non-Spacing |
147 | Mc = Mark, Combining |
148 | Nd = Number, Decimal Digit |
149 | No = Number, Other |
150 | Zs = Separator, Space |
151 | Zl = Separator, Line |
152 | Zp = Separator, Paragraph |
153 | Cc = Other, Control or Format |
154 | Co = Other, Private Use |
155 | Cn = Other, Not Assigned |
156 | |
157 | Informative |
158 | Lu = Letter, Uppercase |
159 | Ll = Letter, Lowercase |
160 | Lt = Letter, Titlecase |
161 | Lm = Letter, Modifier |
162 | Lo = Letter, Other |
163 | Pd = Punctuation, Dash |
164 | Ps = Punctuation, Open |
165 | Pe = Punctuation, Close |
166 | Po = Punctuation, Other |
167 | Sm = Symbol, Math |
168 | Sc = Symbol, Currency |
169 | So = Symbol, Other |
170 | |
171 | BIDIRECTIONAL PROPERTIES |
172 | |
173 | Please refer to the Unicode Standard for an explanation of the algorithm for |
174 | Bidirectional Behavior and an explanation of the sigificance of these categories. |
175 | These values are normative. |
176 | |
177 | Strong types: |
178 | L Left-Right; Most alphabetic, syllabic, and logographic |
179 | characters (e.g., CJK ideographs) |
180 | R Right-Left; Arabic, Hebrew, and |
181 | punctuation specific to those scripts |
182 | Weak types: |
183 | EN European Number |
184 | ES European Number Separator |
185 | ET European Number Terminator |
186 | AN Arabic Number |
187 | CS Common Number Separator |
188 | |
189 | Separators: |
190 | B Block Separator |
191 | S Segment Separator |
192 | |
193 | Neutrals: |
194 | WS Whitespace |
195 | ON Other Neutrals ; All other characters: punctuation, symbols |
196 | |
197 | CHARACTER DECOMPOSITION TAGS |
198 | |
199 | The decomposition is a normative property of a character. The tags supplied |
200 | with certain decompositions generally indicate formatting information. |
201 | Where no such tag is given, the decomposition is designated as canonical. |
202 | Conversely, the presence of a formatting tag also indicates |
203 | that the decomposition is a compatibility decomposition and not a canonical |
204 | decomposition. In the absence of other formatting information in a |
205 | compatibility decomposition, the tag <compat> is used to distinguish it from |
206 | canonical decompositions. |
207 | |
208 | In some instances a canonical decomposition or a compatibility decomposition |
209 | may consist of a single character. For a canonical decomposition, this |
210 | indicates that the character is a canonical equivalent of another single |
211 | character. For a compatibility decomposition, this indicates that the |
212 | character is a compatibility equivalent of another single character. |
213 | |
214 | The compatibility formatting tags used are: |
215 | |
216 | <font> A font variant (e.g. a blackletter form). |
217 | <noBreak> A no-break version of a space or hyphen. |
218 | <initial> An initial presentation form (Arabic). |
219 | <medial> A medial presentation form (Arabic). |
220 | <final> A final presentation form (Arabic). |
221 | <isolated> An isolated presentation form (Arabic). |
222 | <circle> An encircled form. |
223 | <super> A superscript form. |
224 | <sub> A subscript form. |
225 | <vertical> A vertical layout presentation form. |
226 | <wide> A wide (or zenkaku) compatibility character. |
227 | <narrow> A narrow (or hankaku) compatibility character. |
228 | <small> A small variant form (CNS compatibility). |
229 | <square> A CJK squared font variant. |
230 | <compat> Otherwise unspecified compatibility character. |
231 | |
232 | CANONICAL COMBINING CLASSES |
233 | |
234 | 0: Spacing, enclosing, reordrant, and surrounding |
235 | 1: Overlays and interior |
236 | 6: Tibetan subjoined Letters |
237 | 7: Nuktas |
238 | 8: Hiragana/Katakana voiced marks |
239 | 9: Viramas |
240 | 10: Start of fixed position classes |
241 | 199: End of fixed position classes |
242 | 200: Below left attached |
243 | 202: Below attached |
244 | 204: Below right attached |
245 | 208: Left attached (reordrant around single base character) |
246 | 210: Right attached |
247 | 212: Above left attached |
248 | 214: Above attached |
249 | 216: Above right attached |
250 | 218: Below left |
251 | 220: Below |
252 | 222: Below right |
253 | 224: Left (reordrant around single base character) |
254 | 226: Right |
255 | 228: Above left |
256 | 230: Above |
257 | 232: Above right |
258 | 234: Double above |
259 | |
260 | Note: some of the combining classes in this list do not currently have |
261 | members but are specified here for completeness. |
262 | |
263 | CASE MAPPINGS |
264 | |
265 | In addition to uppercase and lowercase, because of the inclusion of certain |
266 | composite characters for compatibility, such as "01F1;LATIN CAPITAL LETTER |
267 | DZ", there is a third case, called titlecase, which is used where the first |
268 | character of a word is to be capitalized (e.g. UPPERCASE, Titlecase, |
269 | lowercase). An example of such a character is "01F2;LATIN CAPITAL LETTER D |
270 | WITH SMALL LETTER Z". |
271 | |
272 | The uppercase, titlecase and lowercase fields are only included for characters |
273 | that have a single corresponding character of that type. Composite characters |
274 | (such as "339D;SQUARE CM") that do not have a single corresponding character |
275 | of that type can be cased by decomposition. |
276 | |
277 | The case mapping is an informative, default mapping. Certain languages, such |
278 | as Turkish, German, French, or Greek may have small deviations from the |
279 | default mappings listed in the Unicode Character Database. |
280 | |
281 | MODIFICATION HISTORY |
282 | |
283 | Some of the modifications made in updating the Unicode Character Database |
284 | for the Unicode Standard, Version 2.0 are: |
285 | * Fixed decompositions with TONOS to use correct NSM: 030D. |
286 | * Removed old Hangul Syllables; mapping to new characters are |
287 | in a separate table. |
288 | * Marked compability decompositions with additional tags. |
289 | * Changed old tag names for clarity. |
290 | * Revision of decompositions to use first-level decomposition, instead |
291 | of maximal decomposition. |
292 | * Correction of all known errors in decompositions from earlier versions. |
293 | * Added control code names (as old Unicode names). |
294 | * Added Hangul Jamo decompositions. |
295 | * Added Number category to match properties list in book. |
296 | * Fixed categories of Koranic Arabic marks. |
297 | * Fixed categories of precomposed characters to match decomposition where possible. |
298 | * Added Hebrew cantillation marks and the Tibetan script. |
299 | * Added place holders for ranges such as CJK Ideographic Area and the |
300 | Private Use Area. |
301 | * Eliminated "Nd" as a category. |