Commit | Line | Data |
c7b62a68 |
1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" |
2 | |
3 | "http://www.w3.org/TR/REC-html40/loose.dtd"> |
4 | |
5 | <html> |
6 | |
c7b62a68 |
7 | <head> |
c7b62a68 |
8 | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
c7b62a68 |
9 | <meta http-equiv="Content-Language" content="en-us"> |
c7b62a68 |
10 | <meta name="GENERATOR" content="Microsoft FrontPage 4.0"> |
c7b62a68 |
11 | <meta name="ProgId" content="FrontPage.Editor.Document"> |
c7b62a68 |
12 | <link rel="stylesheet" href="http://www.unicode.org/unicode.css" type="text/css"> |
c7b62a68 |
13 | <title>Unicode Character Database</title> |
c7b62a68 |
14 | </head> |
15 | |
c7b62a68 |
16 | <body> |
17 | |
06bfd75b |
18 | <table width="100%" cellpadding="0" cellspacing="0" border="0"> |
19 | <tr> |
20 | <td> |
21 | <table width="100%" border="0" cellpadding="0" cellspacing="0"> |
22 | <tr> |
23 | <td class="icon"><a href="http://www.unicode.org"><img border="0" |
24 | src="http://www.unicode.org/webscripts/logo60s2.gif" align="middle" |
25 | alt="[Unicode]" width="34" height="33"></a> <a |
26 | class="bar" href="UnicodeCharacterDatabase.html">Unicode Character |
27 | Database</a></td> |
28 | </tr> |
29 | </table> |
30 | </td> |
31 | </tr> |
32 | <tr> |
33 | <td class="gray"> </td> |
34 | </tr> |
35 | </table> |
36 | <h1>UNICODE CHARACTER DATABASE</h1> |
c7b62a68 |
37 | <table border="1" cellspacing="2" cellpadding="0" height="87" width="100%"> |
c7b62a68 |
38 | <tr> |
c7b62a68 |
39 | <td valign="TOP" width="144">Revision</td> |
06bfd75b |
40 | <td valign="TOP">3.1.0</td> |
c7b62a68 |
41 | </tr> |
c7b62a68 |
42 | <tr> |
c7b62a68 |
43 | <td valign="TOP" width="144">Authors</td> |
c7b62a68 |
44 | <td valign="TOP">Mark Davis and Ken Whistler</td> |
c7b62a68 |
45 | </tr> |
c7b62a68 |
46 | <tr> |
c7b62a68 |
47 | <td valign="TOP" width="144">Date</td> |
06bfd75b |
48 | <td valign="TOP">2001-02-28</td> |
c7b62a68 |
49 | </tr> |
c7b62a68 |
50 | <tr> |
c7b62a68 |
51 | <td valign="TOP" width="144">This Version</td> |
8836d2a5 |
52 | <td valign="TOP"><a |
06bfd75b |
53 | href="http://http://www.unicode.org/Public/3.1-Update/UnicodeCharacterDatabase-3.1.0.html">http://www.unicode.org/Public/3.1-Update/UnicodeCharacterDatabase-3.1.0.html</a></td> |
c7b62a68 |
54 | </tr> |
c7b62a68 |
55 | <tr> |
c7b62a68 |
56 | <td valign="TOP" width="144">Previous Version</td> |
8836d2a5 |
57 | <td valign="TOP"><a |
06bfd75b |
58 | href="http://www.unicode.org/Public/3.0-Update1/UnicodeCharacterDatabase-3.0.1.html">http://www.unicode.org/Public/3.0-Update1/UnicodeCharacterDatabase-3.0.1.html</a></td> |
c7b62a68 |
59 | </tr> |
c7b62a68 |
60 | <tr> |
c7b62a68 |
61 | <td valign="TOP" width="144">Latest Version</td> |
8836d2a5 |
62 | <td valign="TOP"><a |
63 | href="http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html">http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html</a></td> |
c7b62a68 |
64 | </tr> |
c7b62a68 |
65 | </table> |
06bfd75b |
66 | <h3><br> |
67 | S<i>ummary</i></h3> |
68 | <blockquote> |
69 | <p><i>This document describes the format and content of the Unicode Character |
70 | Database (UCD)</i></p> |
71 | </blockquote> |
72 | <h3><i>Status</i></h3> |
73 | <blockquote> |
74 | <p><i>The file and the files described herein are part of the Unicode |
75 | Character Database and are governed by the <a href="#UCD_Terms">UCD Terms of |
76 | Use</a> given below.</i></p> |
77 | <p><i>The <a href="#References">References</a> provide related information |
78 | that is useful in understanding this document.</i></p> |
79 | <p><i><b>Warning: </b>the information in this file does not completely |
80 | describe the use and interpretation of Unicode character properties and |
81 | behavior. It must be used in conjunction with the data in the other files in |
82 | the Unicode Character Database, and relies on the notation and definitions |
83 | supplied in <a |
84 | href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html">The |
85 | Unicode Standard</a>. All chapter references are to Version 3.1.0 of the |
86 | standard.</i></p> |
87 | </blockquote> |
8836d2a5 |
88 | <h2>Introduction</h2> |
06bfd75b |
89 | <p>The Unicode Character Database (UCD) is a set of files that define the |
90 | Unicode character properties and internal mappings. This document describes the |
91 | files that are part of <a href="http://www.unicode.org/unicode/reports/tr27/">The |
92 | Unicode Standard, Version 3.1</a> [<a href="#U3.1">U3.1</a>]. The main changes |
93 | in this version are:</p> |
8836d2a5 |
94 | <ul> |
06bfd75b |
95 | <li>All of the data files have been updated to account for the large number of |
96 | additional characters in Unicode 3.1.</li> |
97 | <li>PropList.txt has been extensively reorganized and reformatted.</li> |
98 | <li>Scripts.txt has been added to the UCD.</li> |
99 | <li>A large number of informative derived property files have been added to |
100 | the UCD.</li> |
8836d2a5 |
101 | </ul> |
06bfd75b |
102 | <p><i>Files in the UCD use a common format unless otherwise specified. For |
103 | details, see <a href="#UCD_File_Format">UCD File Format</a>.</i></p> |
104 | <h2><a name="Conformance">Conformance</a></h2> |
105 | <p>For information on the meaning and application of the terms normative and |
106 | informative, see "Chapter 4, Character Properties (revision)" in <a |
107 | href="http://www.unicode.org/unicode/reports/tr27/#conformance">UAX #27, Unicode |
108 | 3.1</a>.</p> |
109 | <p>Some informative data files contain derived properties, properties that can |
110 | be derived from other properties in the UCD. The derived properties that are |
111 | computed from solely normative properties are themselves normative, while the |
112 | others are informative.</p> |
113 | <h2>UCD Files</h2> |
114 | <p>The following table summarizes the files in the Unicode Character Database. |
115 | For more information about these files, see the referenced technical |
116 | report(s), files, or section of Unicode Standard, Version 3.1.</p> |
117 | <table border="1" cellspacing="0" cellpadding="4"> |
118 | <tr> |
119 | <th>".txt" File</th> |
120 | <th>Description</th> |
121 | <th align="center">N/I</th> |
122 | <th>Summary</th> |
123 | </tr> |
124 | <tr> |
125 | <td>ArabicShaping</td> |
126 | <td>Section 8.2</td> |
127 | <td align="center">N</td> |
128 | <td>Basic Arabic and Syriac character shaping properties, such as initial, |
129 | medial and final shapes.</td> |
130 | </tr> |
131 | <tr> |
132 | <td>BidiMirroring</td> |
133 | <td><a href="http://www.unicode.org/unicode/reports/tr9/">UAX #9</a></td> |
134 | <td align="center">I</td> |
135 | <td>Properties for substituting characters in an implementation of |
136 | bidirectional mirroring.</td> |
137 | </tr> |
138 | <tr> |
139 | <td>Blocks</td> |
140 | <td>Chapter 14</td> |
141 | <td align="center">N</td> |
142 | <td>List of block names.</td> |
143 | </tr> |
144 | <tr> |
145 | <td>CaseFolding</td> |
146 | <td><a href="http://www.unicode.org/unicode/reports/tr21/">UTR #21</a></td> |
147 | <td align="center">N</td> |
148 | <td>Mapping from characters to their case-folded forms. This is an |
149 | informative file containing normative derived properties. |
150 | <p><i>Derived from UnicodeData and SpecialCasing.</i></p> |
151 | </td> |
152 | </tr> |
153 | <tr> |
154 | <td>CompositionExclusions</td> |
155 | <td><a href="http://www.unicode.org/unicode/reports/tr15/">UAX #15</a></td> |
156 | <td align="center">N</td> |
157 | <td>Properties for normalization.</td> |
158 | </tr> |
159 | <tr> |
160 | <td><i>DerivedXXX</i></td> |
161 | <td>DerivedProperties.html</td> |
162 | <td align="center">N/I</td> |
163 | <td>Various informative derived files, described in the documentation file. |
164 | Some of the derived properties are normative and some are informative.</td> |
165 | </tr> |
166 | <tr> |
167 | <td>EastAsianWidth</td> |
168 | <td><a href="http://www.unicode.org/unicode/reports/tr11/">UAX #11</a></td> |
169 | <td align="center">I</td> |
170 | <td>Properties for determining the choice of wide vs. narrow glyphs in East |
171 | Asian contexts.</td> |
172 | </tr> |
173 | <tr> |
174 | <td>Index</td> |
175 | <td>Chapter 14</td> |
176 | <td align="center">I</td> |
177 | <td>Index to Unicode characters, as printed in the Unicode Standard. (See <a |
178 | href="#Update_Note">Update Note</a>.)</td> |
179 | </tr> |
180 | <tr> |
181 | <td>Jamo</td> |
182 | <td>Chapter 4</td> |
183 | <td align="center">N</td> |
184 | <td>List of Jamo short names, used in deriving HANGUL SYLLABLE names |
185 | algorithmically.</td> |
186 | </tr> |
187 | <tr> |
188 | <td>LineBreak</td> |
189 | <td><a href="http://www.unicode.org/unicode/reports/tr14/">UAX #14</a></td> |
190 | <td align="center">N/I</td> |
191 | <td>Properties for line breaking.</td> |
192 | </tr> |
193 | <tr> |
194 | <td>NamesList</td> |
195 | <td>Chapter 14</td> |
196 | <td align="center">I</td> |
197 | <td>This file duplicates some of the material in the UnicodeData file, and |
198 | adds annotations used in the character charts.</td> |
199 | </tr> |
200 | <tr> |
201 | <td>NormalizationTest</td> |
202 | <td><a href="http://www.unicode.org/unicode/reports/tr15/">UAX #15</a></td> |
203 | <td align="center">N</td> |
204 | <td>Test file for conformance to Unicode Normalization Forms.</td> |
205 | </tr> |
206 | <tr> |
207 | <td>PropList</td> |
208 | <td>PropList.html</td> |
209 | <td align="center">N/I</td> |
210 | <td>Extended character properties</td> |
211 | </tr> |
212 | <tr> |
213 | <td>Scripts</td> |
214 | <td><a href="http://www.unicode.org/unicode/reports/tr24/">UTR #24</a></td> |
215 | <td align="center">I</td> |
216 | <td>Default scripts values for use in regular expressions.</td> |
217 | </tr> |
218 | <tr> |
219 | <td>SpecialCasing</td> |
220 | <td>Chapter 4,<br> |
221 | <a href="http://www.unicode.org/unicode/reports/tr21/">UTR #21</a></td> |
222 | <td align="center">N</td> |
223 | <td>List of properties required for full case mapping.</td> |
224 | </tr> |
225 | <tr> |
226 | <td>UnicodeData</td> |
227 | <td>UnicodeData.html,<br> |
228 | Chapter 4,<br> |
229 | <a href="http://www.unicode.org/unicode/reports/tr21/">UTR #21</a>,<br> |
230 | <a href="http://www.unicode.org/unicode/reports/tr15/">UAX #15</a></td> |
231 | <td align="center">N/I</td> |
232 | <td>The main file in the UCD. </td> |
233 | </tr> |
234 | <tr> |
235 | <td>Unihan</td> |
236 | <td>Unihan.txt</td> |
237 | <td align="center">N/I</td> |
238 | <td>Extended properties of Han (CJK) characters. (See <a href="#Format_Note">Format |
239 | Note</a>.)</td> |
240 | </tr> |
241 | </table> |
242 | <blockquote> |
243 | <p><b><a name="Update_Note">Update Note</a>: </b>The information in Index.txt |
244 | files matches the appropriate version of the book. Changes in the Unicode |
245 | Character Database since then may not be reflected in these files, since they |
246 | are primarily of archival interest.</p> |
247 | <p><b><a name="Format_Note">Format Note</a>: </b>The file data format differs |
248 | from the standard format, and is described in the header of the file. The |
249 | header also describes which properties are informative and which are |
250 | normative.</p> |
251 | </blockquote> |
252 | <h2><a name="UCD_File_Format">UCD File Format</a></h2> |
253 | <p>Files in the UCD use the following format, unless otherwise specified.</p> |
8836d2a5 |
254 | <ul> |
06bfd75b |
255 | <li>Each line of data consists of fields separated by semicolons. The fields |
256 | are numbered starting with zero. Code points are expressed as hexadecimal |
257 | numbers with four to six digits. They are written without "U+". |
258 | Within a sequence of code points, spaces are used for separation. Leading |
259 | and trailing spaces within a field are not significant.</li> |
8836d2a5 |
260 | </ul> |
8836d2a5 |
261 | <ul> |
06bfd75b |
262 | <li>The first field (0) of each line in the Unicode Character Database files |
263 | represents a code point or range. The remaining fields (1..n) are properties |
264 | associated with that code point.</li> |
8836d2a5 |
265 | </ul> |
8836d2a5 |
266 | <ul> |
06bfd75b |
267 | <li>A range of code points is specified by the form "X..Y". Each |
268 | code point from X to Y has the associated properties. For example:</li> |
8836d2a5 |
269 | </ul> |
06bfd75b |
270 | <blockquote> |
271 | <pre>0000..007F; Basic Latin |
272 | 0080..00FF; Latin-1 Supplement |
273 | |
274 | 1680 ; White_space # Zs OGHAM SPACE MARK |
275 | 2000..200A; White_space # Zs [11] EN QUAD..HAIR SPACE</pre> |
276 | </blockquote> |
8836d2a5 |
277 | <ul> |
06bfd75b |
278 | <li>Hash marks ("#") are used to indicate comments: all characters |
279 | from the hash mark to the end of the line are comments, and disregarded when |
280 | parsing data. In many files, the comments on data lines use a common format.</li> |
8836d2a5 |
281 | </ul> |
06bfd75b |
282 | <blockquote> |
283 | <pre>00BC..00BE ; numeric # No [3] VULGAR FRACTION ONE QUARTER..VULGAR FRACTION THREE QUARTERS</pre> |
284 | </blockquote> |
8836d2a5 |
285 | <ul> |
06bfd75b |
286 | <li>The first part of the comment is the UCD general category. The symbol |
287 | "L&" indicates characters of type Lu, Ll, or Lt. The code |
288 | point ranges are calculated so that they all have the same General Category |
289 | (or L&). While this results in more ranges than are strictly necessary, |
290 | it makes the contents of the ranges clearer. The second part of the comment |
291 | (in square brackets), indicates the number of items in a range, if there is |
292 | one. The third part is the name of the character in field zero: if it is a |
293 | range, then the character names for the ends of the range are separated by |
294 | "..".</li> |
8836d2a5 |
295 | </ul> |
06bfd75b |
296 | <p>However, the comments are purely informational, and may change format or be |
297 | omitted in the future. They should not be parsed for content.</p> |
298 | <h2><a name="References">References</a></h2> |
299 | <table cellspacing="12" cellpadding="0" width="100%" border="0"> |
300 | <tbody> |
301 | <tr> |
302 | <td valign="top" width="1">[<a name="FAQ">FAQ</a>]</td> |
303 | <td valign="top">Unicode Frequently Asked Questions<br> |
304 | <a href="http://www.unicode.org/unicode/faq/">http://www.unicode.org/unicode/faq/<br> |
305 | </a><i>For answers to common questions on technical issues.</i></td> |
306 | </tr> |
307 | <tr> |
308 | <td valign="top" width="1">[<a name="Glossary">Glossary</a>]</td> |
309 | <td valign="top">Unicode Glossary<a |
310 | href="http://www.unicode.org/glossary/"><br> |
311 | http://www.unicode.org/glossary/<br> |
312 | </a><i>For explanations of terminology used in this and other documents.</i></td> |
313 | </tr> |
314 | <tr> |
315 | <td valign="top" width="1">[<a name="Reports">Reports</a>]</td> |
316 | <td valign="top">Unicode Technical Reports<br> |
317 | <a href="http://www.unicode.org/unicode/reports/">http://www.unicode.org/unicode/reports/<br> |
318 | </a><i>For information on the status and development process for |
319 | technical reports, and for a list of technical reports.</i></td> |
320 | </tr> |
321 | <tr> |
322 | <td valign="top" width="1">[<a name="U3.1">U3.1</a>]</td> |
323 | <td valign="top">Unicode Standard Annex #27: Unicode 3.1<a |
324 | href="http://www.unicode.org/unicode/reports/tr27/"><br> |
325 | http://www.unicode.org/unicode/reports/tr27/</a></td> |
326 | </tr> |
327 | <tr> |
328 | <td valign="top" width="1">[<a name="Versions">Versions</a>]</td> |
329 | <td valign="top">Versions of the Unicode Standard<br> |
330 | <a href="http://www.unicode.org/unicode/standard/versions/">http://www.unicode.org/unicode/standard/versions/<br> |
331 | </a><i>For details on the precise contents of each version of the |
332 | Unicode Standard, and how to cite them.</i></td> |
333 | </tr> |
334 | </tbody> |
335 | </table> |
336 | <h2><br> |
337 | <i><a name="UCD_Terms">UCD Terms of Use</a></i></h2> |
338 | <h3><i>Disclaimer</i></h3> |
339 | <blockquote> |
340 | <p><i>The Unicode Character Database is provided as is by Unicode, Inc. No |
341 | claims are made as to fitness for any particular purpose. No warranties of any |
342 | kind are expressed or implied. The recipient agrees to determine applicability |
343 | of information provided. If this file has been purchased on magnetic or |
344 | optical media from Unicode, Inc., the sole remedy for any claim will be |
345 | exchange of defective media within 90 days of receipt.</i></p> |
346 | <p><i>This disclaimer is applicable for all other data files accompanying the |
347 | Unicode Character Database, some of which have been compiled by the Unicode |
348 | Consortium, and some of which have been supplied by other sources.</i></p> |
349 | </blockquote> |
350 | <h3><i>Limitations on Rights to Redistribute This Data</i></h3> |
351 | <blockquote> |
352 | <p><i>Recipient is granted the right to make copies in any form for internal |
353 | distribution and to freely use the information supplied in the creation of |
354 | products supporting the Unicode<sup>TM</sup> Standard. The files in the |
355 | Unicode Character Database can be redistributed to third parties or other |
356 | organizations (whether for profit or not) as long as this notice and the |
357 | disclaimer notice are retained. Information can be extracted from these files |
358 | and used in documentation or programs, as long as there is an accompanying |
359 | notice indicating the source.</i></p> |
360 | </blockquote> |
361 | <hr width="50%"> |
362 | <p align="center"><a href="http://www.unicode.org/unicode/copyright.html"><img |
363 | src="http://www.unicode.org/img/hb_home.gif" border="0" alt="Home" width="40" |
364 | height="49"><img src="http://www.unicode.org/img/hb_mid.gif" border="0" |
365 | alt="Terms of Use" width="152" height="49"><img |
366 | src="http://www.unicode.org/img/hb_mail.gif" border="0" alt="E-mail" width="46" |
367 | height="49"></a> |
8836d2a5 |
368 | |
369 | </body> |
370 | |
371 | </html> |