Commit | Line | Data |
1911be83 |
1 | The *.txt files were copied from |
8836d2a5 |
2 | |
99870f4d |
3 | ftp://www.unicode.org/Public/UNIDATA |
b6922eda |
4 | |
99870f4d |
5 | with subdirectories 'extracted' and 'auxiliary' |
61131c94 |
6 | |
99870f4d |
7 | The Unihan files were not included due to space considerations. Also NOT |
37e2e78e |
8 | included were any *.html files. It is possible to add the Unihan files, and |
9 | edit mktables (see instructions near its beginning) to look at them. |
99870f4d |
10 | |
11 | The file 'version' should exist and be a single line with the Unicode version, |
12 | like: |
13 | 5.2.0 |
61131c94 |
14 | |
15 | To be 8.3 filesystem friendly, the names of some of the input files have been |
37e2e78e |
16 | changed from the values that are in the Unicode DB. Not all of the Test files |
17 | are currently used, so may not be present, so some of the mv's can fail. The |
18 | .html Test files are not touched. |
61131c94 |
19 | |
20 | mv PropertyValueAliases.txt PropValueAliases.txt |
21 | mv NamedSequencesProv.txt NamedSqProv.txt |
22 | mv DerivedAge.txt DAge.txt |
23 | mv DerivedCoreProperties.txt DCoreProperties.txt |
24 | mv DerivedNormalizationProps.txt DNormalizationProps.txt |
25 | mv extracted/DerivedBidiClass.txt extracted/DBidiClass.txt |
26 | mv extracted/DerivedBinaryProperties.txt extracted/DBinaryProperties.txt |
27 | mv extracted/DerivedCombiningClass.txt extracted/DCombiningClass.txt |
28 | mv extracted/DerivedDecompositionType.txt extracted/DDecompositionType.txt |
29 | mv extracted/DerivedEastAsianWidth.txt extracted/DEastAsianWidth.txt |
30 | mv extracted/DerivedGeneralCategory.txt extracted/DGeneralCategory.txt |
31 | mv extracted/DerivedJoiningGroup.txt extracted/DJoinGroup.txt |
32 | mv extracted/DerivedJoiningType.txt extracted/DJoinType.txt |
33 | mv extracted/DerivedLineBreak.txt extracted/DLineBreak.txt |
34 | mv extracted/DerivedNumericType.txt extracted/DNumType.txt |
35 | mv extracted/DerivedNumericValues.txt extracted/DNumValues.txt |
36 | |
37e2e78e |
37 | mv auxiliary/GraphemeBreakTest.txt auxiliary/GCBTest.txt |
38 | mv auxiliary/LineBreakTest.txt auxiliary/LBTest.txt |
39 | mv auxiliary/SentenceBreakTest.txt auxiliary/SBTest.txt |
40 | mv auxiliary/WordBreakTest.txt auxiliary/WBTest.txt |
41 | |
99870f4d |
42 | If you have the Unihan database (5.2 and above), you should also do the |
43 | following: |
61131c94 |
44 | |
99870f4d |
45 | mv Unihan_DictionaryIndices.txt UnihanIndicesDictionary.txt |
46 | mv Unihan_DictionaryLikeData.txt UnihanDataDictionaryLike.txt |
47 | mv Unihan_IRGSources.txt UnihanIRGSources.txt |
48 | mv Unihan_NumericValues.txt UnihanNumericValues.txt |
49 | mv Unihan_OtherMappings.txt UnihanOtherMappings.txt |
50 | mv Unihan_RadicalStrokeCounts.txt UnihanRadicalStrokeCounts.txt |
51 | mv Unihan_Readings.txt UnihanReadings.txt |
52 | mv Unihan_Variants.txt UnihanVariants.txt |
53 | |
37e2e78e |
54 | If you download everything, the names of files that are not used by mktables |
55 | are not changed by the above, and will not work correctly as-is on 8.3 |
56 | filesystems. |
99870f4d |
57 | |
58 | mktables is used to generate the tables used by the rest of Perl. It will warn |
59 | you about any *.txt files in the directory substructure that it doesn't know |
60 | about. You should remove any so-identified, or edit mktables to add them to |
61 | its lists to process. You can run |
62 | |
63 | mktables -globlist |
64 | |
65 | to have it try to process these tables generically. |
66 | |
0fa75b59 |
67 | FOR PUMPKINS |
68 | |
99870f4d |
69 | The files are inter-related. If you take the latest UnicodeData.txt, for |
70 | example, but leave the older versions of other files, there can be subtle |
37e2e78e |
71 | problems. So get everything available from Unicode, and delete those which |
72 | aren't needed. |
99870f4d |
73 | |
74 | When moving to a new version of Unicode, you need to update 'version' by hand |
75 | |
76 | p4 edit version |
77 | ... |
78 | |
79 | You should look in the Unicode release notes (which are probably towards the |
80 | bottom of http://www.unicode.org/reports/tr44/) to see if any properties have |
81 | newly been moved to be Obsolete, Deprecated, or Stabilized. The full names for |
82 | these should be added to the respective lists near the beginning of mktables, |
83 | using an 'if' to add them for just this Unicode version going forward, so that |
84 | mktables can continue to be used for earlier Unicode versions. |
85 | |
86 | When putting out a new Perl release, think about if any of the Deprecated |
87 | properties should be moved to Suppressed. |
b6922eda |
88 | |
272d2fcc |
89 | perlrecharclass.pod has a list of all the characters that are white space, |
90 | which needs to be updated if there are changes. A quick way to check if there |
91 | have been changes would be to see if the number of such characters listed in |
92 | perluniprops.pod (generated by running mktables) for the property |
93 | \p{White_Space} is no longer 26. Further investigation would then be necessary |
94 | to classify the new characters as horizontal and vertical. |
95 | |
37e2e78e |
96 | The code in regexec.c for the \X match construct is intimately tied to the |
97 | regular expression in UAX #29 (http://www.unicode.org/reports/tr29/). You |
98 | should see if it has changed, and if so regexec.c should be modified. The |
99 | current one is |
100 | ( CRLF |
101 | | Prepend* ( Hangul-syllable | !Control ) |
102 | ( Grapheme_Extend | Spacing_Mark)* |
103 | | . ) |
104 | |
105 | mktables has many checks to warn you if there are unexpected or novel things |
106 | that it doesn't know how to handle. |
0fa75b59 |
107 | |
37e2e78e |
108 | Finally: |
0fa75b59 |
109 | |
110 | p4 submit |
8836d2a5 |
111 | |
112 | -- |
99870f4d |
113 | jhi@iki.fi; updated by nick@ccl4.org, public@khwilliamson.com |