Allow U+0FFFF in regex
[p5sagit/p5-mst-13.2.git] / lib / unicore / README.perl
CommitLineData
1911be83 1The *.txt files were copied from
8836d2a5 2
99870f4d 3 ftp://www.unicode.org/Public/UNIDATA
b6922eda 4
99870f4d 5with subdirectories 'extracted' and 'auxiliary'
61131c94 6
99870f4d 7The Unihan files were not included due to space considerations. Also NOT
37e2e78e 8included were any *.html files. It is possible to add the Unihan files, and
9edit mktables (see instructions near its beginning) to look at them.
99870f4d 10
11The file 'version' should exist and be a single line with the Unicode version,
12like:
135.2.0
61131c94 14
15To be 8.3 filesystem friendly, the names of some of the input files have been
37e2e78e 16changed from the values that are in the Unicode DB. Not all of the Test files
17are currently used, so may not be present, so some of the mv's can fail. The
18.html Test files are not touched.
61131c94 19
20mv PropertyValueAliases.txt PropValueAliases.txt
21mv NamedSequencesProv.txt NamedSqProv.txt
22mv DerivedAge.txt DAge.txt
23mv DerivedCoreProperties.txt DCoreProperties.txt
24mv DerivedNormalizationProps.txt DNormalizationProps.txt
25mv extracted/DerivedBidiClass.txt extracted/DBidiClass.txt
26mv extracted/DerivedBinaryProperties.txt extracted/DBinaryProperties.txt
27mv extracted/DerivedCombiningClass.txt extracted/DCombiningClass.txt
28mv extracted/DerivedDecompositionType.txt extracted/DDecompositionType.txt
29mv extracted/DerivedEastAsianWidth.txt extracted/DEastAsianWidth.txt
30mv extracted/DerivedGeneralCategory.txt extracted/DGeneralCategory.txt
31mv extracted/DerivedJoiningGroup.txt extracted/DJoinGroup.txt
32mv extracted/DerivedJoiningType.txt extracted/DJoinType.txt
33mv extracted/DerivedLineBreak.txt extracted/DLineBreak.txt
34mv extracted/DerivedNumericType.txt extracted/DNumType.txt
35mv extracted/DerivedNumericValues.txt extracted/DNumValues.txt
36
37e2e78e 37mv auxiliary/GraphemeBreakTest.txt auxiliary/GCBTest.txt
38mv auxiliary/LineBreakTest.txt auxiliary/LBTest.txt
39mv auxiliary/SentenceBreakTest.txt auxiliary/SBTest.txt
40mv auxiliary/WordBreakTest.txt auxiliary/WBTest.txt
41
99870f4d 42If you have the Unihan database (5.2 and above), you should also do the
43following:
61131c94 44
99870f4d 45mv Unihan_DictionaryIndices.txt UnihanIndicesDictionary.txt
46mv Unihan_DictionaryLikeData.txt UnihanDataDictionaryLike.txt
47mv Unihan_IRGSources.txt UnihanIRGSources.txt
48mv Unihan_NumericValues.txt UnihanNumericValues.txt
49mv Unihan_OtherMappings.txt UnihanOtherMappings.txt
50mv Unihan_RadicalStrokeCounts.txt UnihanRadicalStrokeCounts.txt
51mv Unihan_Readings.txt UnihanReadings.txt
52mv Unihan_Variants.txt UnihanVariants.txt
53
37e2e78e 54If you download everything, the names of files that are not used by mktables
55are not changed by the above, and will not work correctly as-is on 8.3
56filesystems.
99870f4d 57
58mktables is used to generate the tables used by the rest of Perl. It will warn
59you about any *.txt files in the directory substructure that it doesn't know
60about. You should remove any so-identified, or edit mktables to add them to
61its lists to process. You can run
62
63 mktables -globlist
64
65to have it try to process these tables generically.
66
0fa75b59 67FOR PUMPKINS
68
99870f4d 69The files are inter-related. If you take the latest UnicodeData.txt, for
70example, but leave the older versions of other files, there can be subtle
37e2e78e 71problems. So get everything available from Unicode, and delete those which
72aren't needed.
99870f4d 73
74When moving to a new version of Unicode, you need to update 'version' by hand
75
76 p4 edit version
77 ...
78
79You should look in the Unicode release notes (which are probably towards the
80bottom of http://www.unicode.org/reports/tr44/) to see if any properties have
81newly been moved to be Obsolete, Deprecated, or Stabilized. The full names for
82these should be added to the respective lists near the beginning of mktables,
83using an 'if' to add them for just this Unicode version going forward, so that
84mktables can continue to be used for earlier Unicode versions.
85
86When putting out a new Perl release, think about if any of the Deprecated
87properties should be moved to Suppressed.
b6922eda 88
37e2e78e 89The code in regexec.c for the \X match construct is intimately tied to the
90regular expression in UAX #29 (http://www.unicode.org/reports/tr29/). You
91should see if it has changed, and if so regexec.c should be modified. The
92current one is
93( CRLF
94| Prepend* ( Hangul-syllable | !Control )
95 ( Grapheme_Extend | Spacing_Mark)*
96| . )
97
98mktables has many checks to warn you if there are unexpected or novel things
99that it doesn't know how to handle.
0fa75b59 100
37e2e78e 101Finally:
0fa75b59 102
103 p4 submit
8836d2a5 104
105--
99870f4d 106jhi@iki.fi; updated by nick@ccl4.org, public@khwilliamson.com