* Synced the perlfaq
[p5sagit/p5-mst-13.2.git] / lib / unicore / README.perl
1 The *.txt files were copied from
2
3         ftp://www.unicode.org/Public/UNIDATA
4
5 with subdirectories 'extracted' and 'auxiliary'
6
7 The Unihan files were not included due to space considerations.  Also NOT
8 included were any *.html files.  It is possible to add the Unihan files, and
9 edit mktables (see instructions near its beginning) to look at them.
10
11 The file 'version' should exist and be a single line with the Unicode version,
12 like:
13 5.2.0
14
15 To be 8.3 filesystem friendly, the names of some of the input files have been
16 changed from the values that are in the Unicode DB.  Not all of the Test files
17 are currently used, so may not be present, so some of the mv's can fail.  The
18 .html Test files are not touched.
19
20 mv PropertyValueAliases.txt PropValueAliases.txt
21 mv NamedSequencesProv.txt NamedSqProv.txt
22 mv DerivedAge.txt DAge.txt
23 mv DerivedCoreProperties.txt DCoreProperties.txt
24 mv DerivedNormalizationProps.txt DNormalizationProps.txt
25 mv extracted/DerivedBidiClass.txt extracted/DBidiClass.txt
26 mv extracted/DerivedBinaryProperties.txt extracted/DBinaryProperties.txt
27 mv extracted/DerivedCombiningClass.txt extracted/DCombiningClass.txt
28 mv extracted/DerivedDecompositionType.txt extracted/DDecompositionType.txt
29 mv extracted/DerivedEastAsianWidth.txt extracted/DEastAsianWidth.txt
30 mv extracted/DerivedGeneralCategory.txt extracted/DGeneralCategory.txt
31 mv extracted/DerivedJoiningGroup.txt extracted/DJoinGroup.txt
32 mv extracted/DerivedJoiningType.txt extracted/DJoinType.txt
33 mv extracted/DerivedLineBreak.txt extracted/DLineBreak.txt
34 mv extracted/DerivedNumericType.txt extracted/DNumType.txt
35 mv extracted/DerivedNumericValues.txt extracted/DNumValues.txt
36
37 mv auxiliary/GraphemeBreakTest.txt auxiliary/GCBTest.txt
38 mv auxiliary/LineBreakTest.txt auxiliary/LBTest.txt
39 mv auxiliary/SentenceBreakTest.txt auxiliary/SBTest.txt
40 mv auxiliary/WordBreakTest.txt auxiliary/WBTest.txt
41
42 If you have the Unihan database (5.2 and above), you should also do the
43 following:
44
45 mv Unihan_DictionaryIndices.txt UnihanIndicesDictionary.txt
46 mv Unihan_DictionaryLikeData.txt UnihanDataDictionaryLike.txt
47 mv Unihan_IRGSources.txt UnihanIRGSources.txt
48 mv Unihan_NumericValues.txt UnihanNumericValues.txt
49 mv Unihan_OtherMappings.txt UnihanOtherMappings.txt
50 mv Unihan_RadicalStrokeCounts.txt UnihanRadicalStrokeCounts.txt
51 mv Unihan_Readings.txt UnihanReadings.txt
52 mv Unihan_Variants.txt UnihanVariants.txt
53
54 If you download everything, the names of files that are not used by mktables
55 are not changed by the above, and will not work correctly as-is on 8.3
56 filesystems.
57
58 mktables is used to generate the tables used by the rest of Perl.  It will warn
59 you about any *.txt files in the directory substructure that it doesn't know
60 about.  You should remove any so-identified, or edit mktables to add them to
61 its lists to process.  You can run
62
63     mktables -globlist
64
65 to have it try to process these tables generically.
66
67 FOR PUMPKINS
68
69 The files are inter-related.  If you take the latest UnicodeData.txt, for
70 example, but leave the older versions of other files, there can be subtle
71 problems.  So get everything available from Unicode, and delete those which
72 aren't needed.
73
74 When moving to a new version of Unicode, you need to update 'version' by hand
75
76         p4 edit version
77         ...
78
79 You should look in the Unicode release notes (which are probably towards the
80 bottom of http://www.unicode.org/reports/tr44/) to see if any properties have
81 newly been moved to be Obsolete, Deprecated, or Stabilized.  The full names for
82 these should be added to the respective lists near the beginning of mktables,
83 using an 'if' to add them for just this Unicode version going forward, so that
84 mktables can continue to be used for earlier Unicode versions. 
85
86 When putting out a new Perl release, think about if any of the Deprecated
87 properties should be moved to Suppressed.
88
89 The code in regexec.c for the \X match construct is intimately tied to the
90 regular expression in UAX #29 (http://www.unicode.org/reports/tr29/).  You
91 should see if it has changed, and if so regexec.c should be modified.  The
92 current one is
93 ( CRLF
94 | Prepend* ( Hangul-syllable | !Control )
95   ( Grapheme_Extend | Spacing_Mark)*
96 | . )
97
98 mktables has many checks to warn you if there are unexpected or novel things
99 that it doesn't know how to handle.
100
101 Finally:
102
103         p4 submit
104
105 -- 
106 jhi@iki.fi; updated by nick@ccl4.org, public@khwilliamson.com