initial commit
[urisagit/Sort-Maker.git] / slides / sort_maker.txt
CommitLineData
7468c584 1
2book: Sort::Maker
3chapter: Sort::Maker
4
5title: A Little Past History
6
7*: TPC3 in Monterey, CA 1999
8*: Co-wrote and presented "A Fresh Look at Efficient Perl Sorting"
9html:
10 <A href="http://www.sysarch.com/perl/sort_paper.html">http://www.sysarch.com/perl/sort_paper.html</A>
11*: Co-author was Larry Rosler (retired from HP and Perl)
12*: Introduced the GRT as a way to speed up Perl sorts
13*: It won the prize for Best Technical Paper
14*: A module Sort::Records was promised which would implement the GRT
15*: Sort::Records was started but a code and design review showed many
16 problems.
17*: Sort::Records was shelved during development because of a code review
18*: People still write to me asking about that module
19
20PAGE_END
21
22title: Recent History
23
24*: I have been thinking about this module for 5 years
25*: Solved the major design flaw - have the user provide key extraction
26 code
27*: A thread in perl6-language gave needed inspiration
28*: Sort::Maker emerges
29*: Much help in the API design and some coding tricks from Damian Conway
30PAGE_END
31
32title: Introducing Sort::Maker
33
34*: Easy description of sorts
35*: Generates a sort sub
36*: Choice of generated sort styles
37*: Can generate the highest speed sorts in Perl
38*: Generated source can be printed
39**: For study
40**: For pasting in code to save the generation time overhead
41*: These slides can be found at
42
43html:
44 <A href="http://www.stemsystems.com/sort/slides/slides/index.html">http://www.stemsystems.com/slides/slides/index.html</A><BR>
45*: The current tarball is at
46
47html:
48 <A href="http://www.stemsystems.com/sort/Sort-Maker-0.01.tar.gz">http://www.stemsystems.com/sort/Sort-Maker-0.01.tar.gz</A><BR>
49
50PAGE_END
51
52title: Sort::Maker Synopsis
53
54*: Exports a single sub 'make_sorter'
55*: Each key has its own description
56*: Call make_sorter with sort description arguments and it returns a
57 code reference.
58*: Call that reference with unsorted input records
59*: Returns sorted records
60
61code:
62
63 my $sorter = make_sorter(
64 qw( plain ),
65 number => [
66 code => '/(\d+)/',
67 'descending',
68 ],
69 ) ;
70
71 my @sorted = $sorter->( @unsorted ) ;
72
73PAGE_END
74
75title: Key Descriptions
76
77*: Each key has a description
78*: Keys must be in the order they will sort (higher level keys must be
79 earlier)
80*: The key type must be specified as 'string' or 'number'
81*: Keys can have attributes such as 'descending' or 'case'
82*: Attributes can have defaults set for all the keys
83*: Default attributes can be overridden in any key description
84
85PAGE_END
86
87title: Extraction Code
88
89*: Each key needs code to extract it from a record
90*: Each record is aliased to $_ (via map)
91*: Key extraction code operates on $_ and gets the value for this key
92*: The extraction code is executed in list context so m/(foo)/ works
93*: The code is inside a do{} block so you can have multiple statements
94
95PAGE_END
96
97title: Sort Styles
98
99*: Four different sorting styles to choose from
100**: Plain
101**: Orcish Manouvre
102**: Schwartian Transform (ST)
103**: Guttman-Rosler Transform (GRT)
104*: Each has its uses and advantages
105*: Styles are really different ways to cache extracted keys
106*: Caching keys moves key extraction from O( N log N ) to O( N )
107*: In larger sizes of sort sets, caching keys is a very big win
108*: This is a classic sort of arrays of numbers
109**: Compare this code to the generated code for the different sort styles
110code:
111
112 sort { $a->[0] cmp $b->[0] ||
113 $a->[1] cmp $b->[1]
114 }
115
116PAGE_END
117
118title: Plain Sort
119
120*: No key caching
121*: Similar to common sorts with a code block
122*: Good for small sort sets as there is no caching
123*: Pass 'plain' option to make_sorter
124
125code:
126
127 sub {
128
129 sort {
130 do{ my( $left, $right ) = map { $_->[0] } $a, $b;
131 $left cmp $right }
132 ||
133 do{ my( $left, $right ) = map { $_->[1] } $a, $b;
134 $left cmp $right }
135
136 } @_ ;
137 }
138
139PAGE_END
140
141title: Orcish Manouvre Sort
142
143*: Caches extracted keys in a hash
144*: Assignments to the hash in ||=
145*: Called the orcish because of Or-Cache
146*: Created by Joseph Hall
147*: It will re-extract keys for any record with a false value
148*: The caching hash must be cleared beforehand
149**: Sort::Maker declares a caching hash in the anonymous sub
150*: Hash lookups are still O( N log N )
151*: Good for medium sort sets
152*: Pass 'orcish' option to make_sorter
153
154code:
155 sub {
156 my %or_cache ;
157
158
159 sort {
160 (
161 ( $or_cache{$a} ||=
162 do{ my ($val) = map { $_->[0] } $a ; $val } )
163 cmp
164 ( $or_cache{$b} ||=
165 do{ my ($val) = map { $_->[0] } $b ; $val } )
166 )
167 ||
168 (
169 ( $or_cache{$a} ||=
170 do{ my ($val) = map { $_->[1] } $a ; $val } )
171 cmp
172 ( $or_cache{$b} ||=
173 do{ my ($val) = map { $_->[1] } $b ; $val } )
174 )
175
176 } @_ ;
177 }
178
179PAGE_END
180
181title: Schwartian Transform (ST) Sort
182
183*: Caches extracted keys in an anonymous array for each input record
184*: Stores the record itself in slot 0 of the array
185*: Uses the map/sort/map idiom
186*: Popularized by Randal Schwartz
187*: Good for medium to large sort sets
188*: Key extraction is O( N )
189*: Does a full callback to Perl in the comparison block
190*: Pass 'ST' option to make_sorter
191
192code:
193sub {
194 map $_->[0],
195 sort {
196 $a->[1] cmp $b->[1]
197 ||
198 $a->[2] cmp $b->[2]
199
200 }
201 map [ $_,
202 do{ my ($val) = $_->[0] ; $val },
203 do{ my ($val) = $_->[1] ; $val }
204 ], @_ ;
205}
206
207PAGE_END
208
209title: Guttman-Rosler Transform (GRT) Sort
210
211*: Caches extracted keys in a single string for each input record
212*: String records can be stored in that string
213*: Reference records have their index stored in the cache string
214**: The sorted indices are used to slice into the input array to create
215 sorted records
216*: Key extraction is O( N )
217*: The comparison code is internal and uses C with no Perl callback (big
218 win)
219*: More complex to use properly and efficiently
220*: GRT has special key description attributes for optimization
221**: Numbers can be integer/float, signed/unsigned
222**: Strings can be fixed/varying
223**: Fastest sort style for larger sets of sort records
224*: Pass 'GRT' option to make_sorter
225code:
226sub {
227 my $rec_ind = 0 ;
228
229 return @_[
230 map unpack( 'N', substr( $_, -4 ) ),
231 sort
232 map pack( "Z*Z*N",
233 do{ my( $val ) = $_->[0] ; ( $val ) },
234 do{ my( $val ) = $_->[1] ; ( $val ) },
235 $rec_ind++
236 ), @_
237 ] ;
238}
239
240PAGE_END
241
242title: Tests and Benchmarks
243
244*: Test system is table driven
245*: Scripts can do both tests and benchmarks
246*: Provide a set of data
247**: Manually
248**: Generate in place with map
249**: Generate with a anonymous sub
250*: Provide a hand written 'golden' sort sub
251*: Provide the arguments to make_sorter()
252*: Many tests are in but more are wanted
253**: Send in tests if you want
254*: More complex tests are needed
255code:
256
257 {
258 skip => 0,
259 name => 'arrays of multiple strings',
260 source => 1,
261 data => [ map {
262 [ rand_token( 8, 20 ), rand_token( 8, 20 ), ]
263 } 1 .. 100
264 ],
265 gold => sub { $a->[0] cmp $b->[0] ||
266 $a->[1] cmp $b->[1] },
267 args => [ qw( string $_->[0] string $_->[1] ) ],
268 },
269
270PAGE_END
271