Commit | Line | Data |
7468c584 |
1 | Return-Path: <jdporter@min.net> |
2 | From: John Porter <jdporter@min.net> |
3 | Subject: Re: sort article outline |
4 | To: uri@sysarch.com |
5 | Date: Mon, 28 Dec 1998 11:46:32 -0500 (EST) |
6 | Cc: lr@hpl.hp.com |
7 | In-Reply-To: <199812280441.XAA11943@home.sysarch.com.> from "Uri Guttman" at Dec 27, 98 11:41:54 pm |
8 | Content-Type: text/plain; charset=US-ASCII |
9 | Content-Length: 1680 |
10 | |
11 | > JP> Still linear. Maybe the constants are painfully large, but we're |
12 | > JP> still looking at O(n). |
13 | > |
14 | > i think i see why we don't agree here. i am talking about the linear set |
15 | > of bytes and you are talking about liner O(n) for the work. is that it? |
16 | |
17 | Yep. |
18 | |
19 | |
20 | > >> hey, it's english. |
21 | > |
22 | > JP> Marginally. |
23 | > |
24 | > ??? marginal quality english or marginally in english? |
25 | |
26 | The latter. My point is that wrt computer jargon, esp. Perl jargon, |
27 | english is not much of a standard. I.e. what "string" means in |
28 | english has only marginal bearing on what it means in Perl. |
29 | |
30 | |
31 | > different names. each has advantages. the ascii version can use short |
32 | > integer strings (<4 bytes) and is always endian independent. you could |
33 | > pack into a byte or short if you want to. binary can |
34 | > be more compact for some data formats (floats in particular) but endian |
35 | > issues come up and have to be dealt with. |
36 | > |
37 | > you can cover the concept as one but the conversion process is different |
38 | > as one uses sprintf and one uses pack. though the sprintf could use pack |
39 | > is some ways. the sprintf is used to zero pad integers and print floats |
40 | > in a sortable format. pack is used to store the integers in a byte order |
41 | > that cmp can sort. |
42 | |
43 | O.k, I think I see what you're getting at. "Stringify" (in Perl) has |
44 | always refered to the conversion which occurs transparently when a |
45 | non-string SV gets used as a string. And more generally, the same |
46 | effect can be achieved by sprintf. The result is a string which, when |
47 | printed, represents the non-string value of the SV (be it integer, float, |
48 | reference, whatever). |
49 | |
50 | So maybe the right way to describe a string which is produced by pack |
51 | is "packed". |
52 | |
53 | John |
54 | |
55 | |
56 | |
57 | |
58 | Return-Path: <jdporter@min.net> |
59 | From: John Porter <jdporter@min.net> |
60 | Subject: Re: sort article outline |
61 | To: uri@sysarch.com |
62 | Date: Sun, 27 Dec 1998 18:15:36 -0500 (EST) |
63 | In-Reply-To: <199812262117.QAA09392@sysarch.com.> from "Uri Guttman" at Dec 26, 98 04:17:34 pm |
64 | Content-Type: text/plain; charset=US-ASCII |
65 | Content-Length: 3189 |
66 | |
67 | |
68 | Uri wrote: |
69 | > not if you are packing integers and floats with optional endian |
70 | > reversals. the whole concept is to convert various data types to one |
71 | > concatenated set of keys sortable by cmp. cmp can't handle ints and |
72 | > floats normally. larry showed how to handle signed ints and floats with |
73 | > endian issues covered. so there is a conversion to a linear set of bytes |
74 | > sortable by cmp. the original "disparate set of keys" has been mapped to |
75 | > a single sortable key. |
76 | |
77 | Still linear. Maybe the constants are painfully large, but we're still |
78 | looking at O(n). |
79 | |
80 | |
81 | > JP> Why should we pander to the lcd? |
82 | > |
83 | > i don't think it is pandering at all. all i think is one form is a |
84 | > subset of the other form and should have a different name. |
85 | |
86 | Well clearly, printable characters are a subset (variously defined) |
87 | of the set of octets. My point is that Perl doesn't care, and |
88 | Perl programmers shouldn't care. In Perl terminology, a string is |
89 | the octet buffer part of a SV. Printability is irrelevant. |
90 | |
91 | |
92 | > there are two issues we are arguing |
93 | > here, whether there should be two names and what those names are. |
94 | |
95 | Yep. |
96 | |
97 | > JP> This is where you and I fundamentally disagree, I guess. |
98 | > |
99 | > hey, it's english. |
100 | |
101 | Marginally. |
102 | |
103 | > JP> Nothing wrong with that. The raw printability of the characters |
104 | > JP> on any given terminal is irrelevant, imho. |
105 | > |
106 | > i disagree. some may choose to use the printable form for that very |
107 | > attribute. sometime the converted keys may be kept around and used for |
108 | > other long term purposes where printability may be a needed attribute. |
109 | |
110 | That's why this sorting issue is so open-ended. We can't lay down |
111 | some hard-and-fast technique (like the ST). If the printability is |
112 | critical, then the programmer has to work within that constraint. |
113 | But if the performance of the sort is of primary concern, then she |
114 | may have to sacrifice the intrinsic printability of the comparison |
115 | keys; then they can do a transformation back into printable form, |
116 | or cache the printable form, or whatever. |
117 | |
118 | |
119 | > by having 2 names for 2 subtly different packing methods i think |
120 | > that helps them decide which to choose. |
121 | |
122 | I'm all for educating the programmer on the issues and choices. |
123 | |
124 | |
125 | > JP> I think this is a bad misuse of the term "linear". |
126 | > JP> Maybe you could come up with some other term that expresses |
127 | > JP> that notion. |
128 | > |
129 | > well, i agree the term may not be the best, but do you agree on the |
130 | > difference between the methods, as i hve described above? |
131 | |
132 | Clearly not. |
133 | |
134 | |
135 | > JP> You and I don't see eye-to-eye on this. |
136 | > |
137 | > maybe my argument above will convince you more. or i can get your |
138 | > mother-in-law to agree with me and then you will be sorry! |
139 | |
140 | Um, unfortunately, you have only clarified your position. |
141 | You have not added any persuasive content. |
142 | |
143 | |
144 | > JP> I remember he said something like that about the ST specifically; |
145 | > JP> don't remember anything about Perl generally... |
146 | > |
147 | > maybe my memory was faulty but he was so negative on others that i |
148 | > remember a slam on perl in some letters. it would be tricky to scan him |
149 | > on dejanews for just that. |
150 | |
151 | I'm not saying I don't believe he said such things; I just don't |
152 | remember them. Knowing Rutka, he probably did! |
153 | |
154 | John |
155 | |
156 | |
157 | |
158 | |
159 | Return-Path: <jdporter@min.net> |
160 | From: John Porter <jdporter@min.net> |
161 | Subject: Re: sort article outline |
162 | To: uri@sysarch.com |
163 | Date: Sat, 26 Dec 1998 11:30:34 -0500 (EST) |
164 | Cc: lr@hpl.hp.com |
165 | In-Reply-To: <199812260558.AAA08301@sysarch.com.> from "Uri Guttman" at Dec 26, 98 00:58:39 am |
166 | Content-Type: text/plain; charset=US-ASCII |
167 | Content-Length: 2668 |
168 | |
169 | |
170 | Uri wrote: |
171 | > >>>>> "JP" == John Porter <jdporter@min.net> writes: |
172 | > |
173 | > >> also if you |
174 | > >> put integers in the perl byte string, then it is not stringifying which |
175 | > >> slightly implies ascii. |
176 | > |
177 | > JP> Not in Perl, imho. |
178 | > JP> I think cmp implies strings, regardless of the character value range. |
179 | > JP> We're talking about comparing arrays of byte-sized "characters", |
180 | > JP> i.e. strings. |
181 | > |
182 | > >> so linearization is a better overall term, |
183 | > |
184 | > JP> I find your argument, such as it is, unconvincing. |
185 | > JP> In what way has the process made anything linear? |
186 | > JP> That's what "linearization" means, after all. |
187 | > |
188 | > you are taking a disparate set of keys and making them into a single key |
189 | > of a linear set of bytes. |
190 | |
191 | Even so, the "disparate set of keys" is already linear, wrt |
192 | the sorting algorithm. At least if it's done right, which |
193 | presumably it is. |
194 | |
195 | |
196 | > i understand that perl is 8 bit clean regarding strings and cmp. and cmp |
197 | > is an unsigned compare (IIR). but i (and some others) feel that the word |
198 | > string connotes some printable attributes. |
199 | |
200 | Anyone harboring that notion should be edified. Why should we |
201 | pander to the lcd? |
202 | |
203 | |
204 | > >> (ascii really means printing chars, and probably just digits and |
205 | > >> us-ascii unless you use locale and it works for extendied ascii). |
206 | > |
207 | > JP> Um, imho, all this stuff about ascii is irrelevant. |
208 | > |
209 | > well how would you differentiate converting a key to printable ascii |
210 | > vs. integer bytes? |
211 | |
212 | The important question is, *why* would you? |
213 | I don't think we should, in this case at least. |
214 | |
215 | |
216 | > stringifying implies printable. |
217 | |
218 | This is where you and I fundamentally disagree, I guess. |
219 | |
220 | > but if you create a 8 bit set of bytes that would be something |
221 | > more complex that won't print (without going to hex or something). |
222 | |
223 | Nothing wrong with that. The raw printability of the characters |
224 | on any given terminal is irrelevant, imho. |
225 | |
226 | |
227 | > that |
228 | > is linearization since a set of integers and floats, are now a linear |
229 | > set of sortable bytes, but not a printable string. |
230 | |
231 | I think this is a bad misuse of the term "linear". |
232 | Maybe you could come up with some other term that expresses |
233 | that notion. |
234 | |
235 | |
236 | > this is fine edge semantics, but i do like having two terms for the |
237 | > printable and non-printable converted keys. maybe linearization is not a |
238 | > great term but i want to reserve stringifying for printable only. |
239 | |
240 | Whatever. You and I don't see eye-to-eye on this. |
241 | |
242 | |
243 | > i recall [Rutka] had some early misstatements about not liking perl |
244 | > (or we interpreted his poor wording to mean that). |
245 | |
246 | I remember he said something like that about the ST specifically; |
247 | don't remember anything about Perl generally... |
248 | |
249 | John |
250 | |
251 | |
252 | |
253 | |
254 | Return-Path: <jdporter@min.net> |
255 | From: John Porter <jdporter@min.net> |
256 | Subject: Sorting article |
257 | To: uri@sysarch.com |
258 | Date: Sat, 2 Jan 1999 23:45:47 -0500 (EST) |
259 | Cc: lr@hpl.hp.com |
260 | In-Reply-To: <199901021831.NAA12831@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 01:31:45 pm |
261 | Content-Type: text/plain; charset=US-ASCII |
262 | Content-Length: 1354 |
263 | |
264 | |
265 | > John wrote: |
266 | > > Uri wrote: |
267 | > > > in both cases, the original data is usually concatenated onto the end of |
268 | > > > the created sort key and then extracted (usually with substr) after the |
269 | > > > sort is done. neither string sort can handle perl references directly, |
270 | > > > though the referenced data can be used to create a single sort key and |
271 | > > > then ST or OM can then be used for the actual sort. |
272 | > > |
273 | > > I don't think I'd say "usually". You should say "possibly", with |
274 | > > a description of the situations that allow it. |
275 | > |
276 | > fine. what situations exist where you would not need to concatenate the |
277 | > original data onto the key? |
278 | |
279 | I'm not talking not needing to do it, I'm talking about not being able |
280 | to do it. If you're sorting a list of "objects" e.g., you can't simply |
281 | concat a reference onto the end of a string. However, as you pointed out, |
282 | one could simply concat an index for the object; so now I can't think of |
283 | any situations where the pack technique can't be used. |
284 | |
285 | The slice of the pie I'd like to tackle is addressing certain |
286 | issues, namely: |
287 | 1. Comparison keys are unique, slightly non-unique, or highly non-unique; |
288 | 2. Cost of deriving the comparison key from each datum is nil, small, |
289 | or large. |
290 | According to my benchmarks, these had a significant influence on the |
291 | relative performance of various techniques (ST, OM, etc.). |
292 | |
293 | John |
294 | |
295 | |
296 | |
297 | |
298 | Return-Path: <jdporter@min.net> |
299 | From: John Porter <jdporter@min.net> |
300 | Subject: Re: Sorting article |
301 | To: uri@sysarch.com |
302 | Date: Mon, 4 Jan 1999 08:59:18 -0500 (EST) |
303 | Cc: lr@hpl.hp.com |
304 | In-Reply-To: <199901030455.XAA13708@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 11:55:35 pm |
305 | Content-Type: text/plain; charset=US-ASCII |
306 | Content-Length: 1306 |
307 | |
308 | Uri wrote: |
309 | > good. i like the point you make about the index vs. the reference. this |
310 | > could be a sort of combination of swchartzian and stringify (either |
311 | > flavor). if you have to sort a list of refs (or complex stuff), |
312 | > preprocess into a array with a single ref to the real data, make a list |
313 | > of stringified keys and append the index to the key and sort with |
314 | > builtin cmp. then extract out the indexes and original ref(s). the |
315 | > schwartzian part is doing it with maps. |
316 | |
317 | The similary to ST had occured to me. |
318 | map { (split("\0",$_,2))[0] } |
319 | sort |
320 | map { join "\0", $_->cmpkey(), $_ } |
321 | @a |
322 | |
323 | But this is as close as it gets. The other "advantage" of ST -- |
324 | multiple key comparisons -- is obviated by our technique. |
325 | |
326 | |
327 | > tho i beleive temp arrays and |
328 | > foreach modifiers can be or are faster than maps. we have seen cases |
329 | > where multiple maps ar slow. i think it is because it has to recopy all |
330 | > the values for each map. a foreach on a temp array would use the same |
331 | > array without copying. |
332 | |
333 | I thought the slow part was all those anonymous arrays which are |
334 | created in the first map. map itself doesn't allocate an AV, it makes |
335 | a list. Same with sort. So if you're sorting a list of 1000 items, |
336 | you have three lists created (map, sort, map), but 1000 anonymous AV's |
337 | allocated in the first map. |
338 | |
339 | John |
340 | |
341 | |
342 | |
343 | |
344 | Return-Path: <lr@hpl.hp.com> |
345 | From: Larry Rosler <lr@hpl.hp.com> |
346 | To: "'Uri Guttman'" <uri@sysarch.com>, jdporter@min.net |
347 | Subject: RE: Sorting article |
348 | Date: Mon, 4 Jan 1999 12:55:54 -0800 |
349 | Content-Type: text/plain; |
350 | charset="iso-8859-1" |
351 | Content-Length: 4272 |
352 | |
353 | > From: Uri Guttman [mailto:uri@sysarch.com] |
354 | > Sent: Monday, January 04, 1999 08:23 |
355 | > To: jdporter@min.net |
356 | > Cc: lr@hpl.hp.com |
357 | > Subject: Re: Sorting article |
358 | ... |
359 | > P.S. larry has some catchup reading and replying to do now that he |
360 | > should be back from vacation. |
361 | |
362 | Yes, I am back. We drove 7 hours yesterday from Southern California, |
363 | much of it through some heavy fog in the Central Valley, so I am still a |
364 | little bleary. And I have tried to plow through over 1200 articles in |
365 | c.l.p.misc, with so far only two conclusions: |
366 | |
367 | 1. The group got along fine without me (something to think about for |
368 | the future :-). |
369 | |
370 | 2. Uri seems to have had a couple of really bad days through the past |
371 | week or so. :-( |
372 | |
373 | I haven't had a chance to read all of the correspondence between you two |
374 | yet, but I did skim a couple, and noticed the mention of sorting |
375 | indexes. I "solved" the problem of using the default sort on references |
376 | while on vacation, *literally* while I was sleeping (i.e., I woke up and |
377 | wrote it down!). I then generalized it some, and thought it would be a |
378 | world-beater speedwise. But my first measurements are somewhat |
379 | disappointing. |
380 | |
381 | Here is what I have so far; I'm sharing it before doing much else |
382 | because your comments and observations would be most useful. It is |
383 | actually a different sorting paradigm, as you will see: |
384 | |
385 | EVOLUTION OF EFFICIENT PERL SORTS |
386 | |
387 | Naive { sortkey(value_a) cmp sortkey(value_b) } |
388 | Orcish $h{ value } ||= sortkey OR @h{ values } = sortkeys |
389 | Schwartz [ value, sortkey, ... ] |
390 | Linearized sortkey . value |
391 | New @h{ sortkeys } = values |
392 | |
393 | |
394 | |
395 | #!/usr/local/bin/perl -w |
396 | use Benchmark; |
397 | |
398 | my $n = 10000; |
399 | my @a = map { [ int(rand 1000), 'foo' . int(rand 1000) ] } 1 .. $n; |
400 | |
401 | sub new { |
402 | my $i = 'aaa'; |
403 | my %h; |
404 | @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a; |
405 | my @out = @h{ sort keys %h }; |
406 | } |
407 | |
408 | sub usual { |
409 | my @out = sort { $a->[0] <=> $b->[0] || $a->[1] cmp $b->[1] } @a; |
410 | } |
411 | |
412 | timethese(1 << (shift || 0), { |
413 | New => sub { &new }, |
414 | Usual => sub { &usual }, |
415 | }); |
416 | __END__ |
417 | |
418 | Benchmark: timing 16 iterations of New, Usual... |
419 | New: 10 wallclock secs ( 9.56 usr + 0.14 sys = 9.70 CPU) |
420 | Usual: 13 wallclock secs (12.92 usr + 0.00 sys = 12.92 CPU) |
421 | |
422 | Comments: |
423 | |
424 | $i serves two functions: insuring the uniqueness of the sort key (which |
425 | is saved as the key to a hash, so must be unique); preserving the |
426 | stability of the sort (entries that sort equal preserve their order in |
427 | the output array). |
428 | |
429 | Do you like those two hash slices, Uri? Take a look at the next |
430 | attempt, which adds an array slice. |
431 | |
432 | In the above case, storing the references themselves as values in the |
433 | hash is efficient. But I was concerned about generalizing this sorting |
434 | paradigm to values that were not references, but might be arbitrarily |
435 | long strings. Before I left on vacation, someone had posted about |
436 | perhaps sorting references to the strings, or indexes into the array. |
437 | So I tried the following: |
438 | |
439 | #!/usr/local/bin/perl -w |
440 | use Benchmark; |
441 | |
442 | my $n = 10000; |
443 | my @a = map { sprintf "XXX %d %s %s", int(rand 1000), 'foo' . int(rand |
444 | 1000), |
445 | 'x' x 100 . "\n" } 1 .. $n; |
446 | |
447 | sub indexed { |
448 | my $i = 'aaa'; |
449 | my %h; |
450 | @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a; |
451 | my @out = @a[ @h{ sort keys %h } ]; |
452 | } |
453 | |
454 | sub linear { |
455 | my @out = map { substr($_, 1 + rindex $_, "\0") } sort |
456 | map pack('N A* x A*' => /(\d+) (\w+)/, $_) => @a; |
457 | } |
458 | |
459 | timethese(1 << (shift || 0), { |
460 | Indexed => sub { &indexed }, |
461 | Linear => sub { &linear }, |
462 | }); |
463 | __END__ |
464 | |
465 | Benchmark: timing 16 iterations of Indexed, Linear... |
466 | Indexed: 15 wallclock secs (14.06 usr + 0.58 sys = 14.64 CPU) |
467 | Linear: 13 wallclock secs (12.44 usr + 0.61 sys = 13.05 CPU) |
468 | |
469 | Comments: |
470 | |
471 | I had hoped that this would be better, because it seemingly avoids |
472 | appending the data to the sortkeys and then extracting via the substrs. |
473 | But both of those operations take place in the O(n) part of the sort, |
474 | not in the O(n log n). So perhaps I was expecting too much. But why is |
475 | it *slower*??? |
476 | |
477 | I have to catch up now on the rest of the world, including my *real* |
478 | job. But I wanted to get this off my chest (and onto yours? :-) ASAP. |
479 | |
480 | HNY!!! |
481 | |
482 | Larry |
483 | |
484 | -- |
485 | Larry Rosler |
486 | Hewlett-Packard Company |
487 | http://www.hpl.hp.com/personal/Larry_Rosler/ |
488 | lr@hpl.hp.com |
489 | |
490 | |
491 | |
492 | From: Larry Rosler <lr@hpl.hp.com> |
493 | To: "'John Porter'" <jdporter@min.net> |
494 | Cc: uri@sysarch.com |
495 | Subject: RE: Sorting article |
496 | Date: Wed, 6 Jan 1999 13:27:16 -0800 |
497 | Content-Type: text/plain |
498 | Content-Length: 1096 |
499 | |
500 | > From: John Porter [mailto:jdporter@min.net] |
501 | > Sent: Wednesday, January 06, 1999 13:02 |
502 | > To: lr@hplb.hpl.hp.com |
503 | > Cc: uri@sysarch.com |
504 | > Subject: Re: Sorting article |
505 | > |
506 | > |
507 | > Larry wrote: |
508 | > > New @h{ sortkeys } = values |
509 | > >... |
510 | > > @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a; |
511 | > > my @out = @h{ sort keys %h }; |
512 | > |
513 | > This actually reminds me of a technique I first learned about from |
514 | > Joseph Hall, see EPP, the bottom of p. 47, where he does something |
515 | > similar with an array. He calls it a "mundane" way to sort. ! |
516 | > |
517 | > John |
518 | |
519 | I think you clipped the wrong piece from my letter. Hall is |
520 | demonstrating the sort-the-indexes sort, but his sort function uses the |
521 | 'mundane' key comparisons (from an array instead of a hash). I was |
522 | trying to combine the index sort with the default sort. This is the |
523 | right clip: |
524 | |
525 | sub indexed { |
526 | my $i = 'aaa'; |
527 | my %h; |
528 | @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a; |
529 | my @out = @a[ @h{ sort keys %h } ]; |
530 | } |
531 | |
532 | The nesting of the hash slice inside of the array slice is not |
533 | 'mundane', is it? |
534 | |
535 | Larry |
536 | |
537 | |
538 | |
539 | From: Larry Rosler <lr@hpl.hp.com> |
540 | To: "'Uri Guttman'" <uri@sysarch.com>, jdporter@min.net |
541 | Subject: RE: sort article outline |
542 | Date: Thu, 7 Jan 1999 23:05:27 -0800 |
543 | Content-Type: text/plain |
544 | Content-Length: 1258 |
545 | |
546 | Hi. I'm still way behind on stuff, especially this work. But here is a |
547 | thought on terminology. |
548 | |
549 | Rather than 'stringified sort' or 'linearized sort', how about |
550 | 'lexicographic sort'? If that isn't definitive enough, consider 'pure |
551 | lexicographic sort' or 'default lexicographic sort'. |
552 | |
553 | Which reminds me: Whoever now owns the draft of the paper (Uri?) -- |
554 | make sure to include a note that all this requires that 'no locale;' be |
555 | in effect, or the idea falls apart. This is from perllocale: |
556 | |
557 | <QUOTE> |
558 | no locale; |
559 | print +(sort grep /\w/, map { chr() } 0..255), "\n"; |
560 | |
561 | This machine-native collation (which is what you get unless use locale |
562 | has appeared earlier in the same block) must be used for sorting raw |
563 | binary data, whereas the locale-dependent collation of the first example |
564 | is useful for natural text. |
565 | </QUOTE> |
566 | |
567 | Note the 'sorting raw binary data' which is what we are talking about. |
568 | |
569 | I dislike 'stringified' or 'stringized' for several reasons, not the |
570 | least of which is that verbification of adjectives isn't good English, |
571 | IMO. 'Linearized' (also a verbification of an adjective, but with a |
572 | better pedigree) may be a little off-beat semantically. I think |
573 | 'lexicographic' comes from the C Standard -- I'll take a look tomorrow. |
574 | |
575 | Larry |
576 | |
577 | |
578 | |
579 | From: Larry Rosler <lr@hpl.hp.com> |
580 | To: "'Uri Guttman'" <uri@sysarch.com> |
581 | Subject: RE: sort article outline |
582 | Date: Fri, 8 Jan 1999 10:36:21 -0800 |
583 | Content-Type: text/plain; |
584 | charset="iso-8859-1" |
585 | Content-Length: 7713 |
586 | |
587 | > From: Uri Guttman [mailto:uri@sysarch.com] |
588 | > Sent: Friday, January 08, 1999 10:07 |
589 | > To: lr@hpl.hp.com |
590 | > Cc: jdporter@min.net |
591 | > Subject: Re: sort article outline |
592 | > |
593 | > larry, |
594 | > |
595 | > i seemed to have misplaced (lost) your comments to my outline. do you |
596 | > have that letter and can you forward it to me? i suspect you keep all |
597 | > you old email around. you seem that type. i delete most of mine. i |
598 | will |
599 | > be switching to vm on emacs which has folders and stuff and it might |
600 | > improve that aspect of things. |
601 | > |
602 | > uri |
603 | |
604 | All it costs is disk space, now on someone else's server (because I have |
605 | switched from running elm on my Unix box to using Microsoft Exchange, |
606 | which the rest of my group uses). |
607 | |
608 | The letter is below, and I have not re-edited it. But note some things: |
609 | the use of 'lexicographic' (as in my most recent suggeston) and the |
610 | mention of 'index sorts' which have since received more discussion |
611 | between me and John. |
612 | |
613 | Larry |
614 | |
615 | > From: Uri Guttman [mailto:uri@sysarch.com] |
616 | > Sent: Tuesday, December 22, 1998 21:47 |
617 | > To: jdporter@min.net; lr@hpl.hp.com |
618 | > Subject: sort article outline |
619 | > |
620 | > hi guys, |
621 | > |
622 | > i felt productive tonight and i didn't have other work and news was |
623 | slow |
624 | > so i wrote an outline for the sort article. i like it so far and i |
625 | think |
626 | > it will make for a good article (or 2). |
627 | |
628 | Likely the latter, if it is thorough (as the outline seems to be). |
629 | |
630 | > so please shred it up and send your comments. when we agree on a |
631 | > (semi-) final outline, then we can start fleshing out the paragraphs |
632 | > with text and code and benchmarks. |
633 | > |
634 | > an open area is benchmarking and describing when to use each of the |
635 | > speedup tricks. this will need some research and playing with code and |
636 | > data. |
637 | |
638 | 'When to use' is what the summary is about. See below. |
639 | |
640 | > we should have a set of benchmark data we create to drive this so we |
641 | can |
642 | > all use it and cme up with useful benchmarks. having this stuff on a |
643 | > website (maybe MJD could help us) and having others playing with our |
644 | > stuff could be an intersting idea. |
645 | > |
646 | > otherwise, a belated happy hannukah to larry (we threw a sitdown |
647 | dinner |
648 | > party for 14 on sunday, the eigth night. we had 8 menorahs going!), |
649 | and |
650 | > happy holidays to you john. |
651 | |
652 | Nice image (8 menorahs * 9 candles each). And to you both. |
653 | |
654 | > uri |
655 | > |
656 | > |
657 | > Outline for TPJ Sort Article |
658 | > |
659 | > (Note the use of caps here :-) |
660 | > |
661 | > I. Overview of sorting in the real world. |
662 | |
663 | Refs. Knuth Vol. 3, Chap. 5 |
664 | FMTYEWTK |
665 | |
666 | Introduce O() concept and notation briefly |
667 | |
668 | > A. What is sorting |
669 | > |
670 | > B. Why is it needed |
671 | |
672 | Any program that produces tabular output is almost certain to need it! |
673 | |
674 | > C. Sort algorithms |
675 | > |
676 | > 1. bubble |
677 | |
678 | O(n**2) |
679 | |
680 | > 2. insertion |
681 | |
682 | O(n**2) |
683 | |
684 | > 3. tree |
685 | |
686 | O(n log n) |
687 | |
688 | > 4. shell |
689 | |
690 | O(n log n) |
691 | |
692 | > 5. quicksort |
693 | |
694 | O(n log n) |
695 | |
696 | > 6. others |
697 | > |
698 | > D. O(n) order function |
699 | |
700 | Is this later than needed, in order to characterize the sort |
701 | algorithms mentioned above? |
702 | |
703 | > 1. What it means |
704 | > |
705 | > 2. Constant and linear overhead for pre- and post- |
706 | > processing |
707 | > |
708 | > 3. Constant overhead per compare |
709 | > |
710 | > 4. Why reducing compare count is critical |
711 | > |
712 | > 5. Why amount of sort data matters |
713 | > |
714 | > II. Perl's sort function |
715 | |
716 | Implementation -- C qsort() with optional callbacks to a Perl function? |
717 | Limitations of Perl internal sorting -- size of data set, w/o |
718 | sort-merge. |
719 | Usefulness of highly tuned system sort command or commercial |
720 | sort packages (e.g., SyncSort) |
721 | |
722 | > A. Builtin compare is cmp |
723 | |
724 | and implementation is optimal (C or assembly memcmp function) |
725 | |
726 | > B. Custom compare code |
727 | |
728 | (in callback function) |
729 | |
730 | > 1. block of code |
731 | > |
732 | > 2. sub name |
733 | > |
734 | > 3. sub type glob |
735 | > |
736 | > 4. sub ref (to be done) |
737 | > |
738 | > C. Special vars $a and $b in compare |
739 | > |
740 | > III Ways to sort in Perl |
741 | |
742 | Ascending alphabetic (lexicographic) sort (default) |
743 | |
744 | > A. Numeric sort |
745 | > |
746 | > B. Reverse sort |
747 | > |
748 | > C. Multi-level sort |
749 | |
750 | Are these 'field' sorts? |
751 | |
752 | > D. Array index sort |
753 | |
754 | This is a derived-key sort. |
755 | |
756 | > E. Hash value sort |
757 | |
758 | This is a derived-key sort. |
759 | |
760 | > F. Complex compare code |
761 | |
762 | No single-valued sort key per datum. Beyond the scope of |
763 | this article. |
764 | |
765 | > IV. Efficient sorting in Perl |
766 | > |
767 | > A. When efficiency is not warranted |
768 | ^ an investment in improving |
769 | |
770 | > 1. Short data sets |
771 | > |
772 | > 2. Rarely executed program |
773 | |
774 | More generally, low cost compared to other parts of the |
775 | program. Quote from Rob Pike about when to optimize (don't!). |
776 | |
777 | > B. Sort overhead |
778 | > |
779 | > 1. builtin vs. {$a cmp $b} |
780 | |
781 | Trivial sorts (key is trivially derived from datum): |
782 | a. {$a <=> $b} or reverse |
783 | b. {$hash{$a} cmp $hash{$b}} or numeric; or reverse |
784 | c. {$a->[CONST] cmp $b->[CONST]} or numeric; or reverse |
785 | |
786 | > C. General ideas on how to speed up sort |
787 | > |
788 | > 1. Preprocessing data |
789 | to achieve a trivial sort |
790 | |
791 | > 2. Simple and fast compare code (builtin if possible) |
792 | |
793 | Trivial sort. |
794 | |
795 | > 3. Postprocessing data from LOL (if needed) |
796 | > |
797 | > D. Speed up techniques |
798 | > |
799 | > 1. Schwartzian transform (from Randal Schwartz) |
800 | > |
801 | > a. Converts input to LOL, index sort, convert LOL to output |
802 | |
803 | index sort? This is Trivial Sort c (see above). |
804 | |
805 | > b. uses maps or temp arrays |
806 | > |
807 | > 2. Orcish (Or-cache) manoever (from Joseph Hall) |
808 | > |
809 | > a. Converts sort data in compare code |
810 | > |
811 | > b. Converts only once per unique sort data |
812 | |
813 | but checks every time to see if key derivation is needed. |
814 | |
815 | > c. Stores converted keys in private hash |
816 | |
817 | This is Trivial Sort b (see above). |
818 | |
819 | > d. No need for post processing, as sort is on real data |
820 | |
821 | Where is the discussion about creating the hash during a |
822 | pre-processing pass, obviating the test on each comparand? |
823 | |
824 | > 3. Stringifying (need better name) (from many people) |
825 | |
826 | How about 'Linearized Sort'? |
827 | |
828 | > a. Preprocessing of input data to sortable strings |
829 | |
830 | 'string' == arbitrary sequence of bytes, stored in a Perl |
831 | scalar variable (u.e., linear). |
832 | |
833 | > i. uses either pack or sprintf (or other perl code). |
834 | > |
835 | > ii. Converted key is combination of binary and ascii |
836 | ^^ may be |
837 | |
838 | > iii. Original data is appended onto converted key |
839 | |
840 | See below -- cannot be references. |
841 | |
842 | > b. Uses builtin compare (fastest) |
843 | > |
844 | > c. Postprocess keys to get original data (use substr |
845 | or unpack) |
846 | |
847 | Note that this is isomorphic to the Schwartz Transform |
848 | map/sort/map paradigm. |
849 | |
850 | > d. Can handle integers (signed or unsigned), ascii |
851 | strings, and |
852 | ^^^^^ |
853 | ??? see above |
854 | > normal or reverse sorts on a per key basis. |
855 | > |
856 | > e. Can handle floats (IEEE format only) with code to |
857 | deal with big |
858 | ^ non-negative, until someone gets smarter! |
859 | > vs. little endian. |
860 | |
861 | Cannot handle references! |
862 | |
863 | 4. Summary |
864 | |
865 | a. Examples |
866 | IP addresses |
867 | Fully-qualified domain names (right to left) |
868 | ... |
869 | |
870 | b. When to use |
871 | Criteria |
872 | Benchmarks |
873 | |
874 | > |
875 | > -- |
876 | > Uri Guttman ----------------- SYStems ARCHitecture and |
877 | Software Engineering |
878 | > Perl Hacker for Hire ---------------------- Perl, |
879 | Internet, UNIX Consulting |
880 | > uri@sysarch.com ------------------------------------ |
881 | http://www.sysarch.com |
882 | The Best Search Engine on the Net ------------- |
883 | http://www.northernlight.com |
884 | |
885 | Good start. Excelsior! |
886 | |
887 | I am having trouble minding my tongue about Rutka's obnoxious posts, but |
888 | he won't disappear (and he now works for Lucent, my alma mater. |
889 | Sigh...) |
890 | |
891 | I will be away from December 25 through January 3; real vacation (no |
892 | electronic communication). |
893 | |
894 | Happy New Year!!! |
895 | |
896 | Larry |
897 | |
898 | -- |
899 | Larry Rosler |
900 | Hewlett-Packard Company |
901 | http://www.hpl.hp.com/personal/Larry_Rosler/ |
902 | lr@hpl.hp.com |
903 | |
904 | |
905 | |
906 | From: Larry Rosler <lr@hpl.hp.com> |
907 | To: "'Uri Guttman'" <uri@sysarch.com> |
908 | Cc: jdporter@min.net |
909 | Subject: RE: sort article outline |
910 | Date: Fri, 8 Jan 1999 12:32:49 -0800 |
911 | Content-Type: text/plain |
912 | Content-Length: 2465 |
913 | |
914 | > From: Uri Guttman [mailto:uri@sysarch.com] |
915 | > Sent: Friday, January 08, 1999 09:59 |
916 | > To: lr@hpl.hp.com |
917 | > Cc: jdporter@min.net |
918 | > Subject: Re: sort article outline |
919 | > |
920 | ... |
921 | > LR> make sure to include a note that all this requires that 'no |
922 | locale;' be |
923 | > LR> in effect, or the idea falls apart. This is from perllocale: |
924 | > |
925 | > LR> <QUOTE> |
926 | > LR> no locale; |
927 | > LR> print +(sort grep /\w/, map { chr() } 0..255), "\n"; |
928 | > |
929 | > LR> This machine-native collation (which is what you get unless use |
930 | locale |
931 | > LR> has appeared earlier in the same block) must be used for sorting |
932 | raw |
933 | > LR> binary data, whereas the locale-dependent collation of the first |
934 | example |
935 | > LR> is useful for natural text. |
936 | > LR> </QUOTE> |
937 | > |
938 | > LR> Note the 'sorting raw binary data' which is what we are talking |
939 | about. |
940 | > |
941 | > is this due to signed/unsigned cmp issues? that 0 .. 255 might map to |
942 | > 128 ..255, 0 .. 127 on some machines because of signed char compares? |
943 | > this is another porting issue for packed sorts. not just endianess. |
944 | |
945 | No. Without looking at the perl sources, I venture to guess that it is |
946 | because the 'no locale;' default sort uses 'memcmp' while the 'use |
947 | locale;' default sort uses 'strcoll' which relies on LC_COLLATE. There |
948 | might be a signedness problem if 'strncmp' were used instead of |
949 | 'memcmp'. But I recall John saying that 'memcmp' was used (and it would |
950 | *have* to be to conform to Perl's definition of a string, which is a |
951 | sequence of arbitrary bytes whose length is specified by a separate |
952 | integer, vs. C's definition of a string, which is a sequence of bytes |
953 | terminated by a zero byte). In any case, I have yet to see any machine |
954 | dependencies in my tests on PA-RISC or Intel architectures. Does anyone |
955 | know of any target that uses signed-char compares that this could be |
956 | verified against -- or just look into the perl source (which I haven't |
957 | done yet) to make sure it uses 'memcmp'. |
958 | |
959 | > printable sorts (hey i like these terms) need to worry about |
960 | > locale tho. |
961 | |
962 | Well, that is interesting. You distinguish between a 'packed' sort key |
963 | (which *surely* demands 'no locale;') and a 'printable' ('stringized') |
964 | sort key (which perhaps relies on 'use locale;'). Maybe that is a real |
965 | distinction that should be understood and documented. A mixed sort key |
966 | (packed number and a string, for example) would have to be sorted |
967 | without locale, though the number could be stringized by 'sprintf |
968 | "%10u"' in which case locale could be observed. |
969 | |
970 | Larry |
971 | |
972 | |
973 | |
974 | From: Larry Rosler <lr@hpl.hp.com> |
975 | To: "'Uri Guttman'" <uri@sysarch.com> |
976 | Cc: "'jdporter@min.net'" <jdporter@min.net> |
977 | Subject: RE: sort article outline |
978 | Date: Fri, 8 Jan 1999 12:59:24 -0800 |
979 | Content-Type: text/plain |
980 | Content-Length: 1266 |
981 | |
982 | ... |
983 | > No. Without looking at the perl sources, I venture to guess |
984 | > that it is because the 'no locale;' default sort uses |
985 | > 'memcmp' while the 'use locale;' default sort uses 'strcoll' |
986 | > which relies on LC_COLLATE. There might be a signedness |
987 | > problem if 'strncmp' were used instead of 'memcmp'. But I |
988 | > recall John saying that 'memcmp' was used (and it would |
989 | > *have* to be to conform to Perl's definition of a string, |
990 | > which is a sequence of arbitrary bytes whose length is |
991 | > specified by a separate integer, vs. C's definition of a |
992 | > string, which is a sequence of bytes terminated by a zero |
993 | > byte). In any case, I have yet to see any machine |
994 | > dependencies in my tests on PA-RISC or Intel architectures. |
995 | > Does anyone know of any target that uses signed-char compares |
996 | > that this could be verified against -- or just look into the |
997 | > perl source (which I haven't done yet) to make sure it uses 'memcmp'. |
998 | |
999 | >From 'sv.c', the only comparison routine used is 'memcmp'. When locale |
1000 | is in effect, each comparand is passed once through strxfrm and the |
1001 | transformed value is saved for sort comparisons. (The mechanism must be |
1002 | like that which converts numbers to strings or v.v. once only.) |
1003 | |
1004 | So I think character-signedness is a non-issue. |
1005 | |
1006 | Larry |
1007 | |
1008 | |
1009 | |
1010 | From: Larry Rosler <lr@hpl.hp.com> |
1011 | To: "'John Porter'" <jdporter@min.net>, uri@sysarch.com |
1012 | Subject: RE: sort article outline |
1013 | Date: Fri, 8 Jan 1999 14:29:13 -0800 |
1014 | Content-Type: text/plain |
1015 | Content-Length: 2430 |
1016 | |
1017 | > From: John Porter [mailto:jdporter@min.net] |
1018 | > Sent: Friday, January 08, 1999 11:03 |
1019 | > To: uri@sysarch.com |
1020 | > Cc: lr@hpl.hp.com |
1021 | > Subject: Re: sort article outline |
1022 | > |
1023 | > Uri wrote: |
1024 | > > JP> But it's also long-winded. Can't we simply talk about the |
1025 | > > JP> "default sort" (locale issues aside)? |
1026 | > > |
1027 | > > i don't like that because it doesn't describe what transforms are |
1028 | being |
1029 | > > done to get it to use the builtin cmp. just oo vague a term. |
1030 | > |
1031 | > Same applies to "default lexicographic sort". It's not really for |
1032 | > describing any of the transforms we're investigation, but for the |
1033 | > underlying sort (qsort/cmp) which we're exploiting. |
1034 | > |
1035 | > > what is important is the method of transforms that |
1036 | > > allows you to use the builtin cmp. so i feel the name |
1037 | > should reflect that. |
1038 | > |
1039 | > Right. But everything needs a name, even the "default sort". |
1040 | |
1041 | Here, in journalistic format, is a synopsis of where we stand. You may |
1042 | prefer 'Advanced' to 'Enhanced'. |
1043 | |
1044 | WHAT: The Enhanced Default Sort |
1045 | |
1046 | WHY: For optimal performance on large sorts (often with several sort |
1047 | fields), compare sort keys using built-in C functions rather than |
1048 | using Perl code in a sort subroutine. |
1049 | |
1050 | WHERE: Wherever the sort is based on comparing a single-valued mapping |
1051 | function of each value to be sorted. (This is true for all |
1052 | enhanced sorts -- ST, Orcish, ...) |
1053 | |
1054 | HOW: For each value to be sorted, create a sort key that is a Perl |
1055 | string (a sequence of bytes) corresponding lexicographically to |
1056 | the |
1057 | position of the value in the sorted output; append the value to |
1058 | the |
1059 | sort key so that it can readily be retrieved. The sort key can |
1060 | be |
1061 | created using any combination of 'pack', 'sprintf' or |
1062 | concatenation. |
1063 | Each field to be sorted may comprise a signed or unsigned |
1064 | integer, |
1065 | a floating-point number, or a string of characters. (Special |
1066 | handling is required if the values to be sorted are references -- |
1067 | LoL or LoH.) |
1068 | |
1069 | WHEN: Compute the keys in a single pass through the list of values |
1070 | (map). |
1071 | Sort. Retrieve the values from the keys (map). |
1072 | |
1073 | Other minor issues: As stated, this method sorts two values whose keys |
1074 | compare equal according to the values themselves (which is the usual |
1075 | fall-through condition for a fielded sort). A 'stable' sort can be |
1076 | forced (using any sorting method) by adding a sequential index as the |
1077 | last sort key. |
1078 | |
1079 | GO NINERS!!! (Bye-bye, Pats :-() |
1080 | |
1081 | Larry |
1082 | |
1083 | |
1084 | |
1085 | i have to write these down now or i may forget them. |
1086 | |
1087 | a sort module api could look like this: |
1088 | |
1089 | $sort_obj = Sort->new( [ sort keys desriptions ], gloabl sort options |
1090 | |
1091 | sort key description => { |
1092 | |
1093 | 'type' => integer, string, float, etc. |
1094 | |
1095 | 'down' => yes |
1096 | |
1097 | 'code_to_get_key_value' => code ref OR code text. |
1098 | |
1099 | } |
1100 | |
1101 | $sort_obj->values( list of values or refs) |
1102 | |
1103 | can be called again to add more to list. |
1104 | |
1105 | $sort_obj->sort returns sorted data |
1106 | |
1107 | |
1108 | if the option is to build a preprocess sub, we take the code texts and |
1109 | create the text for it. then it will run fast, an optimization for large |
1110 | sorts. otherwise, we use the code ref which takes a ref (and key |
1111 | number?) for an argument and we call it to return the nth key value. |
1112 | |
1113 | from this info we create a pack format string. |
1114 | |
1115 | any inversions or endian things are handled here in the preprocess code |
1116 | (either version) |
1117 | |
1118 | for each sort value we build up an array of key values. then we just |
1119 | do a pack and push its value. |
1120 | |
1121 | we also generate an unpack format (not same as pack format but related) |
1122 | to get the index and/or the original values to return. |
1123 | |
1124 | the pack formats, refs to processing code and other stuff is stored in |
1125 | the sort object for reuse. |
1126 | |
1127 | we can even have special case packs for common things like ip addresses |
1128 | so we can optimize even further there. ip can be designated as text or |
1129 | binary. |
1130 | |
1131 | well, what do you think? |
1132 | |
1133 | uri |