initial commit
[urisagit/Sort-Maker.git] / paper / outline_notes
CommitLineData
7468c584 1Return-Path: <jdporter@min.net>
2From: John Porter <jdporter@min.net>
3Subject: Re: sort article outline
4To: uri@sysarch.com
5Date: Mon, 28 Dec 1998 11:46:32 -0500 (EST)
6Cc: lr@hpl.hp.com
7In-Reply-To: <199812280441.XAA11943@home.sysarch.com.> from "Uri Guttman" at Dec 27, 98 11:41:54 pm
8Content-Type: text/plain; charset=US-ASCII
9Content-Length: 1680
10
11> JP> Still linear. Maybe the constants are painfully large, but we're
12> JP> still looking at O(n).
13>
14> i think i see why we don't agree here. i am talking about the linear set
15> of bytes and you are talking about liner O(n) for the work. is that it?
16
17Yep.
18
19
20> >> hey, it's english.
21>
22> JP> Marginally.
23>
24> ??? marginal quality english or marginally in english?
25
26The latter. My point is that wrt computer jargon, esp. Perl jargon,
27english is not much of a standard. I.e. what "string" means in
28english has only marginal bearing on what it means in Perl.
29
30
31> different names. each has advantages. the ascii version can use short
32> integer strings (<4 bytes) and is always endian independent. you could
33> pack into a byte or short if you want to. binary can
34> be more compact for some data formats (floats in particular) but endian
35> issues come up and have to be dealt with.
36>
37> you can cover the concept as one but the conversion process is different
38> as one uses sprintf and one uses pack. though the sprintf could use pack
39> is some ways. the sprintf is used to zero pad integers and print floats
40> in a sortable format. pack is used to store the integers in a byte order
41> that cmp can sort.
42
43O.k, I think I see what you're getting at. "Stringify" (in Perl) has
44always refered to the conversion which occurs transparently when a
45non-string SV gets used as a string. And more generally, the same
46effect can be achieved by sprintf. The result is a string which, when
47printed, represents the non-string value of the SV (be it integer, float,
48reference, whatever).
49
50So maybe the right way to describe a string which is produced by pack
51is "packed".
52
53John
54
55
56
57
58Return-Path: <jdporter@min.net>
59From: John Porter <jdporter@min.net>
60Subject: Re: sort article outline
61To: uri@sysarch.com
62Date: Sun, 27 Dec 1998 18:15:36 -0500 (EST)
63In-Reply-To: <199812262117.QAA09392@sysarch.com.> from "Uri Guttman" at Dec 26, 98 04:17:34 pm
64Content-Type: text/plain; charset=US-ASCII
65Content-Length: 3189
66
67
68Uri wrote:
69> not if you are packing integers and floats with optional endian
70> reversals. the whole concept is to convert various data types to one
71> concatenated set of keys sortable by cmp. cmp can't handle ints and
72> floats normally. larry showed how to handle signed ints and floats with
73> endian issues covered. so there is a conversion to a linear set of bytes
74> sortable by cmp. the original "disparate set of keys" has been mapped to
75> a single sortable key.
76
77Still linear. Maybe the constants are painfully large, but we're still
78looking at O(n).
79
80
81> JP> Why should we pander to the lcd?
82>
83> i don't think it is pandering at all. all i think is one form is a
84> subset of the other form and should have a different name.
85
86Well clearly, printable characters are a subset (variously defined)
87of the set of octets. My point is that Perl doesn't care, and
88Perl programmers shouldn't care. In Perl terminology, a string is
89the octet buffer part of a SV. Printability is irrelevant.
90
91
92> there are two issues we are arguing
93> here, whether there should be two names and what those names are.
94
95Yep.
96
97> JP> This is where you and I fundamentally disagree, I guess.
98>
99> hey, it's english.
100
101Marginally.
102
103> JP> Nothing wrong with that. The raw printability of the characters
104> JP> on any given terminal is irrelevant, imho.
105>
106> i disagree. some may choose to use the printable form for that very
107> attribute. sometime the converted keys may be kept around and used for
108> other long term purposes where printability may be a needed attribute.
109
110That's why this sorting issue is so open-ended. We can't lay down
111some hard-and-fast technique (like the ST). If the printability is
112critical, then the programmer has to work within that constraint.
113But if the performance of the sort is of primary concern, then she
114may have to sacrifice the intrinsic printability of the comparison
115keys; then they can do a transformation back into printable form,
116or cache the printable form, or whatever.
117
118
119> by having 2 names for 2 subtly different packing methods i think
120> that helps them decide which to choose.
121
122I'm all for educating the programmer on the issues and choices.
123
124
125> JP> I think this is a bad misuse of the term "linear".
126> JP> Maybe you could come up with some other term that expresses
127> JP> that notion.
128>
129> well, i agree the term may not be the best, but do you agree on the
130> difference between the methods, as i hve described above?
131
132Clearly not.
133
134
135> JP> You and I don't see eye-to-eye on this.
136>
137> maybe my argument above will convince you more. or i can get your
138> mother-in-law to agree with me and then you will be sorry!
139
140Um, unfortunately, you have only clarified your position.
141You have not added any persuasive content.
142
143
144> JP> I remember he said something like that about the ST specifically;
145> JP> don't remember anything about Perl generally...
146>
147> maybe my memory was faulty but he was so negative on others that i
148> remember a slam on perl in some letters. it would be tricky to scan him
149> on dejanews for just that.
150
151I'm not saying I don't believe he said such things; I just don't
152remember them. Knowing Rutka, he probably did!
153
154John
155
156
157
158
159Return-Path: <jdporter@min.net>
160From: John Porter <jdporter@min.net>
161Subject: Re: sort article outline
162To: uri@sysarch.com
163Date: Sat, 26 Dec 1998 11:30:34 -0500 (EST)
164Cc: lr@hpl.hp.com
165In-Reply-To: <199812260558.AAA08301@sysarch.com.> from "Uri Guttman" at Dec 26, 98 00:58:39 am
166Content-Type: text/plain; charset=US-ASCII
167Content-Length: 2668
168
169
170Uri wrote:
171> >>>>> "JP" == John Porter <jdporter@min.net> writes:
172>
173> >> also if you
174> >> put integers in the perl byte string, then it is not stringifying which
175> >> slightly implies ascii.
176>
177> JP> Not in Perl, imho.
178> JP> I think cmp implies strings, regardless of the character value range.
179> JP> We're talking about comparing arrays of byte-sized "characters",
180> JP> i.e. strings.
181>
182> >> so linearization is a better overall term,
183>
184> JP> I find your argument, such as it is, unconvincing.
185> JP> In what way has the process made anything linear?
186> JP> That's what "linearization" means, after all.
187>
188> you are taking a disparate set of keys and making them into a single key
189> of a linear set of bytes.
190
191Even so, the "disparate set of keys" is already linear, wrt
192the sorting algorithm. At least if it's done right, which
193presumably it is.
194
195
196> i understand that perl is 8 bit clean regarding strings and cmp. and cmp
197> is an unsigned compare (IIR). but i (and some others) feel that the word
198> string connotes some printable attributes.
199
200Anyone harboring that notion should be edified. Why should we
201pander to the lcd?
202
203
204> >> (ascii really means printing chars, and probably just digits and
205> >> us-ascii unless you use locale and it works for extendied ascii).
206>
207> JP> Um, imho, all this stuff about ascii is irrelevant.
208>
209> well how would you differentiate converting a key to printable ascii
210> vs. integer bytes?
211
212The important question is, *why* would you?
213I don't think we should, in this case at least.
214
215
216> stringifying implies printable.
217
218This is where you and I fundamentally disagree, I guess.
219
220> but if you create a 8 bit set of bytes that would be something
221> more complex that won't print (without going to hex or something).
222
223Nothing wrong with that. The raw printability of the characters
224on any given terminal is irrelevant, imho.
225
226
227> that
228> is linearization since a set of integers and floats, are now a linear
229> set of sortable bytes, but not a printable string.
230
231I think this is a bad misuse of the term "linear".
232Maybe you could come up with some other term that expresses
233that notion.
234
235
236> this is fine edge semantics, but i do like having two terms for the
237> printable and non-printable converted keys. maybe linearization is not a
238> great term but i want to reserve stringifying for printable only.
239
240Whatever. You and I don't see eye-to-eye on this.
241
242
243> i recall [Rutka] had some early misstatements about not liking perl
244> (or we interpreted his poor wording to mean that).
245
246I remember he said something like that about the ST specifically;
247don't remember anything about Perl generally...
248
249John
250
251
252
253
254Return-Path: <jdporter@min.net>
255From: John Porter <jdporter@min.net>
256Subject: Sorting article
257To: uri@sysarch.com
258Date: Sat, 2 Jan 1999 23:45:47 -0500 (EST)
259Cc: lr@hpl.hp.com
260In-Reply-To: <199901021831.NAA12831@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 01:31:45 pm
261Content-Type: text/plain; charset=US-ASCII
262Content-Length: 1354
263
264
265> John wrote:
266> > Uri wrote:
267> > > in both cases, the original data is usually concatenated onto the end of
268> > > the created sort key and then extracted (usually with substr) after the
269> > > sort is done. neither string sort can handle perl references directly,
270> > > though the referenced data can be used to create a single sort key and
271> > > then ST or OM can then be used for the actual sort.
272> >
273> > I don't think I'd say "usually". You should say "possibly", with
274> > a description of the situations that allow it.
275>
276> fine. what situations exist where you would not need to concatenate the
277> original data onto the key?
278
279I'm not talking not needing to do it, I'm talking about not being able
280to do it. If you're sorting a list of "objects" e.g., you can't simply
281concat a reference onto the end of a string. However, as you pointed out,
282one could simply concat an index for the object; so now I can't think of
283any situations where the pack technique can't be used.
284
285The slice of the pie I'd like to tackle is addressing certain
286issues, namely:
2871. Comparison keys are unique, slightly non-unique, or highly non-unique;
2882. Cost of deriving the comparison key from each datum is nil, small,
289 or large.
290According to my benchmarks, these had a significant influence on the
291relative performance of various techniques (ST, OM, etc.).
292
293John
294
295
296
297
298Return-Path: <jdporter@min.net>
299From: John Porter <jdporter@min.net>
300Subject: Re: Sorting article
301To: uri@sysarch.com
302Date: Mon, 4 Jan 1999 08:59:18 -0500 (EST)
303Cc: lr@hpl.hp.com
304In-Reply-To: <199901030455.XAA13708@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 11:55:35 pm
305Content-Type: text/plain; charset=US-ASCII
306Content-Length: 1306
307
308Uri wrote:
309> good. i like the point you make about the index vs. the reference. this
310> could be a sort of combination of swchartzian and stringify (either
311> flavor). if you have to sort a list of refs (or complex stuff),
312> preprocess into a array with a single ref to the real data, make a list
313> of stringified keys and append the index to the key and sort with
314> builtin cmp. then extract out the indexes and original ref(s). the
315> schwartzian part is doing it with maps.
316
317The similary to ST had occured to me.
318 map { (split("\0",$_,2))[0] }
319 sort
320 map { join "\0", $_->cmpkey(), $_ }
321 @a
322
323But this is as close as it gets. The other "advantage" of ST --
324multiple key comparisons -- is obviated by our technique.
325
326
327> tho i beleive temp arrays and
328> foreach modifiers can be or are faster than maps. we have seen cases
329> where multiple maps ar slow. i think it is because it has to recopy all
330> the values for each map. a foreach on a temp array would use the same
331> array without copying.
332
333I thought the slow part was all those anonymous arrays which are
334created in the first map. map itself doesn't allocate an AV, it makes
335a list. Same with sort. So if you're sorting a list of 1000 items,
336you have three lists created (map, sort, map), but 1000 anonymous AV's
337allocated in the first map.
338
339John
340
341
342
343
344Return-Path: <lr@hpl.hp.com>
345From: Larry Rosler <lr@hpl.hp.com>
346To: "'Uri Guttman'" <uri@sysarch.com>, jdporter@min.net
347Subject: RE: Sorting article
348Date: Mon, 4 Jan 1999 12:55:54 -0800
349Content-Type: text/plain;
350 charset="iso-8859-1"
351Content-Length: 4272
352
353> From: Uri Guttman [mailto:uri@sysarch.com]
354> Sent: Monday, January 04, 1999 08:23
355> To: jdporter@min.net
356> Cc: lr@hpl.hp.com
357> Subject: Re: Sorting article
358...
359> P.S. larry has some catchup reading and replying to do now that he
360> should be back from vacation.
361
362Yes, I am back. We drove 7 hours yesterday from Southern California,
363much of it through some heavy fog in the Central Valley, so I am still a
364little bleary. And I have tried to plow through over 1200 articles in
365c.l.p.misc, with so far only two conclusions:
366
3671. The group got along fine without me (something to think about for
368the future :-).
369
3702. Uri seems to have had a couple of really bad days through the past
371week or so. :-(
372
373I haven't had a chance to read all of the correspondence between you two
374yet, but I did skim a couple, and noticed the mention of sorting
375indexes. I "solved" the problem of using the default sort on references
376while on vacation, *literally* while I was sleeping (i.e., I woke up and
377wrote it down!). I then generalized it some, and thought it would be a
378world-beater speedwise. But my first measurements are somewhat
379disappointing.
380
381Here is what I have so far; I'm sharing it before doing much else
382because your comments and observations would be most useful. It is
383actually a different sorting paradigm, as you will see:
384
385EVOLUTION OF EFFICIENT PERL SORTS
386
387Naive { sortkey(value_a) cmp sortkey(value_b) }
388Orcish $h{ value } ||= sortkey OR @h{ values } = sortkeys
389Schwartz [ value, sortkey, ... ]
390Linearized sortkey . value
391New @h{ sortkeys } = values
392
393
394
395#!/usr/local/bin/perl -w
396use Benchmark;
397
398my $n = 10000;
399my @a = map { [ int(rand 1000), 'foo' . int(rand 1000) ] } 1 .. $n;
400
401sub new {
402 my $i = 'aaa';
403 my %h;
404 @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a;
405 my @out = @h{ sort keys %h };
406}
407
408sub usual {
409 my @out = sort { $a->[0] <=> $b->[0] || $a->[1] cmp $b->[1] } @a;
410}
411
412timethese(1 << (shift || 0), {
413 New => sub { &new },
414 Usual => sub { &usual },
415});
416__END__
417
418Benchmark: timing 16 iterations of New, Usual...
419 New: 10 wallclock secs ( 9.56 usr + 0.14 sys = 9.70 CPU)
420 Usual: 13 wallclock secs (12.92 usr + 0.00 sys = 12.92 CPU)
421
422Comments:
423
424$i serves two functions: insuring the uniqueness of the sort key (which
425is saved as the key to a hash, so must be unique); preserving the
426stability of the sort (entries that sort equal preserve their order in
427the output array).
428
429Do you like those two hash slices, Uri? Take a look at the next
430attempt, which adds an array slice.
431
432In the above case, storing the references themselves as values in the
433hash is efficient. But I was concerned about generalizing this sorting
434paradigm to values that were not references, but might be arbitrarily
435long strings. Before I left on vacation, someone had posted about
436perhaps sorting references to the strings, or indexes into the array.
437So I tried the following:
438
439#!/usr/local/bin/perl -w
440use Benchmark;
441
442my $n = 10000;
443my @a = map { sprintf "XXX %d %s %s", int(rand 1000), 'foo' . int(rand
4441000),
445 'x' x 100 . "\n" } 1 .. $n;
446
447sub indexed {
448 my $i = 'aaa';
449 my %h;
450 @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a;
451 my @out = @a[ @h{ sort keys %h } ];
452}
453
454sub linear {
455 my @out = map { substr($_, 1 + rindex $_, "\0") } sort
456 map pack('N A* x A*' => /(\d+) (\w+)/, $_) => @a;
457}
458
459timethese(1 << (shift || 0), {
460 Indexed => sub { &indexed },
461 Linear => sub { &linear },
462});
463__END__
464
465Benchmark: timing 16 iterations of Indexed, Linear...
466 Indexed: 15 wallclock secs (14.06 usr + 0.58 sys = 14.64 CPU)
467 Linear: 13 wallclock secs (12.44 usr + 0.61 sys = 13.05 CPU)
468
469Comments:
470
471I had hoped that this would be better, because it seemingly avoids
472appending the data to the sortkeys and then extracting via the substrs.
473But both of those operations take place in the O(n) part of the sort,
474not in the O(n log n). So perhaps I was expecting too much. But why is
475it *slower*???
476
477I have to catch up now on the rest of the world, including my *real*
478job. But I wanted to get this off my chest (and onto yours? :-) ASAP.
479
480HNY!!!
481
482Larry
483
484--
485Larry Rosler
486Hewlett-Packard Company
487http://www.hpl.hp.com/personal/Larry_Rosler/
488lr@hpl.hp.com
489
490
491
492From: Larry Rosler <lr@hpl.hp.com>
493To: "'John Porter'" <jdporter@min.net>
494Cc: uri@sysarch.com
495Subject: RE: Sorting article
496Date: Wed, 6 Jan 1999 13:27:16 -0800
497Content-Type: text/plain
498Content-Length: 1096
499
500> From: John Porter [mailto:jdporter@min.net]
501> Sent: Wednesday, January 06, 1999 13:02
502> To: lr@hplb.hpl.hp.com
503> Cc: uri@sysarch.com
504> Subject: Re: Sorting article
505>
506>
507> Larry wrote:
508> > New @h{ sortkeys } = values
509> >...
510> > @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a;
511> > my @out = @h{ sort keys %h };
512>
513> This actually reminds me of a technique I first learned about from
514> Joseph Hall, see EPP, the bottom of p. 47, where he does something
515> similar with an array. He calls it a "mundane" way to sort. !
516>
517> John
518
519I think you clipped the wrong piece from my letter. Hall is
520demonstrating the sort-the-indexes sort, but his sort function uses the
521'mundane' key comparisons (from an array instead of a hash). I was
522trying to combine the index sort with the default sort. This is the
523right clip:
524
525sub indexed {
526 my $i = 'aaa';
527 my %h;
528 @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a;
529 my @out = @a[ @h{ sort keys %h } ];
530}
531
532The nesting of the hash slice inside of the array slice is not
533'mundane', is it?
534
535Larry
536
537
538
539From: Larry Rosler <lr@hpl.hp.com>
540To: "'Uri Guttman'" <uri@sysarch.com>, jdporter@min.net
541Subject: RE: sort article outline
542Date: Thu, 7 Jan 1999 23:05:27 -0800
543Content-Type: text/plain
544Content-Length: 1258
545
546Hi. I'm still way behind on stuff, especially this work. But here is a
547thought on terminology.
548
549Rather than 'stringified sort' or 'linearized sort', how about
550'lexicographic sort'? If that isn't definitive enough, consider 'pure
551lexicographic sort' or 'default lexicographic sort'.
552
553Which reminds me: Whoever now owns the draft of the paper (Uri?) --
554make sure to include a note that all this requires that 'no locale;' be
555in effect, or the idea falls apart. This is from perllocale:
556
557<QUOTE>
558 no locale;
559 print +(sort grep /\w/, map { chr() } 0..255), "\n";
560
561This machine-native collation (which is what you get unless use locale
562has appeared earlier in the same block) must be used for sorting raw
563binary data, whereas the locale-dependent collation of the first example
564is useful for natural text.
565</QUOTE>
566
567Note the 'sorting raw binary data' which is what we are talking about.
568
569I dislike 'stringified' or 'stringized' for several reasons, not the
570least of which is that verbification of adjectives isn't good English,
571IMO. 'Linearized' (also a verbification of an adjective, but with a
572better pedigree) may be a little off-beat semantically. I think
573'lexicographic' comes from the C Standard -- I'll take a look tomorrow.
574
575Larry
576
577
578
579From: Larry Rosler <lr@hpl.hp.com>
580To: "'Uri Guttman'" <uri@sysarch.com>
581Subject: RE: sort article outline
582Date: Fri, 8 Jan 1999 10:36:21 -0800
583Content-Type: text/plain;
584 charset="iso-8859-1"
585Content-Length: 7713
586
587> From: Uri Guttman [mailto:uri@sysarch.com]
588> Sent: Friday, January 08, 1999 10:07
589> To: lr@hpl.hp.com
590> Cc: jdporter@min.net
591> Subject: Re: sort article outline
592>
593> larry,
594>
595> i seemed to have misplaced (lost) your comments to my outline. do you
596> have that letter and can you forward it to me? i suspect you keep all
597> you old email around. you seem that type. i delete most of mine. i
598will
599> be switching to vm on emacs which has folders and stuff and it might
600> improve that aspect of things.
601>
602> uri
603
604All it costs is disk space, now on someone else's server (because I have
605switched from running elm on my Unix box to using Microsoft Exchange,
606which the rest of my group uses).
607
608The letter is below, and I have not re-edited it. But note some things:
609the use of 'lexicographic' (as in my most recent suggeston) and the
610mention of 'index sorts' which have since received more discussion
611between me and John.
612
613Larry
614
615> From: Uri Guttman [mailto:uri@sysarch.com]
616> Sent: Tuesday, December 22, 1998 21:47
617> To: jdporter@min.net; lr@hpl.hp.com
618> Subject: sort article outline
619>
620> hi guys,
621>
622> i felt productive tonight and i didn't have other work and news was
623slow
624> so i wrote an outline for the sort article. i like it so far and i
625think
626> it will make for a good article (or 2).
627
628Likely the latter, if it is thorough (as the outline seems to be).
629
630> so please shred it up and send your comments. when we agree on a
631> (semi-) final outline, then we can start fleshing out the paragraphs
632> with text and code and benchmarks.
633>
634> an open area is benchmarking and describing when to use each of the
635> speedup tricks. this will need some research and playing with code and
636> data.
637
638'When to use' is what the summary is about. See below.
639
640> we should have a set of benchmark data we create to drive this so we
641can
642> all use it and cme up with useful benchmarks. having this stuff on a
643> website (maybe MJD could help us) and having others playing with our
644> stuff could be an intersting idea.
645>
646> otherwise, a belated happy hannukah to larry (we threw a sitdown
647dinner
648> party for 14 on sunday, the eigth night. we had 8 menorahs going!),
649and
650> happy holidays to you john.
651
652Nice image (8 menorahs * 9 candles each). And to you both.
653
654> uri
655>
656>
657> Outline for TPJ Sort Article
658>
659> (Note the use of caps here :-)
660>
661> I. Overview of sorting in the real world.
662
663Refs. Knuth Vol. 3, Chap. 5
664FMTYEWTK
665
666Introduce O() concept and notation briefly
667
668> A. What is sorting
669>
670> B. Why is it needed
671
672Any program that produces tabular output is almost certain to need it!
673
674> C. Sort algorithms
675>
676> 1. bubble
677
678O(n**2)
679
680> 2. insertion
681
682O(n**2)
683
684> 3. tree
685
686O(n log n)
687
688> 4. shell
689
690O(n log n)
691
692> 5. quicksort
693
694O(n log n)
695
696> 6. others
697>
698> D. O(n) order function
699
700Is this later than needed, in order to characterize the sort
701algorithms mentioned above?
702
703> 1. What it means
704>
705> 2. Constant and linear overhead for pre- and post-
706> processing
707>
708> 3. Constant overhead per compare
709>
710> 4. Why reducing compare count is critical
711>
712> 5. Why amount of sort data matters
713>
714> II. Perl's sort function
715
716Implementation -- C qsort() with optional callbacks to a Perl function?
717Limitations of Perl internal sorting -- size of data set, w/o
718sort-merge.
719Usefulness of highly tuned system sort command or commercial
720sort packages (e.g., SyncSort)
721
722> A. Builtin compare is cmp
723
724and implementation is optimal (C or assembly memcmp function)
725
726> B. Custom compare code
727
728(in callback function)
729
730> 1. block of code
731>
732> 2. sub name
733>
734> 3. sub type glob
735>
736> 4. sub ref (to be done)
737>
738> C. Special vars $a and $b in compare
739>
740> III Ways to sort in Perl
741
742Ascending alphabetic (lexicographic) sort (default)
743
744> A. Numeric sort
745>
746> B. Reverse sort
747>
748> C. Multi-level sort
749
750Are these 'field' sorts?
751
752> D. Array index sort
753
754This is a derived-key sort.
755
756> E. Hash value sort
757
758This is a derived-key sort.
759
760> F. Complex compare code
761
762No single-valued sort key per datum. Beyond the scope of
763this article.
764
765> IV. Efficient sorting in Perl
766>
767> A. When efficiency is not warranted
768 ^ an investment in improving
769
770> 1. Short data sets
771>
772> 2. Rarely executed program
773
774More generally, low cost compared to other parts of the
775program. Quote from Rob Pike about when to optimize (don't!).
776
777> B. Sort overhead
778>
779> 1. builtin vs. {$a cmp $b}
780
781Trivial sorts (key is trivially derived from datum):
782a. {$a <=> $b} or reverse
783b. {$hash{$a} cmp $hash{$b}} or numeric; or reverse
784c. {$a->[CONST] cmp $b->[CONST]} or numeric; or reverse
785
786> C. General ideas on how to speed up sort
787>
788> 1. Preprocessing data
789to achieve a trivial sort
790
791> 2. Simple and fast compare code (builtin if possible)
792
793Trivial sort.
794
795> 3. Postprocessing data from LOL (if needed)
796>
797> D. Speed up techniques
798>
799> 1. Schwartzian transform (from Randal Schwartz)
800>
801> a. Converts input to LOL, index sort, convert LOL to output
802
803index sort? This is Trivial Sort c (see above).
804
805> b. uses maps or temp arrays
806>
807> 2. Orcish (Or-cache) manoever (from Joseph Hall)
808>
809> a. Converts sort data in compare code
810>
811> b. Converts only once per unique sort data
812
813but checks every time to see if key derivation is needed.
814
815> c. Stores converted keys in private hash
816
817This is Trivial Sort b (see above).
818
819> d. No need for post processing, as sort is on real data
820
821Where is the discussion about creating the hash during a
822pre-processing pass, obviating the test on each comparand?
823
824> 3. Stringifying (need better name) (from many people)
825
826How about 'Linearized Sort'?
827
828> a. Preprocessing of input data to sortable strings
829
830'string' == arbitrary sequence of bytes, stored in a Perl
831scalar variable (u.e., linear).
832
833> i. uses either pack or sprintf (or other perl code).
834>
835> ii. Converted key is combination of binary and ascii
836 ^^ may be
837
838> iii. Original data is appended onto converted key
839
840See below -- cannot be references.
841
842> b. Uses builtin compare (fastest)
843>
844> c. Postprocess keys to get original data (use substr
845or unpack)
846
847Note that this is isomorphic to the Schwartz Transform
848map/sort/map paradigm.
849
850> d. Can handle integers (signed or unsigned), ascii
851strings, and
852 ^^^^^
853??? see above
854> normal or reverse sorts on a per key basis.
855>
856> e. Can handle floats (IEEE format only) with code to
857deal with big
858 ^ non-negative, until someone gets smarter!
859> vs. little endian.
860
861Cannot handle references!
862
863 4. Summary
864
865 a. Examples
866 IP addresses
867 Fully-qualified domain names (right to left)
868 ...
869
870 b. When to use
871 Criteria
872 Benchmarks
873
874>
875> --
876> Uri Guttman ----------------- SYStems ARCHitecture and
877Software Engineering
878> Perl Hacker for Hire ---------------------- Perl,
879Internet, UNIX Consulting
880> uri@sysarch.com ------------------------------------
881http://www.sysarch.com
882The Best Search Engine on the Net -------------
883http://www.northernlight.com
884
885Good start. Excelsior!
886
887I am having trouble minding my tongue about Rutka's obnoxious posts, but
888he won't disappear (and he now works for Lucent, my alma mater.
889Sigh...)
890
891I will be away from December 25 through January 3; real vacation (no
892electronic communication).
893
894Happy New Year!!!
895
896Larry
897
898--
899Larry Rosler
900Hewlett-Packard Company
901http://www.hpl.hp.com/personal/Larry_Rosler/
902lr@hpl.hp.com
903
904
905
906From: Larry Rosler <lr@hpl.hp.com>
907To: "'Uri Guttman'" <uri@sysarch.com>
908Cc: jdporter@min.net
909Subject: RE: sort article outline
910Date: Fri, 8 Jan 1999 12:32:49 -0800
911Content-Type: text/plain
912Content-Length: 2465
913
914> From: Uri Guttman [mailto:uri@sysarch.com]
915> Sent: Friday, January 08, 1999 09:59
916> To: lr@hpl.hp.com
917> Cc: jdporter@min.net
918> Subject: Re: sort article outline
919>
920...
921> LR> make sure to include a note that all this requires that 'no
922locale;' be
923> LR> in effect, or the idea falls apart. This is from perllocale:
924>
925> LR> <QUOTE>
926> LR> no locale;
927> LR> print +(sort grep /\w/, map { chr() } 0..255), "\n";
928>
929> LR> This machine-native collation (which is what you get unless use
930locale
931> LR> has appeared earlier in the same block) must be used for sorting
932raw
933> LR> binary data, whereas the locale-dependent collation of the first
934example
935> LR> is useful for natural text.
936> LR> </QUOTE>
937>
938> LR> Note the 'sorting raw binary data' which is what we are talking
939about.
940>
941> is this due to signed/unsigned cmp issues? that 0 .. 255 might map to
942> 128 ..255, 0 .. 127 on some machines because of signed char compares?
943> this is another porting issue for packed sorts. not just endianess.
944
945No. Without looking at the perl sources, I venture to guess that it is
946because the 'no locale;' default sort uses 'memcmp' while the 'use
947locale;' default sort uses 'strcoll' which relies on LC_COLLATE. There
948might be a signedness problem if 'strncmp' were used instead of
949'memcmp'. But I recall John saying that 'memcmp' was used (and it would
950*have* to be to conform to Perl's definition of a string, which is a
951sequence of arbitrary bytes whose length is specified by a separate
952integer, vs. C's definition of a string, which is a sequence of bytes
953terminated by a zero byte). In any case, I have yet to see any machine
954dependencies in my tests on PA-RISC or Intel architectures. Does anyone
955know of any target that uses signed-char compares that this could be
956verified against -- or just look into the perl source (which I haven't
957done yet) to make sure it uses 'memcmp'.
958
959> printable sorts (hey i like these terms) need to worry about
960> locale tho.
961
962Well, that is interesting. You distinguish between a 'packed' sort key
963(which *surely* demands 'no locale;') and a 'printable' ('stringized')
964sort key (which perhaps relies on 'use locale;'). Maybe that is a real
965distinction that should be understood and documented. A mixed sort key
966(packed number and a string, for example) would have to be sorted
967without locale, though the number could be stringized by 'sprintf
968"%10u"' in which case locale could be observed.
969
970Larry
971
972
973
974From: Larry Rosler <lr@hpl.hp.com>
975To: "'Uri Guttman'" <uri@sysarch.com>
976Cc: "'jdporter@min.net'" <jdporter@min.net>
977Subject: RE: sort article outline
978Date: Fri, 8 Jan 1999 12:59:24 -0800
979Content-Type: text/plain
980Content-Length: 1266
981
982...
983> No. Without looking at the perl sources, I venture to guess
984> that it is because the 'no locale;' default sort uses
985> 'memcmp' while the 'use locale;' default sort uses 'strcoll'
986> which relies on LC_COLLATE. There might be a signedness
987> problem if 'strncmp' were used instead of 'memcmp'. But I
988> recall John saying that 'memcmp' was used (and it would
989> *have* to be to conform to Perl's definition of a string,
990> which is a sequence of arbitrary bytes whose length is
991> specified by a separate integer, vs. C's definition of a
992> string, which is a sequence of bytes terminated by a zero
993> byte). In any case, I have yet to see any machine
994> dependencies in my tests on PA-RISC or Intel architectures.
995> Does anyone know of any target that uses signed-char compares
996> that this could be verified against -- or just look into the
997> perl source (which I haven't done yet) to make sure it uses 'memcmp'.
998
999>From 'sv.c', the only comparison routine used is 'memcmp'. When locale
1000is in effect, each comparand is passed once through strxfrm and the
1001transformed value is saved for sort comparisons. (The mechanism must be
1002like that which converts numbers to strings or v.v. once only.)
1003
1004So I think character-signedness is a non-issue.
1005
1006Larry
1007
1008
1009
1010From: Larry Rosler <lr@hpl.hp.com>
1011To: "'John Porter'" <jdporter@min.net>, uri@sysarch.com
1012Subject: RE: sort article outline
1013Date: Fri, 8 Jan 1999 14:29:13 -0800
1014Content-Type: text/plain
1015Content-Length: 2430
1016
1017> From: John Porter [mailto:jdporter@min.net]
1018> Sent: Friday, January 08, 1999 11:03
1019> To: uri@sysarch.com
1020> Cc: lr@hpl.hp.com
1021> Subject: Re: sort article outline
1022>
1023> Uri wrote:
1024> > JP> But it's also long-winded. Can't we simply talk about the
1025> > JP> "default sort" (locale issues aside)?
1026> >
1027> > i don't like that because it doesn't describe what transforms are
1028being
1029> > done to get it to use the builtin cmp. just oo vague a term.
1030>
1031> Same applies to "default lexicographic sort". It's not really for
1032> describing any of the transforms we're investigation, but for the
1033> underlying sort (qsort/cmp) which we're exploiting.
1034>
1035> > what is important is the method of transforms that
1036> > allows you to use the builtin cmp. so i feel the name
1037> should reflect that.
1038>
1039> Right. But everything needs a name, even the "default sort".
1040
1041Here, in journalistic format, is a synopsis of where we stand. You may
1042prefer 'Advanced' to 'Enhanced'.
1043
1044WHAT: The Enhanced Default Sort
1045
1046WHY: For optimal performance on large sorts (often with several sort
1047 fields), compare sort keys using built-in C functions rather than
1048 using Perl code in a sort subroutine.
1049
1050WHERE: Wherever the sort is based on comparing a single-valued mapping
1051 function of each value to be sorted. (This is true for all
1052 enhanced sorts -- ST, Orcish, ...)
1053
1054HOW: For each value to be sorted, create a sort key that is a Perl
1055 string (a sequence of bytes) corresponding lexicographically to
1056the
1057 position of the value in the sorted output; append the value to
1058the
1059 sort key so that it can readily be retrieved. The sort key can
1060be
1061 created using any combination of 'pack', 'sprintf' or
1062concatenation.
1063 Each field to be sorted may comprise a signed or unsigned
1064integer,
1065 a floating-point number, or a string of characters. (Special
1066 handling is required if the values to be sorted are references --
1067 LoL or LoH.)
1068
1069WHEN: Compute the keys in a single pass through the list of values
1070(map).
1071 Sort. Retrieve the values from the keys (map).
1072
1073Other minor issues: As stated, this method sorts two values whose keys
1074compare equal according to the values themselves (which is the usual
1075fall-through condition for a fielded sort). A 'stable' sort can be
1076forced (using any sorting method) by adding a sequential index as the
1077last sort key.
1078
1079GO NINERS!!! (Bye-bye, Pats :-()
1080
1081Larry
1082
1083
1084
1085i have to write these down now or i may forget them.
1086
1087a sort module api could look like this:
1088
1089$sort_obj = Sort->new( [ sort keys desriptions ], gloabl sort options
1090
1091sort key description => {
1092
1093 'type' => integer, string, float, etc.
1094
1095 'down' => yes
1096
1097 'code_to_get_key_value' => code ref OR code text.
1098
1099}
1100
1101$sort_obj->values( list of values or refs)
1102
1103can be called again to add more to list.
1104
1105$sort_obj->sort returns sorted data
1106
1107
1108if the option is to build a preprocess sub, we take the code texts and
1109create the text for it. then it will run fast, an optimization for large
1110sorts. otherwise, we use the code ref which takes a ref (and key
1111number?) for an argument and we call it to return the nth key value.
1112
1113from this info we create a pack format string.
1114
1115any inversions or endian things are handled here in the preprocess code
1116(either version)
1117
1118for each sort value we build up an array of key values. then we just
1119do a pack and push its value.
1120
1121we also generate an unpack format (not same as pack format but related)
1122to get the index and/or the original values to return.
1123
1124the pack formats, refs to processing code and other stuff is stored in
1125the sort object for reuse.
1126
1127we can even have special case packs for common things like ip addresses
1128so we can optimize even further there. ip can be designated as text or
1129binary.
1130
1131well, what do you think?
1132
1133uri