1 Return-Path: <jdporter@min.net>
2 From: John Porter <jdporter@min.net>
3 Subject: Re: sort article outline
5 Date: Mon, 28 Dec 1998 11:46:32 -0500 (EST)
7 In-Reply-To: <199812280441.XAA11943@home.sysarch.com.> from "Uri Guttman" at Dec 27, 98 11:41:54 pm
8 Content-Type: text/plain; charset=US-ASCII
11 > JP> Still linear. Maybe the constants are painfully large, but we're
12 > JP> still looking at O(n).
14 > i think i see why we don't agree here. i am talking about the linear set
15 > of bytes and you are talking about liner O(n) for the work. is that it?
20 > >> hey, it's english.
24 > ??? marginal quality english or marginally in english?
26 The latter. My point is that wrt computer jargon, esp. Perl jargon,
27 english is not much of a standard. I.e. what "string" means in
28 english has only marginal bearing on what it means in Perl.
31 > different names. each has advantages. the ascii version can use short
32 > integer strings (<4 bytes) and is always endian independent. you could
33 > pack into a byte or short if you want to. binary can
34 > be more compact for some data formats (floats in particular) but endian
35 > issues come up and have to be dealt with.
37 > you can cover the concept as one but the conversion process is different
38 > as one uses sprintf and one uses pack. though the sprintf could use pack
39 > is some ways. the sprintf is used to zero pad integers and print floats
40 > in a sortable format. pack is used to store the integers in a byte order
43 O.k, I think I see what you're getting at. "Stringify" (in Perl) has
44 always refered to the conversion which occurs transparently when a
45 non-string SV gets used as a string. And more generally, the same
46 effect can be achieved by sprintf. The result is a string which, when
47 printed, represents the non-string value of the SV (be it integer, float,
50 So maybe the right way to describe a string which is produced by pack
58 Return-Path: <jdporter@min.net>
59 From: John Porter <jdporter@min.net>
60 Subject: Re: sort article outline
62 Date: Sun, 27 Dec 1998 18:15:36 -0500 (EST)
63 In-Reply-To: <199812262117.QAA09392@sysarch.com.> from "Uri Guttman" at Dec 26, 98 04:17:34 pm
64 Content-Type: text/plain; charset=US-ASCII
69 > not if you are packing integers and floats with optional endian
70 > reversals. the whole concept is to convert various data types to one
71 > concatenated set of keys sortable by cmp. cmp can't handle ints and
72 > floats normally. larry showed how to handle signed ints and floats with
73 > endian issues covered. so there is a conversion to a linear set of bytes
74 > sortable by cmp. the original "disparate set of keys" has been mapped to
75 > a single sortable key.
77 Still linear. Maybe the constants are painfully large, but we're still
81 > JP> Why should we pander to the lcd?
83 > i don't think it is pandering at all. all i think is one form is a
84 > subset of the other form and should have a different name.
86 Well clearly, printable characters are a subset (variously defined)
87 of the set of octets. My point is that Perl doesn't care, and
88 Perl programmers shouldn't care. In Perl terminology, a string is
89 the octet buffer part of a SV. Printability is irrelevant.
92 > there are two issues we are arguing
93 > here, whether there should be two names and what those names are.
97 > JP> This is where you and I fundamentally disagree, I guess.
103 > JP> Nothing wrong with that. The raw printability of the characters
104 > JP> on any given terminal is irrelevant, imho.
106 > i disagree. some may choose to use the printable form for that very
107 > attribute. sometime the converted keys may be kept around and used for
108 > other long term purposes where printability may be a needed attribute.
110 That's why this sorting issue is so open-ended. We can't lay down
111 some hard-and-fast technique (like the ST). If the printability is
112 critical, then the programmer has to work within that constraint.
113 But if the performance of the sort is of primary concern, then she
114 may have to sacrifice the intrinsic printability of the comparison
115 keys; then they can do a transformation back into printable form,
116 or cache the printable form, or whatever.
119 > by having 2 names for 2 subtly different packing methods i think
120 > that helps them decide which to choose.
122 I'm all for educating the programmer on the issues and choices.
125 > JP> I think this is a bad misuse of the term "linear".
126 > JP> Maybe you could come up with some other term that expresses
129 > well, i agree the term may not be the best, but do you agree on the
130 > difference between the methods, as i hve described above?
135 > JP> You and I don't see eye-to-eye on this.
137 > maybe my argument above will convince you more. or i can get your
138 > mother-in-law to agree with me and then you will be sorry!
140 Um, unfortunately, you have only clarified your position.
141 You have not added any persuasive content.
144 > JP> I remember he said something like that about the ST specifically;
145 > JP> don't remember anything about Perl generally...
147 > maybe my memory was faulty but he was so negative on others that i
148 > remember a slam on perl in some letters. it would be tricky to scan him
149 > on dejanews for just that.
151 I'm not saying I don't believe he said such things; I just don't
152 remember them. Knowing Rutka, he probably did!
159 Return-Path: <jdporter@min.net>
160 From: John Porter <jdporter@min.net>
161 Subject: Re: sort article outline
163 Date: Sat, 26 Dec 1998 11:30:34 -0500 (EST)
165 In-Reply-To: <199812260558.AAA08301@sysarch.com.> from "Uri Guttman" at Dec 26, 98 00:58:39 am
166 Content-Type: text/plain; charset=US-ASCII
171 > >>>>> "JP" == John Porter <jdporter@min.net> writes:
174 > >> put integers in the perl byte string, then it is not stringifying which
175 > >> slightly implies ascii.
177 > JP> Not in Perl, imho.
178 > JP> I think cmp implies strings, regardless of the character value range.
179 > JP> We're talking about comparing arrays of byte-sized "characters",
182 > >> so linearization is a better overall term,
184 > JP> I find your argument, such as it is, unconvincing.
185 > JP> In what way has the process made anything linear?
186 > JP> That's what "linearization" means, after all.
188 > you are taking a disparate set of keys and making them into a single key
189 > of a linear set of bytes.
191 Even so, the "disparate set of keys" is already linear, wrt
192 the sorting algorithm. At least if it's done right, which
196 > i understand that perl is 8 bit clean regarding strings and cmp. and cmp
197 > is an unsigned compare (IIR). but i (and some others) feel that the word
198 > string connotes some printable attributes.
200 Anyone harboring that notion should be edified. Why should we
204 > >> (ascii really means printing chars, and probably just digits and
205 > >> us-ascii unless you use locale and it works for extendied ascii).
207 > JP> Um, imho, all this stuff about ascii is irrelevant.
209 > well how would you differentiate converting a key to printable ascii
212 The important question is, *why* would you?
213 I don't think we should, in this case at least.
216 > stringifying implies printable.
218 This is where you and I fundamentally disagree, I guess.
220 > but if you create a 8 bit set of bytes that would be something
221 > more complex that won't print (without going to hex or something).
223 Nothing wrong with that. The raw printability of the characters
224 on any given terminal is irrelevant, imho.
228 > is linearization since a set of integers and floats, are now a linear
229 > set of sortable bytes, but not a printable string.
231 I think this is a bad misuse of the term "linear".
232 Maybe you could come up with some other term that expresses
236 > this is fine edge semantics, but i do like having two terms for the
237 > printable and non-printable converted keys. maybe linearization is not a
238 > great term but i want to reserve stringifying for printable only.
240 Whatever. You and I don't see eye-to-eye on this.
243 > i recall [Rutka] had some early misstatements about not liking perl
244 > (or we interpreted his poor wording to mean that).
246 I remember he said something like that about the ST specifically;
247 don't remember anything about Perl generally...
254 Return-Path: <jdporter@min.net>
255 From: John Porter <jdporter@min.net>
256 Subject: Sorting article
258 Date: Sat, 2 Jan 1999 23:45:47 -0500 (EST)
260 In-Reply-To: <199901021831.NAA12831@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 01:31:45 pm
261 Content-Type: text/plain; charset=US-ASCII
267 > > > in both cases, the original data is usually concatenated onto the end of
268 > > > the created sort key and then extracted (usually with substr) after the
269 > > > sort is done. neither string sort can handle perl references directly,
270 > > > though the referenced data can be used to create a single sort key and
271 > > > then ST or OM can then be used for the actual sort.
273 > > I don't think I'd say "usually". You should say "possibly", with
274 > > a description of the situations that allow it.
276 > fine. what situations exist where you would not need to concatenate the
277 > original data onto the key?
279 I'm not talking not needing to do it, I'm talking about not being able
280 to do it. If you're sorting a list of "objects" e.g., you can't simply
281 concat a reference onto the end of a string. However, as you pointed out,
282 one could simply concat an index for the object; so now I can't think of
283 any situations where the pack technique can't be used.
285 The slice of the pie I'd like to tackle is addressing certain
287 1. Comparison keys are unique, slightly non-unique, or highly non-unique;
288 2. Cost of deriving the comparison key from each datum is nil, small,
290 According to my benchmarks, these had a significant influence on the
291 relative performance of various techniques (ST, OM, etc.).
298 Return-Path: <jdporter@min.net>
299 From: John Porter <jdporter@min.net>
300 Subject: Re: Sorting article
302 Date: Mon, 4 Jan 1999 08:59:18 -0500 (EST)
304 In-Reply-To: <199901030455.XAA13708@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 11:55:35 pm
305 Content-Type: text/plain; charset=US-ASCII
309 > good. i like the point you make about the index vs. the reference. this
310 > could be a sort of combination of swchartzian and stringify (either
311 > flavor). if you have to sort a list of refs (or complex stuff),
312 > preprocess into a array with a single ref to the real data, make a list
313 > of stringified keys and append the index to the key and sort with
314 > builtin cmp. then extract out the indexes and original ref(s). the
315 > schwartzian part is doing it with maps.
317 The similary to ST had occured to me.
318 map { (split("\0",$_,2))[0] }
320 map { join "\0", $_->cmpkey(), $_ }
323 But this is as close as it gets. The other "advantage" of ST --
324 multiple key comparisons -- is obviated by our technique.
327 > tho i beleive temp arrays and
328 > foreach modifiers can be or are faster than maps. we have seen cases
329 > where multiple maps ar slow. i think it is because it has to recopy all
330 > the values for each map. a foreach on a temp array would use the same
331 > array without copying.
333 I thought the slow part was all those anonymous arrays which are
334 created in the first map. map itself doesn't allocate an AV, it makes
335 a list. Same with sort. So if you're sorting a list of 1000 items,
336 you have three lists created (map, sort, map), but 1000 anonymous AV's
337 allocated in the first map.
344 Return-Path: <lr@hpl.hp.com>
345 From: Larry Rosler <lr@hpl.hp.com>
346 To: "'Uri Guttman'" <uri@sysarch.com>, jdporter@min.net
347 Subject: RE: Sorting article
348 Date: Mon, 4 Jan 1999 12:55:54 -0800
349 Content-Type: text/plain;
353 > From: Uri Guttman [mailto:uri@sysarch.com]
354 > Sent: Monday, January 04, 1999 08:23
355 > To: jdporter@min.net
357 > Subject: Re: Sorting article
359 > P.S. larry has some catchup reading and replying to do now that he
360 > should be back from vacation.
362 Yes, I am back. We drove 7 hours yesterday from Southern California,
363 much of it through some heavy fog in the Central Valley, so I am still a
364 little bleary. And I have tried to plow through over 1200 articles in
365 c.l.p.misc, with so far only two conclusions:
367 1. The group got along fine without me (something to think about for
370 2. Uri seems to have had a couple of really bad days through the past
373 I haven't had a chance to read all of the correspondence between you two
374 yet, but I did skim a couple, and noticed the mention of sorting
375 indexes. I "solved" the problem of using the default sort on references
376 while on vacation, *literally* while I was sleeping (i.e., I woke up and
377 wrote it down!). I then generalized it some, and thought it would be a
378 world-beater speedwise. But my first measurements are somewhat
381 Here is what I have so far; I'm sharing it before doing much else
382 because your comments and observations would be most useful. It is
383 actually a different sorting paradigm, as you will see:
385 EVOLUTION OF EFFICIENT PERL SORTS
387 Naive { sortkey(value_a) cmp sortkey(value_b) }
388 Orcish $h{ value } ||= sortkey OR @h{ values } = sortkeys
389 Schwartz [ value, sortkey, ... ]
390 Linearized sortkey . value
391 New @h{ sortkeys } = values
395 #!/usr/local/bin/perl -w
399 my @a = map { [ int(rand 1000), 'foo' . int(rand 1000) ] } 1 .. $n;
404 @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a;
405 my @out = @h{ sort keys %h };
409 my @out = sort { $a->[0] <=> $b->[0] || $a->[1] cmp $b->[1] } @a;
412 timethese(1 << (shift || 0), {
414 Usual => sub { &usual },
418 Benchmark: timing 16 iterations of New, Usual...
419 New: 10 wallclock secs ( 9.56 usr + 0.14 sys = 9.70 CPU)
420 Usual: 13 wallclock secs (12.92 usr + 0.00 sys = 12.92 CPU)
424 $i serves two functions: insuring the uniqueness of the sort key (which
425 is saved as the key to a hash, so must be unique); preserving the
426 stability of the sort (entries that sort equal preserve their order in
429 Do you like those two hash slices, Uri? Take a look at the next
430 attempt, which adds an array slice.
432 In the above case, storing the references themselves as values in the
433 hash is efficient. But I was concerned about generalizing this sorting
434 paradigm to values that were not references, but might be arbitrarily
435 long strings. Before I left on vacation, someone had posted about
436 perhaps sorting references to the strings, or indexes into the array.
437 So I tried the following:
439 #!/usr/local/bin/perl -w
443 my @a = map { sprintf "XXX %d %s %s", int(rand 1000), 'foo' . int(rand
445 'x' x 100 . "\n" } 1 .. $n;
450 @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a;
451 my @out = @a[ @h{ sort keys %h } ];
455 my @out = map { substr($_, 1 + rindex $_, "\0") } sort
456 map pack('N A* x A*' => /(\d+) (\w+)/, $_) => @a;
459 timethese(1 << (shift || 0), {
460 Indexed => sub { &indexed },
461 Linear => sub { &linear },
465 Benchmark: timing 16 iterations of Indexed, Linear...
466 Indexed: 15 wallclock secs (14.06 usr + 0.58 sys = 14.64 CPU)
467 Linear: 13 wallclock secs (12.44 usr + 0.61 sys = 13.05 CPU)
471 I had hoped that this would be better, because it seemingly avoids
472 appending the data to the sortkeys and then extracting via the substrs.
473 But both of those operations take place in the O(n) part of the sort,
474 not in the O(n log n). So perhaps I was expecting too much. But why is
477 I have to catch up now on the rest of the world, including my *real*
478 job. But I wanted to get this off my chest (and onto yours? :-) ASAP.
486 Hewlett-Packard Company
487 http://www.hpl.hp.com/personal/Larry_Rosler/
492 From: Larry Rosler <lr@hpl.hp.com>
493 To: "'John Porter'" <jdporter@min.net>
495 Subject: RE: Sorting article
496 Date: Wed, 6 Jan 1999 13:27:16 -0800
497 Content-Type: text/plain
500 > From: John Porter [mailto:jdporter@min.net]
501 > Sent: Wednesday, January 06, 1999 13:02
502 > To: lr@hplb.hpl.hp.com
503 > Cc: uri@sysarch.com
504 > Subject: Re: Sorting article
508 > > New @h{ sortkeys } = values
510 > > @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a;
511 > > my @out = @h{ sort keys %h };
513 > This actually reminds me of a technique I first learned about from
514 > Joseph Hall, see EPP, the bottom of p. 47, where he does something
515 > similar with an array. He calls it a "mundane" way to sort. !
519 I think you clipped the wrong piece from my letter. Hall is
520 demonstrating the sort-the-indexes sort, but his sort function uses the
521 'mundane' key comparisons (from an array instead of a hash). I was
522 trying to combine the index sort with the default sort. This is the
528 @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a;
529 my @out = @a[ @h{ sort keys %h } ];
532 The nesting of the hash slice inside of the array slice is not
539 From: Larry Rosler <lr@hpl.hp.com>
540 To: "'Uri Guttman'" <uri@sysarch.com>, jdporter@min.net
541 Subject: RE: sort article outline
542 Date: Thu, 7 Jan 1999 23:05:27 -0800
543 Content-Type: text/plain
546 Hi. I'm still way behind on stuff, especially this work. But here is a
547 thought on terminology.
549 Rather than 'stringified sort' or 'linearized sort', how about
550 'lexicographic sort'? If that isn't definitive enough, consider 'pure
551 lexicographic sort' or 'default lexicographic sort'.
553 Which reminds me: Whoever now owns the draft of the paper (Uri?) --
554 make sure to include a note that all this requires that 'no locale;' be
555 in effect, or the idea falls apart. This is from perllocale:
559 print +(sort grep /\w/, map { chr() } 0..255), "\n";
561 This machine-native collation (which is what you get unless use locale
562 has appeared earlier in the same block) must be used for sorting raw
563 binary data, whereas the locale-dependent collation of the first example
564 is useful for natural text.
567 Note the 'sorting raw binary data' which is what we are talking about.
569 I dislike 'stringified' or 'stringized' for several reasons, not the
570 least of which is that verbification of adjectives isn't good English,
571 IMO. 'Linearized' (also a verbification of an adjective, but with a
572 better pedigree) may be a little off-beat semantically. I think
573 'lexicographic' comes from the C Standard -- I'll take a look tomorrow.
579 From: Larry Rosler <lr@hpl.hp.com>
580 To: "'Uri Guttman'" <uri@sysarch.com>
581 Subject: RE: sort article outline
582 Date: Fri, 8 Jan 1999 10:36:21 -0800
583 Content-Type: text/plain;
587 > From: Uri Guttman [mailto:uri@sysarch.com]
588 > Sent: Friday, January 08, 1999 10:07
590 > Cc: jdporter@min.net
591 > Subject: Re: sort article outline
595 > i seemed to have misplaced (lost) your comments to my outline. do you
596 > have that letter and can you forward it to me? i suspect you keep all
597 > you old email around. you seem that type. i delete most of mine. i
599 > be switching to vm on emacs which has folders and stuff and it might
600 > improve that aspect of things.
604 All it costs is disk space, now on someone else's server (because I have
605 switched from running elm on my Unix box to using Microsoft Exchange,
606 which the rest of my group uses).
608 The letter is below, and I have not re-edited it. But note some things:
609 the use of 'lexicographic' (as in my most recent suggeston) and the
610 mention of 'index sorts' which have since received more discussion
615 > From: Uri Guttman [mailto:uri@sysarch.com]
616 > Sent: Tuesday, December 22, 1998 21:47
617 > To: jdporter@min.net; lr@hpl.hp.com
618 > Subject: sort article outline
622 > i felt productive tonight and i didn't have other work and news was
624 > so i wrote an outline for the sort article. i like it so far and i
626 > it will make for a good article (or 2).
628 Likely the latter, if it is thorough (as the outline seems to be).
630 > so please shred it up and send your comments. when we agree on a
631 > (semi-) final outline, then we can start fleshing out the paragraphs
632 > with text and code and benchmarks.
634 > an open area is benchmarking and describing when to use each of the
635 > speedup tricks. this will need some research and playing with code and
638 'When to use' is what the summary is about. See below.
640 > we should have a set of benchmark data we create to drive this so we
642 > all use it and cme up with useful benchmarks. having this stuff on a
643 > website (maybe MJD could help us) and having others playing with our
644 > stuff could be an intersting idea.
646 > otherwise, a belated happy hannukah to larry (we threw a sitdown
648 > party for 14 on sunday, the eigth night. we had 8 menorahs going!),
650 > happy holidays to you john.
652 Nice image (8 menorahs * 9 candles each). And to you both.
657 > Outline for TPJ Sort Article
659 > (Note the use of caps here :-)
661 > I. Overview of sorting in the real world.
663 Refs. Knuth Vol. 3, Chap. 5
666 Introduce O() concept and notation briefly
670 > B. Why is it needed
672 Any program that produces tabular output is almost certain to need it!
698 > D. O(n) order function
700 Is this later than needed, in order to characterize the sort
701 algorithms mentioned above?
705 > 2. Constant and linear overhead for pre- and post-
708 > 3. Constant overhead per compare
710 > 4. Why reducing compare count is critical
712 > 5. Why amount of sort data matters
714 > II. Perl's sort function
716 Implementation -- C qsort() with optional callbacks to a Perl function?
717 Limitations of Perl internal sorting -- size of data set, w/o
719 Usefulness of highly tuned system sort command or commercial
720 sort packages (e.g., SyncSort)
722 > A. Builtin compare is cmp
724 and implementation is optimal (C or assembly memcmp function)
726 > B. Custom compare code
728 (in callback function)
736 > 4. sub ref (to be done)
738 > C. Special vars $a and $b in compare
740 > III Ways to sort in Perl
742 Ascending alphabetic (lexicographic) sort (default)
748 > C. Multi-level sort
750 Are these 'field' sorts?
752 > D. Array index sort
754 This is a derived-key sort.
758 This is a derived-key sort.
760 > F. Complex compare code
762 No single-valued sort key per datum. Beyond the scope of
765 > IV. Efficient sorting in Perl
767 > A. When efficiency is not warranted
768 ^ an investment in improving
772 > 2. Rarely executed program
774 More generally, low cost compared to other parts of the
775 program. Quote from Rob Pike about when to optimize (don't!).
779 > 1. builtin vs. {$a cmp $b}
781 Trivial sorts (key is trivially derived from datum):
782 a. {$a <=> $b} or reverse
783 b. {$hash{$a} cmp $hash{$b}} or numeric; or reverse
784 c. {$a->[CONST] cmp $b->[CONST]} or numeric; or reverse
786 > C. General ideas on how to speed up sort
788 > 1. Preprocessing data
789 to achieve a trivial sort
791 > 2. Simple and fast compare code (builtin if possible)
795 > 3. Postprocessing data from LOL (if needed)
797 > D. Speed up techniques
799 > 1. Schwartzian transform (from Randal Schwartz)
801 > a. Converts input to LOL, index sort, convert LOL to output
803 index sort? This is Trivial Sort c (see above).
805 > b. uses maps or temp arrays
807 > 2. Orcish (Or-cache) manoever (from Joseph Hall)
809 > a. Converts sort data in compare code
811 > b. Converts only once per unique sort data
813 but checks every time to see if key derivation is needed.
815 > c. Stores converted keys in private hash
817 This is Trivial Sort b (see above).
819 > d. No need for post processing, as sort is on real data
821 Where is the discussion about creating the hash during a
822 pre-processing pass, obviating the test on each comparand?
824 > 3. Stringifying (need better name) (from many people)
826 How about 'Linearized Sort'?
828 > a. Preprocessing of input data to sortable strings
830 'string' == arbitrary sequence of bytes, stored in a Perl
831 scalar variable (u.e., linear).
833 > i. uses either pack or sprintf (or other perl code).
835 > ii. Converted key is combination of binary and ascii
838 > iii. Original data is appended onto converted key
840 See below -- cannot be references.
842 > b. Uses builtin compare (fastest)
844 > c. Postprocess keys to get original data (use substr
847 Note that this is isomorphic to the Schwartz Transform
848 map/sort/map paradigm.
850 > d. Can handle integers (signed or unsigned), ascii
854 > normal or reverse sorts on a per key basis.
856 > e. Can handle floats (IEEE format only) with code to
858 ^ non-negative, until someone gets smarter!
861 Cannot handle references!
867 Fully-qualified domain names (right to left)
876 > Uri Guttman ----------------- SYStems ARCHitecture and
878 > Perl Hacker for Hire ---------------------- Perl,
879 Internet, UNIX Consulting
880 > uri@sysarch.com ------------------------------------
881 http://www.sysarch.com
882 The Best Search Engine on the Net -------------
883 http://www.northernlight.com
885 Good start. Excelsior!
887 I am having trouble minding my tongue about Rutka's obnoxious posts, but
888 he won't disappear (and he now works for Lucent, my alma mater.
891 I will be away from December 25 through January 3; real vacation (no
892 electronic communication).
900 Hewlett-Packard Company
901 http://www.hpl.hp.com/personal/Larry_Rosler/
906 From: Larry Rosler <lr@hpl.hp.com>
907 To: "'Uri Guttman'" <uri@sysarch.com>
909 Subject: RE: sort article outline
910 Date: Fri, 8 Jan 1999 12:32:49 -0800
911 Content-Type: text/plain
914 > From: Uri Guttman [mailto:uri@sysarch.com]
915 > Sent: Friday, January 08, 1999 09:59
917 > Cc: jdporter@min.net
918 > Subject: Re: sort article outline
921 > LR> make sure to include a note that all this requires that 'no
923 > LR> in effect, or the idea falls apart. This is from perllocale:
927 > LR> print +(sort grep /\w/, map { chr() } 0..255), "\n";
929 > LR> This machine-native collation (which is what you get unless use
931 > LR> has appeared earlier in the same block) must be used for sorting
933 > LR> binary data, whereas the locale-dependent collation of the first
935 > LR> is useful for natural text.
938 > LR> Note the 'sorting raw binary data' which is what we are talking
941 > is this due to signed/unsigned cmp issues? that 0 .. 255 might map to
942 > 128 ..255, 0 .. 127 on some machines because of signed char compares?
943 > this is another porting issue for packed sorts. not just endianess.
945 No. Without looking at the perl sources, I venture to guess that it is
946 because the 'no locale;' default sort uses 'memcmp' while the 'use
947 locale;' default sort uses 'strcoll' which relies on LC_COLLATE. There
948 might be a signedness problem if 'strncmp' were used instead of
949 'memcmp'. But I recall John saying that 'memcmp' was used (and it would
950 *have* to be to conform to Perl's definition of a string, which is a
951 sequence of arbitrary bytes whose length is specified by a separate
952 integer, vs. C's definition of a string, which is a sequence of bytes
953 terminated by a zero byte). In any case, I have yet to see any machine
954 dependencies in my tests on PA-RISC or Intel architectures. Does anyone
955 know of any target that uses signed-char compares that this could be
956 verified against -- or just look into the perl source (which I haven't
957 done yet) to make sure it uses 'memcmp'.
959 > printable sorts (hey i like these terms) need to worry about
962 Well, that is interesting. You distinguish between a 'packed' sort key
963 (which *surely* demands 'no locale;') and a 'printable' ('stringized')
964 sort key (which perhaps relies on 'use locale;'). Maybe that is a real
965 distinction that should be understood and documented. A mixed sort key
966 (packed number and a string, for example) would have to be sorted
967 without locale, though the number could be stringized by 'sprintf
968 "%10u"' in which case locale could be observed.
974 From: Larry Rosler <lr@hpl.hp.com>
975 To: "'Uri Guttman'" <uri@sysarch.com>
976 Cc: "'jdporter@min.net'" <jdporter@min.net>
977 Subject: RE: sort article outline
978 Date: Fri, 8 Jan 1999 12:59:24 -0800
979 Content-Type: text/plain
983 > No. Without looking at the perl sources, I venture to guess
984 > that it is because the 'no locale;' default sort uses
985 > 'memcmp' while the 'use locale;' default sort uses 'strcoll'
986 > which relies on LC_COLLATE. There might be a signedness
987 > problem if 'strncmp' were used instead of 'memcmp'. But I
988 > recall John saying that 'memcmp' was used (and it would
989 > *have* to be to conform to Perl's definition of a string,
990 > which is a sequence of arbitrary bytes whose length is
991 > specified by a separate integer, vs. C's definition of a
992 > string, which is a sequence of bytes terminated by a zero
993 > byte). In any case, I have yet to see any machine
994 > dependencies in my tests on PA-RISC or Intel architectures.
995 > Does anyone know of any target that uses signed-char compares
996 > that this could be verified against -- or just look into the
997 > perl source (which I haven't done yet) to make sure it uses 'memcmp'.
999 >From 'sv.c', the only comparison routine used is 'memcmp'. When locale
1000 is in effect, each comparand is passed once through strxfrm and the
1001 transformed value is saved for sort comparisons. (The mechanism must be
1002 like that which converts numbers to strings or v.v. once only.)
1004 So I think character-signedness is a non-issue.
1010 From: Larry Rosler <lr@hpl.hp.com>
1011 To: "'John Porter'" <jdporter@min.net>, uri@sysarch.com
1012 Subject: RE: sort article outline
1013 Date: Fri, 8 Jan 1999 14:29:13 -0800
1014 Content-Type: text/plain
1015 Content-Length: 2430
1017 > From: John Porter [mailto:jdporter@min.net]
1018 > Sent: Friday, January 08, 1999 11:03
1019 > To: uri@sysarch.com
1021 > Subject: Re: sort article outline
1024 > > JP> But it's also long-winded. Can't we simply talk about the
1025 > > JP> "default sort" (locale issues aside)?
1027 > > i don't like that because it doesn't describe what transforms are
1029 > > done to get it to use the builtin cmp. just oo vague a term.
1031 > Same applies to "default lexicographic sort". It's not really for
1032 > describing any of the transforms we're investigation, but for the
1033 > underlying sort (qsort/cmp) which we're exploiting.
1035 > > what is important is the method of transforms that
1036 > > allows you to use the builtin cmp. so i feel the name
1037 > should reflect that.
1039 > Right. But everything needs a name, even the "default sort".
1041 Here, in journalistic format, is a synopsis of where we stand. You may
1042 prefer 'Advanced' to 'Enhanced'.
1044 WHAT: The Enhanced Default Sort
1046 WHY: For optimal performance on large sorts (often with several sort
1047 fields), compare sort keys using built-in C functions rather than
1048 using Perl code in a sort subroutine.
1050 WHERE: Wherever the sort is based on comparing a single-valued mapping
1051 function of each value to be sorted. (This is true for all
1052 enhanced sorts -- ST, Orcish, ...)
1054 HOW: For each value to be sorted, create a sort key that is a Perl
1055 string (a sequence of bytes) corresponding lexicographically to
1057 position of the value in the sorted output; append the value to
1059 sort key so that it can readily be retrieved. The sort key can
1061 created using any combination of 'pack', 'sprintf' or
1063 Each field to be sorted may comprise a signed or unsigned
1065 a floating-point number, or a string of characters. (Special
1066 handling is required if the values to be sorted are references --
1069 WHEN: Compute the keys in a single pass through the list of values
1071 Sort. Retrieve the values from the keys (map).
1073 Other minor issues: As stated, this method sorts two values whose keys
1074 compare equal according to the values themselves (which is the usual
1075 fall-through condition for a fielded sort). A 'stable' sort can be
1076 forced (using any sorting method) by adding a sequential index as the
1079 GO NINERS!!! (Bye-bye, Pats :-()
1085 i have to write these down now or i may forget them.
1087 a sort module api could look like this:
1089 $sort_obj = Sort->new( [ sort keys desriptions ], gloabl sort options
1091 sort key description => {
1093 'type' => integer, string, float, etc.
1097 'code_to_get_key_value' => code ref OR code text.
1101 $sort_obj->values( list of values or refs)
1103 can be called again to add more to list.
1105 $sort_obj->sort returns sorted data
1108 if the option is to build a preprocess sub, we take the code texts and
1109 create the text for it. then it will run fast, an optimization for large
1110 sorts. otherwise, we use the code ref which takes a ref (and key
1111 number?) for an argument and we call it to return the nth key value.
1113 from this info we create a pack format string.
1115 any inversions or endian things are handled here in the preprocess code
1118 for each sort value we build up an array of key values. then we just
1119 do a pack and push its value.
1121 we also generate an unpack format (not same as pack format but related)
1122 to get the index and/or the original values to return.
1124 the pack formats, refs to processing code and other stuff is stored in
1125 the sort object for reuse.
1127 we can even have special case packs for common things like ip addresses
1128 so we can optimize even further there. ip can be designated as text or
1131 well, what do you think?