Return-Path: From: John Porter Subject: Re: sort article outline To: uri@sysarch.com Date: Mon, 28 Dec 1998 11:46:32 -0500 (EST) Cc: lr@hpl.hp.com In-Reply-To: <199812280441.XAA11943@home.sysarch.com.> from "Uri Guttman" at Dec 27, 98 11:41:54 pm Content-Type: text/plain; charset=US-ASCII Content-Length: 1680 > JP> Still linear. Maybe the constants are painfully large, but we're > JP> still looking at O(n). > > i think i see why we don't agree here. i am talking about the linear set > of bytes and you are talking about liner O(n) for the work. is that it? Yep. > >> hey, it's english. > > JP> Marginally. > > ??? marginal quality english or marginally in english? The latter. My point is that wrt computer jargon, esp. Perl jargon, english is not much of a standard. I.e. what "string" means in english has only marginal bearing on what it means in Perl. > different names. each has advantages. the ascii version can use short > integer strings (<4 bytes) and is always endian independent. you could > pack into a byte or short if you want to. binary can > be more compact for some data formats (floats in particular) but endian > issues come up and have to be dealt with. > > you can cover the concept as one but the conversion process is different > as one uses sprintf and one uses pack. though the sprintf could use pack > is some ways. the sprintf is used to zero pad integers and print floats > in a sortable format. pack is used to store the integers in a byte order > that cmp can sort. O.k, I think I see what you're getting at. "Stringify" (in Perl) has always refered to the conversion which occurs transparently when a non-string SV gets used as a string. And more generally, the same effect can be achieved by sprintf. The result is a string which, when printed, represents the non-string value of the SV (be it integer, float, reference, whatever). So maybe the right way to describe a string which is produced by pack is "packed". John Return-Path: From: John Porter Subject: Re: sort article outline To: uri@sysarch.com Date: Sun, 27 Dec 1998 18:15:36 -0500 (EST) In-Reply-To: <199812262117.QAA09392@sysarch.com.> from "Uri Guttman" at Dec 26, 98 04:17:34 pm Content-Type: text/plain; charset=US-ASCII Content-Length: 3189 Uri wrote: > not if you are packing integers and floats with optional endian > reversals. the whole concept is to convert various data types to one > concatenated set of keys sortable by cmp. cmp can't handle ints and > floats normally. larry showed how to handle signed ints and floats with > endian issues covered. so there is a conversion to a linear set of bytes > sortable by cmp. the original "disparate set of keys" has been mapped to > a single sortable key. Still linear. Maybe the constants are painfully large, but we're still looking at O(n). > JP> Why should we pander to the lcd? > > i don't think it is pandering at all. all i think is one form is a > subset of the other form and should have a different name. Well clearly, printable characters are a subset (variously defined) of the set of octets. My point is that Perl doesn't care, and Perl programmers shouldn't care. In Perl terminology, a string is the octet buffer part of a SV. Printability is irrelevant. > there are two issues we are arguing > here, whether there should be two names and what those names are. Yep. > JP> This is where you and I fundamentally disagree, I guess. > > hey, it's english. Marginally. > JP> Nothing wrong with that. The raw printability of the characters > JP> on any given terminal is irrelevant, imho. > > i disagree. some may choose to use the printable form for that very > attribute. sometime the converted keys may be kept around and used for > other long term purposes where printability may be a needed attribute. That's why this sorting issue is so open-ended. We can't lay down some hard-and-fast technique (like the ST). If the printability is critical, then the programmer has to work within that constraint. But if the performance of the sort is of primary concern, then she may have to sacrifice the intrinsic printability of the comparison keys; then they can do a transformation back into printable form, or cache the printable form, or whatever. > by having 2 names for 2 subtly different packing methods i think > that helps them decide which to choose. I'm all for educating the programmer on the issues and choices. > JP> I think this is a bad misuse of the term "linear". > JP> Maybe you could come up with some other term that expresses > JP> that notion. > > well, i agree the term may not be the best, but do you agree on the > difference between the methods, as i hve described above? Clearly not. > JP> You and I don't see eye-to-eye on this. > > maybe my argument above will convince you more. or i can get your > mother-in-law to agree with me and then you will be sorry! Um, unfortunately, you have only clarified your position. You have not added any persuasive content. > JP> I remember he said something like that about the ST specifically; > JP> don't remember anything about Perl generally... > > maybe my memory was faulty but he was so negative on others that i > remember a slam on perl in some letters. it would be tricky to scan him > on dejanews for just that. I'm not saying I don't believe he said such things; I just don't remember them. Knowing Rutka, he probably did! John Return-Path: From: John Porter Subject: Re: sort article outline To: uri@sysarch.com Date: Sat, 26 Dec 1998 11:30:34 -0500 (EST) Cc: lr@hpl.hp.com In-Reply-To: <199812260558.AAA08301@sysarch.com.> from "Uri Guttman" at Dec 26, 98 00:58:39 am Content-Type: text/plain; charset=US-ASCII Content-Length: 2668 Uri wrote: > >>>>> "JP" == John Porter writes: > > >> also if you > >> put integers in the perl byte string, then it is not stringifying which > >> slightly implies ascii. > > JP> Not in Perl, imho. > JP> I think cmp implies strings, regardless of the character value range. > JP> We're talking about comparing arrays of byte-sized "characters", > JP> i.e. strings. > > >> so linearization is a better overall term, > > JP> I find your argument, such as it is, unconvincing. > JP> In what way has the process made anything linear? > JP> That's what "linearization" means, after all. > > you are taking a disparate set of keys and making them into a single key > of a linear set of bytes. Even so, the "disparate set of keys" is already linear, wrt the sorting algorithm. At least if it's done right, which presumably it is. > i understand that perl is 8 bit clean regarding strings and cmp. and cmp > is an unsigned compare (IIR). but i (and some others) feel that the word > string connotes some printable attributes. Anyone harboring that notion should be edified. Why should we pander to the lcd? > >> (ascii really means printing chars, and probably just digits and > >> us-ascii unless you use locale and it works for extendied ascii). > > JP> Um, imho, all this stuff about ascii is irrelevant. > > well how would you differentiate converting a key to printable ascii > vs. integer bytes? The important question is, *why* would you? I don't think we should, in this case at least. > stringifying implies printable. This is where you and I fundamentally disagree, I guess. > but if you create a 8 bit set of bytes that would be something > more complex that won't print (without going to hex or something). Nothing wrong with that. The raw printability of the characters on any given terminal is irrelevant, imho. > that > is linearization since a set of integers and floats, are now a linear > set of sortable bytes, but not a printable string. I think this is a bad misuse of the term "linear". Maybe you could come up with some other term that expresses that notion. > this is fine edge semantics, but i do like having two terms for the > printable and non-printable converted keys. maybe linearization is not a > great term but i want to reserve stringifying for printable only. Whatever. You and I don't see eye-to-eye on this. > i recall [Rutka] had some early misstatements about not liking perl > (or we interpreted his poor wording to mean that). I remember he said something like that about the ST specifically; don't remember anything about Perl generally... John Return-Path: From: John Porter Subject: Sorting article To: uri@sysarch.com Date: Sat, 2 Jan 1999 23:45:47 -0500 (EST) Cc: lr@hpl.hp.com In-Reply-To: <199901021831.NAA12831@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 01:31:45 pm Content-Type: text/plain; charset=US-ASCII Content-Length: 1354 > John wrote: > > Uri wrote: > > > in both cases, the original data is usually concatenated onto the end of > > > the created sort key and then extracted (usually with substr) after the > > > sort is done. neither string sort can handle perl references directly, > > > though the referenced data can be used to create a single sort key and > > > then ST or OM can then be used for the actual sort. > > > > I don't think I'd say "usually". You should say "possibly", with > > a description of the situations that allow it. > > fine. what situations exist where you would not need to concatenate the > original data onto the key? I'm not talking not needing to do it, I'm talking about not being able to do it. If you're sorting a list of "objects" e.g., you can't simply concat a reference onto the end of a string. However, as you pointed out, one could simply concat an index for the object; so now I can't think of any situations where the pack technique can't be used. The slice of the pie I'd like to tackle is addressing certain issues, namely: 1. Comparison keys are unique, slightly non-unique, or highly non-unique; 2. Cost of deriving the comparison key from each datum is nil, small, or large. According to my benchmarks, these had a significant influence on the relative performance of various techniques (ST, OM, etc.). John Return-Path: From: John Porter Subject: Re: Sorting article To: uri@sysarch.com Date: Mon, 4 Jan 1999 08:59:18 -0500 (EST) Cc: lr@hpl.hp.com In-Reply-To: <199901030455.XAA13708@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 11:55:35 pm Content-Type: text/plain; charset=US-ASCII Content-Length: 1306 Uri wrote: > good. i like the point you make about the index vs. the reference. this > could be a sort of combination of swchartzian and stringify (either > flavor). if you have to sort a list of refs (or complex stuff), > preprocess into a array with a single ref to the real data, make a list > of stringified keys and append the index to the key and sort with > builtin cmp. then extract out the indexes and original ref(s). the > schwartzian part is doing it with maps. The similary to ST had occured to me. map { (split("\0",$_,2))[0] } sort map { join "\0", $_->cmpkey(), $_ } @a But this is as close as it gets. The other "advantage" of ST -- multiple key comparisons -- is obviated by our technique. > tho i beleive temp arrays and > foreach modifiers can be or are faster than maps. we have seen cases > where multiple maps ar slow. i think it is because it has to recopy all > the values for each map. a foreach on a temp array would use the same > array without copying. I thought the slow part was all those anonymous arrays which are created in the first map. map itself doesn't allocate an AV, it makes a list. Same with sort. So if you're sorting a list of 1000 items, you have three lists created (map, sort, map), but 1000 anonymous AV's allocated in the first map. John Return-Path: From: Larry Rosler To: "'Uri Guttman'" , jdporter@min.net Subject: RE: Sorting article Date: Mon, 4 Jan 1999 12:55:54 -0800 Content-Type: text/plain; charset="iso-8859-1" Content-Length: 4272 > From: Uri Guttman [mailto:uri@sysarch.com] > Sent: Monday, January 04, 1999 08:23 > To: jdporter@min.net > Cc: lr@hpl.hp.com > Subject: Re: Sorting article ... > P.S. larry has some catchup reading and replying to do now that he > should be back from vacation. Yes, I am back. We drove 7 hours yesterday from Southern California, much of it through some heavy fog in the Central Valley, so I am still a little bleary. And I have tried to plow through over 1200 articles in c.l.p.misc, with so far only two conclusions: 1. The group got along fine without me (something to think about for the future :-). 2. Uri seems to have had a couple of really bad days through the past week or so. :-( I haven't had a chance to read all of the correspondence between you two yet, but I did skim a couple, and noticed the mention of sorting indexes. I "solved" the problem of using the default sort on references while on vacation, *literally* while I was sleeping (i.e., I woke up and wrote it down!). I then generalized it some, and thought it would be a world-beater speedwise. But my first measurements are somewhat disappointing. Here is what I have so far; I'm sharing it before doing much else because your comments and observations would be most useful. It is actually a different sorting paradigm, as you will see: EVOLUTION OF EFFICIENT PERL SORTS Naive { sortkey(value_a) cmp sortkey(value_b) } Orcish $h{ value } ||= sortkey OR @h{ values } = sortkeys Schwartz [ value, sortkey, ... ] Linearized sortkey . value New @h{ sortkeys } = values #!/usr/local/bin/perl -w use Benchmark; my $n = 10000; my @a = map { [ int(rand 1000), 'foo' . int(rand 1000) ] } 1 .. $n; sub new { my $i = 'aaa'; my %h; @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a; my @out = @h{ sort keys %h }; } sub usual { my @out = sort { $a->[0] <=> $b->[0] || $a->[1] cmp $b->[1] } @a; } timethese(1 << (shift || 0), { New => sub { &new }, Usual => sub { &usual }, }); __END__ Benchmark: timing 16 iterations of New, Usual... New: 10 wallclock secs ( 9.56 usr + 0.14 sys = 9.70 CPU) Usual: 13 wallclock secs (12.92 usr + 0.00 sys = 12.92 CPU) Comments: $i serves two functions: insuring the uniqueness of the sort key (which is saved as the key to a hash, so must be unique); preserving the stability of the sort (entries that sort equal preserve their order in the output array). Do you like those two hash slices, Uri? Take a look at the next attempt, which adds an array slice. In the above case, storing the references themselves as values in the hash is efficient. But I was concerned about generalizing this sorting paradigm to values that were not references, but might be arbitrarily long strings. Before I left on vacation, someone had posted about perhaps sorting references to the strings, or indexes into the array. So I tried the following: #!/usr/local/bin/perl -w use Benchmark; my $n = 10000; my @a = map { sprintf "XXX %d %s %s", int(rand 1000), 'foo' . int(rand 1000), 'x' x 100 . "\n" } 1 .. $n; sub indexed { my $i = 'aaa'; my %h; @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a; my @out = @a[ @h{ sort keys %h } ]; } sub linear { my @out = map { substr($_, 1 + rindex $_, "\0") } sort map pack('N A* x A*' => /(\d+) (\w+)/, $_) => @a; } timethese(1 << (shift || 0), { Indexed => sub { &indexed }, Linear => sub { &linear }, }); __END__ Benchmark: timing 16 iterations of Indexed, Linear... Indexed: 15 wallclock secs (14.06 usr + 0.58 sys = 14.64 CPU) Linear: 13 wallclock secs (12.44 usr + 0.61 sys = 13.05 CPU) Comments: I had hoped that this would be better, because it seemingly avoids appending the data to the sortkeys and then extracting via the substrs. But both of those operations take place in the O(n) part of the sort, not in the O(n log n). So perhaps I was expecting too much. But why is it *slower*??? I have to catch up now on the rest of the world, including my *real* job. But I wanted to get this off my chest (and onto yours? :-) ASAP. HNY!!! Larry -- Larry Rosler Hewlett-Packard Company http://www.hpl.hp.com/personal/Larry_Rosler/ lr@hpl.hp.com From: Larry Rosler To: "'John Porter'" Cc: uri@sysarch.com Subject: RE: Sorting article Date: Wed, 6 Jan 1999 13:27:16 -0800 Content-Type: text/plain Content-Length: 1096 > From: John Porter [mailto:jdporter@min.net] > Sent: Wednesday, January 06, 1999 13:02 > To: lr@hplb.hpl.hp.com > Cc: uri@sysarch.com > Subject: Re: Sorting article > > > Larry wrote: > > New @h{ sortkeys } = values > >... > > @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a; > > my @out = @h{ sort keys %h }; > > This actually reminds me of a technique I first learned about from > Joseph Hall, see EPP, the bottom of p. 47, where he does something > similar with an array. He calls it a "mundane" way to sort. ! > > John I think you clipped the wrong piece from my letter. Hall is demonstrating the sort-the-indexes sort, but his sort function uses the 'mundane' key comparisons (from an array instead of a hash). I was trying to combine the index sort with the default sort. This is the right clip: sub indexed { my $i = 'aaa'; my %h; @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a; my @out = @a[ @h{ sort keys %h } ]; } The nesting of the hash slice inside of the array slice is not 'mundane', is it? Larry From: Larry Rosler To: "'Uri Guttman'" , jdporter@min.net Subject: RE: sort article outline Date: Thu, 7 Jan 1999 23:05:27 -0800 Content-Type: text/plain Content-Length: 1258 Hi. I'm still way behind on stuff, especially this work. But here is a thought on terminology. Rather than 'stringified sort' or 'linearized sort', how about 'lexicographic sort'? If that isn't definitive enough, consider 'pure lexicographic sort' or 'default lexicographic sort'. Which reminds me: Whoever now owns the draft of the paper (Uri?) -- make sure to include a note that all this requires that 'no locale;' be in effect, or the idea falls apart. This is from perllocale: no locale; print +(sort grep /\w/, map { chr() } 0..255), "\n"; This machine-native collation (which is what you get unless use locale has appeared earlier in the same block) must be used for sorting raw binary data, whereas the locale-dependent collation of the first example is useful for natural text. Note the 'sorting raw binary data' which is what we are talking about. I dislike 'stringified' or 'stringized' for several reasons, not the least of which is that verbification of adjectives isn't good English, IMO. 'Linearized' (also a verbification of an adjective, but with a better pedigree) may be a little off-beat semantically. I think 'lexicographic' comes from the C Standard -- I'll take a look tomorrow. Larry From: Larry Rosler To: "'Uri Guttman'" Subject: RE: sort article outline Date: Fri, 8 Jan 1999 10:36:21 -0800 Content-Type: text/plain; charset="iso-8859-1" Content-Length: 7713 > From: Uri Guttman [mailto:uri@sysarch.com] > Sent: Friday, January 08, 1999 10:07 > To: lr@hpl.hp.com > Cc: jdporter@min.net > Subject: Re: sort article outline > > larry, > > i seemed to have misplaced (lost) your comments to my outline. do you > have that letter and can you forward it to me? i suspect you keep all > you old email around. you seem that type. i delete most of mine. i will > be switching to vm on emacs which has folders and stuff and it might > improve that aspect of things. > > uri All it costs is disk space, now on someone else's server (because I have switched from running elm on my Unix box to using Microsoft Exchange, which the rest of my group uses). The letter is below, and I have not re-edited it. But note some things: the use of 'lexicographic' (as in my most recent suggeston) and the mention of 'index sorts' which have since received more discussion between me and John. Larry > From: Uri Guttman [mailto:uri@sysarch.com] > Sent: Tuesday, December 22, 1998 21:47 > To: jdporter@min.net; lr@hpl.hp.com > Subject: sort article outline > > hi guys, > > i felt productive tonight and i didn't have other work and news was slow > so i wrote an outline for the sort article. i like it so far and i think > it will make for a good article (or 2). Likely the latter, if it is thorough (as the outline seems to be). > so please shred it up and send your comments. when we agree on a > (semi-) final outline, then we can start fleshing out the paragraphs > with text and code and benchmarks. > > an open area is benchmarking and describing when to use each of the > speedup tricks. this will need some research and playing with code and > data. 'When to use' is what the summary is about. See below. > we should have a set of benchmark data we create to drive this so we can > all use it and cme up with useful benchmarks. having this stuff on a > website (maybe MJD could help us) and having others playing with our > stuff could be an intersting idea. > > otherwise, a belated happy hannukah to larry (we threw a sitdown dinner > party for 14 on sunday, the eigth night. we had 8 menorahs going!), and > happy holidays to you john. Nice image (8 menorahs * 9 candles each). And to you both. > uri > > > Outline for TPJ Sort Article > > (Note the use of caps here :-) > > I. Overview of sorting in the real world. Refs. Knuth Vol. 3, Chap. 5 FMTYEWTK Introduce O() concept and notation briefly > A. What is sorting > > B. Why is it needed Any program that produces tabular output is almost certain to need it! > C. Sort algorithms > > 1. bubble O(n**2) > 2. insertion O(n**2) > 3. tree O(n log n) > 4. shell O(n log n) > 5. quicksort O(n log n) > 6. others > > D. O(n) order function Is this later than needed, in order to characterize the sort algorithms mentioned above? > 1. What it means > > 2. Constant and linear overhead for pre- and post- > processing > > 3. Constant overhead per compare > > 4. Why reducing compare count is critical > > 5. Why amount of sort data matters > > II. Perl's sort function Implementation -- C qsort() with optional callbacks to a Perl function? Limitations of Perl internal sorting -- size of data set, w/o sort-merge. Usefulness of highly tuned system sort command or commercial sort packages (e.g., SyncSort) > A. Builtin compare is cmp and implementation is optimal (C or assembly memcmp function) > B. Custom compare code (in callback function) > 1. block of code > > 2. sub name > > 3. sub type glob > > 4. sub ref (to be done) > > C. Special vars $a and $b in compare > > III Ways to sort in Perl Ascending alphabetic (lexicographic) sort (default) > A. Numeric sort > > B. Reverse sort > > C. Multi-level sort Are these 'field' sorts? > D. Array index sort This is a derived-key sort. > E. Hash value sort This is a derived-key sort. > F. Complex compare code No single-valued sort key per datum. Beyond the scope of this article. > IV. Efficient sorting in Perl > > A. When efficiency is not warranted ^ an investment in improving > 1. Short data sets > > 2. Rarely executed program More generally, low cost compared to other parts of the program. Quote from Rob Pike about when to optimize (don't!). > B. Sort overhead > > 1. builtin vs. {$a cmp $b} Trivial sorts (key is trivially derived from datum): a. {$a <=> $b} or reverse b. {$hash{$a} cmp $hash{$b}} or numeric; or reverse c. {$a->[CONST] cmp $b->[CONST]} or numeric; or reverse > C. General ideas on how to speed up sort > > 1. Preprocessing data to achieve a trivial sort > 2. Simple and fast compare code (builtin if possible) Trivial sort. > 3. Postprocessing data from LOL (if needed) > > D. Speed up techniques > > 1. Schwartzian transform (from Randal Schwartz) > > a. Converts input to LOL, index sort, convert LOL to output index sort? This is Trivial Sort c (see above). > b. uses maps or temp arrays > > 2. Orcish (Or-cache) manoever (from Joseph Hall) > > a. Converts sort data in compare code > > b. Converts only once per unique sort data but checks every time to see if key derivation is needed. > c. Stores converted keys in private hash This is Trivial Sort b (see above). > d. No need for post processing, as sort is on real data Where is the discussion about creating the hash during a pre-processing pass, obviating the test on each comparand? > 3. Stringifying (need better name) (from many people) How about 'Linearized Sort'? > a. Preprocessing of input data to sortable strings 'string' == arbitrary sequence of bytes, stored in a Perl scalar variable (u.e., linear). > i. uses either pack or sprintf (or other perl code). > > ii. Converted key is combination of binary and ascii ^^ may be > iii. Original data is appended onto converted key See below -- cannot be references. > b. Uses builtin compare (fastest) > > c. Postprocess keys to get original data (use substr or unpack) Note that this is isomorphic to the Schwartz Transform map/sort/map paradigm. > d. Can handle integers (signed or unsigned), ascii strings, and ^^^^^ ??? see above > normal or reverse sorts on a per key basis. > > e. Can handle floats (IEEE format only) with code to deal with big ^ non-negative, until someone gets smarter! > vs. little endian. Cannot handle references! 4. Summary a. Examples IP addresses Fully-qualified domain names (right to left) ... b. When to use Criteria Benchmarks > > -- > Uri Guttman ----------------- SYStems ARCHitecture and Software Engineering > Perl Hacker for Hire ---------------------- Perl, Internet, UNIX Consulting > uri@sysarch.com ------------------------------------ http://www.sysarch.com The Best Search Engine on the Net ------------- http://www.northernlight.com Good start. Excelsior! I am having trouble minding my tongue about Rutka's obnoxious posts, but he won't disappear (and he now works for Lucent, my alma mater. Sigh...) I will be away from December 25 through January 3; real vacation (no electronic communication). Happy New Year!!! Larry -- Larry Rosler Hewlett-Packard Company http://www.hpl.hp.com/personal/Larry_Rosler/ lr@hpl.hp.com From: Larry Rosler To: "'Uri Guttman'" Cc: jdporter@min.net Subject: RE: sort article outline Date: Fri, 8 Jan 1999 12:32:49 -0800 Content-Type: text/plain Content-Length: 2465 > From: Uri Guttman [mailto:uri@sysarch.com] > Sent: Friday, January 08, 1999 09:59 > To: lr@hpl.hp.com > Cc: jdporter@min.net > Subject: Re: sort article outline > ... > LR> make sure to include a note that all this requires that 'no locale;' be > LR> in effect, or the idea falls apart. This is from perllocale: > > LR> > LR> no locale; > LR> print +(sort grep /\w/, map { chr() } 0..255), "\n"; > > LR> This machine-native collation (which is what you get unless use locale > LR> has appeared earlier in the same block) must be used for sorting raw > LR> binary data, whereas the locale-dependent collation of the first example > LR> is useful for natural text. > LR> > > LR> Note the 'sorting raw binary data' which is what we are talking about. > > is this due to signed/unsigned cmp issues? that 0 .. 255 might map to > 128 ..255, 0 .. 127 on some machines because of signed char compares? > this is another porting issue for packed sorts. not just endianess. No. Without looking at the perl sources, I venture to guess that it is because the 'no locale;' default sort uses 'memcmp' while the 'use locale;' default sort uses 'strcoll' which relies on LC_COLLATE. There might be a signedness problem if 'strncmp' were used instead of 'memcmp'. But I recall John saying that 'memcmp' was used (and it would *have* to be to conform to Perl's definition of a string, which is a sequence of arbitrary bytes whose length is specified by a separate integer, vs. C's definition of a string, which is a sequence of bytes terminated by a zero byte). In any case, I have yet to see any machine dependencies in my tests on PA-RISC or Intel architectures. Does anyone know of any target that uses signed-char compares that this could be verified against -- or just look into the perl source (which I haven't done yet) to make sure it uses 'memcmp'. > printable sorts (hey i like these terms) need to worry about > locale tho. Well, that is interesting. You distinguish between a 'packed' sort key (which *surely* demands 'no locale;') and a 'printable' ('stringized') sort key (which perhaps relies on 'use locale;'). Maybe that is a real distinction that should be understood and documented. A mixed sort key (packed number and a string, for example) would have to be sorted without locale, though the number could be stringized by 'sprintf "%10u"' in which case locale could be observed. Larry From: Larry Rosler To: "'Uri Guttman'" Cc: "'jdporter@min.net'" Subject: RE: sort article outline Date: Fri, 8 Jan 1999 12:59:24 -0800 Content-Type: text/plain Content-Length: 1266 ... > No. Without looking at the perl sources, I venture to guess > that it is because the 'no locale;' default sort uses > 'memcmp' while the 'use locale;' default sort uses 'strcoll' > which relies on LC_COLLATE. There might be a signedness > problem if 'strncmp' were used instead of 'memcmp'. But I > recall John saying that 'memcmp' was used (and it would > *have* to be to conform to Perl's definition of a string, > which is a sequence of arbitrary bytes whose length is > specified by a separate integer, vs. C's definition of a > string, which is a sequence of bytes terminated by a zero > byte). In any case, I have yet to see any machine > dependencies in my tests on PA-RISC or Intel architectures. > Does anyone know of any target that uses signed-char compares > that this could be verified against -- or just look into the > perl source (which I haven't done yet) to make sure it uses 'memcmp'. >From 'sv.c', the only comparison routine used is 'memcmp'. When locale is in effect, each comparand is passed once through strxfrm and the transformed value is saved for sort comparisons. (The mechanism must be like that which converts numbers to strings or v.v. once only.) So I think character-signedness is a non-issue. Larry From: Larry Rosler To: "'John Porter'" , uri@sysarch.com Subject: RE: sort article outline Date: Fri, 8 Jan 1999 14:29:13 -0800 Content-Type: text/plain Content-Length: 2430 > From: John Porter [mailto:jdporter@min.net] > Sent: Friday, January 08, 1999 11:03 > To: uri@sysarch.com > Cc: lr@hpl.hp.com > Subject: Re: sort article outline > > Uri wrote: > > JP> But it's also long-winded. Can't we simply talk about the > > JP> "default sort" (locale issues aside)? > > > > i don't like that because it doesn't describe what transforms are being > > done to get it to use the builtin cmp. just oo vague a term. > > Same applies to "default lexicographic sort". It's not really for > describing any of the transforms we're investigation, but for the > underlying sort (qsort/cmp) which we're exploiting. > > > what is important is the method of transforms that > > allows you to use the builtin cmp. so i feel the name > should reflect that. > > Right. But everything needs a name, even the "default sort". Here, in journalistic format, is a synopsis of where we stand. You may prefer 'Advanced' to 'Enhanced'. WHAT: The Enhanced Default Sort WHY: For optimal performance on large sorts (often with several sort fields), compare sort keys using built-in C functions rather than using Perl code in a sort subroutine. WHERE: Wherever the sort is based on comparing a single-valued mapping function of each value to be sorted. (This is true for all enhanced sorts -- ST, Orcish, ...) HOW: For each value to be sorted, create a sort key that is a Perl string (a sequence of bytes) corresponding lexicographically to the position of the value in the sorted output; append the value to the sort key so that it can readily be retrieved. The sort key can be created using any combination of 'pack', 'sprintf' or concatenation. Each field to be sorted may comprise a signed or unsigned integer, a floating-point number, or a string of characters. (Special handling is required if the values to be sorted are references -- LoL or LoH.) WHEN: Compute the keys in a single pass through the list of values (map). Sort. Retrieve the values from the keys (map). Other minor issues: As stated, this method sorts two values whose keys compare equal according to the values themselves (which is the usual fall-through condition for a fielded sort). A 'stable' sort can be forced (using any sorting method) by adding a sequential index as the last sort key. GO NINERS!!! (Bye-bye, Pats :-() Larry i have to write these down now or i may forget them. a sort module api could look like this: $sort_obj = Sort->new( [ sort keys desriptions ], gloabl sort options sort key description => { 'type' => integer, string, float, etc. 'down' => yes 'code_to_get_key_value' => code ref OR code text. } $sort_obj->values( list of values or refs) can be called again to add more to list. $sort_obj->sort returns sorted data if the option is to build a preprocess sub, we take the code texts and create the text for it. then it will run fast, an optimization for large sorts. otherwise, we use the code ref which takes a ref (and key number?) for an argument and we call it to return the nth key value. from this info we create a pack format string. any inversions or endian things are handled here in the preprocess code (either version) for each sort value we build up an array of key values. then we just do a pack and push its value. we also generate an unpack format (not same as pack format but related) to get the index and/or the original values to return. the pack formats, refs to processing code and other stuff is stored in the sort object for reuse. we can even have special case packs for common things like ip addresses so we can optimize even further there. ip can be designated as text or binary. well, what do you think? uri