[urisagit/Sort-Maker.git] / paper / outline_notes

Return-Path: <jdporter@min.net>
From: John Porter <jdporter@min.net>
Subject: Re: sort article outline
To: uri@sysarch.com
Date: Mon, 28 Dec 1998 11:46:32 -0500 (EST)
Cc: lr@hpl.hp.com
In-Reply-To: <199812280441.XAA11943@home.sysarch.com.> from "Uri Guttman" at Dec 27, 98 11:41:54 pm
Content-Type: text/plain; charset=US-ASCII
Content-Length: 1680

>   JP> Still linear.  Maybe the constants are painfully large, but we're
>   JP> still looking at O(n).
> 
> i think i see why we don't agree here. i am talking about the linear set
> of bytes and you are talking about liner O(n) for the work. is that it?

Yep.  


>   >> hey, it's english. 
> 
>   JP> Marginally.
> 
> ??? marginal quality english or marginally in english?

The latter.  My point is that wrt computer jargon, esp. Perl jargon,
english is not much of a standard.  I.e. what "string" means in
english has only marginal bearing on what it means in Perl.


> different names. each has advantages. the ascii version can use short
> integer strings (<4 bytes) and is always endian independent. you could
> pack into a byte or short if you want to. binary can
> be more compact for some data formats (floats in particular) but endian
> issues come up and have to be dealt with.
> 
> you can cover the concept as one but the conversion process is different
> as one uses sprintf and one uses pack. though the sprintf could use pack
> is some ways. the sprintf is used to zero pad integers and print floats
> in a sortable format. pack is used to store the integers in a byte order
> that cmp can sort.

O.k, I think I see what you're getting at.  "Stringify" (in Perl) has
always refered to the conversion which occurs transparently when a
non-string SV gets used as a string.  And more generally, the same
effect can be achieved by sprintf.  The result is a string which, when
printed, represents the non-string value of the SV (be it integer, float,
reference, whatever). 

So maybe the right way to describe a string which is produced by pack
is "packed".

John


Return-Path: <jdporter@min.net>
From: John Porter <jdporter@min.net>
Subject: Re: sort article outline
To: uri@sysarch.com
Date: Sun, 27 Dec 1998 18:15:36 -0500 (EST)
In-Reply-To: <199812262117.QAA09392@sysarch.com.> from "Uri Guttman" at Dec 26, 98 04:17:34 pm
Content-Type: text/plain; charset=US-ASCII
Content-Length: 3189


Uri wrote:
> not if you are packing integers and floats with optional endian
> reversals. the whole concept is to convert various data types to one
> concatenated set of keys sortable by cmp. cmp can't handle ints and
> floats normally. larry showed how to handle signed ints and floats with
> endian issues covered. so there is a conversion to a linear set of bytes
> sortable by cmp. the original "disparate set of keys" has been mapped to
> a single sortable key. 

Still linear.  Maybe the constants are painfully large, but we're still
looking at O(n).


>   JP> Why should we pander to the lcd?
> 
> i don't think it is pandering at all. all i think is one form is a
> subset of the other form and should have a different name.

Well clearly, printable characters are a subset (variously defined)
of the set of octets.  My point is that Perl doesn't care, and 
Perl programmers shouldn't care.  In Perl terminology, a string is
the octet buffer part of a SV.  Printability is irrelevant.


> there are two issues we are arguing
> here, whether there should be two names and what those names are.

Yep.

>   JP> This is where you and I fundamentally disagree, I guess.
> 
> hey, it's english. 

Marginally.

>   JP> Nothing wrong with that.  The raw printability of the characters
>   JP> on any given terminal is irrelevant, imho.
> 
> i disagree. some may choose to use the printable form for that very
> attribute. sometime the converted keys may be kept around and used for
> other long term purposes where printability may be a needed attribute.

That's why this sorting issue is so open-ended.  We can't lay down
some hard-and-fast technique (like the ST).  If the printability is
critical, then the programmer has to work within that constraint.
But if the performance of the sort is of primary concern, then she
may have to sacrifice the intrinsic printability of the comparison
keys; then they can do a transformation back into printable form,
or cache the printable form, or whatever.


> by having 2 names for 2 subtly different packing methods i think
> that helps them decide which to choose.

I'm all for educating the programmer on the issues and choices.


>   JP> I think this is a bad misuse of the term "linear".
>   JP> Maybe you could come up with some other term that expresses
>   JP> that notion.
> 
> well, i agree the term may not be the best, but do you agree on the
> difference between the methods, as i hve described above?

Clearly not.


>   JP> You and I don't see eye-to-eye on this.
> 
> maybe my argument above will convince you more. or i can get your
> mother-in-law to agree with me and then you will be sorry! 

Um, unfortunately, you have only clarified your position.
You have not added any persuasive content.


>   JP> I remember he said something like that about the ST specifically;
>   JP> don't remember anything about Perl generally...
> 
> maybe my memory was faulty but he was so negative on others that i
> remember a slam on perl in some letters. it would be tricky to scan him
> on dejanews for just that.

I'm not saying I don't believe he said such things; I just don't
remember them.  Knowing Rutka, he probably did!

John


Return-Path: <jdporter@min.net>
From: John Porter <jdporter@min.net>
Subject: Re: sort article outline
To: uri@sysarch.com
Date: Sat, 26 Dec 1998 11:30:34 -0500 (EST)
Cc: lr@hpl.hp.com
In-Reply-To: <199812260558.AAA08301@sysarch.com.> from "Uri Guttman" at Dec 26, 98 00:58:39 am
Content-Type: text/plain; charset=US-ASCII
Content-Length: 2668


Uri wrote:
> >>>>> "JP" == John Porter <jdporter@min.net> writes:
> 
>   >> also if you
>   >> put integers in the perl byte string, then it is not stringifying which
>   >> slightly implies ascii.
> 
>   JP> Not in Perl, imho.
>   JP> I think cmp implies strings, regardless of the character value range.
>   JP> We're talking about comparing arrays of byte-sized "characters",
>   JP> i.e. strings.
> 
>   >> so linearization is a better overall term,
> 
>   JP> I find your argument, such as it is, unconvincing.
>   JP> In what way has the process made anything linear?
>   JP> That's what "linearization" means, after all.
> 
> you are taking a disparate set of keys and making them into a single key
> of a linear set of bytes.

Even so, the "disparate set of keys" is already linear, wrt 
the sorting algorithm.  At least if it's done right, which
presumably it is.


> i understand that perl is 8 bit clean regarding strings and cmp. and cmp
> is an unsigned compare (IIR). but i (and some others) feel that the word
> string connotes some printable attributes. 

Anyone harboring that notion should be edified.  Why should we
pander to the lcd?


>   >> (ascii really means printing chars, and probably just digits and
>   >> us-ascii unless you use locale and it works for extendied ascii).
> 
>   JP> Um, imho, all this stuff about ascii is irrelevant.
> 
> well how would you differentiate converting a key to printable ascii
> vs. integer bytes?

The important question is, *why* would you?
I don't think we should, in this case at least.


> stringifying implies printable. 

This is where you and I fundamentally disagree, I guess.

> but if you create a 8 bit set of bytes that would be something
> more complex that won't print (without going to hex or something).

Nothing wrong with that.  The raw printability of the characters
on any given terminal is irrelevant, imho.


> that
> is linearization since a set of integers and floats, are now a linear
> set of sortable bytes, but not a printable string. 

I think this is a bad misuse of the term "linear".
Maybe you could come up with some other term that expresses
that notion.


> this is fine edge semantics, but i do like having two terms for the
> printable and non-printable converted keys. maybe linearization is not a
> great term but i want to reserve stringifying for printable only.

Whatever.  You and I don't see eye-to-eye on this.


> i recall [Rutka] had some early misstatements about not liking perl
> (or we interpreted his poor wording to mean that).

I remember he said something like that about the ST specifically;
don't remember anything about Perl generally...

John


Return-Path: <jdporter@min.net>
From: John Porter <jdporter@min.net>
Subject: Sorting article
To: uri@sysarch.com
Date: Sat, 2 Jan 1999 23:45:47 -0500 (EST)
Cc: lr@hpl.hp.com
In-Reply-To: <199901021831.NAA12831@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 01:31:45 pm
Content-Type: text/plain; charset=US-ASCII
Content-Length: 1354


> John wrote:
> > Uri wrote:
> > > in both cases, the original data is usually concatenated onto the end of
> > > the created sort key and then extracted (usually with substr) after the
> > > sort is done.  neither string sort can handle perl references directly,
> > > though the referenced data can be used to create a single sort key and
> > > then ST or OM can then be used for the actual sort.
> > 
> > I don't think I'd say "usually".  You should say "possibly", with
> > a description of the situations that allow it.
> 
> fine. what situations exist where you would not need to concatenate the
> original data onto the key? 

I'm not talking not needing to do it, I'm talking about not being able
to do it.  If you're sorting a list of "objects" e.g., you can't simply
concat a reference onto the end of a string.  However, as you pointed out,
one could simply concat an index for the object; so now I can't think of
any situations where the pack technique can't be used.

The slice of the pie I'd like to tackle is addressing certain
issues, namely:
1. Comparison keys are unique, slightly non-unique, or highly non-unique;
2. Cost of deriving the comparison key from each datum is nil, small,
   or large.
According to my benchmarks, these had a significant influence on the
relative performance of various techniques (ST, OM, etc.).

John


Return-Path: <jdporter@min.net>
From: John Porter <jdporter@min.net>
Subject: Re: Sorting article
To: uri@sysarch.com
Date: Mon, 4 Jan 1999 08:59:18 -0500 (EST)
Cc: lr@hpl.hp.com
In-Reply-To: <199901030455.XAA13708@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 11:55:35 pm
Content-Type: text/plain; charset=US-ASCII
Content-Length: 1306

Uri wrote:
> good. i like the point you make about the index vs. the reference. this
> could be a sort of combination of swchartzian and stringify (either
> flavor). if you have to sort a list of refs (or complex stuff),
> preprocess into a array with a single ref to the real data, make a list
> of stringified keys and append the index to the key and sort with
> builtin cmp. then extract out the indexes and original ref(s). the
> schwartzian part is doing it with maps.

The similary to ST had occured to me.
	map { (split("\0",$_,2))[0] }
	sort
	map { join "\0", $_->cmpkey(), $_ }
	@a

But this is as close as it gets.  The other "advantage" of ST -- 
multiple key comparisons -- is obviated by our technique.


> tho i beleive temp arrays and
> foreach modifiers can be or are faster than maps. we have seen cases
> where multiple maps ar slow. i think it is because it has to recopy all
> the values for each map. a foreach on a temp array would use the same
> array without copying.

I thought the slow part was all those anonymous arrays which are
created in the first map.  map itself doesn't allocate an AV, it makes
a list.  Same with sort.  So if you're sorting a list of 1000 items,
you have three lists created (map, sort, map), but 1000 anonymous AV's
allocated in the first map.

John


Return-Path: <lr@hpl.hp.com>
From: Larry Rosler <lr@hpl.hp.com>
To: "'Uri Guttman'" <uri@sysarch.com>, jdporter@min.net
Subject: RE: Sorting article
Date: Mon, 4 Jan 1999 12:55:54 -0800 
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Length: 4272

> From: Uri Guttman [mailto:uri@sysarch.com]
> Sent: Monday, January 04, 1999 08:23
> To: jdporter@min.net
> Cc: lr@hpl.hp.com
> Subject: Re: Sorting article
...
> P.S. larry has some catchup reading and replying to do now that he
> should be back from vacation.

Yes, I am back.  We drove 7 hours yesterday from Southern California,
much of it through some heavy fog in the Central Valley, so I am still a
little bleary.  And I have tried to plow through over 1200 articles in
c.l.p.misc, with so far only two conclusions:

1.  The group got along fine without me (something to think about for
the future :-).

2.  Uri seems to have had a couple of really bad days through the past
week or so.  :-(

I haven't had a chance to read all of the correspondence between you two
yet, but I did skim a couple, and noticed the mention of sorting
indexes.  I "solved" the problem of using the default sort on references
while on vacation, *literally* while I was sleeping (i.e., I woke up and
wrote it down!).  I then generalized it some, and thought it would be a
world-beater speedwise.  But my first measurements are somewhat
disappointing.

Here is what I have so far; I'm sharing it before doing much else
because your comments and observations would be most useful.  It is
actually a different sorting paradigm, as you will see:

EVOLUTION OF EFFICIENT PERL SORTS

Naive          { sortkey(value_a) cmp sortkey(value_b) }
Orcish         $h{ value } ||= sortkey  OR  @h{ values } = sortkeys
Schwartz       [ value, sortkey, ... ]
Linearized     sortkey . value
New            @h{ sortkeys } = values


#!/usr/local/bin/perl -w
use Benchmark;

my $n = 10000;
my @a = map { [ int(rand 1000), 'foo' . int(rand 1000) ] } 1 .. $n;

sub new {
    my $i = 'aaa';
    my %h;
    @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a;
    my @out = @h{ sort keys %h };
}

sub usual {
    my @out = sort { $a->[0] <=> $b->[0] || $a->[1] cmp $b->[1] } @a;
}

timethese(1 << (shift || 0), {
    New   => sub { &new   },
    Usual => sub { &usual },
});
__END__

Benchmark: timing 16 iterations of New, Usual...
       New: 10 wallclock secs ( 9.56 usr +  0.14 sys =  9.70 CPU)
     Usual: 13 wallclock secs (12.92 usr +  0.00 sys = 12.92 CPU)

Comments:

$i serves two functions:  insuring the uniqueness of the sort key (which
is saved as the key to a hash, so must be unique);  preserving the
stability of the sort (entries that sort equal preserve their order in
the output array).

Do you like those two hash slices, Uri?  Take a look at the next
attempt, which adds an array slice.

In the above case, storing the references themselves as values in the
hash is efficient.  But I was concerned about generalizing this sorting
paradigm to values that were not references, but might be arbitrarily
long strings.  Before I left on vacation, someone had posted about
perhaps sorting references to the strings, or indexes into the array.
So I tried the following:

#!/usr/local/bin/perl -w
use Benchmark;

my $n = 10000;
my @a = map { sprintf "XXX %d %s %s", int(rand 1000), 'foo' . int(rand
1000),
	'x' x 100 . "\n" } 1 .. $n;

sub indexed {
    my $i = 'aaa';
    my %h;
    @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a;
    my @out = @a[ @h{ sort keys %h } ];
}

sub linear  {
    my @out = map { substr($_, 1 + rindex $_, "\0") } sort
	map pack('N A* x A*' => /(\d+) (\w+)/, $_) => @a;
}

timethese(1 << (shift || 0), {
    Indexed => sub { &indexed },
    Linear  => sub { &linear  },
});
__END__

Benchmark: timing 16 iterations of Indexed, Linear...
   Indexed: 15 wallclock secs (14.06 usr +  0.58 sys = 14.64 CPU)
    Linear: 13 wallclock secs (12.44 usr +  0.61 sys = 13.05 CPU)

Comments:

I had hoped that this would be better, because it seemingly avoids
appending the data to the sortkeys and then extracting via the substrs.
But both of those operations take place in the O(n) part of the sort,
not in the O(n log n).  So perhaps I was expecting too much.  But why is
it *slower*???

I have to catch up now on the rest of the world, including my *real*
job.  But I wanted to get this off my chest (and onto yours? :-) ASAP.

HNY!!!

Larry

-- 
Larry Rosler
Hewlett-Packard Company
http://www.hpl.hp.com/personal/Larry_Rosler/
lr@hpl.hp.com


From: Larry Rosler <lr@hpl.hp.com>
To: "'John Porter'" <jdporter@min.net>
Cc: uri@sysarch.com
Subject: RE: Sorting article
Date: Wed, 6 Jan 1999 13:27:16 -0800 
Content-Type: text/plain
Content-Length: 1096

> From: John Porter [mailto:jdporter@min.net]
> Sent: Wednesday, January 06, 1999 13:02
> To: lr@hplb.hpl.hp.com
> Cc: uri@sysarch.com
> Subject: Re: Sorting article
> 
> 
> Larry wrote:
> > New            @h{ sortkeys } = values
> >...
> >     @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a;
> >     my @out = @h{ sort keys %h };
> 
> This actually reminds me of a technique I first learned about from
> Joseph Hall, see EPP, the bottom of p. 47, where he does something
> similar with an array.  He calls it a "mundane" way to sort.  !
> 
> John

I think you clipped the wrong piece from my letter.  Hall is
demonstrating the sort-the-indexes sort, but his sort function uses the
'mundane' key comparisons (from an array instead of a hash).  I was
trying to combine the index sort with the default sort.  This is the
right clip:

sub indexed {
    my $i = 'aaa';
    my %h;
    @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a;
    my @out = @a[ @h{ sort keys %h } ];
}

The nesting of the hash slice inside of the array slice is not
'mundane', is it?

Larry


From: Larry Rosler <lr@hpl.hp.com>
To: "'Uri Guttman'" <uri@sysarch.com>, jdporter@min.net
Subject: RE: sort article outline
Date: Thu, 7 Jan 1999 23:05:27 -0800 
Content-Type: text/plain
Content-Length: 1258

Hi.  I'm still way behind on stuff, especially this work.  But here is a
thought on terminology.

Rather than 'stringified sort' or 'linearized sort', how about
'lexicographic sort'?  If that isn't definitive enough, consider 'pure
lexicographic sort' or 'default lexicographic sort'.

Which reminds me:  Whoever now owns the draft of the paper (Uri?) --
make sure to include a note that all this requires that 'no locale;' be
in effect, or the idea falls apart.  This is from perllocale:

<QUOTE>
        no locale;
        print +(sort grep /\w/, map { chr() } 0..255), "\n";

This machine-native collation (which is what you get unless use locale
has appeared earlier in the same block) must be used for sorting raw
binary data, whereas the locale-dependent collation of the first example
is useful for natural text.
</QUOTE>

Note the 'sorting raw binary data' which is what we are talking about.

I dislike 'stringified' or 'stringized' for several reasons, not the
least of which is that verbification of adjectives isn't good English,
IMO.  'Linearized' (also a verbification of an adjective, but with a
better pedigree) may be a little off-beat semantically.  I think
'lexicographic' comes from the C Standard -- I'll take a look tomorrow. 

Larry


From: Larry Rosler <lr@hpl.hp.com>
To: "'Uri Guttman'" <uri@sysarch.com>
Subject: RE: sort article outline
Date: Fri, 8 Jan 1999 10:36:21 -0800 
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Length: 7713

> From: Uri Guttman [mailto:uri@sysarch.com]
> Sent: Friday, January 08, 1999 10:07
> To: lr@hpl.hp.com
> Cc: jdporter@min.net
> Subject: Re: sort article outline
> 
> larry,
> 
> i seemed to have misplaced (lost) your comments to my outline. do you
> have that letter and can you forward it to me? i suspect you keep all
> you old email around. you seem that type. i delete most of mine. i
will
> be switching to vm on emacs which has folders and stuff and it might
> improve that aspect of things.
> 
> uri

All it costs is disk space, now on someone else's server (because I have
switched from running elm on my Unix box to using Microsoft Exchange,
which the rest of my group uses).

The letter is below, and I have not re-edited it.  But note some things:
the use of 'lexicographic' (as in my most recent suggeston) and the
mention of 'index sorts' which have since received more discussion
between me and John.

Larry

> From: Uri Guttman [mailto:uri@sysarch.com]
> Sent: Tuesday, December 22, 1998 21:47
> To: jdporter@min.net; lr@hpl.hp.com
> Subject: sort article outline
> 
> hi guys,
> 
> i felt productive tonight and i didn't have other work and news was
slow
> so i wrote an outline for the sort article. i like it so far and i
think
> it will make for a good article (or 2).

Likely the latter, if it is thorough (as the outline seems to be).
 
> so please shred it up and send your comments. when we agree on a
> (semi-) final outline, then we can start fleshing out the paragraphs
> with text and code and benchmarks.
> 
> an open area is benchmarking and describing when to use each of the
> speedup tricks. this will need some research and playing with code and
> data.

'When to use' is what the summary is about.  See below.

> we should have a set of benchmark data we create to drive this so we
can
> all use it and cme up with useful benchmarks. having this stuff on a
> website (maybe MJD could help us) and having others playing with our
> stuff could be an intersting idea.
> 
> otherwise, a belated happy hannukah to larry (we threw a sitdown
dinner
> party for 14 on sunday, the eigth night. we had 8 menorahs going!),
and
> happy holidays to you john.

Nice image (8 menorahs * 9 candles each).  And to you both.
 
> uri
> 
> 
> 	Outline for TPJ Sort Article
> 
> (Note the use of caps here :-)
> 
> I. Overview of sorting in the real world.

Refs.  Knuth Vol. 3, Chap. 5
FMTYEWTK

Introduce O() concept and notation briefly

>   A. What is sorting
> 
>   B. Why is it needed

Any program that produces tabular output is almost certain to need it!
 
>   C. Sort algorithms
> 
>     1. bubble

O(n**2)

>     2. insertion

O(n**2)

>     3. tree

O(n log n)

>     4. shell

O(n log n)

>     5. quicksort

O(n log n)

>     6. others
> 
>   D. O(n) order function

Is this later than needed, in order to characterize the sort 
algorithms mentioned above?

>      1. What it means
> 
>      2. Constant and linear overhead for pre- and post-
>         processing
> 
>      3. Constant overhead per compare 
> 
>      4. Why reducing compare count is critical
> 
>      5. Why amount of sort data matters
> 
> II. Perl's sort function

Implementation -- C qsort() with optional callbacks to a Perl function?
Limitations of Perl internal sorting -- size of data set, w/o
sort-merge.
Usefulness of highly tuned system sort command or commercial 
sort packages (e.g., SyncSort)
 
>   A. Builtin compare is cmp

and implementation is optimal (C or assembly memcmp function)

>   B. Custom compare code

(in callback function)

> 	1. block of code
> 
> 	2. sub name
> 
> 	3. sub type glob
> 
> 	4. sub ref (to be done)
> 
>   C. Special vars $a and $b in compare
> 
> III Ways to sort in Perl

Ascending alphabetic (lexicographic) sort (default)
 
>   A. Numeric sort
> 
>   B. Reverse sort
> 
>   C. Multi-level sort

Are these 'field' sorts?

>   D. Array index sort

This is a derived-key sort.
 
>   E. Hash value sort

This is a derived-key sort.
 
>   F. Complex compare code

No single-valued sort key per datum.  Beyond the scope of 
this article.

> IV. Efficient sorting in Perl
> 
>   A. When efficiency is not warranted
           ^ an investment in improving
  
>      1. Short data sets
> 
>      2. Rarely executed program

More generally, low cost compared to other parts of the 
program.  Quote from Rob Pike about when to optimize (don't!).
 
>   B. Sort overhead
> 
>     1. builtin vs. {$a cmp $b}

Trivial sorts (key is trivially derived from datum):
a.  {$a <=> $b} or reverse
b.  {$hash{$a} cmp $hash{$b}} or numeric; or reverse
c.  {$a->[CONST] cmp $b->[CONST]} or numeric; or reverse
 
>   C. General ideas on how to speed up sort
> 
>     1. Preprocessing data
to achieve a trivial sort
 
>     2. Simple and fast compare code (builtin if possible)

Trivial sort.

>     3. Postprocessing data from LOL (if needed)
> 
>   D. Speed up techniques
> 
>     1. Schwartzian transform (from Randal Schwartz)
> 
>       a. Converts input to LOL, index sort, convert LOL to output

index sort?  This is Trivial Sort c (see above).

>       b. uses maps or temp arrays
> 
>     2. Orcish (Or-cache) manoever (from Joseph Hall)
> 
>       a. Converts sort data in compare code
> 
>       b. Converts only once per unique sort data

but checks every time to see if key derivation is needed.

>       c. Stores converted keys in private hash

This is Trivial Sort b (see above).

>       d. No need for post processing, as sort is on real data

Where is the discussion about creating the hash during a 
pre-processing pass, obviating the test on each comparand?
 
>     3. Stringifying (need better name) (from many people)

How about 'Linearized Sort'?

>       a. Preprocessing of input data to sortable strings

'string' == arbitrary sequence of bytes, stored in a Perl 
scalar variable (u.e., linear).
 
> 	 i. uses either pack or sprintf (or other perl code).
> 
>          ii. Converted key is combination of binary and ascii
                             ^^  may be
 
>          iii. Original data is appended onto converted key

See below -- cannot be references.
 
>       b. Uses builtin compare (fastest)
> 
>       c. Postprocess keys to get original data (use substr 
or unpack)

Note that this is isomorphic to the Schwartz Transform 
map/sort/map paradigm.
 
>       d. Can handle integers (signed or unsigned), ascii 
strings, and
                                                     ^^^^^ 
??? see above
>          normal or reverse sorts on a per key basis.
> 
>       e. Can handle floats (IEEE format only) with code to 
deal with big
                     ^ non-negative, until someone gets smarter!
>          vs. little endian.

Cannot handle references!

       4.  Summary
  
         a.  Examples
              IP addresses
              Fully-qualified domain names (right to left)
              ...

         b. When to use
              Criteria
              Benchmarks

> 
> -- 
> Uri Guttman  -----------------  SYStems ARCHitecture and 
Software Engineering
> Perl Hacker for Hire  ----------------------  Perl, 
Internet, UNIX Consulting
> uri@sysarch.com  ------------------------------------  
http://www.sysarch.com
The Best Search Engine on the Net -------------
http://www.northernlight.com

Good start.  Excelsior!

I am having trouble minding my tongue about Rutka's obnoxious posts, but
he won't disappear (and he now works for Lucent, my alma mater.
Sigh...)

I will be away from December 25 through January 3; real vacation (no
electronic communication).

Happy New Year!!!

Larry

-- 
Larry Rosler
Hewlett-Packard Company
http://www.hpl.hp.com/personal/Larry_Rosler/
lr@hpl.hp.com


From: Larry Rosler <lr@hpl.hp.com>
To: "'Uri Guttman'" <uri@sysarch.com>
Cc: jdporter@min.net
Subject: RE: sort article outline
Date: Fri, 8 Jan 1999 12:32:49 -0800 
Content-Type: text/plain
Content-Length: 2465

> From: Uri Guttman [mailto:uri@sysarch.com]
> Sent: Friday, January 08, 1999 09:59
> To: lr@hpl.hp.com
> Cc: jdporter@min.net
> Subject: Re: sort article outline
> 
...
>   LR> make sure to include a note that all this requires that 'no
locale;' be
>   LR> in effect, or the idea falls apart.  This is from perllocale:
> 
>   LR> <QUOTE>
>   LR>         no locale;
>   LR>         print +(sort grep /\w/, map { chr() } 0..255), "\n";
> 
>   LR> This machine-native collation (which is what you get unless use
locale
>   LR> has appeared earlier in the same block) must be used for sorting
raw
>   LR> binary data, whereas the locale-dependent collation of the first
example
>   LR> is useful for natural text.
>   LR> </QUOTE>
> 
>   LR> Note the 'sorting raw binary data' which is what we are talking
about.
> 
> is this due to signed/unsigned cmp issues? that 0 .. 255 might map to
> 128 ..255, 0 .. 127 on some machines because of signed char compares?
> this is another porting issue for packed sorts. not just endianess.

No.  Without looking at the perl sources, I venture to guess that it is
because the 'no locale;' default sort uses 'memcmp' while the 'use
locale;' default sort uses 'strcoll' which relies on LC_COLLATE.  There
might be a signedness problem if 'strncmp' were used instead of
'memcmp'.  But I recall John saying that 'memcmp' was used (and it would
*have* to be to conform to Perl's definition of a string, which is a
sequence of arbitrary bytes whose length is specified by a separate
integer, vs. C's definition of a string, which is a sequence of bytes
terminated by a zero byte).  In any case, I have yet to see any machine
dependencies in my tests on PA-RISC or Intel architectures.  Does anyone
know of any target that uses signed-char compares that this could be
verified against -- or just look into the perl source (which I haven't
done yet) to make sure it uses 'memcmp'.

> printable sorts (hey i like these terms) need to worry about 
> locale tho.

Well, that is interesting.  You distinguish between a 'packed' sort key
(which *surely* demands 'no locale;') and a 'printable' ('stringized')
sort key (which perhaps relies on 'use locale;').  Maybe that is a real
distinction that should be understood and documented.  A mixed sort key
(packed number and a string, for example) would have to be sorted
without locale, though the number could be stringized by 'sprintf
"%10u"' in which case locale could be observed.

Larry


From: Larry Rosler <lr@hpl.hp.com>
To: "'Uri Guttman'" <uri@sysarch.com>
Cc: "'jdporter@min.net'" <jdporter@min.net>
Subject: RE: sort article outline
Date: Fri, 8 Jan 1999 12:59:24 -0800 
Content-Type: text/plain
Content-Length: 1266

...
> No.  Without looking at the perl sources, I venture to guess 
> that it is because the 'no locale;' default sort uses 
> 'memcmp' while the 'use locale;' default sort uses 'strcoll' 
> which relies on LC_COLLATE.  There might be a signedness 
> problem if 'strncmp' were used instead of 'memcmp'.  But I 
> recall John saying that 'memcmp' was used (and it would 
> *have* to be to conform to Perl's definition of a string, 
> which is a sequence of arbitrary bytes whose length is 
> specified by a separate integer, vs. C's definition of a 
> string, which is a sequence of bytes terminated by a zero 
> byte).  In any case, I have yet to see any machine 
> dependencies in my tests on PA-RISC or Intel architectures.  
> Does anyone know of any target that uses signed-char compares 
> that this could be verified against -- or just look into the 
> perl source (which I haven't done yet) to make sure it uses 'memcmp'.

>From 'sv.c', the only comparison routine used is 'memcmp'.  When locale
is in effect, each comparand is passed once through strxfrm and the
transformed value is saved for sort comparisons.  (The mechanism must be
like that which converts numbers to strings or v.v. once only.)

So I think character-signedness is a non-issue.

Larry


From: Larry Rosler <lr@hpl.hp.com>
To: "'John Porter'" <jdporter@min.net>, uri@sysarch.com
Subject: RE: sort article outline
Date: Fri, 8 Jan 1999 14:29:13 -0800 
Content-Type: text/plain
Content-Length: 2430

> From: John Porter [mailto:jdporter@min.net]
> Sent: Friday, January 08, 1999 11:03
> To: uri@sysarch.com
> Cc: lr@hpl.hp.com
> Subject: Re: sort article outline
> 
> Uri wrote:
> >   JP> But it's also long-winded.  Can't we simply talk about the
> >   JP> "default sort" (locale issues aside)?
> > 
> > i don't like that because it doesn't describe what transforms are
being
> > done to get it to use the builtin cmp. just oo vague a term. 
> 
> Same applies to "default lexicographic sort".  It's not really for
> describing any of the transforms we're investigation, but for the
> underlying sort (qsort/cmp) which we're exploiting.
> 
> > what is important is the method of transforms that
> > allows you to use the builtin cmp. so i feel the name 
> should reflect that.
> 
> Right. But everything needs a name, even the "default sort".

Here, in journalistic format, is a synopsis of where we stand.  You may
prefer 'Advanced' to 'Enhanced'.

WHAT:  The Enhanced Default Sort

WHY:   For optimal performance on large sorts (often with several sort
       fields), compare sort keys using built-in C functions rather than
       using Perl code in a sort subroutine.

WHERE: Wherever the sort is based on comparing a single-valued mapping
       function of each value to be sorted.  (This is true for all
       enhanced sorts -- ST, Orcish, ...)

HOW:   For each value to be sorted, create a sort key that is a Perl
       string (a sequence of bytes) corresponding lexicographically to
the
       position of the value in the sorted output; append the value to
the
       sort key so that it can readily be retrieved.  The sort key can
be
       created using any combination of 'pack', 'sprintf' or
concatenation.
       Each field to be sorted may comprise a signed or unsigned
integer,
       a floating-point number, or a string of characters.  (Special
       handling is required if the values to be sorted are references --
       LoL or LoH.)

WHEN:  Compute the keys in a single pass through the list of values
(map).
       Sort.  Retrieve the values from the keys (map).

Other minor issues:  As stated, this method sorts two values whose keys
compare equal according to the values themselves (which is the usual
fall-through condition for a fielded sort).  A 'stable' sort can be
forced (using any sorting method) by adding a sequential index as the
last sort key.

GO NINERS!!!  (Bye-bye, Pats :-()

Larry


i have to write these down now or i may forget them.

a sort module api could look like this:

$sort_obj = Sort->new( [ sort keys desriptions ], gloabl sort options

sort key description => {

		'type'	=> integer, string, float, etc.

		'down'	=> yes

		'code_to_get_key_value' => code ref OR code text.

}

$sort_obj->values( list of values or refs)

can be called again to add more to list.

$sort_obj->sort			returns sorted data


if the option is to build a preprocess sub, we take the code texts and
create the text for it. then it will run fast, an optimization for large
sorts. otherwise, we use the code ref which takes a ref (and key
number?) for an argument and we call it to return the nth key value.

from this info we create a pack format string.

any inversions or endian things are handled here in the preprocess code
(either version)

for each sort value we build up an array of key values.  then we just
do a pack and push its value.

we also generate an unpack format (not same as pack format but related)
to get the index and/or the original values to return.

the pack formats, refs to processing code and other stuff is stored in
the sort object for reuse.

we can even have special case packs for common things like ip addresses
so we can optimize even further there. ip can be designated as text or
binary.

well, what do you think?

uri