paper/outline_notes

   1 Return-Path: <jdporter@min.net>
   2 From: John Porter <jdporter@min.net>
   3 Subject: Re: sort article outline
   4 To: uri@sysarch.com
   5 Date: Mon, 28 Dec 1998 11:46:32 -0500 (EST)
   6 Cc: lr@hpl.hp.com
   7 In-Reply-To: <199812280441.XAA11943@home.sysarch.com.> from "Uri Guttman" at Dec 27, 98 11:41:54 pm
   8 Content-Type: text/plain; charset=US-ASCII
   9 Content-Length: 1680
  10
  11 >   JP> Still linear.  Maybe the constants are painfully large, but we're
  12 >   JP> still looking at O(n).
  13 >
  14 > i think i see why we don't agree here. i am talking about the linear set
  15 > of bytes and you are talking about liner O(n) for the work. is that it?
  16
  17 Yep.
  18
  19
  20 >   >> hey, it's english.
  21 >
  22 >   JP> Marginally.
  23 >
  24 > ??? marginal quality english or marginally in english?
  25
  26 The latter.  My point is that wrt computer jargon, esp. Perl jargon,
  27 english is not much of a standard.  I.e. what "string" means in
  28 english has only marginal bearing on what it means in Perl.
  29
  30
  31 > different names. each has advantages. the ascii version can use short
  32 > integer strings (<4 bytes) and is always endian independent. you could
  33 > pack into a byte or short if you want to. binary can
  34 > be more compact for some data formats (floats in particular) but endian
  35 > issues come up and have to be dealt with.
  36 >
  37 > you can cover the concept as one but the conversion process is different
  38 > as one uses sprintf and one uses pack. though the sprintf could use pack
  39 > is some ways. the sprintf is used to zero pad integers and print floats
  40 > in a sortable format. pack is used to store the integers in a byte order
  41 > that cmp can sort.
  42
  43 O.k, I think I see what you're getting at.  "Stringify" (in Perl) has
  44 always refered to the conversion which occurs transparently when a
  45 non-string SV gets used as a string.  And more generally, the same
  46 effect can be achieved by sprintf.  The result is a string which, when
  47 printed, represents the non-string value of the SV (be it integer, float,
  48 reference, whatever).
  49
  50 So maybe the right way to describe a string which is produced by pack
  51 is "packed".
  52
  53 John
  54
  55
  56
  57
  58 Return-Path: <jdporter@min.net>
  59 From: John Porter <jdporter@min.net>
  60 Subject: Re: sort article outline
  61 To: uri@sysarch.com
  62 Date: Sun, 27 Dec 1998 18:15:36 -0500 (EST)
  63 In-Reply-To: <199812262117.QAA09392@sysarch.com.> from "Uri Guttman" at Dec 26, 98 04:17:34 pm
  64 Content-Type: text/plain; charset=US-ASCII
  65 Content-Length: 3189
  66
  67
  68 Uri wrote:
  69 > not if you are packing integers and floats with optional endian
  70 > reversals. the whole concept is to convert various data types to one
  71 > concatenated set of keys sortable by cmp. cmp can't handle ints and
  72 > floats normally. larry showed how to handle signed ints and floats with
  73 > endian issues covered. so there is a conversion to a linear set of bytes
  74 > sortable by cmp. the original "disparate set of keys" has been mapped to
  75 > a single sortable key.
  76
  77 Still linear.  Maybe the constants are painfully large, but we're still
  78 looking at O(n).
  79
  80
  81 >   JP> Why should we pander to the lcd?
  82 >
  83 > i don't think it is pandering at all. all i think is one form is a
  84 > subset of the other form and should have a different name.
  85
  86 Well clearly, printable characters are a subset (variously defined)
  87 of the set of octets.  My point is that Perl doesn't care, and
  88 Perl programmers shouldn't care.  In Perl terminology, a string is
  89 the octet buffer part of a SV.  Printability is irrelevant.
  90
  91
  92 > there are two issues we are arguing
  93 > here, whether there should be two names and what those names are.
  94
  95 Yep.
  96
  97 >   JP> This is where you and I fundamentally disagree, I guess.
  98 >
  99 > hey, it's english.
 100
 101 Marginally.
 102
 103 >   JP> Nothing wrong with that.  The raw printability of the characters
 104 >   JP> on any given terminal is irrelevant, imho.
 105 >
 106 > i disagree. some may choose to use the printable form for that very
 107 > attribute. sometime the converted keys may be kept around and used for
 108 > other long term purposes where printability may be a needed attribute.
 109
 110 That's why this sorting issue is so open-ended.  We can't lay down
 111 some hard-and-fast technique (like the ST).  If the printability is
 112 critical, then the programmer has to work within that constraint.
 113 But if the performance of the sort is of primary concern, then she
 114 may have to sacrifice the intrinsic printability of the comparison
 115 keys; then they can do a transformation back into printable form,
 116 or cache the printable form, or whatever.
 117
 118
 119 > by having 2 names for 2 subtly different packing methods i think
 120 > that helps them decide which to choose.
 121
 122 I'm all for educating the programmer on the issues and choices.
 123
 124
 125 >   JP> I think this is a bad misuse of the term "linear".
 126 >   JP> Maybe you could come up with some other term that expresses
 127 >   JP> that notion.
 128 >
 129 > well, i agree the term may not be the best, but do you agree on the
 130 > difference between the methods, as i hve described above?
 131
 132 Clearly not.
 133
 134
 135 >   JP> You and I don't see eye-to-eye on this.
 136 >
 137 > maybe my argument above will convince you more. or i can get your
 138 > mother-in-law to agree with me and then you will be sorry!
 139
 140 Um, unfortunately, you have only clarified your position.
 141 You have not added any persuasive content.
 142
 143
 144 >   JP> I remember he said something like that about the ST specifically;
 145 >   JP> don't remember anything about Perl generally...
 146 >
 147 > maybe my memory was faulty but he was so negative on others that i
 148 > remember a slam on perl in some letters. it would be tricky to scan him
 149 > on dejanews for just that.
 150
 151 I'm not saying I don't believe he said such things; I just don't
 152 remember them.  Knowing Rutka, he probably did!
 153
 154 John
 155
 156
 157
 158
 159 Return-Path: <jdporter@min.net>
 160 From: John Porter <jdporter@min.net>
 161 Subject: Re: sort article outline
 162 To: uri@sysarch.com
 163 Date: Sat, 26 Dec 1998 11:30:34 -0500 (EST)
 164 Cc: lr@hpl.hp.com
 165 In-Reply-To: <199812260558.AAA08301@sysarch.com.> from "Uri Guttman" at Dec 26, 98 00:58:39 am
 166 Content-Type: text/plain; charset=US-ASCII
 167 Content-Length: 2668
 168
 169
 170 Uri wrote:
 171 > >>>>> "JP" == John Porter <jdporter@min.net> writes:
 172 >
 173 >   >> also if you
 174 >   >> put integers in the perl byte string, then it is not stringifying which
 175 >   >> slightly implies ascii.
 176 >
 177 >   JP> Not in Perl, imho.
 178 >   JP> I think cmp implies strings, regardless of the character value range.
 179 >   JP> We're talking about comparing arrays of byte-sized "characters",
 180 >   JP> i.e. strings.
 181 >
 182 >   >> so linearization is a better overall term,
 183 >
 184 >   JP> I find your argument, such as it is, unconvincing.
 185 >   JP> In what way has the process made anything linear?
 186 >   JP> That's what "linearization" means, after all.
 187 >
 188 > you are taking a disparate set of keys and making them into a single key
 189 > of a linear set of bytes.
 190
 191 Even so, the "disparate set of keys" is already linear, wrt
 192 the sorting algorithm.  At least if it's done right, which
 193 presumably it is.
 194
 195
 196 > i understand that perl is 8 bit clean regarding strings and cmp. and cmp
 197 > is an unsigned compare (IIR). but i (and some others) feel that the word
 198 > string connotes some printable attributes.
 199
 200 Anyone harboring that notion should be edified.  Why should we
 201 pander to the lcd?
 202
 203
 204 >   >> (ascii really means printing chars, and probably just digits and
 205 >   >> us-ascii unless you use locale and it works for extendied ascii).
 206 >
 207 >   JP> Um, imho, all this stuff about ascii is irrelevant.
 208 >
 209 > well how would you differentiate converting a key to printable ascii
 210 > vs. integer bytes?
 211
 212 The important question is, *why* would you?
 213 I don't think we should, in this case at least.
 214
 215
 216 > stringifying implies printable.
 217
 218 This is where you and I fundamentally disagree, I guess.
 219
 220 > but if you create a 8 bit set of bytes that would be something
 221 > more complex that won't print (without going to hex or something).
 222
 223 Nothing wrong with that.  The raw printability of the characters
 224 on any given terminal is irrelevant, imho.
 225
 226
 227 > that
 228 > is linearization since a set of integers and floats, are now a linear
 229 > set of sortable bytes, but not a printable string.
 230
 231 I think this is a bad misuse of the term "linear".
 232 Maybe you could come up with some other term that expresses
 233 that notion.
 234
 235
 236 > this is fine edge semantics, but i do like having two terms for the
 237 > printable and non-printable converted keys. maybe linearization is not a
 238 > great term but i want to reserve stringifying for printable only.
 239
 240 Whatever.  You and I don't see eye-to-eye on this.
 241
 242
 243 > i recall [Rutka] had some early misstatements about not liking perl
 244 > (or we interpreted his poor wording to mean that).
 245
 246 I remember he said something like that about the ST specifically;
 247 don't remember anything about Perl generally...
 248
 249 John
 250
 251
 252
 253
 254 Return-Path: <jdporter@min.net>
 255 From: John Porter <jdporter@min.net>
 256 Subject: Sorting article
 257 To: uri@sysarch.com
 258 Date: Sat, 2 Jan 1999 23:45:47 -0500 (EST)
 259 Cc: lr@hpl.hp.com
 260 In-Reply-To: <199901021831.NAA12831@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 01:31:45 pm
 261 Content-Type: text/plain; charset=US-ASCII
 262 Content-Length: 1354
 263
 264
 265 > John wrote:
 266 > > Uri wrote:
 267 > > > in both cases, the original data is usually concatenated onto the end of
 268 > > > the created sort key and then extracted (usually with substr) after the
 269 > > > sort is done.  neither string sort can handle perl references directly,
 270 > > > though the referenced data can be used to create a single sort key and
 271 > > > then ST or OM can then be used for the actual sort.
 272 > >
 273 > > I don't think I'd say "usually".  You should say "possibly", with
 274 > > a description of the situations that allow it.
 275 >
 276 > fine. what situations exist where you would not need to concatenate the
 277 > original data onto the key?
 278
 279 I'm not talking not needing to do it, I'm talking about not being able
 280 to do it.  If you're sorting a list of "objects" e.g., you can't simply
 281 concat a reference onto the end of a string.  However, as you pointed out,
 282 one could simply concat an index for the object; so now I can't think of
 283 any situations where the pack technique can't be used.
 284
 285 The slice of the pie I'd like to tackle is addressing certain
 286 issues, namely:
 287 1. Comparison keys are unique, slightly non-unique, or highly non-unique;
 288 2. Cost of deriving the comparison key from each datum is nil, small,
 289    or large.
 290 According to my benchmarks, these had a significant influence on the
 291 relative performance of various techniques (ST, OM, etc.).
 292
 293 John
 294
 295
 296
 297
 298 Return-Path: <jdporter@min.net>
 299 From: John Porter <jdporter@min.net>
 300 Subject: Re: Sorting article
 301 To: uri@sysarch.com
 302 Date: Mon, 4 Jan 1999 08:59:18 -0500 (EST)
 303 Cc: lr@hpl.hp.com
 304 In-Reply-To: <199901030455.XAA13708@home.sysarch.com.> from "Uri Guttman" at Jan 2, 99 11:55:35 pm
 305 Content-Type: text/plain; charset=US-ASCII
 306 Content-Length: 1306
 307
 308 Uri wrote:
 309 > good. i like the point you make about the index vs. the reference. this
 310 > could be a sort of combination of swchartzian and stringify (either
 311 > flavor). if you have to sort a list of refs (or complex stuff),
 312 > preprocess into a array with a single ref to the real data, make a list
 313 > of stringified keys and append the index to the key and sort with
 314 > builtin cmp. then extract out the indexes and original ref(s). the
 315 > schwartzian part is doing it with maps.
 316
 317 The similary to ST had occured to me.
 318         map { (split("\0",$_,2))[0] }
 319         sort
 320         map { join "\0", $_->cmpkey(), $_ }
 321         @a
 322
 323 But this is as close as it gets.  The other "advantage" of ST --
 324 multiple key comparisons -- is obviated by our technique.
 325
 326
 327 > tho i beleive temp arrays and
 328 > foreach modifiers can be or are faster than maps. we have seen cases
 329 > where multiple maps ar slow. i think it is because it has to recopy all
 330 > the values for each map. a foreach on a temp array would use the same
 331 > array without copying.
 332
 333 I thought the slow part was all those anonymous arrays which are
 334 created in the first map.  map itself doesn't allocate an AV, it makes
 335 a list.  Same with sort.  So if you're sorting a list of 1000 items,
 336 you have three lists created (map, sort, map), but 1000 anonymous AV's
 337 allocated in the first map.
 338
 339 John
 340
 341
 342
 343
 344 Return-Path: <lr@hpl.hp.com>
 345 From: Larry Rosler <lr@hpl.hp.com>
 346 To: "'Uri Guttman'" <uri@sysarch.com>, jdporter@min.net
 347 Subject: RE: Sorting article
 348 Date: Mon, 4 Jan 1999 12:55:54 -0800
 349 Content-Type: text/plain;
 350         charset="iso-8859-1"
 351 Content-Length: 4272
 352
 353 > From: Uri Guttman [mailto:uri@sysarch.com]
 354 > Sent: Monday, January 04, 1999 08:23
 355 > To: jdporter@min.net
 356 > Cc: lr@hpl.hp.com
 357 > Subject: Re: Sorting article
 358 ...
 359 > P.S. larry has some catchup reading and replying to do now that he
 360 > should be back from vacation.
 361
 362 Yes, I am back.  We drove 7 hours yesterday from Southern California,
 363 much of it through some heavy fog in the Central Valley, so I am still a
 364 little bleary.  And I have tried to plow through over 1200 articles in
 365 c.l.p.misc, with so far only two conclusions:
 366
 367 1.  The group got along fine without me (something to think about for
 368 the future :-).
 369
 370 2.  Uri seems to have had a couple of really bad days through the past
 371 week or so.  :-(
 372
 373 I haven't had a chance to read all of the correspondence between you two
 374 yet, but I did skim a couple, and noticed the mention of sorting
 375 indexes.  I "solved" the problem of using the default sort on references
 376 while on vacation, *literally* while I was sleeping (i.e., I woke up and
 377 wrote it down!).  I then generalized it some, and thought it would be a
 378 world-beater speedwise.  But my first measurements are somewhat
 379 disappointing.
 380
 381 Here is what I have so far; I'm sharing it before doing much else
 382 because your comments and observations would be most useful.  It is
 383 actually a different sorting paradigm, as you will see:
 384
 385 EVOLUTION OF EFFICIENT PERL SORTS
 386
 387 Naive          { sortkey(value_a) cmp sortkey(value_b) }
 388 Orcish         $h{ value } ||= sortkey  OR  @h{ values } = sortkeys
 389 Schwartz       [ value, sortkey, ... ]
 390 Linearized     sortkey . value
 391 New            @h{ sortkeys } = values
 392
 393
 394
 395 #!/usr/local/bin/perl -w
 396 use Benchmark;
 397
 398 my $n = 10000;
 399 my @a = map { [ int(rand 1000), 'foo' . int(rand 1000) ] } 1 .. $n;
 400
 401 sub new {
 402     my $i = 'aaa';
 403     my %h;
 404     @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a;
 405     my @out = @h{ sort keys %h };
 406 }
 407
 408 sub usual {
 409     my @out = sort { $a->[0] <=> $b->[0] || $a->[1] cmp $b->[1] } @a;
 410 }
 411
 412 timethese(1 << (shift || 0), {
 413     New   => sub { &new   },
 414     Usual => sub { &usual },
 415 });
 416 __END__
 417
 418 Benchmark: timing 16 iterations of New, Usual...
 419        New: 10 wallclock secs ( 9.56 usr +  0.14 sys =  9.70 CPU)
 420      Usual: 13 wallclock secs (12.92 usr +  0.00 sys = 12.92 CPU)
 421
 422 Comments:
 423
 424 $i serves two functions:  insuring the uniqueness of the sort key (which
 425 is saved as the key to a hash, so must be unique);  preserving the
 426 stability of the sort (entries that sort equal preserve their order in
 427 the output array).
 428
 429 Do you like those two hash slices, Uri?  Take a look at the next
 430 attempt, which adds an array slice.
 431
 432 In the above case, storing the references themselves as values in the
 433 hash is efficient.  But I was concerned about generalizing this sorting
 434 paradigm to values that were not references, but might be arbitrarily
 435 long strings.  Before I left on vacation, someone had posted about
 436 perhaps sorting references to the strings, or indexes into the array.
 437 So I tried the following:
 438
 439 #!/usr/local/bin/perl -w
 440 use Benchmark;
 441
 442 my $n = 10000;
 443 my @a = map { sprintf "XXX %d %s %s", int(rand 1000), 'foo' . int(rand
 444 1000),
 445         'x' x 100 . "\n" } 1 .. $n;
 446
 447 sub indexed {
 448     my $i = 'aaa';
 449     my %h;
 450     @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a;
 451     my @out = @a[ @h{ sort keys %h } ];
 452 }
 453
 454 sub linear  {
 455     my @out = map { substr($_, 1 + rindex $_, "\0") } sort
 456         map pack('N A* x A*' => /(\d+) (\w+)/, $_) => @a;
 457 }
 458
 459 timethese(1 << (shift || 0), {
 460     Indexed => sub { &indexed },
 461     Linear  => sub { &linear  },
 462 });
 463 __END__
 464
 465 Benchmark: timing 16 iterations of Indexed, Linear...
 466    Indexed: 15 wallclock secs (14.06 usr +  0.58 sys = 14.64 CPU)
 467     Linear: 13 wallclock secs (12.44 usr +  0.61 sys = 13.05 CPU)
 468
 469 Comments:
 470
 471 I had hoped that this would be better, because it seemingly avoids
 472 appending the data to the sortkeys and then extracting via the substrs.
 473 But both of those operations take place in the O(n) part of the sort,
 474 not in the O(n log n).  So perhaps I was expecting too much.  But why is
 475 it *slower*???
 476
 477 I have to catch up now on the rest of the world, including my *real*
 478 job.  But I wanted to get this off my chest (and onto yours? :-) ASAP.
 479
 480 HNY!!!
 481
 482 Larry
 483
 484 --
 485 Larry Rosler
 486 Hewlett-Packard Company
 487 http://www.hpl.hp.com/personal/Larry_Rosler/
 488 lr@hpl.hp.com
 489
 490
 491
 492 From: Larry Rosler <lr@hpl.hp.com>
 493 To: "'John Porter'" <jdporter@min.net>
 494 Cc: uri@sysarch.com
 495 Subject: RE: Sorting article
 496 Date: Wed, 6 Jan 1999 13:27:16 -0800
 497 Content-Type: text/plain
 498 Content-Length: 1096
 499
 500 > From: John Porter [mailto:jdporter@min.net]
 501 > Sent: Wednesday, January 06, 1999 13:02
 502 > To: lr@hplb.hpl.hp.com
 503 > Cc: uri@sysarch.com
 504 > Subject: Re: Sorting article
 505 >
 506 >
 507 > Larry wrote:
 508 > > New            @h{ sortkeys } = values
 509 > >...
 510 > >     @h{ map pack('N A* A*' => $_->[0], $_->[1], $i++) => @a } = @a;
 511 > >     my @out = @h{ sort keys %h };
 512 >
 513 > This actually reminds me of a technique I first learned about from
 514 > Joseph Hall, see EPP, the bottom of p. 47, where he does something
 515 > similar with an array.  He calls it a "mundane" way to sort.  !
 516 >
 517 > John
 518
 519 I think you clipped the wrong piece from my letter.  Hall is
 520 demonstrating the sort-the-indexes sort, but his sort function uses the
 521 'mundane' key comparisons (from an array instead of a hash).  I was
 522 trying to combine the index sort with the default sort.  This is the
 523 right clip:
 524
 525 sub indexed {
 526     my $i = 'aaa';
 527     my %h;
 528     @h{ map pack('N A* A*' => /(\d+) (\w+)/, $i++) => @a } = 0 .. $#a;
 529     my @out = @a[ @h{ sort keys %h } ];
 530 }
 531
 532 The nesting of the hash slice inside of the array slice is not
 533 'mundane', is it?
 534
 535 Larry
 536
 537
 538
 539 From: Larry Rosler <lr@hpl.hp.com>
 540 To: "'Uri Guttman'" <uri@sysarch.com>, jdporter@min.net
 541 Subject: RE: sort article outline
 542 Date: Thu, 7 Jan 1999 23:05:27 -0800
 543 Content-Type: text/plain
 544 Content-Length: 1258
 545
 546 Hi.  I'm still way behind on stuff, especially this work.  But here is a
 547 thought on terminology.
 548
 549 Rather than 'stringified sort' or 'linearized sort', how about
 550 'lexicographic sort'?  If that isn't definitive enough, consider 'pure
 551 lexicographic sort' or 'default lexicographic sort'.
 552
 553 Which reminds me:  Whoever now owns the draft of the paper (Uri?) --
 554 make sure to include a note that all this requires that 'no locale;' be
 555 in effect, or the idea falls apart.  This is from perllocale:
 556
 557 <QUOTE>
 558         no locale;
 559         print +(sort grep /\w/, map { chr() } 0..255), "\n";
 560
 561 This machine-native collation (which is what you get unless use locale
 562 has appeared earlier in the same block) must be used for sorting raw
 563 binary data, whereas the locale-dependent collation of the first example
 564 is useful for natural text.
 565 </QUOTE>
 566
 567 Note the 'sorting raw binary data' which is what we are talking about.
 568
 569 I dislike 'stringified' or 'stringized' for several reasons, not the
 570 least of which is that verbification of adjectives isn't good English,
 571 IMO.  'Linearized' (also a verbification of an adjective, but with a
 572 better pedigree) may be a little off-beat semantically.  I think
 573 'lexicographic' comes from the C Standard -- I'll take a look tomorrow.
 574
 575 Larry
 576
 577
 578
 579 From: Larry Rosler <lr@hpl.hp.com>
 580 To: "'Uri Guttman'" <uri@sysarch.com>
 581 Subject: RE: sort article outline
 582 Date: Fri, 8 Jan 1999 10:36:21 -0800
 583 Content-Type: text/plain;
 584         charset="iso-8859-1"
 585 Content-Length: 7713
 586
 587 > From: Uri Guttman [mailto:uri@sysarch.com]
 588 > Sent: Friday, January 08, 1999 10:07
 589 > To: lr@hpl.hp.com
 590 > Cc: jdporter@min.net
 591 > Subject: Re: sort article outline
 592 >
 593 > larry,
 594 >
 595 > i seemed to have misplaced (lost) your comments to my outline. do you
 596 > have that letter and can you forward it to me? i suspect you keep all
 597 > you old email around. you seem that type. i delete most of mine. i
 598 will
 599 > be switching to vm on emacs which has folders and stuff and it might
 600 > improve that aspect of things.
 601 >
 602 > uri
 603
 604 All it costs is disk space, now on someone else's server (because I have
 605 switched from running elm on my Unix box to using Microsoft Exchange,
 606 which the rest of my group uses).
 607
 608 The letter is below, and I have not re-edited it.  But note some things:
 609 the use of 'lexicographic' (as in my most recent suggeston) and the
 610 mention of 'index sorts' which have since received more discussion
 611 between me and John.
 612
 613 Larry
 614
 615 > From: Uri Guttman [mailto:uri@sysarch.com]
 616 > Sent: Tuesday, December 22, 1998 21:47
 617 > To: jdporter@min.net; lr@hpl.hp.com
 618 > Subject: sort article outline
 619 >
 620 > hi guys,
 621 >
 622 > i felt productive tonight and i didn't have other work and news was
 623 slow
 624 > so i wrote an outline for the sort article. i like it so far and i
 625 think
 626 > it will make for a good article (or 2).
 627
 628 Likely the latter, if it is thorough (as the outline seems to be).
 629
 630 > so please shred it up and send your comments. when we agree on a
 631 > (semi-) final outline, then we can start fleshing out the paragraphs
 632 > with text and code and benchmarks.
 633 >
 634 > an open area is benchmarking and describing when to use each of the
 635 > speedup tricks. this will need some research and playing with code and
 636 > data.
 637
 638 'When to use' is what the summary is about.  See below.
 639
 640 > we should have a set of benchmark data we create to drive this so we
 641 can
 642 > all use it and cme up with useful benchmarks. having this stuff on a
 643 > website (maybe MJD could help us) and having others playing with our
 644 > stuff could be an intersting idea.
 645 >
 646 > otherwise, a belated happy hannukah to larry (we threw a sitdown
 647 dinner
 648 > party for 14 on sunday, the eigth night. we had 8 menorahs going!),
 649 and
 650 > happy holidays to you john.
 651
 652 Nice image (8 menorahs * 9 candles each).  And to you both.
 653
 654 > uri
 655 >
 656 >
 657 >       Outline for TPJ Sort Article
 658 >
 659 > (Note the use of caps here :-)
 660 >
 661 > I. Overview of sorting in the real world.
 662
 663 Refs.  Knuth Vol. 3, Chap. 5
 664 FMTYEWTK
 665
 666 Introduce O() concept and notation briefly
 667
 668 >   A. What is sorting
 669 >
 670 >   B. Why is it needed
 671
 672 Any program that produces tabular output is almost certain to need it!
 673
 674 >   C. Sort algorithms
 675 >
 676 >     1. bubble
 677
 678 O(n**2)
 679
 680 >     2. insertion
 681
 682 O(n**2)
 683
 684 >     3. tree
 685
 686 O(n log n)
 687
 688 >     4. shell
 689
 690 O(n log n)
 691
 692 >     5. quicksort
 693
 694 O(n log n)
 695
 696 >     6. others
 697 >
 698 >   D. O(n) order function
 699
 700 Is this later than needed, in order to characterize the sort
 701 algorithms mentioned above?
 702
 703 >      1. What it means
 704 >
 705 >      2. Constant and linear overhead for pre- and post-
 706 >         processing
 707 >
 708 >      3. Constant overhead per compare
 709 >
 710 >      4. Why reducing compare count is critical
 711 >
 712 >      5. Why amount of sort data matters
 713 >
 714 > II. Perl's sort function
 715
 716 Implementation -- C qsort() with optional callbacks to a Perl function?
 717 Limitations of Perl internal sorting -- size of data set, w/o
 718 sort-merge.
 719 Usefulness of highly tuned system sort command or commercial
 720 sort packages (e.g., SyncSort)
 721
 722 >   A. Builtin compare is cmp
 723
 724 and implementation is optimal (C or assembly memcmp function)
 725
 726 >   B. Custom compare code
 727
 728 (in callback function)
 729
 730 >       1. block of code
 731 >
 732 >       2. sub name
 733 >
 734 >       3. sub type glob
 735 >
 736 >       4. sub ref (to be done)
 737 >
 738 >   C. Special vars $a and $b in compare
 739 >
 740 > III Ways to sort in Perl
 741
 742 Ascending alphabetic (lexicographic) sort (default)
 743
 744 >   A. Numeric sort
 745 >
 746 >   B. Reverse sort
 747 >
 748 >   C. Multi-level sort
 749
 750 Are these 'field' sorts?
 751
 752 >   D. Array index sort
 753
 754 This is a derived-key sort.
 755
 756 >   E. Hash value sort
 757
 758 This is a derived-key sort.
 759
 760 >   F. Complex compare code
 761
 762 No single-valued sort key per datum.  Beyond the scope of
 763 this article.
 764
 765 > IV. Efficient sorting in Perl
 766 >
 767 >   A. When efficiency is not warranted
 768            ^ an investment in improving
 769
 770 >      1. Short data sets
 771 >
 772 >      2. Rarely executed program
 773
 774 More generally, low cost compared to other parts of the
 775 program.  Quote from Rob Pike about when to optimize (don't!).
 776
 777 >   B. Sort overhead
 778 >
 779 >     1. builtin vs. {$a cmp $b}
 780
 781 Trivial sorts (key is trivially derived from datum):
 782 a.  {$a <=> $b} or reverse
 783 b.  {$hash{$a} cmp $hash{$b}} or numeric; or reverse
 784 c.  {$a->[CONST] cmp $b->[CONST]} or numeric; or reverse
 785
 786 >   C. General ideas on how to speed up sort
 787 >
 788 >     1. Preprocessing data
 789 to achieve a trivial sort
 790
 791 >     2. Simple and fast compare code (builtin if possible)
 792
 793 Trivial sort.
 794
 795 >     3. Postprocessing data from LOL (if needed)
 796 >
 797 >   D. Speed up techniques
 798 >
 799 >     1. Schwartzian transform (from Randal Schwartz)
 800 >
 801 >       a. Converts input to LOL, index sort, convert LOL to output
 802
 803 index sort?  This is Trivial Sort c (see above).
 804
 805 >       b. uses maps or temp arrays
 806 >
 807 >     2. Orcish (Or-cache) manoever (from Joseph Hall)
 808 >
 809 >       a. Converts sort data in compare code
 810 >
 811 >       b. Converts only once per unique sort data
 812
 813 but checks every time to see if key derivation is needed.
 814
 815 >       c. Stores converted keys in private hash
 816
 817 This is Trivial Sort b (see above).
 818
 819 >       d. No need for post processing, as sort is on real data
 820
 821 Where is the discussion about creating the hash during a
 822 pre-processing pass, obviating the test on each comparand?
 823
 824 >     3. Stringifying (need better name) (from many people)
 825
 826 How about 'Linearized Sort'?
 827
 828 >       a. Preprocessing of input data to sortable strings
 829
 830 'string' == arbitrary sequence of bytes, stored in a Perl
 831 scalar variable (u.e., linear).
 832
 833 >        i. uses either pack or sprintf (or other perl code).
 834 >
 835 >          ii. Converted key is combination of binary and ascii
 836                              ^^  may be
 837
 838 >          iii. Original data is appended onto converted key
 839
 840 See below -- cannot be references.
 841
 842 >       b. Uses builtin compare (fastest)
 843 >
 844 >       c. Postprocess keys to get original data (use substr
 845 or unpack)
 846
 847 Note that this is isomorphic to the Schwartz Transform
 848 map/sort/map paradigm.
 849
 850 >       d. Can handle integers (signed or unsigned), ascii
 851 strings, and
 852                                                      ^^^^^
 853 ??? see above
 854 >          normal or reverse sorts on a per key basis.
 855 >
 856 >       e. Can handle floats (IEEE format only) with code to
 857 deal with big
 858                      ^ non-negative, until someone gets smarter!
 859 >          vs. little endian.
 860
 861 Cannot handle references!
 862
 863        4.  Summary
 864
 865          a.  Examples
 866               IP addresses
 867               Fully-qualified domain names (right to left)
 868               ...
 869
 870          b. When to use
 871               Criteria
 872               Benchmarks
 873
 874 >
 875 > --
 876 > Uri Guttman  -----------------  SYStems ARCHitecture and
 877 Software Engineering
 878 > Perl Hacker for Hire  ----------------------  Perl,
 879 Internet, UNIX Consulting
 880 > uri@sysarch.com  ------------------------------------
 881 http://www.sysarch.com
 882 The Best Search Engine on the Net -------------
 883 http://www.northernlight.com
 884
 885 Good start.  Excelsior!
 886
 887 I am having trouble minding my tongue about Rutka's obnoxious posts, but
 888 he won't disappear (and he now works for Lucent, my alma mater.
 889 Sigh...)
 890
 891 I will be away from December 25 through January 3; real vacation (no
 892 electronic communication).
 893
 894 Happy New Year!!!
 895
 896 Larry
 897
 898 --
 899 Larry Rosler
 900 Hewlett-Packard Company
 901 http://www.hpl.hp.com/personal/Larry_Rosler/
 902 lr@hpl.hp.com
 903
 904
 905
 906 From: Larry Rosler <lr@hpl.hp.com>
 907 To: "'Uri Guttman'" <uri@sysarch.com>
 908 Cc: jdporter@min.net
 909 Subject: RE: sort article outline
 910 Date: Fri, 8 Jan 1999 12:32:49 -0800
 911 Content-Type: text/plain
 912 Content-Length: 2465
 913
 914 > From: Uri Guttman [mailto:uri@sysarch.com]
 915 > Sent: Friday, January 08, 1999 09:59
 916 > To: lr@hpl.hp.com
 917 > Cc: jdporter@min.net
 918 > Subject: Re: sort article outline
 919 >
 920 ...
 921 >   LR> make sure to include a note that all this requires that 'no
 922 locale;' be
 923 >   LR> in effect, or the idea falls apart.  This is from perllocale:
 924 >
 925 >   LR> <QUOTE>
 926 >   LR>         no locale;
 927 >   LR>         print +(sort grep /\w/, map { chr() } 0..255), "\n";
 928 >
 929 >   LR> This machine-native collation (which is what you get unless use
 930 locale
 931 >   LR> has appeared earlier in the same block) must be used for sorting
 932 raw
 933 >   LR> binary data, whereas the locale-dependent collation of the first
 934 example
 935 >   LR> is useful for natural text.
 936 >   LR> </QUOTE>
 937 >
 938 >   LR> Note the 'sorting raw binary data' which is what we are talking
 939 about.
 940 >
 941 > is this due to signed/unsigned cmp issues? that 0 .. 255 might map to
 942 > 128 ..255, 0 .. 127 on some machines because of signed char compares?
 943 > this is another porting issue for packed sorts. not just endianess.
 944
 945 No.  Without looking at the perl sources, I venture to guess that it is
 946 because the 'no locale;' default sort uses 'memcmp' while the 'use
 947 locale;' default sort uses 'strcoll' which relies on LC_COLLATE.  There
 948 might be a signedness problem if 'strncmp' were used instead of
 949 'memcmp'.  But I recall John saying that 'memcmp' was used (and it would
 950 *have* to be to conform to Perl's definition of a string, which is a
 951 sequence of arbitrary bytes whose length is specified by a separate
 952 integer, vs. C's definition of a string, which is a sequence of bytes
 953 terminated by a zero byte).  In any case, I have yet to see any machine
 954 dependencies in my tests on PA-RISC or Intel architectures.  Does anyone
 955 know of any target that uses signed-char compares that this could be
 956 verified against -- or just look into the perl source (which I haven't
 957 done yet) to make sure it uses 'memcmp'.
 958
 959 > printable sorts (hey i like these terms) need to worry about
 960 > locale tho.
 961
 962 Well, that is interesting.  You distinguish between a 'packed' sort key
 963 (which *surely* demands 'no locale;') and a 'printable' ('stringized')
 964 sort key (which perhaps relies on 'use locale;').  Maybe that is a real
 965 distinction that should be understood and documented.  A mixed sort key
 966 (packed number and a string, for example) would have to be sorted
 967 without locale, though the number could be stringized by 'sprintf
 968 "%10u"' in which case locale could be observed.
 969
 970 Larry
 971
 972
 973
 974 From: Larry Rosler <lr@hpl.hp.com>
 975 To: "'Uri Guttman'" <uri@sysarch.com>
 976 Cc: "'jdporter@min.net'" <jdporter@min.net>
 977 Subject: RE: sort article outline
 978 Date: Fri, 8 Jan 1999 12:59:24 -0800
 979 Content-Type: text/plain
 980 Content-Length: 1266
 981
 982 ...
 983 > No.  Without looking at the perl sources, I venture to guess
 984 > that it is because the 'no locale;' default sort uses
 985 > 'memcmp' while the 'use locale;' default sort uses 'strcoll'
 986 > which relies on LC_COLLATE.  There might be a signedness
 987 > problem if 'strncmp' were used instead of 'memcmp'.  But I
 988 > recall John saying that 'memcmp' was used (and it would
 989 > *have* to be to conform to Perl's definition of a string,
 990 > which is a sequence of arbitrary bytes whose length is
 991 > specified by a separate integer, vs. C's definition of a
 992 > string, which is a sequence of bytes terminated by a zero
 993 > byte).  In any case, I have yet to see any machine
 994 > dependencies in my tests on PA-RISC or Intel architectures.
 995 > Does anyone know of any target that uses signed-char compares
 996 > that this could be verified against -- or just look into the
 997 > perl source (which I haven't done yet) to make sure it uses 'memcmp'.
 998
 999 >From 'sv.c', the only comparison routine used is 'memcmp'.  When locale
1000 is in effect, each comparand is passed once through strxfrm and the
1001 transformed value is saved for sort comparisons.  (The mechanism must be
1002 like that which converts numbers to strings or v.v. once only.)
1003
1004 So I think character-signedness is a non-issue.
1005
1006 Larry
1007
1008
1009
1010 From: Larry Rosler <lr@hpl.hp.com>
1011 To: "'John Porter'" <jdporter@min.net>, uri@sysarch.com
1012 Subject: RE: sort article outline
1013 Date: Fri, 8 Jan 1999 14:29:13 -0800
1014 Content-Type: text/plain
1015 Content-Length: 2430
1016
1017 > From: John Porter [mailto:jdporter@min.net]
1018 > Sent: Friday, January 08, 1999 11:03
1019 > To: uri@sysarch.com
1020 > Cc: lr@hpl.hp.com
1021 > Subject: Re: sort article outline
1022 >
1023 > Uri wrote:
1024 > >   JP> But it's also long-winded.  Can't we simply talk about the
1025 > >   JP> "default sort" (locale issues aside)?
1026 > >
1027 > > i don't like that because it doesn't describe what transforms are
1028 being
1029 > > done to get it to use the builtin cmp. just oo vague a term.
1030 >
1031 > Same applies to "default lexicographic sort".  It's not really for
1032 > describing any of the transforms we're investigation, but for the
1033 > underlying sort (qsort/cmp) which we're exploiting.
1034 >
1035 > > what is important is the method of transforms that
1036 > > allows you to use the builtin cmp. so i feel the name
1037 > should reflect that.
1038 >
1039 > Right. But everything needs a name, even the "default sort".
1040
1041 Here, in journalistic format, is a synopsis of where we stand.  You may
1042 prefer 'Advanced' to 'Enhanced'.
1043
1044 WHAT:  The Enhanced Default Sort
1045
1046 WHY:   For optimal performance on large sorts (often with several sort
1047        fields), compare sort keys using built-in C functions rather than
1048        using Perl code in a sort subroutine.
1049
1050 WHERE: Wherever the sort is based on comparing a single-valued mapping
1051        function of each value to be sorted.  (This is true for all
1052        enhanced sorts -- ST, Orcish, ...)
1053
1054 HOW:   For each value to be sorted, create a sort key that is a Perl
1055        string (a sequence of bytes) corresponding lexicographically to
1056 the
1057        position of the value in the sorted output; append the value to
1058 the
1059        sort key so that it can readily be retrieved.  The sort key can
1060 be
1061        created using any combination of 'pack', 'sprintf' or
1062 concatenation.
1063        Each field to be sorted may comprise a signed or unsigned
1064 integer,
1065        a floating-point number, or a string of characters.  (Special
1066        handling is required if the values to be sorted are references --
1067        LoL or LoH.)
1068
1069 WHEN:  Compute the keys in a single pass through the list of values
1070 (map).
1071        Sort.  Retrieve the values from the keys (map).
1072
1073 Other minor issues:  As stated, this method sorts two values whose keys
1074 compare equal according to the values themselves (which is the usual
1075 fall-through condition for a fielded sort).  A 'stable' sort can be
1076 forced (using any sorting method) by adding a sequential index as the
1077 last sort key.
1078
1079 GO NINERS!!!  (Bye-bye, Pats :-()
1080
1081 Larry
1082
1083
1084
1085 i have to write these down now or i may forget them.
1086
1087 a sort module api could look like this:
1088
1089 $sort_obj = Sort->new( [ sort keys desriptions ], gloabl sort options
1090
1091 sort key description => {
1092
1093                 'type'  => integer, string, float, etc.
1094
1095                 'down'  => yes
1096
1097                 'code_to_get_key_value' => code ref OR code text.
1098
1099 }
1100
1101 $sort_obj->values( list of values or refs)
1102
1103 can be called again to add more to list.
1104
1105 $sort_obj->sort                 returns sorted data
1106
1107
1108 if the option is to build a preprocess sub, we take the code texts and
1109 create the text for it. then it will run fast, an optimization for large
1110 sorts. otherwise, we use the code ref which takes a ref (and key
1111 number?) for an argument and we call it to return the nth key value.
1112
1113 from this info we create a pack format string.
1114
1115 any inversions or endian things are handled here in the preprocess code
1116 (either version)
1117
1118 for each sort value we build up an array of key values.  then we just
1119 do a pack and push its value.
1120
1121 we also generate an unpack format (not same as pack format but related)
1122 to get the index and/or the original values to return.
1123
1124 the pack formats, refs to processing code and other stuff is stored in
1125 the sort object for reuse.
1126
1127 we can even have special case packs for common things like ip addresses
1128 so we can optimize even further there. ip can be designated as text or
1129 binary.
1130
1131 well, what do you think?
1132
1133 uri