book: Sort::Maker
chapter: Sort::Maker
title: A Little Past History
*: TPC3 in Monterey, CA 1999
*: Co-wrote and presented "A Fresh Look at Efficient Perl Sorting"
html:
http://www.sysarch.com/perl/sort_paper.html
*: Co-author was Larry Rosler (retired from HP and Perl)
*: Introduced the GRT as a way to speed up Perl sorts
*: It won the prize for Best Technical Paper
*: A module Sort::Records was promised which would implement the GRT
*: Sort::Records was started but a code and design review showed many
problems.
*: Sort::Records was shelved during development because of a code review
*: People still write to me asking about that module
PAGE_END
title: Recent History
*: I have been thinking about this module for 5 years
*: Solved the major design flaw - have the user provide key extraction
code
*: A thread in perl6-language gave needed inspiration
*: Sort::Maker emerges
*: Much help in the API design and some coding tricks from Damian Conway
PAGE_END
title: Introducing Sort::Maker
*: Easy description of sorts
*: Generates a sort sub
*: Choice of generated sort styles
*: Can generate the highest speed sorts in Perl
*: Generated source can be printed
**: For study
**: For pasting in code to save the generation time overhead
*: These slides can be found at
html:
http://www.stemsystems.com/slides/slides/index.html
*: The current tarball is at
html:
http://www.stemsystems.com/sort/Sort-Maker-0.01.tar.gz
PAGE_END
title: Sort::Maker Synopsis
*: Exports a single sub 'make_sorter'
*: Each key has its own description
*: Call make_sorter with sort description arguments and it returns a
code reference.
*: Call that reference with unsorted input records
*: Returns sorted records
code:
my $sorter = make_sorter(
qw( plain ),
number => [
code => '/(\d+)/',
'descending',
],
) ;
my @sorted = $sorter->( @unsorted ) ;
PAGE_END
title: Key Descriptions
*: Each key has a description
*: Keys must be in the order they will sort (higher level keys must be
earlier)
*: The key type must be specified as 'string' or 'number'
*: Keys can have attributes such as 'descending' or 'case'
*: Attributes can have defaults set for all the keys
*: Default attributes can be overridden in any key description
PAGE_END
title: Extraction Code
*: Each key needs code to extract it from a record
*: Each record is aliased to $_ (via map)
*: Key extraction code operates on $_ and gets the value for this key
*: The extraction code is executed in list context so m/(foo)/ works
*: The code is inside a do{} block so you can have multiple statements
PAGE_END
title: Sort Styles
*: Four different sorting styles to choose from
**: Plain
**: Orcish Manouvre
**: Schwartian Transform (ST)
**: Guttman-Rosler Transform (GRT)
*: Each has its uses and advantages
*: Styles are really different ways to cache extracted keys
*: Caching keys moves key extraction from O( N log N ) to O( N )
*: In larger sizes of sort sets, caching keys is a very big win
*: This is a classic sort of arrays of numbers
**: Compare this code to the generated code for the different sort styles
code:
sort { $a->[0] cmp $b->[0] ||
$a->[1] cmp $b->[1]
}
PAGE_END
title: Plain Sort
*: No key caching
*: Similar to common sorts with a code block
*: Good for small sort sets as there is no caching
*: Pass 'plain' option to make_sorter
code:
sub {
sort {
do{ my( $left, $right ) = map { $_->[0] } $a, $b;
$left cmp $right }
||
do{ my( $left, $right ) = map { $_->[1] } $a, $b;
$left cmp $right }
} @_ ;
}
PAGE_END
title: Orcish Manouvre Sort
*: Caches extracted keys in a hash
*: Assignments to the hash in ||=
*: Called the orcish because of Or-Cache
*: Created by Joseph Hall
*: It will re-extract keys for any record with a false value
*: The caching hash must be cleared beforehand
**: Sort::Maker declares a caching hash in the anonymous sub
*: Hash lookups are still O( N log N )
*: Good for medium sort sets
*: Pass 'orcish' option to make_sorter
code:
sub {
my %or_cache ;
sort {
(
( $or_cache{$a} ||=
do{ my ($val) = map { $_->[0] } $a ; $val } )
cmp
( $or_cache{$b} ||=
do{ my ($val) = map { $_->[0] } $b ; $val } )
)
||
(
( $or_cache{$a} ||=
do{ my ($val) = map { $_->[1] } $a ; $val } )
cmp
( $or_cache{$b} ||=
do{ my ($val) = map { $_->[1] } $b ; $val } )
)
} @_ ;
}
PAGE_END
title: Schwartian Transform (ST) Sort
*: Caches extracted keys in an anonymous array for each input record
*: Stores the record itself in slot 0 of the array
*: Uses the map/sort/map idiom
*: Popularized by Randal Schwartz
*: Good for medium to large sort sets
*: Key extraction is O( N )
*: Does a full callback to Perl in the comparison block
*: Pass 'ST' option to make_sorter
code:
sub {
map $_->[0],
sort {
$a->[1] cmp $b->[1]
||
$a->[2] cmp $b->[2]
}
map [ $_,
do{ my ($val) = $_->[0] ; $val },
do{ my ($val) = $_->[1] ; $val }
], @_ ;
}
PAGE_END
title: Guttman-Rosler Transform (GRT) Sort
*: Caches extracted keys in a single string for each input record
*: String records can be stored in that string
*: Reference records have their index stored in the cache string
**: The sorted indices are used to slice into the input array to create
sorted records
*: Key extraction is O( N )
*: The comparison code is internal and uses C with no Perl callback (big
win)
*: More complex to use properly and efficiently
*: GRT has special key description attributes for optimization
**: Numbers can be integer/float, signed/unsigned
**: Strings can be fixed/varying
**: Fastest sort style for larger sets of sort records
*: Pass 'GRT' option to make_sorter
code:
sub {
my $rec_ind = 0 ;
return @_[
map unpack( 'N', substr( $_, -4 ) ),
sort
map pack( "Z*Z*N",
do{ my( $val ) = $_->[0] ; ( $val ) },
do{ my( $val ) = $_->[1] ; ( $val ) },
$rec_ind++
), @_
] ;
}
PAGE_END
title: Tests and Benchmarks
*: Test system is table driven
*: Scripts can do both tests and benchmarks
*: Provide a set of data
**: Manually
**: Generate in place with map
**: Generate with a anonymous sub
*: Provide a hand written 'golden' sort sub
*: Provide the arguments to make_sorter()
*: Many tests are in but more are wanted
**: Send in tests if you want
*: More complex tests are needed
code:
{
skip => 0,
name => 'arrays of multiple strings',
source => 1,
data => [ map {
[ rand_token( 8, 20 ), rand_token( 8, 20 ), ]
} 1 .. 100
],
gold => sub { $a->[0] cmp $b->[0] ||
$a->[1] cmp $b->[1] },
args => [ qw( string $_->[0] string $_->[1] ) ],
},
PAGE_END