If you are entirely unfamiliar with using Perl to sort data, read the "Sorting Data with Perl - Part One and Two" articles before reading this article. Beginning Perl coders may find this article uses unfamiliar terms and syntax. Intermediate and advanced Perl coders should find this article useful. The object of the article is to inform the reader, it is not about how to code Perl or how to write good Perl code, but to teach the Schwartzian Transform technique in a manner that is hopefully easily understood.
The Schwartzian Transform
The Schwartzian Transform (referred to as ST from now on) is a good balance between implementation and efficiency making it an attractive option for sorting data. For the curious, the name of this technique is in honor of its originator, Randal Schwartz. This article is a detailed look at how the ST works.
ACME INC.
First we need some data to sort. At the end of this article a data file is attached (employees.txt) . It's a tab separated values list of employees of ACME INC. It's a list of ten employees with some fields that describe their employment status with ACME INC. The fields are:
EmployeeID, Name, HireDate, Position, Department, Salary
[code=text]
EmployeeID Name HireDate Position Department Salary
M011 Smith,John,T 01-12-1981 Welder Maintenance 20.00 p/h
M102 Hart,Thomas,J 06-30-1982 Supervisor Maintenance 28.50 p/h
J309 Jones,Steve,W 02-23-1990 Janitor Janitorial 15.50 p/h
A124 White,Mary,H 03-15-1990 Assembler Assembly 8.75 p/h
Q365 Miles,Frank,R 12-01-1999 Inspector Quality 16.25 p/h
A316 Roberts,Andy,P 07-30-2006 Assembler Assembly 8.00 p/h
S554 William,Terry,K 03-3-2005 Expeditor Stock Room 9.00 p/h
R078 Norris,Chris,K 04-17-2003 Clerk Shipping/Recieving 9.00 p/h
X832 Anderson,Jane,M 03-23-1992 VP Operations Mangement 33.00 p/h
X111 Kelly,Mark,D 05-29-1989 Engineer Engineering 26.75 p/h
[/code]
The data file will be read into a hash of hashes (see the Online Resources section at the end of the article if you are unfamiliar with what a hash of hashes is).
[code=perl]
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper; # for printing a "picture" of the data
my %employees;
open (my $in, 'path/to/employees.txt') or die "$!";
readline <$in>; #skip file header
while (<$in>) {
chomp;
my ($id,$name,$hda te,$pos,$dept,$ sal) = split/\t/;
$employees{$id} = {NAME => $name, HIREDATE => $hdate, POSITION => $pos, DEPT => $dept, SALARY => $sal};
}
close $in;
print Dumper \%employees;
[/code]
The above code creates a hash of hashes from the employees file with the EmplyeeID being used as the top-level hash keys. It produces a series of records like this (partial output of Data::Dumper):
'R078' (the employee ID) is the key to an anonymous hash (see the Resources section at the end of the article if you are not familiar with what an anonymous hash is) that has the employee data stored in key/value pairs. The HIREDATE is in American format of MM-DD-YYYY. We are going to order the records by hire date in ascending order. To do that we will need to munge the date into sortable keys.
[code=perl]
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %employees;
open (my $in, 'path/to/employees.txt') or die "$!";
readline <$in>; #skip header
while (<$in>) {
chomp;
my ($id,$name,$hda te,$pos,$dept,$ sal) = split/\t/;
$employees{$id} = {NAME => $name, HIREDATE => $hdate, POSITION => $pos, DEPT => $dept, SALARY => $sal};
}
close $in;
my @sorted = map {$employees{$_->[0]}{NAME} was hired on $employees{$_->[0]}{HIREDATE}}
sort {$a->[1] cmp $b->[1]}
map {my ($m,$d,$y) = split/-/,$employees{$_} {HIREDATE};
[$_, "$y$m$d"] } keys %employees;
print Dumper \@sorted;
[/code]
What we end up with is a list of employees sorted by hire date:
'Smith,John,T was hired on 01-12-1981',
'Hart,Thomas,J was hired on 06-30-1982',
'Kelly,Mark,D was hired on 05-29-1989',
'Jones,Steve,W was hired on 02-23-1990',
'White,Mary,H was hired on 03-15-1990',
'Anderson,Jane, M was hired on 03-23-1992',
'Miles,Frank,R was hired on 12-01-1999',
'Norris,Chris,K was hired on 04-17-2003',
'William,Terry, K was hired on 03-3-2005',
'Roberts,Andy,P was hired on 07-30-2006'
Lets take a closer look at the ST with most of the code stripped away to reveal what's going on.
[code=perl]
my @sorted = map {}
sort {}
map {} keys %employees;
[/code]
That is the bare essence of the ST: a sort block wrapped in two map{} blocks.
What Makes the Schwartzian Transform Fly
What makes the ST fly with so little code and no temporary variables is the ability to chain together list processing functions and Perls support of anonymous data storage.
It's maybe easier to visualize the ST technique linearly. As a flow chart it would look like this with the arrows indicating input/output direction:
The first map block is really the trickiest part when using this technique. In that block is where you decide what you need to do to create the list of sort keys that the sort block will sort. Lets take a closer look.
The Map Function
A map block is similar to a subroutine:
[code=perl]map {
my ($m,$d,$y) = split/-/,$employees{$_} {HIREDATE};
[$_, "$y$m$d"]
} keys %employees;
[/code]
The data is read into the map block from the list appended to the end instead of imported with @_ like a subroutine. The list can be any list: an array, a hash, a (list of things), another function that creates a list, like the grep() and split() functions, a filehandle, or even a subroutine that returns a list.
The map block is also very much like a loop, but not the same as for/foreach/while/until loops. Map will process all the elements of a list, like grep() does. You can't use loop controls like "last" or "next" or "redo" with map.
In our example we used only one comparison to sort the data:
But you can use as many comparisons as you need to sort your data using the || (or) operator:
Note the mixed use of "cmp" and "<=>" and ascending and descending ordering all in the same sort block. This is perfectly valid and very useful depending on your particular needs.
The last map block can also munge the final output however you need it to for your particular application. In our example we just added a little text to make the output more readable. But you can do whatever you want in that map block: omit some of the data, add some markup code (like html code), change the order of the fields, whatever you want.
Why Use a Schwartzian Transform?
The ST is a good option when the sort keys do not actually exist within the data and the sort keys will only be used to sort the data and not for any other purpose. Like in my example we sorted the list by the hire date, but the sort key had to be made on-the-fly by splitting the date and reordering the American style date of MM-DD-YYYY into a sortable ASCII string, YYYYMMDD. The date could have been sorted numerically but sorting it as a string is really easier.
Another good use of the ST is when you have to sort some data just to test some sort keys that you will use in a larger more robust program.
Why Not Use a Schwartzian Transform?
If you have data with keys that are already is a sortable format there is no need to use the ST technique but you still could if you wanted to. You may also consider using a database that can sort data much more efficiently than a Perl script can.
Conclusion
The Schwartzian Transform is a powerful yet convenient method of sorting data. If after reading the article you are still a bit confused, try experimenting with some code to get a feel for how it works. Once you master this technique you will find yourself using it all the time.
Kevin (aka KevinADC)
Online Resources
Perldoc Website All the Perl documentation online.
perlreftut - Mark's very short tutorial about references
perldsc - data structure complex data structure struct
perllol - Manipulating Arrays of Arrays in Perl
map - apply a change to a list to get back a new list with the changes
This article is protected under the Creative Commons License.
The Schwartzian Transform
The Schwartzian Transform (referred to as ST from now on) is a good balance between implementation and efficiency making it an attractive option for sorting data. For the curious, the name of this technique is in honor of its originator, Randal Schwartz. This article is a detailed look at how the ST works.
ACME INC.
First we need some data to sort. At the end of this article a data file is attached (employees.txt) . It's a tab separated values list of employees of ACME INC. It's a list of ten employees with some fields that describe their employment status with ACME INC. The fields are:
EmployeeID, Name, HireDate, Position, Department, Salary
[code=text]
EmployeeID Name HireDate Position Department Salary
M011 Smith,John,T 01-12-1981 Welder Maintenance 20.00 p/h
M102 Hart,Thomas,J 06-30-1982 Supervisor Maintenance 28.50 p/h
J309 Jones,Steve,W 02-23-1990 Janitor Janitorial 15.50 p/h
A124 White,Mary,H 03-15-1990 Assembler Assembly 8.75 p/h
Q365 Miles,Frank,R 12-01-1999 Inspector Quality 16.25 p/h
A316 Roberts,Andy,P 07-30-2006 Assembler Assembly 8.00 p/h
S554 William,Terry,K 03-3-2005 Expeditor Stock Room 9.00 p/h
R078 Norris,Chris,K 04-17-2003 Clerk Shipping/Recieving 9.00 p/h
X832 Anderson,Jane,M 03-23-1992 VP Operations Mangement 33.00 p/h
X111 Kelly,Mark,D 05-29-1989 Engineer Engineering 26.75 p/h
[/code]
The data file will be read into a hash of hashes (see the Online Resources section at the end of the article if you are unfamiliar with what a hash of hashes is).
[code=perl]
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper; # for printing a "picture" of the data
my %employees;
open (my $in, 'path/to/employees.txt') or die "$!";
readline <$in>; #skip file header
while (<$in>) {
chomp;
my ($id,$name,$hda te,$pos,$dept,$ sal) = split/\t/;
$employees{$id} = {NAME => $name, HIREDATE => $hdate, POSITION => $pos, DEPT => $dept, SALARY => $sal};
}
close $in;
print Dumper \%employees;
[/code]
The above code creates a hash of hashes from the employees file with the EmplyeeID being used as the top-level hash keys. It produces a series of records like this (partial output of Data::Dumper):
Code:
'R078' => {
'DEPT' => 'Shipping/Recieving',
'POSITION' => 'Clerk',
'NAME' => 'Norris,Chris,K',
'HIREDATE' => '04-17-2003',
'SALARY' => '9.00 p/h'
},
[code=perl]
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %employees;
open (my $in, 'path/to/employees.txt') or die "$!";
readline <$in>; #skip header
while (<$in>) {
chomp;
my ($id,$name,$hda te,$pos,$dept,$ sal) = split/\t/;
$employees{$id} = {NAME => $name, HIREDATE => $hdate, POSITION => $pos, DEPT => $dept, SALARY => $sal};
}
close $in;
my @sorted = map {$employees{$_->[0]}{NAME} was hired on $employees{$_->[0]}{HIREDATE}}
sort {$a->[1] cmp $b->[1]}
map {my ($m,$d,$y) = split/-/,$employees{$_} {HIREDATE};
[$_, "$y$m$d"] } keys %employees;
print Dumper \@sorted;
[/code]
What we end up with is a list of employees sorted by hire date:
'Smith,John,T was hired on 01-12-1981',
'Hart,Thomas,J was hired on 06-30-1982',
'Kelly,Mark,D was hired on 05-29-1989',
'Jones,Steve,W was hired on 02-23-1990',
'White,Mary,H was hired on 03-15-1990',
'Anderson,Jane, M was hired on 03-23-1992',
'Miles,Frank,R was hired on 12-01-1999',
'Norris,Chris,K was hired on 04-17-2003',
'William,Terry, K was hired on 03-3-2005',
'Roberts,Andy,P was hired on 07-30-2006'
Lets take a closer look at the ST with most of the code stripped away to reveal what's going on.
[code=perl]
my @sorted = map {}
sort {}
map {} keys %employees;
[/code]
That is the bare essence of the ST: a sort block wrapped in two map{} blocks.
What Makes the Schwartzian Transform Fly
What makes the ST fly with so little code and no temporary variables is the ability to chain together list processing functions and Perls support of anonymous data storage.
It's maybe easier to visualize the ST technique linearly. As a flow chart it would look like this with the arrows indicating input/output direction:
Code:
END output <-- map {} <-- sort {} <-- map {} <-- input START;
The Map Function
A map block is similar to a subroutine:
[code=perl]map {
my ($m,$d,$y) = split/-/,$employees{$_} {HIREDATE};
[$_, "$y$m$d"]
} keys %employees;
[/code]
The data is read into the map block from the list appended to the end instead of imported with @_ like a subroutine. The list can be any list: an array, a hash, a (list of things), another function that creates a list, like the grep() and split() functions, a filehandle, or even a subroutine that returns a list.
The map block is also very much like a loop, but not the same as for/foreach/while/until loops. Map will process all the elements of a list, like grep() does. You can't use loop controls like "last" or "next" or "redo" with map.
In our example we used only one comparison to sort the data:
Code:
sort {$a->[1] cmp $b->[1]}
Code:
sort {$a->[1] cmp $b->[1] ||
$b->[2] <=> $a->[2] ||
$a->[3] cmp $b->[3]}
The last map block can also munge the final output however you need it to for your particular application. In our example we just added a little text to make the output more readable. But you can do whatever you want in that map block: omit some of the data, add some markup code (like html code), change the order of the fields, whatever you want.
Why Use a Schwartzian Transform?
The ST is a good option when the sort keys do not actually exist within the data and the sort keys will only be used to sort the data and not for any other purpose. Like in my example we sorted the list by the hire date, but the sort key had to be made on-the-fly by splitting the date and reordering the American style date of MM-DD-YYYY into a sortable ASCII string, YYYYMMDD. The date could have been sorted numerically but sorting it as a string is really easier.
Another good use of the ST is when you have to sort some data just to test some sort keys that you will use in a larger more robust program.
Why Not Use a Schwartzian Transform?
If you have data with keys that are already is a sortable format there is no need to use the ST technique but you still could if you wanted to. You may also consider using a database that can sort data much more efficiently than a Perl script can.
Conclusion
The Schwartzian Transform is a powerful yet convenient method of sorting data. If after reading the article you are still a bit confused, try experimenting with some code to get a feel for how it works. Once you master this technique you will find yourself using it all the time.
Kevin (aka KevinADC)
Online Resources
Perldoc Website All the Perl documentation online.
perlreftut - Mark's very short tutorial about references
perldsc - data structure complex data structure struct
perllol - Manipulating Arrays of Arrays in Perl
map - apply a change to a list to get back a new list with the changes
This article is protected under the Creative Commons License.
Comment