“My Neckbeard Grew Three Sizes That Day” or How I Beat a GNU tool with Perl

(Today is another guest post from security expert Scott Pack!)

I spend a lot of time doing text based data processing. A *lot* of time. During an analysis, I often want to do things like look at ‘Top Talkers’, ‘Most Frequent Visitors’, or really anything that comprises a list of unique identifiers sorted by count. As a result, I’ve translated two actions into a series of pipes:

  1. What’s the count of events per thingy: “ | sort | uniq -c | sort -n
  2. Who has been doing whatever: “ | sort -u

This tends to work pretty well in most cases. Today, however, was not one of those cases. While attempting get a list of unique MACs I started out with a source (i.e. non-uniqued) 16GB text file with one MAC per line. This is where things got annoying. Muscle memory kicked in and since this matched Action #2, I ran the following command: cat macs_all.txt | sort -u >; macs_unique.txt

I expected it to take a few minutes, so I went back to the other things I was doing and let it go. I checked back 15 minutes later, and it was still running. Waited 5 minutes…still running. When the command had been running for 45 minutes, I got fed up and decided that I could do better. Perl, being my go to tool, came to the rescue in the form of hashes. I won’t go into gritty detail, but a Perl hash is a data structure that consists of a list of key/value pairs. Whenever you assign a value to a key it will add an entry for the key if it doesn’t exist, or update the value if it does. Since a key cannot be in the same hash multiple times, it makes for a pretty good hack to generate a unique list. This is what I ended up doing:

#!/usr/bin/perl -w
use strict;
my %unique;
while( my $line =  )
  next unless $line;
  chomp $line;
  $unique{$line} = '';
for my $key ( keys %unique )
  print "$keyn";

This worked significantly better for me. The output was not sorted, but that’s fine, I didn’t need it sorted, only unique. The timing information looked a lot better too.

packs@ node1:~> time cat macs_all.txt | sort -u > macs_unique.txt
real    181m12.417s
user    176m13.926s
sys     1m42.335s
packs@ node1:~> time cat macs_all.txt | ./fast_uniq.pl > macs_fast_uniqed.txt
real    8m9.074s
user    7m28.176s
sys     0m46.271s

The times can’t really be directly compared, since output from fast_uniq.pl isn’t actually sorted. Given the pretty substantial difference I think we can reasonably accept the fact that fast_uniq.pl is better in this use case. After seeing this, I’m tempted to add some functionality so I stop using both sort and uniq entirely.

I’m interested to hear if anyone else has done something similar or explain to me how much my code sucks.


  1. pauska

    May 14, 2012 at 2:58 am

    A 16GB flat file with -unique- MAC addresses? That’s the biggest L2 network I’ve ever heard of! Were these spoofed?


  2. kristoffer

    May 14, 2012 at 3:57 am

    I wondow how fast it would have been have you increased the buffersize of sort (-S flag) and removed the “cat” command from the pipe.
    Something like this:

    sort -S 1G -u macs_all.txt


  3. russm

    May 14, 2012 at 4:36 am

    perl -ne ‘print unless $seen{$_}++’ macs_all.txt


  4. russm

    May 14, 2012 at 4:44 am

    or if you want sorted output with counts,

    perl -ne ‘chomp; $seen{$_}++; END {foreach $mac (sort keys %seen) {print “$seen{$mac} $macn”}}’ macs_all.txt


  5. ScottPack

    May 14, 2012 at 5:17 am

    Sorry for the lack of clarity! That’s 16GB *un*uniqued, Pauska. Once I got it uniqued the file was only a few MB.


  6. russm

    May 14, 2012 at 5:45 am

    kristoffer – that certainly is a useless use of cat, but the cat process is just going to be shuffling data from fd3 to fd1 so it’s not going to be using much CPU compared to managing the hash table in perl…


    • kristoffer

      May 14, 2012 at 5:52 am

      True, but the orders of magnitude larger buffer would speed sort up a great deal.


  7. Barry Morrison (@esacteksab)

    May 14, 2012 at 5:50 am

    Scott (@vassago) and I had a similar venture in shell vs. Perl. http://codetwit.com/interview-questions/2012/04/iq_renaming_files_en_masse/

    Posts like these make me further aware that I should learn Perl.


  8. Twirrim

    May 14, 2012 at 9:47 am

    No particular observations on that sort situation, but I would like to mention the utility pipe-viewer, or pv (http://www.ivarch.com/programs/pv.shtml). It’s a drop in replacement for cat that will actually provide you with a progress bar. In your case:

    pv macs_all.txt | sort -u >; macs_unique.txt

    That at least gives you some kind of idea how things are going. I find the tool invaluable!


    • Wesley David

      May 14, 2012 at 1:19 pm

      The -L option is also great for throttling I/O or for sending a stream across a network.


  9. Jason Healy

    May 14, 2012 at 6:28 pm

    A classic memory-vs-efficiency tradeoff at work (my data structures students would love this). As long as you’ve got the RAM to hold all the uniq’d values (plus hash overhead), this is a speedy implementation. Note that sort is built to work on arbitrarily-sized data (files especially), so it trades speed for memory efficiency (ensuring that even large files can be sorted with little RAM):


    So long as the working (unique) set fits comfortably in memory, this is indeed faster than a full sort followed by unique.

    Note that if perl hashes used a tree structure (O(log n) insertion time instead of O(1)), you’d get the unique results in sorted order. However, if the number of unique items is small compared to the input size, it’s probably best to hash then sort if necessary.


  10. Jesus

    May 22, 2012 at 12:29 am

    Is your second time command much quicker as some of the data is cached in memory or did you do something like this in between?

    sync ; echo 3 > /proc/sys/vm/drop_caches


Leave a Reply

Follow TheNubbyAdmin!

follow us in feedly

Raw RSS Feed:

Contact Me!

Want to hire me as a consultant? Have a job you think I might be interested in? Drop me a line:

Contact Me!

Subscribe via Email

Your email address is handled by Google FeedBurner and never spammed!

The Nubby Archives

Circle Me on Google+!

Photos from Flickr

Me on StackExchange

The IT Crowd Strava Group

%d bloggers like this: