Whitfin's Blog: Neek - A faster way to remove duplicates in files

You may be aware of the *nix tool uniq, used to filter duplicate values from within files. Uniq is a simple way to remove duplicated lines within a file via a simple command. The downside to uniq is that it requires the input to be sorted, which can be a huge downfall, and sometimes isn't even an option. Due to this, many people prefer to use another *nix tool; sort. Sort has the ability to filter out duplicated text without the need for the file to be sorted in advance, however it does this by sorting the text during the removal - so your file will be sorted upon completion. Another downside to this is that the entire file is buffered into memory, and the need for a sort means it's an exceptionally slow way to go. Because of this, I decided to make a new, faster and more efficient tool for removing duplicate values from text - which is where Neek comes in.

Neek is a NodeJS command line tool, which filters out duplicate lines from text the same way that sort and uniq would, except with speed and without the need for sorted text. It does this by hashing the input, and using it to check any subsequent lines. Several hashing methods are supported (defaulting to SHA-1). You can configure this to fit your situation, however SHA-1 will probably always be sufficient. Neek also takes advantage of NodeJS streams, meaning that there's no need to buffer the entire file into memory, and output flows as input is still being read in. Neek is extremely simple to use, and in most (if not all) cases, you can swap it out any place you would use uniq.

Examples

$ cat my_file.txt | neek > my_new_file.txt
$ cat my_file.txt | neek -o my_new_file.txt
$ neek -i my_file.txt -o my_new_file.txt -a MD5

Just for comparison, I took some benchmarks against both uniq and sort on a reasonably large dataset:

293MB
576,905 total lines
322,392 unique lines

The contents of this file was a series of JSON documents separated by a newline, for reference. When using uniq, we can see that filtering takes approximately 30s:

$ time uniq test-set.txt

# output

real    0m33.951s
user    0m27.086s
sys     0m2.161s

Of course the above requires that the file is already sorted. In the unfortunate case that the file isn't sorted, you'd have to defer to using sort -u. This works for both cases, but we can see it comes at a much higher time cost:

$ time sort -u test-set.txt

# output

real    1m39.203s
user    1m32.484s
sys     0m1.518s

Neek doesn't require sorting, so we can compare this time directly against the output of Neek if we look to see how it fares:

$ time bin/neek --input test-set.txt

# output

real    0m16.354s
user    0m13.733s
sys     0m2.217s

As you can see, Neek is roughly 45% faster to run than Uniq and almost 85% faster to run than Sort, meaning it's invaluable for larger files! Awesome! So how do I get it? Neek is extremely easy to install, assuming you have NodeJS and NPM installed, just run the following command to install the latest version from NPM:

$ npm install -g neek

Neek is pretty thoroughly tested, but in the case you find any issues feel free to leave a comment here or create an issue on GitHub. It's also usable from inside NodeJS alongside streams, and if you wish to use it in that way, please defer to the README.md in the repo.