You may be aware of the *nix tool
uniq, used to filter duplicate values from within files. Uniq is a simple way to remove duplicated lines within a file via a simple command. The downside to uniq is that it requires the input to be sorted, which can be a huge downfall, and sometimes isn't even an option. Due to this, many people prefer to use another *nix tool;
sort. Sort has the ability to filter out duplicated text without the need for the file to be sorted in advance, however it does this by sorting the text during the removal - so your file will be sorted upon completion. Another downside to this is that the entire file is buffered into memory, and the need for a sort means it's an exceptionally slow way to go. Because of this, I decided to make a new, faster and more efficient tool for removing duplicate values from text - which is where
Neek comes in.
Neek is a NodeJS command line tool, which filters out duplicate lines from text the same way that sort and uniq would, except with speed and without the need for sorted text. It does this by hashing the input, and using it to check any subsequent lines. Several hashing methods are supported (defaulting to SHA-1). You can configure this to fit your situation, however SHA-1 will probably always be sufficient. Neek also takes advantage of NodeJS streams, meaning that there's no need to buffer the entire file into memory, and output flows as input is still being read in. Neek is extremely simple to use, and in most (if not all) cases, you can swap it out any place you would use uniq.
$ cat my_file.txt | neek > my_new_file.txt $ cat my_file.txt | neek -o my_new_file.txt $ neek -i my_file.txt -o my_new_file.txt -a MD5
Just for comparison, I took some benchmarks against both uniq and sort on a large dataset. Below are the results:
576,905 total lines
322,392 unique lines
$ time uniq test-set.txt # output real 0m33.951s user 0m27.086s sys 0m2.161s
Of course the above requires that the file is already sorted. In the unfortunate case that the file isn't sorted, you'd have to defer to using sort -u.
$ time sort -u test-set.txt # output real 1m39.203s user 1m32.484s sys 0m1.518s
As you can see, this is extremely inefficient for our needs. So now let's take a look at now Neek fares.
$ time bin/neek --input test-set.txt # output real 0m16.354s user 0m13.733s sys 0m2.217s
As you can see, Neek is roughly 45% faster to run than Uniq and almost 85% faster to run than Sort, meaning it's invaluable for larger files.
Awesome! So how do I get it? Neek is extremely easy to install, assuming you have NodeJS and NPM installed, just run the following command to install the latest version from NPM:
$ npm install -g neek
Neek is pretty thoroughly tested, but in the case you find any issues feel free to leave a comment here or create an issue on GitHub. It's also usable from inside NodeJS alongside streams, and if you wish to use it in that way, please defer to the README.md in the repo.