If you work on any kind of backend system, odds are you're familiar with the pains of trawling through files looking for something specific only to discover huge amounts of entries which are basically all the same thing. Perhaps you have a really cool API that everyone calls all the time, or perhaps you have noisy logging inside a hot chunk of code.

In my own day-to-day, I work with a bunch of messages flowing through a distributed system where the delivery guarantee is treated as "at least once". All of these messages hit disk and are archived as part of a routine backup, for both disaster recovery and to provide general debuggability. Naturally, the 1..N repetition of any given message means that what you can actually debug isn't a great deal, because numbers will be skewed all over the place.

In the past, I would use a couple of techniques to remove duplicates from these files before running any kind of aggregation over them. A couple of popular tools for this type of thing are the Unix tools sort and uniq; if you're unfamiliar with these tools, they're pretty much what they sound like. The sort tool will sort data, and accepts a -u flag to remove duplicated values. The uniq tool is similar in that it will remove duplicated values from input, with the caveat that it only operates on consecutive entries (i.e. a,b,a would not remove the second a). For a time (a long time) these tools were enough, but once I hit larger and larger files it became evident that a better solution was needed.

Whilst sort -u is super easy to use, the requirement of sorting your data cannot be overlooked as it means your entire file needs to be buffered into memory at once. This is not always possible, or even if possible it's not always desired. On the other hand uniq is great, but in order to sort your data in advance you still have the same issue. It was due to this that I wrote runiq, which is essentially an optimized and more flexible implementation of both of the aforemention tools for the purposes of duplicate filtering. It performs much faster, with much less memory, and does not require sorted input (although it can optimize this case).

Comparisons

There are a couple of things to look at when comparing runiq to the other solutions; first how it compares to unsorted data, and secondly how it compares to sorted data. This is essentially comparing it separately to sort -u and uniq when playing to their strengths. Rather than having to choose one way or the other, runiq tries to offer both efficiently (as sometimes you cannot control your data). The ability to runiq to execute on unsorted data means that it's also suitable for streaming data without having to buffer.

Test Data

The following is a bunch of metadata about the test file which will be used as a rough benchmark. This file consists of newline-separated JSON documents and is actually the file which pushed me into writing runiq in the first place.

File size:     3,290,971,321
                   (~3.29GB)
Line count:        5,784,383
Unique count:      2,715,727
Duplicates:        3,068,656

Yes, this is from a production system. I only spend my time with the best data.

Sorted Data

We'll start with sorted data, because it's doubtful that there will be a huge variance. Instead of the raw test data, the data for this test will first be written to a temporary file using sort so the duplicates will actually be stored consecutively. Results will be averaged out over 5 runs each. Of course, this is by no means extensive benchmarking - it only serves as a rough guide. The results were as follows:

| Tool  | Flags | Time Taken | Peak Memory |
|-------|-------|------------|-------------|
| uniq  | N/A   | 52.6s      | 1.5MB       |
| runiq | N/A   | 24.8s      | 67.7MB      |

The first thing to note is that runiq is way faster, taking roughly 47% of the time to complete. This is a good start! Unfortunately though, the memory requirement is pretty poor. This is due to the fact that uniq is optimized for operating on sorted data. This is not the case for runiq, which optimizes for unsorted data by default, and thus requires for memory.

However, let's not despair! Thanks to the configurable filtering of runiq, we can provide the argument -f sorted to use a filtering system which only checks for consecutive duplicates, much like uniq does. If we run the tests again with this flag, we should see much better memory usage:

| Tool  | Flags     | Time Taken | Peak Memory |
|-------|-----------|------------|-------------|
| uniq  | N/A       | 52.6s      | 1.5MB       |
| runiq | -f sorted | 22.1s      | 800KB       |

Ok, that's more like it. From this we can infer that for the same use case one would use uniq, runiq takes roughly 42% of the time to complete, and requires only 53% of the memory (although memory here is so low that it doesn't really matter anyway). That's great! We cover the uniq case just fine.

Unsorted Data

Based on the slight detour we took above, you might be able to guess already what the results here might look like. In addition to testing against sort -u, I am also going to compare against a couple of other tools in this section;

  • neek - a node.js implementation of a similar tool I made a long time ago
  • uq - a similar tool (also written in Rust) designed to replace sort | uniq

I might add more comparisons in future, but for now these will suffice for the purpose of this post. There results of runiq below represent a number of flags and configuration options, as well as the defaults. Results are as follows (again over multiple runs):

| Tool  | Flags     | Time Taken | Peak Memory |
|-------|-----------|------------|-------------|
| neek  | N/A       | 55.8s      | 313MB       |
| sort  | -u        | 595s       | 9.07GB      |
| uq    | N/A       | 32.3s      | 1.66GB      |
| runiq | -f digest | 25.3s      | 64.6MB      |
| runiq | -f naive  | 32.2s      | 1.62GB      |
| runiq | -f bloom  | 44.5s      | 13MB        |

The results are pretty fantastic (and demonstrate why I felt something better could be done initially). The sort -u implementation requires a whopping 9GB of memory to run, and takes 10 minutes to complete. This is beaten by all filters runiq offers, with the default (digest) taking only ~4% of the time to complete and requiring a miniscule 0.7% of the memory. The other tools aren't far off in terms of execution time, but the memory use is still quite high (although not terrible).

Design & Reasoning

The default filtering mechanism for runiq requires little memory as it uses hashes to determine uniqueness, and so only a single usize is stored to represent a unique entry. This is in contrast to other tools where the entry itself is stored, and thus memory is rather drastically increased. The xxhash64 algorithm is used to generate hashes due to being sufficiently fast, whilst staying resistant to collisions (which is why it's safe to make the default).

Collisions are naturally possible in theory, regardless of likelihood. If you'd rather use raw values for comparisons, you can toggle this on using the -f flag with a value of naive. This will use a simple filtering mechanism to skip over the hashing and store the raw value instead. This increases memory, and appears to behave more like the filtering used by uq above (around ~1.6GB when tested).

It is the intent to add other filters in future based on the required level of accuracy, but at the time of writing the available filters are naive, digest, bloom and sorted, with digest being the default.

Interface

The CLI is very simple for the time being (and was created using clap for those interested).

runiq 0.1.0
An efficient way to filter duplicate lines from input, à la uniq.

USAGE:
    runiq [FLAGS] [OPTIONS] <inputs>...

FLAGS:
    -h, --help          Prints help information
    -i, --invert        Prints duplicates instead of uniques
    -s, --statistics    Prints statistics instead of entries
    -V, --version       Prints version information

OPTIONS:
    -f, --filter <filter>    Filter to use to determine uniqueness

ARGS:
    <inputs>...    Input sources to filter

There are a couple of flags which might be useful; the first is the --invert flag which flips the conditions to only print the duplicates, rather than the uniques - handy if you're trying to locate the source of duplication. There's also the --statistics flag which skips printing altogether and dumps some general counters to the terminal instead, if you're interested in the general distribution.

There are several ways to filter uniques, and so runiq offers the aforementioned -f flag to customize which type of filtering is used at runtime. There are currently three filtering types, all of which are discussed above. More will be added in future as appropriate, so feel free to contribute ideas :).