Whitfin's Blog: Easily Analyzing Your S3 Buckets

Amazon S3 is a storage solution used by pretty much everyone these days. Due to this there are naturally a bunch of tools for doing almost everything you can think of with S3. This post is about a tool I wrote to retrieve metadata about S3 buckets. If you don't care for the post, feel free to skip straight to the repo.

In the background recently, I've been working on something related to compression with a friend at work. I figured that one of the best ways to test this would be to just grab one of the largest files on our S3 setup, compress it, and compare the results against other tools like GZIP. Of course, turns out this is easier said than done because there's no easy way to search for a file in this way using S3.

Our S3 bucket in question is >10TB, so going through it by hand was also not an option. It's possible to do something similar (I think?) using the AWS CLI, but I gave up on that after I'd been sat for 10 minutes and didn't have a result yet. It's due to this that I decided to write a tool for stuff like this. Well, that and the fact I was looking for an excuse to write more Rust.

All I really wanted was the superlatives of S3 (which now I come to think of it, would have been an awesome name!). Pieces of information like the largest file, the latest file, and so on. This is all pretty standard stuff, and can actually be surfaced by a single AWS API of ListObjectsV2. Skip ahead a few hours and I had something that pretty much satisfied what I needed, and currently generates a report something like this:

[general]
total_time=7s
total_space=1.94TB
total_files=51,152

[file_size]
average_file_size=37.95MB
average_file_bytes=37949529
largest_file_size=1.82GB
largest_file_bytes=1818900684
largest_file_name=path/to/my_largest_file.txt.gz
smallest_file_size=54B
smallest_file_bytes=54
smallest_file_name=path/to/my_smallest_file.txt.gz
smallest_file_others=12

[extensions]
unique_extensions=1
most_frequent_extension=gz

[modification]
earliest_file_date=2016-06-11T17:36:57.000Z
earliest_file_name=path/to/my_earliest_file.txt.gz
earliest_file_others=3
latest_file_date=2017-01-01T00:03:19.000Z
latest_file_name=path/to/my_latest_file.txt.gz

There's not much to it; you get a quick meta rollup of a provided bucket, which is great, and it doesn't take (too) long. It's designed to be easily readable, and consumable via shell (i.e. if you pipe it, you can do stuff with most of it). You can also provide a key prefix (i.e. a directory, etc), to only look at a subsection of a bucket, rather than the entire thing. I plan to keep extending metadata as it becomes available via the API; I don't have much interest in calling various APIs and stitching things together (yet). Plans for the future also include figuring out how to speed it up a little (the ListObjectsV2 API uses a cursor token, so you can't divide and conquer easily).

Anyway, that's s3-meta! Feel free to try it out and let me know any thoughts or feedback. You can install it via cargo, and it can be found here!