Easily Analyzing Your S3 Buckets
Amazon S3 is a storage solution used by pretty much everyone these days. Due to this there are naturally a bunch of tools for doing almost everything you can think of with S3. This post is about a tool I wrote to retrieve metadata about S3 buckets. If you don't care for the post, feel free to skip straight to the repo.
In the background recently, I've been working on something related to compression with a friend at work. I figured that one of the best ways to test this would be to just grab one of the largest files on our S3 setup, compress it, and compare the results against other tools like GZIP. Of course, turns out this is easier said than done because there's no easy way to search for a file in this way using S3.
Our S3 bucket in question is >10TB, so going through it by hand was also not an option. It's possible to do something similar (I think?) using the AWS CLI, but I gave up on that after I'd been sat for 10 minutes and didn't have a result yet. It's due to this that I decided to write a tool for stuff like this. Well, that and the fact I was looking for an excuse to write more Rust.
All I really wanted was the superlatives of S3 (which now I come to think of it, would have been an
awesome name!). Pieces of information like the largest file, the latest file, and so on. This is all
pretty standard stuff, and can actually be surfaced by a single AWS API of
Skip ahead a few hours and I had something that pretty much satisfied what I needed, and currently
generates a report something like this:
[general] total_time=7s total_space=1.94TB total_files=51,152 [file_size] average_file_size=37.95MB average_file_bytes=37949529 largest_file_size=1.82GB largest_file_bytes=1818900684 largest_file_name=path/to/my_largest_file.txt.gz smallest_file_size=54B smallest_file_bytes=54 smallest_file_name=path/to/my_smallest_file.txt.gz smallest_file_others=12 [extensions] unique_extensions=1 most_frequent_extension=gz [modification] earliest_file_date=2016-06-11T17:36:57.000Z earliest_file_name=path/to/my_earliest_file.txt.gz earliest_file_others=3 latest_file_date=2017-01-01T00:03:19.000Z latest_file_name=path/to/my_latest_file.txt.gz
There's not much to it; you get a quick meta rollup of a provided bucket, which is great, and it
doesn't take (too) long. It's designed to be easily readable, and consumable via shell (i.e. if you
pipe it, you can do stuff with most of it). You can also provide a key prefix (i.e. a directory, etc),
to only look at a subsection of a bucket, rather than the entire thing. I plan to keep extending
metadata as it becomes available via the API; I don't have much interest in calling various APIs and
stitching things together (yet). Plans for the future also include figuring out how to speed it up a
ListObjectsV2 API uses a cursor token, so you can't divide and conquer
s3-meta! Feel free to try it out and let me know any thoughts or feedback.
You can install it via
cargo, and it can be found