Quickly Concatenating Files in Amazon S3
Amazon S3 is a storage solution used by pretty much everyone these days. Due to this there are naturally a bunch of tools for doing almost everything you can think of with S3. This post is about a tool I wrote to retrieve concatenate files efficiently in S3 buckets. If you don't care for the post, feel free to skip straight to the repo.
Over the last few years, I've been working closely with Hadoop MapReduce in my day job. A typical
Hadoop job will output a
part-* file based on the task writing the result - so you'd
usually end up with a whole bunch of
part-r-<number> files. Although there are ways
around this, they're typically more complicated and require changing your actual MapReduce code flow
(which isn't always possible).
In the last week or so, I've been working with MapReduce jobs using Amazon EMR, which is the AWS managed Hadoop service. The job was nothing particularly special, but it resulted in a whole bunch of output files going into another S3 bucket (this was essentially a re-sharding of some archive data). This was pretty much due to us having to shard our job across many nodes to improve the throughput of the job, and so naturally this results in a tonne of outputs.
Rather than write another bunch of MR code to concatenate these files, I was hopeful this was
something offered by the Amazon S3 API. Unfortunately this API has never (as far as I know) offered
the ability to either append or concatenate files explicitly, so I decided that it was time to write a
tool for this. This led to s3-concat, which is a
small CLI tool written in Rust and allows you to remotely
concatenate files together using patterns. This tool can be installed via
provides a very simple interface:
s3-concat 1.0.0 Isaac Whitfield <firstname.lastname@example.org> Concatenate Amazon S3 files remotely using flexible patterns USAGE: s3-concat [FLAGS] <bucket> <source> <target> FLAGS: -c, --cleanup Removes source files after concatenation -d, --dry-run Only print out the calculated writes -h, --help Prints help information -q, --quiet Only prints errors during execution -V, --version Prints version information ARGS: <bucket> An S3 bucket prefix to work within <source> A source pattern to use to locate files <target> A target pattern to use to concatenate files into
It can be used in a number of ways (although check the README for further details). The tool supports patterns, and pattern replacement, which makes it surprisingly powerful for file concatenation. It works using the Multipart API for S3, and essentially uses each source file as a part of the upload. This comes with limitations (again, check the README), but allows for efficient concatenation as you don't even have to download the files - it all runs within S3 itself.
It quickly proved useful (for me, at least) as I could concatenate all of my
together very easily. The below command would concatenate all files in the directory
part-* into a single file of
aggregated-output. Due to providing the
-c flag, this will also clean up the
source files after the successful concatenation.
s3-concat -c my.bucket.name/my-job-output 'part-*' 'aggregated-output'
However there are cases where you can't explicitly concatenate everything into a single target. You might elect to concatenate based on the input file name; for example perhaps the first digit of the part number in the MapReduce case. Fortunately this is also supported via pattern replacement; the following example matches the first digit of the part file, and then uses the captured group in the target path.
s3-concat -c my.bucket.name/my-job-output 'part-r-(\d).*' 'aggregated-part-$1'
This would result in 10 files named
aggregated-part- + the first digit of the parts they
represent. Of course this is a simple example, but you can use any pattern supported by the official
engine, and create complex paths using the capture replacement. The tool will keep you informed of
what's happening, and can be used as a dry-run via the
-d flag (which would only tell you
what will be concatenated - nothing will be written). It's extremely simple, but works well. Feel free
to try it out and let me know any thoughts or feedback. You can install it via cargo, and it can be