Whitfin's Blog: Quickly Concatenating Files in Amazon S3

Amazon S3 is a storage solution used by pretty much everyone these days. Due to this there are naturally a bunch of tools for doing almost everything you can think of with S3. This post is about a tool I wrote to retrieve concatenate files efficiently in S3 buckets. If you don't care for the post, feel free to skip straight to the repo.

Over the last few years, I've been working closely with Hadoop MapReduce in my day job. A typical Hadoop job will output a part-* file based on the task writing the result - so you'd usually end up with a whole bunch of part-r-<number> files. Although there are ways around this, they're typically more complicated and require changing your actual MapReduce code flow (which isn't always possible).

In the last week or so, I've been working with MapReduce jobs using Amazon EMR, which is the AWS managed Hadoop service. The job was nothing particularly special, but it resulted in a whole bunch of output files going into another S3 bucket (this was essentially a re-sharding of some archive data). This was pretty much due to us having to shard our job across many nodes to improve the throughput of the job, and so naturally this results in a tonne of outputs.

Rather than write another bunch of MR code to concatenate these files, I was hopeful this was something offered by the Amazon S3 API. Unfortunately this API has never (as far as I know) offered the ability to either append or concatenate files explicitly, so I decided that it was time to write a tool for this. This led to s3-concat, which is a small CLI tool written in Rust and allows you to remotely concatenate files together using patterns. This tool can be installed via cargo, and provides a very simple interface:

s3-concat 1.0.0
Isaac Whitfield <[email protected]>
Concatenate Amazon S3 files remotely using flexible patterns

USAGE:
    s3-concat [FLAGS] <bucket> <source> <target>

FLAGS:
    -c, --cleanup    Removes source files after concatenation
    -d, --dry-run    Only print out the calculated writes
    -h, --help       Prints help information
    -q, --quiet      Only prints errors during execution
    -V, --version    Prints version information

ARGS:
    <bucket>    An S3 bucket prefix to work within
    <source>    A source pattern to use to locate files
    <target>    A target pattern to use to concatenate files into

It can be used in a number of ways (although check the README for further details). The tool supports patterns, and pattern replacement, which makes it surprisingly powerful for file concatenation. It works using the Multipart API for S3, and essentially uses each source file as a part of the upload. This comes with limitations (again, check the README), but allows for efficient concatenation as you don't even have to download the files - it all runs within S3 itself.

It quickly proved useful (for me, at least) as I could concatenate all of my part-* files together very easily. The below command would concatenate all files in the directory s3://my.bucket.name/my-job-output/ matching part-* into a single file of aggregated-output. Due to providing the -c flag, this will also clean up the source files after the successful concatenation.

s3-concat -c my.bucket.name/my-job-output 'part-*' 'aggregated-output'

However there are cases where you can't explicitly concatenate everything into a single target. You might elect to concatenate based on the input file name; for example perhaps the first digit of the part number in the MapReduce case. Fortunately this is also supported via pattern replacement; the following example matches the first digit of the part file, and then uses the captured group in the target path.

s3-concat -c my.bucket.name/my-job-output 'part-r-(\d).*' 'aggregated-part-$1'

This would result in 10 files named aggregated-part- + the first digit of the parts they represent. Of course this is a simple example, but you can use any pattern supported by the official engine, and create complex paths using the capture replacement. The tool will keep you informed of what's happening, and can be used as a dry-run via the -d flag (which would only tell you what will be concatenated - nothing will be written). It's extremely simple, but works well. Feel free to try it out and let me know any thoughts or feedback. You can install it via cargo, and it can be found here!