Select Page

Determining optimal Amazon S3 transfer parallelism

Author: Jeremiah Wilton | | April 23, 2011

Amazon’s Simple Storage Service (S3) is a robust, inexpensive and highly-available internet data storage service.  At Datavail, we occasionally help our customers design and implement S3-based backup strategies.

Compared to conventional off site tape vaulting services, the advantages of vaulting database and other backups to S3 are many.  S3 backups are always on line, so you never have to wait for a  truck to arrive with your tapes. S3 backups are replicated, so if one of Amazon’s availability zones experiences a failure, your data is still intact and available at one of their other zones. Best of all, Amazon also offers the Elastic Compute Cloud (AKA EC2, virtual server hosts by the hour), so your S3 backups double as a super-low-cost disaster recovery strategy. S3 is low-cost, starting at just 3.7¢ / GB / month for storage, and 10¢ / GB for uploads.

I back up all my home computers to S3 using third-party software called Jungle Disk.  Jungle Disk runs in the background on a Mac or PC and backs up new data to the cloud ever five minutes (or whatever frequency you specify). Many times these backups have come in handy for me, as I am able to browse and retrieve files from my home computers (such as photos and documents) from the office, without my home computers even being on.

Sounds like an ideal backup solution for small to medium-sized business, right?  So what’s the catch?

The catch is your Internet connection.  Consider the following:

A typical cable Internet connection with an actual upload rate of 2.8 Mb/s could transfer a 100G file to Amazon S3 in just over ten hours.  For many businesses, that’s the whole backup window. If there is more than 100G to transfer, you’re out of luck.  If your servers are co-located somewhere with a fast connection to the Internet, you might get better transfer rates, but there are other limiting factors, like number of hops to Amazon S3 and overall latency.

Compress before uploading

If the backup files are not already compressed, you can almost always dramatically improve upload times by compressing them before uploading them.

Parallel uploads save the day

Upload performance to Amazon S3 can almost always be improved by running uploads in parallel.  Choosing a degree of parallelism depends on the connection between your site and Amazon, the host from which you are uploading, and several other factors.  The best way to determine your optimal degree of parallelism is to test it!

Blue Gecko happens to have a customer who wants to vault their Microsoft SQL Server backups to Amazon S3.  Unlike Oracle, SQL Server has no native facility that allows it to stream backups directly to S3 like tape. Instead, with this customer, we will compress and upload their database backups after they complete each night. To make it easy to find the optimal degree of parallelism, I delved into the murky world of Windows command shell programming. Against all instincts, I wrote this tool in Batch so that it would work easily on any of this customer SQL Server hosts.

This tool is designed to allow you to effectively determine the optimal parallel degree for backing up data from a particular server over the Internet to Amazon S3. It generates its own large files to upload.  All you need is an Amazon S3 account.  The tool comes as a pair of scripts that call a Ruby tool called s3cmd. To use it, follow these steps:

  1. Download and install Ruby 1.8.7-p334 for Windows.
  2. Download S3Sync into a convenient directory.
  3. Download and install gnuwin gzip and gnuwin tar for Windows
  4. Open a command prompt window, and change to the directory where you downloaded S3Sync
    [code type=”ps”]c:> cd my_directory[/code]
  5. Unzip and untar the S3Sync package, then change to the S3Sync directory:
    [code type=”ps”]
    c:my_directory> “Program FilesGnuWin32bingzip.exe” -d s3sync.tar.gz
    c:my_directory> “Program FilesGnuWin32bintar.exe” xvf s3sync.tar
    c:my_directory> cd s3sync
    [/code]
  6. Edit a file called test_parallel.bat, and paste the following contents into the file:
    [code type=”ps”]
    echo off
    set /a filesize = %1 / %2
    for /l %%v in (1,1,%2) do (
    fsutil file createnew dummy.%%v %filesize%
    )
    for /l %%v in (1,1,%2) do (
    start /b upload %3 %%v
    )
    echo on
    [/code]
  7. Create a bucket for testing. Make sure to substitute your AWS security credentials in the appropriate places:
    [code type=”ps”]
    c:my_directory> set AWS_ACCESS_KEY=your AWS access key ID
    c:my_directory> set AWS_SECRET_ACCESS_KEY=your AWS secret access key
    c:my_directory> set AWS_CALLING_FORMAT=SUBDOMAIN
    c:my_directory> s3cmd.rb createbucket my_test_bucket_1234
    [/code]
  8. Edit a file called upload.bat, and paste the following contents into the file. Make sure to substitute your AWS security credentials in the appropriate places:
    [code type=”ps”]
    echo off
    set AWS_ACCESS_KEY_ID=your AWS access key id
    set AWS_SECRET_ACCESS_KEY=your AWS secret access key
    set AWS_CALLING_FORMAT=SUBDOMAIN
    echo %time%
    s3cmd.rb put %1:dummy.%2 dummy.%2
    echo %time%
    del dummy.%2
    [/code]
  9. Now you can start testing parallel uploads to AWS. The syntax to call test_parallel.bat is:
    [code type=”ps”]
    c:my_directory> test_parallel file_size degree bucket
    [/code]Here is an upload example for 128M at parallel 10 to a bucket called my_test_bucket_1234:[code type=”ps”]
    c:my_directory> test_parallel 125829120 10 my_test_bucket_1234
    [/code]The elapsed time can be determined as follows:
  • Look for the first timestamp displayed in the output and note it.
  • Waitfor the last timestamp to display. Note it.
  • The intervening time is the elapsed time for the upload. You can use Ex el or any other tool you like to calculate time deltas.

I typically run the upload without parallelism (degree = 1), then increase it in increments of five.  If there is any doubt as to which S3 region will provide the best performance, I create a bucket in each region (US-West and US-East), then perform identical tests against each.

How to Solve the Oracle Error ORA-12154: TNS:could not resolve the connect identifier specified

The “ORA-12154: TNS Oracle error message is very common for database administrators. Learn how to diagnose & resolve this common issue here today.

Vijay Muthu | February 4, 2021

How to Recover a Table from an Oracle 12c RMAN Backup

Our database experts explain how to recover and restore a table from an Oracle 12c RMAN Backup with this step-by-step blog. Read more.

Megan Elphingstone | February 2, 2017

Data Types: The Importance of Choosing the Correct Data Type

Most DBAs have struggled with the pros and cons of choosing one data type over another. This blog post discusses different situations.

Craig Mullins | October 11, 2017

Subscribe to Our Blog

Never miss a post! Stay up to date with the latest database, application and analytics tips and news. Delivered in a handy bi-weekly update straight to your inbox. You can unsubscribe at any time.

Work with Us

Let’s have a conversation about what you need to succeed and how we can help get you there.

CONTACT US

Work for Us

Where do you want to take your career? Explore exciting opportunities to join our team.

EXPLORE JOBS