Skip to content
anastasiosl edited this page Nov 7, 2014 · 1 revision

Motivation

Highly-scalable distributed applications produce a huge amount of data (either user content or application data like logs) every minute and it is essential for monitoring those applications to have the data partitioned in a way that is meaningful allowing us to analyze that data.

tn_s3_file_uploader is a Ruby gem that helps store files in S3 allowing the user to partition them in folders programmatically, via a number of supported substitutions that can create destination folders in S3 based on the IP of the machine that performs the upload, file characteristics (file extension) and the current date and time.

Within ThinkNear, we install tn_s3_file_uploader onto our EC2 instances to upload our applications' logs to S3. For more information, check out the logrotate scripts under our AWS templates.

Install

gem install tn_s3_file_uploader

Upload files to S3

tn_s3_file_uploader --input-file-pattern=logs.*.log --s3-output-pattern=bucket/folder/ip-%{ip-address}-%{file-name}.%{file-extension}

When the current directory contains logs.1.log and logs.2.log the above command will upload both files to S3 under bucket bucket, inside folder folder with names ip-158-10-50-100.logs.1.log and ip-158.10.50.100.logs.2.log, assuming the ip of the machine was 158.10.50.100

tn_s3_file_uploader arguments

Two arguments are mandatory for tn_s3_uploader, input-file-pattern and s3-output-pattern.

input-file-pattern

Accepts a shell glob that is used to find which files to upload. The glob provided must match at least one file.

Example: --input-file-pattern=/etc/sonarqube/logs/*.logs will match all files with the '.log' extension inside folder '/etc/sonarqube/logs'.

s3-output-pattern

Accepts an S3 destination path with four possible macro substitutions. The s3-output-pattern option must begin with the S3 bucket (without any leading '/' character) and the bucket should exist.

Accepted macros to be substituted are:

  1. Anything that Ruby's strftime accepts (%Y, %m etc.)
  2. %{file-name} will be substituted with each input file's filename (everything before the last dot)
  3. %{file-extension} will be substituted with each input file's extension (everything after the last dot)
  4. %{file-timestamp} will be substituted with the current time tn_s3_uploader ran
  5. %{ip-address} will be substituted with the ip of the local machine with dashes instead of dots.

Examples:


Using the above macro substitutions, tn_s3_file_uploader can build destination folders based on time, ip address of the origin machine or the file name extensions.

Partition uploaded files based on date:


The output pattern below will create a folder for the year, with a sub-folder for the month and a sub-folder for the day. All log files will be stored on S3 and partitioned based on the day they were uploaded.

--s3-output-pattern=bucket/y=%Y/m=%m/d=%d/%{file-name}.%{file-extension}

If the script ran on 20th of September 2014, file warning.log will be uploaded to S3 bucket/y=2014/m=09/d=20/warning.log.

Partition uploaded files based on IP:


The output pattern below will create a folder for each different IP the script was ran from:

--s3-output-pattern=bucket/ip-%{ip-address}/%{file-name}.%{file-extension}

With the above output pattern, all files uploaded running the script on machine with IP 158.10.25.50 will be uploaded to S3 folder bucket/ip-158-10-25-50.

Partition uploaded files based on file type:


The output pattern below will create a folder for each different file type (extension) that was uploaded:

--s3-output-pattern=bucket/%{file-extension}/%{file-name}.%{file-extension}

Using the above output pattern, all log files will be uploaded on S3 under bucket/log, all pdf files will be uploaded on S3 under bucket/pdf etc.

For the optional arguments to tn_s3_file_uploader, please check Optional Parameters