Files
bodyshop-media-server/S3_SYNC_README.md
2025-10-31 15:43:22 -07:00

5.5 KiB

S3 Daily Sync Configuration

This application now includes automatic daily synchronization of the Jobs directory to an S3 bucket using the AWS CLI.

Prerequisites

AWS CLI Installation

The sync functionality requires the AWS CLI to be installed on your system:

macOS:

curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /

Linux:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Docker (if running in container):

RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
    unzip awscliv2.zip && \
    ./aws/install && \
    rm -rf awscliv2.zip aws/

Required Environment Variables

Add the following environment variables to your .env file:

# S3 Configuration (Required for daily sync)
S3_BUCKET_NAME=your-s3-bucket-name
S3_REGION=us-east-1
S3_ACCESS_KEY_ID=your-access-key-id
S3_SECRET_ACCESS_KEY=your-secret-access-key

# Optional: S3 key prefix (defaults to "jobs/" if not specified)
S3_KEY_PREFIX=jobs/

Features

Automatic Daily Sync

  • Runs every day at midnight PST/PDT using aws s3 sync
  • Uses the --delete flag to remove files from S3 that no longer exist locally
  • Efficient incremental sync (only uploads changed files)
  • Comprehensive logging of sync operations

Jobs Directory Analysis

  • Real-time analysis of all job folders in the Jobs directory
  • Recursive document counting and size calculation
  • Detailed per-job statistics including document counts and sizes
  • Useful for monitoring storage usage and job completion status
  • No S3 configuration required for analysis functionality

API Endpoints

Check Sync Status

GET /sync/status

Returns the current status of the S3 sync scheduler including:

  • Configuration status
  • Scheduler running status
  • Next scheduled run time
  • S3 connection availability

Manual Sync Trigger

POST /sync/trigger

Manually triggers an S3 sync operation (useful for testing).

Jobs Directory Analysis

GET /jobs/analysis

Analyzes the Jobs directory and returns detailed statistics:

  • Total number of job folders
  • Total documents across all jobs
  • Total size in bytes and MB
  • Per-job statistics including:
    • Job ID (folder name)
    • Relative path
    • Document count in that job
    • Total size for that job

Example Response:

{
  "totalJobs": 150,
  "totalDocuments": 1250,
  "totalSizeBytes": 2147483648,
  "totalSizeMB": 2048.0,
  "jobs": [
    {
      "jobId": "JOB-001",
      "relativePath": "Jobs/JOB-001",
      "documentCount": 8,
      "totalSizeBytes": 15728640,
      "totalSizeMB": 15.0
    },
    ...
  ]
}

Setup Instructions

  1. Install AWS CLI: Follow the installation instructions above for your platform

  2. Configure S3 Bucket:

    • Create an S3 bucket in your AWS account
    • Create an IAM user with S3 permissions for the bucket
    • Generate access keys for the IAM user
  3. Set Environment Variables:

    • Add the S3 configuration to your environment file
    • Restart the server
  4. Test the Setup:

    • Check the sync status: GET /s3-sync/status
    • Trigger a manual sync: POST /s3-sync/trigger
    • Monitor the logs for sync operations

IAM Permissions

Your IAM user needs the following permissions for the S3 bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket-name",
                "arn:aws:s3:::your-bucket-name/*"
            ]
        }
    ]
}

Advantages of Using AWS CLI

Why AWS CLI vs SDK?

  • Simplicity: Single command (aws s3 sync) handles everything
  • Efficiency: AWS CLI is optimized for bulk operations
  • Robustness: Built-in retry logic and error handling
  • Features: Automatic multipart uploads, checksums, and progress tracking
  • Maintenance: No need to manage complex SDK code for file operations

AWS CLI Sync Command

The sync uses: aws s3 sync /local/jobs/path s3://bucket/jobs/ --delete

This command:

  • Only uploads files that are new or modified (based on size and timestamp)
  • Automatically handles large files with multipart uploads
  • Deletes files from S3 that no longer exist locally (with --delete flag)
  • Provides detailed output of what was transferred

Troubleshooting

  • AWS CLI not found: Install AWS CLI using the instructions above
  • Permission denied: Check IAM permissions and access keys
  • Sync fails: Check the application logs for detailed error messages
  • Connection issues: Verify S3 bucket name and region
  • Test connection: Use the status endpoint to verify S3 connectivity

How It Works

  1. Scheduler: Uses node-cron to schedule daily execution at midnight PST
  2. AWS CLI Check: Verifies AWS CLI is installed and available
  3. Credential Setup: Sets AWS credentials as environment variables
  4. Sync Execution: Runs aws s3 sync with appropriate parameters
  5. Logging: Captures and logs all command output and errors

Dependencies

The implementation now only requires:

  • node-cron for scheduling
  • fs-extra for file system operations
  • Node.js built-in child_process for executing AWS CLI commands

No AWS SDK dependencies are needed!