5.5 KiB
S3 Daily Sync Configuration
This application now includes automatic daily synchronization of the Jobs directory to an S3 bucket using the AWS CLI.
Prerequisites
AWS CLI Installation
The sync functionality requires the AWS CLI to be installed on your system:
macOS:
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /
Linux:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
Docker (if running in container):
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
unzip awscliv2.zip && \
./aws/install && \
rm -rf awscliv2.zip aws/
Required Environment Variables
Add the following environment variables to your .env file:
# S3 Configuration (Required for daily sync)
S3_BUCKET_NAME=your-s3-bucket-name
S3_REGION=us-east-1
S3_ACCESS_KEY_ID=your-access-key-id
S3_SECRET_ACCESS_KEY=your-secret-access-key
# Optional: S3 key prefix (defaults to "jobs/" if not specified)
S3_KEY_PREFIX=jobs/
Features
Automatic Daily Sync
- Runs every day at midnight PST/PDT using
aws s3 sync - Uses the
--deleteflag to remove files from S3 that no longer exist locally - Efficient incremental sync (only uploads changed files)
- Comprehensive logging of sync operations
Jobs Directory Analysis
- Real-time analysis of all job folders in the Jobs directory
- Recursive document counting and size calculation
- Detailed per-job statistics including document counts and sizes
- Useful for monitoring storage usage and job completion status
- No S3 configuration required for analysis functionality
API Endpoints
Check Sync Status
GET /sync/status
Returns the current status of the S3 sync scheduler including:
- Configuration status
- Scheduler running status
- Next scheduled run time
- S3 connection availability
Manual Sync Trigger
POST /sync/trigger
Manually triggers an S3 sync operation (useful for testing).
Jobs Directory Analysis
GET /jobs/analysis
Analyzes the Jobs directory and returns detailed statistics:
- Total number of job folders
- Total documents across all jobs
- Total size in bytes and MB
- Per-job statistics including:
- Job ID (folder name)
- Relative path
- Document count in that job
- Total size for that job
Example Response:
{
"totalJobs": 150,
"totalDocuments": 1250,
"totalSizeBytes": 2147483648,
"totalSizeMB": 2048.0,
"jobs": [
{
"jobId": "JOB-001",
"relativePath": "Jobs/JOB-001",
"documentCount": 8,
"totalSizeBytes": 15728640,
"totalSizeMB": 15.0
},
...
]
}
Setup Instructions
-
Install AWS CLI: Follow the installation instructions above for your platform
-
Configure S3 Bucket:
- Create an S3 bucket in your AWS account
- Create an IAM user with S3 permissions for the bucket
- Generate access keys for the IAM user
-
Set Environment Variables:
- Add the S3 configuration to your environment file
- Restart the server
-
Test the Setup:
- Check the sync status:
GET /s3-sync/status - Trigger a manual sync:
POST /s3-sync/trigger - Monitor the logs for sync operations
- Check the sync status:
IAM Permissions
Your IAM user needs the following permissions for the S3 bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
Advantages of Using AWS CLI
Why AWS CLI vs SDK?
- Simplicity: Single command (
aws s3 sync) handles everything - Efficiency: AWS CLI is optimized for bulk operations
- Robustness: Built-in retry logic and error handling
- Features: Automatic multipart uploads, checksums, and progress tracking
- Maintenance: No need to manage complex SDK code for file operations
AWS CLI Sync Command
The sync uses: aws s3 sync /local/jobs/path s3://bucket/jobs/ --delete
This command:
- Only uploads files that are new or modified (based on size and timestamp)
- Automatically handles large files with multipart uploads
- Deletes files from S3 that no longer exist locally (with
--deleteflag) - Provides detailed output of what was transferred
Troubleshooting
- AWS CLI not found: Install AWS CLI using the instructions above
- Permission denied: Check IAM permissions and access keys
- Sync fails: Check the application logs for detailed error messages
- Connection issues: Verify S3 bucket name and region
- Test connection: Use the status endpoint to verify S3 connectivity
How It Works
- Scheduler: Uses
node-cronto schedule daily execution at midnight PST - AWS CLI Check: Verifies AWS CLI is installed and available
- Credential Setup: Sets AWS credentials as environment variables
- Sync Execution: Runs
aws s3 syncwith appropriate parameters - Logging: Captures and logs all command output and errors
Dependencies
The implementation now only requires:
node-cronfor schedulingfs-extrafor file system operations- Node.js built-in
child_processfor executing AWS CLI commands
No AWS SDK dependencies are needed!