# S3 Daily Sync Configuration This application now includes automatic daily synchronization of the Jobs directory to an S3 bucket using the AWS CLI. ## Prerequisites ### AWS CLI Installation The sync functionality requires the AWS CLI to be installed on your system: **macOS:** ```bash curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg" sudo installer -pkg AWSCLIV2.pkg -target / ``` **Linux:** ```bash curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install ``` **Docker (if running in container):** ```dockerfile RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \ unzip awscliv2.zip && \ ./aws/install && \ rm -rf awscliv2.zip aws/ ``` ## Required Environment Variables Add the following environment variables to your `.env` file: ```env # S3 Configuration (Required for daily sync) S3_BUCKET_NAME=your-s3-bucket-name S3_REGION=us-east-1 S3_ACCESS_KEY_ID=your-access-key-id S3_SECRET_ACCESS_KEY=your-secret-access-key # Optional: S3 key prefix (defaults to "jobs/" if not specified) S3_KEY_PREFIX=jobs/ ``` ## Features ### Automatic Daily Sync - Runs every day at midnight PST/PDT using `aws s3 sync` - Uses the `--delete` flag to remove files from S3 that no longer exist locally - Efficient incremental sync (only uploads changed files) - Comprehensive logging of sync operations ### Jobs Directory Analysis - Real-time analysis of all job folders in the Jobs directory - Recursive document counting and size calculation - Detailed per-job statistics including document counts and sizes - Useful for monitoring storage usage and job completion status - No S3 configuration required for analysis functionality ### API Endpoints #### Check Sync Status ``` GET /sync/status ``` Returns the current status of the S3 sync scheduler including: - Configuration status - Scheduler running status - Next scheduled run time - S3 connection availability #### Manual Sync Trigger ``` POST /sync/trigger ``` Manually triggers an S3 sync operation (useful for testing). #### Jobs Directory Analysis ``` GET /jobs/analysis ``` Analyzes the Jobs directory and returns detailed statistics: - Total number of job folders - Total documents across all jobs - Total size in bytes and MB - Per-job statistics including: - Job ID (folder name) - Relative path - Document count in that job - Total size for that job **Example Response:** ```json { "totalJobs": 150, "totalDocuments": 1250, "totalSizeBytes": 2147483648, "totalSizeMB": 2048.0, "jobs": [ { "jobId": "JOB-001", "relativePath": "Jobs/JOB-001", "documentCount": 8, "totalSizeBytes": 15728640, "totalSizeMB": 15.0 }, ... ] } ``` ## Setup Instructions 1. **Install AWS CLI**: Follow the installation instructions above for your platform 2. **Configure S3 Bucket**: - Create an S3 bucket in your AWS account - Create an IAM user with S3 permissions for the bucket - Generate access keys for the IAM user 3. **Set Environment Variables**: - Add the S3 configuration to your environment file - Restart the server 4. **Test the Setup**: - Check the sync status: `GET /s3-sync/status` - Trigger a manual sync: `POST /s3-sync/trigger` - Monitor the logs for sync operations ## IAM Permissions Your IAM user needs the following permissions for the S3 bucket: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::your-bucket-name", "arn:aws:s3:::your-bucket-name/*" ] } ] } ``` ## Advantages of Using AWS CLI ### Why AWS CLI vs SDK? - **Simplicity**: Single command (`aws s3 sync`) handles everything - **Efficiency**: AWS CLI is optimized for bulk operations - **Robustness**: Built-in retry logic and error handling - **Features**: Automatic multipart uploads, checksums, and progress tracking - **Maintenance**: No need to manage complex SDK code for file operations ### AWS CLI Sync Command The sync uses: `aws s3 sync /local/jobs/path s3://bucket/jobs/ --delete` This command: - Only uploads files that are new or modified (based on size and timestamp) - Automatically handles large files with multipart uploads - Deletes files from S3 that no longer exist locally (with `--delete` flag) - Provides detailed output of what was transferred ## Troubleshooting - **AWS CLI not found**: Install AWS CLI using the instructions above - **Permission denied**: Check IAM permissions and access keys - **Sync fails**: Check the application logs for detailed error messages - **Connection issues**: Verify S3 bucket name and region - **Test connection**: Use the status endpoint to verify S3 connectivity ## How It Works 1. **Scheduler**: Uses `node-cron` to schedule daily execution at midnight PST 2. **AWS CLI Check**: Verifies AWS CLI is installed and available 3. **Credential Setup**: Sets AWS credentials as environment variables 4. **Sync Execution**: Runs `aws s3 sync` with appropriate parameters 5. **Logging**: Captures and logs all command output and errors ## Dependencies The implementation now only requires: - `node-cron` for scheduling - `fs-extra` for file system operations - Node.js built-in `child_process` for executing AWS CLI commands No AWS SDK dependencies are needed!