197 lines
5.5 KiB
Markdown
197 lines
5.5 KiB
Markdown
# S3 Daily Sync Configuration
|
|
|
|
This application now includes automatic daily synchronization of the Jobs directory to an S3 bucket using the AWS CLI.
|
|
|
|
## Prerequisites
|
|
|
|
### AWS CLI Installation
|
|
The sync functionality requires the AWS CLI to be installed on your system:
|
|
|
|
**macOS:**
|
|
```bash
|
|
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
|
|
sudo installer -pkg AWSCLIV2.pkg -target /
|
|
```
|
|
|
|
**Linux:**
|
|
```bash
|
|
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
|
|
unzip awscliv2.zip
|
|
sudo ./aws/install
|
|
```
|
|
|
|
**Docker (if running in container):**
|
|
```dockerfile
|
|
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
|
|
unzip awscliv2.zip && \
|
|
./aws/install && \
|
|
rm -rf awscliv2.zip aws/
|
|
```
|
|
|
|
## Required Environment Variables
|
|
|
|
Add the following environment variables to your `.env` file:
|
|
|
|
```env
|
|
# S3 Configuration (Required for daily sync)
|
|
S3_BUCKET_NAME=your-s3-bucket-name
|
|
S3_REGION=us-east-1
|
|
S3_ACCESS_KEY_ID=your-access-key-id
|
|
S3_SECRET_ACCESS_KEY=your-secret-access-key
|
|
|
|
# Optional: S3 key prefix (defaults to "jobs/" if not specified)
|
|
S3_KEY_PREFIX=jobs/
|
|
```
|
|
|
|
## Features
|
|
|
|
### Automatic Daily Sync
|
|
- Runs every day at midnight PST/PDT using `aws s3 sync`
|
|
- Uses the `--delete` flag to remove files from S3 that no longer exist locally
|
|
- Efficient incremental sync (only uploads changed files)
|
|
- Comprehensive logging of sync operations
|
|
|
|
### Jobs Directory Analysis
|
|
- Real-time analysis of all job folders in the Jobs directory
|
|
- Recursive document counting and size calculation
|
|
- Detailed per-job statistics including document counts and sizes
|
|
- Useful for monitoring storage usage and job completion status
|
|
- No S3 configuration required for analysis functionality
|
|
|
|
### API Endpoints
|
|
|
|
#### Check Sync Status
|
|
```
|
|
GET /sync/status
|
|
```
|
|
Returns the current status of the S3 sync scheduler including:
|
|
- Configuration status
|
|
- Scheduler running status
|
|
- Next scheduled run time
|
|
- S3 connection availability
|
|
|
|
#### Manual Sync Trigger
|
|
```
|
|
POST /sync/trigger
|
|
```
|
|
Manually triggers an S3 sync operation (useful for testing).
|
|
|
|
#### Jobs Directory Analysis
|
|
```
|
|
GET /jobs/analysis
|
|
```
|
|
Analyzes the Jobs directory and returns detailed statistics:
|
|
- Total number of job folders
|
|
- Total documents across all jobs
|
|
- Total size in bytes and MB
|
|
- Per-job statistics including:
|
|
- Job ID (folder name)
|
|
- Relative path
|
|
- Document count in that job
|
|
- Total size for that job
|
|
|
|
**Example Response:**
|
|
```json
|
|
{
|
|
"totalJobs": 150,
|
|
"totalDocuments": 1250,
|
|
"totalSizeBytes": 2147483648,
|
|
"totalSizeMB": 2048.0,
|
|
"jobs": [
|
|
{
|
|
"jobId": "JOB-001",
|
|
"relativePath": "Jobs/JOB-001",
|
|
"documentCount": 8,
|
|
"totalSizeBytes": 15728640,
|
|
"totalSizeMB": 15.0
|
|
},
|
|
...
|
|
]
|
|
}
|
|
```
|
|
|
|
## Setup Instructions
|
|
|
|
1. **Install AWS CLI**: Follow the installation instructions above for your platform
|
|
|
|
2. **Configure S3 Bucket**:
|
|
- Create an S3 bucket in your AWS account
|
|
- Create an IAM user with S3 permissions for the bucket
|
|
- Generate access keys for the IAM user
|
|
|
|
3. **Set Environment Variables**:
|
|
- Add the S3 configuration to your environment file
|
|
- Restart the server
|
|
|
|
4. **Test the Setup**:
|
|
- Check the sync status: `GET /s3-sync/status`
|
|
- Trigger a manual sync: `POST /s3-sync/trigger`
|
|
- Monitor the logs for sync operations
|
|
|
|
## IAM Permissions
|
|
|
|
Your IAM user needs the following permissions for the S3 bucket:
|
|
|
|
```json
|
|
{
|
|
"Version": "2012-10-17",
|
|
"Statement": [
|
|
{
|
|
"Effect": "Allow",
|
|
"Action": [
|
|
"s3:GetObject",
|
|
"s3:PutObject",
|
|
"s3:DeleteObject",
|
|
"s3:ListBucket"
|
|
],
|
|
"Resource": [
|
|
"arn:aws:s3:::your-bucket-name",
|
|
"arn:aws:s3:::your-bucket-name/*"
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Advantages of Using AWS CLI
|
|
|
|
### Why AWS CLI vs SDK?
|
|
- **Simplicity**: Single command (`aws s3 sync`) handles everything
|
|
- **Efficiency**: AWS CLI is optimized for bulk operations
|
|
- **Robustness**: Built-in retry logic and error handling
|
|
- **Features**: Automatic multipart uploads, checksums, and progress tracking
|
|
- **Maintenance**: No need to manage complex SDK code for file operations
|
|
|
|
### AWS CLI Sync Command
|
|
The sync uses: `aws s3 sync /local/jobs/path s3://bucket/jobs/ --delete`
|
|
|
|
This command:
|
|
- Only uploads files that are new or modified (based on size and timestamp)
|
|
- Automatically handles large files with multipart uploads
|
|
- Deletes files from S3 that no longer exist locally (with `--delete` flag)
|
|
- Provides detailed output of what was transferred
|
|
|
|
## Troubleshooting
|
|
|
|
- **AWS CLI not found**: Install AWS CLI using the instructions above
|
|
- **Permission denied**: Check IAM permissions and access keys
|
|
- **Sync fails**: Check the application logs for detailed error messages
|
|
- **Connection issues**: Verify S3 bucket name and region
|
|
- **Test connection**: Use the status endpoint to verify S3 connectivity
|
|
|
|
## How It Works
|
|
|
|
1. **Scheduler**: Uses `node-cron` to schedule daily execution at midnight PST
|
|
2. **AWS CLI Check**: Verifies AWS CLI is installed and available
|
|
3. **Credential Setup**: Sets AWS credentials as environment variables
|
|
4. **Sync Execution**: Runs `aws s3 sync` with appropriate parameters
|
|
5. **Logging**: Captures and logs all command output and errors
|
|
|
|
## Dependencies
|
|
|
|
The implementation now only requires:
|
|
- `node-cron` for scheduling
|
|
- `fs-extra` for file system operations
|
|
- Node.js built-in `child_process` for executing AWS CLI commands
|
|
|
|
No AWS SDK dependencies are needed! |