WIP s3 sync and directory stats.

This commit is contained in:
Patrick Fic
2025-10-31 15:43:22 -07:00
parent 080ed8e335
commit be3918cbaf
10 changed files with 926 additions and 5 deletions

197
S3_SYNC_README.md Normal file
View File

@@ -0,0 +1,197 @@
# S3 Daily Sync Configuration
This application now includes automatic daily synchronization of the Jobs directory to an S3 bucket using the AWS CLI.
## Prerequisites
### AWS CLI Installation
The sync functionality requires the AWS CLI to be installed on your system:
**macOS:**
```bash
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /
```
**Linux:**
```bash
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
```
**Docker (if running in container):**
```dockerfile
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
unzip awscliv2.zip && \
./aws/install && \
rm -rf awscliv2.zip aws/
```
## Required Environment Variables
Add the following environment variables to your `.env` file:
```env
# S3 Configuration (Required for daily sync)
S3_BUCKET_NAME=your-s3-bucket-name
S3_REGION=us-east-1
S3_ACCESS_KEY_ID=your-access-key-id
S3_SECRET_ACCESS_KEY=your-secret-access-key
# Optional: S3 key prefix (defaults to "jobs/" if not specified)
S3_KEY_PREFIX=jobs/
```
## Features
### Automatic Daily Sync
- Runs every day at midnight PST/PDT using `aws s3 sync`
- Uses the `--delete` flag to remove files from S3 that no longer exist locally
- Efficient incremental sync (only uploads changed files)
- Comprehensive logging of sync operations
### Jobs Directory Analysis
- Real-time analysis of all job folders in the Jobs directory
- Recursive document counting and size calculation
- Detailed per-job statistics including document counts and sizes
- Useful for monitoring storage usage and job completion status
- No S3 configuration required for analysis functionality
### API Endpoints
#### Check Sync Status
```
GET /sync/status
```
Returns the current status of the S3 sync scheduler including:
- Configuration status
- Scheduler running status
- Next scheduled run time
- S3 connection availability
#### Manual Sync Trigger
```
POST /sync/trigger
```
Manually triggers an S3 sync operation (useful for testing).
#### Jobs Directory Analysis
```
GET /jobs/analysis
```
Analyzes the Jobs directory and returns detailed statistics:
- Total number of job folders
- Total documents across all jobs
- Total size in bytes and MB
- Per-job statistics including:
- Job ID (folder name)
- Relative path
- Document count in that job
- Total size for that job
**Example Response:**
```json
{
"totalJobs": 150,
"totalDocuments": 1250,
"totalSizeBytes": 2147483648,
"totalSizeMB": 2048.0,
"jobs": [
{
"jobId": "JOB-001",
"relativePath": "Jobs/JOB-001",
"documentCount": 8,
"totalSizeBytes": 15728640,
"totalSizeMB": 15.0
},
...
]
}
```
## Setup Instructions
1. **Install AWS CLI**: Follow the installation instructions above for your platform
2. **Configure S3 Bucket**:
- Create an S3 bucket in your AWS account
- Create an IAM user with S3 permissions for the bucket
- Generate access keys for the IAM user
3. **Set Environment Variables**:
- Add the S3 configuration to your environment file
- Restart the server
4. **Test the Setup**:
- Check the sync status: `GET /s3-sync/status`
- Trigger a manual sync: `POST /s3-sync/trigger`
- Monitor the logs for sync operations
## IAM Permissions
Your IAM user needs the following permissions for the S3 bucket:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
```
## Advantages of Using AWS CLI
### Why AWS CLI vs SDK?
- **Simplicity**: Single command (`aws s3 sync`) handles everything
- **Efficiency**: AWS CLI is optimized for bulk operations
- **Robustness**: Built-in retry logic and error handling
- **Features**: Automatic multipart uploads, checksums, and progress tracking
- **Maintenance**: No need to manage complex SDK code for file operations
### AWS CLI Sync Command
The sync uses: `aws s3 sync /local/jobs/path s3://bucket/jobs/ --delete`
This command:
- Only uploads files that are new or modified (based on size and timestamp)
- Automatically handles large files with multipart uploads
- Deletes files from S3 that no longer exist locally (with `--delete` flag)
- Provides detailed output of what was transferred
## Troubleshooting
- **AWS CLI not found**: Install AWS CLI using the instructions above
- **Permission denied**: Check IAM permissions and access keys
- **Sync fails**: Check the application logs for detailed error messages
- **Connection issues**: Verify S3 bucket name and region
- **Test connection**: Use the status endpoint to verify S3 connectivity
## How It Works
1. **Scheduler**: Uses `node-cron` to schedule daily execution at midnight PST
2. **AWS CLI Check**: Verifies AWS CLI is installed and available
3. **Credential Setup**: Sets AWS credentials as environment variables
4. **Sync Execution**: Runs `aws s3 sync` with appropriate parameters
5. **Logging**: Captures and logs all command output and errors
## Dependencies
The implementation now only requires:
- `node-cron` for scheduling
- `fs-extra` for file system operations
- Node.js built-in `child_process` for executing AWS CLI commands
No AWS SDK dependencies are needed!