WIP s3 sync and directory stats.
This commit is contained in:
197
S3_SYNC_README.md
Normal file
197
S3_SYNC_README.md
Normal file
@@ -0,0 +1,197 @@
|
||||
# S3 Daily Sync Configuration
|
||||
|
||||
This application now includes automatic daily synchronization of the Jobs directory to an S3 bucket using the AWS CLI.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### AWS CLI Installation
|
||||
The sync functionality requires the AWS CLI to be installed on your system:
|
||||
|
||||
**macOS:**
|
||||
```bash
|
||||
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
|
||||
sudo installer -pkg AWSCLIV2.pkg -target /
|
||||
```
|
||||
|
||||
**Linux:**
|
||||
```bash
|
||||
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
|
||||
unzip awscliv2.zip
|
||||
sudo ./aws/install
|
||||
```
|
||||
|
||||
**Docker (if running in container):**
|
||||
```dockerfile
|
||||
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
|
||||
unzip awscliv2.zip && \
|
||||
./aws/install && \
|
||||
rm -rf awscliv2.zip aws/
|
||||
```
|
||||
|
||||
## Required Environment Variables
|
||||
|
||||
Add the following environment variables to your `.env` file:
|
||||
|
||||
```env
|
||||
# S3 Configuration (Required for daily sync)
|
||||
S3_BUCKET_NAME=your-s3-bucket-name
|
||||
S3_REGION=us-east-1
|
||||
S3_ACCESS_KEY_ID=your-access-key-id
|
||||
S3_SECRET_ACCESS_KEY=your-secret-access-key
|
||||
|
||||
# Optional: S3 key prefix (defaults to "jobs/" if not specified)
|
||||
S3_KEY_PREFIX=jobs/
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
### Automatic Daily Sync
|
||||
- Runs every day at midnight PST/PDT using `aws s3 sync`
|
||||
- Uses the `--delete` flag to remove files from S3 that no longer exist locally
|
||||
- Efficient incremental sync (only uploads changed files)
|
||||
- Comprehensive logging of sync operations
|
||||
|
||||
### Jobs Directory Analysis
|
||||
- Real-time analysis of all job folders in the Jobs directory
|
||||
- Recursive document counting and size calculation
|
||||
- Detailed per-job statistics including document counts and sizes
|
||||
- Useful for monitoring storage usage and job completion status
|
||||
- No S3 configuration required for analysis functionality
|
||||
|
||||
### API Endpoints
|
||||
|
||||
#### Check Sync Status
|
||||
```
|
||||
GET /sync/status
|
||||
```
|
||||
Returns the current status of the S3 sync scheduler including:
|
||||
- Configuration status
|
||||
- Scheduler running status
|
||||
- Next scheduled run time
|
||||
- S3 connection availability
|
||||
|
||||
#### Manual Sync Trigger
|
||||
```
|
||||
POST /sync/trigger
|
||||
```
|
||||
Manually triggers an S3 sync operation (useful for testing).
|
||||
|
||||
#### Jobs Directory Analysis
|
||||
```
|
||||
GET /jobs/analysis
|
||||
```
|
||||
Analyzes the Jobs directory and returns detailed statistics:
|
||||
- Total number of job folders
|
||||
- Total documents across all jobs
|
||||
- Total size in bytes and MB
|
||||
- Per-job statistics including:
|
||||
- Job ID (folder name)
|
||||
- Relative path
|
||||
- Document count in that job
|
||||
- Total size for that job
|
||||
|
||||
**Example Response:**
|
||||
```json
|
||||
{
|
||||
"totalJobs": 150,
|
||||
"totalDocuments": 1250,
|
||||
"totalSizeBytes": 2147483648,
|
||||
"totalSizeMB": 2048.0,
|
||||
"jobs": [
|
||||
{
|
||||
"jobId": "JOB-001",
|
||||
"relativePath": "Jobs/JOB-001",
|
||||
"documentCount": 8,
|
||||
"totalSizeBytes": 15728640,
|
||||
"totalSizeMB": 15.0
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
1. **Install AWS CLI**: Follow the installation instructions above for your platform
|
||||
|
||||
2. **Configure S3 Bucket**:
|
||||
- Create an S3 bucket in your AWS account
|
||||
- Create an IAM user with S3 permissions for the bucket
|
||||
- Generate access keys for the IAM user
|
||||
|
||||
3. **Set Environment Variables**:
|
||||
- Add the S3 configuration to your environment file
|
||||
- Restart the server
|
||||
|
||||
4. **Test the Setup**:
|
||||
- Check the sync status: `GET /s3-sync/status`
|
||||
- Trigger a manual sync: `POST /s3-sync/trigger`
|
||||
- Monitor the logs for sync operations
|
||||
|
||||
## IAM Permissions
|
||||
|
||||
Your IAM user needs the following permissions for the S3 bucket:
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:GetObject",
|
||||
"s3:PutObject",
|
||||
"s3:DeleteObject",
|
||||
"s3:ListBucket"
|
||||
],
|
||||
"Resource": [
|
||||
"arn:aws:s3:::your-bucket-name",
|
||||
"arn:aws:s3:::your-bucket-name/*"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Advantages of Using AWS CLI
|
||||
|
||||
### Why AWS CLI vs SDK?
|
||||
- **Simplicity**: Single command (`aws s3 sync`) handles everything
|
||||
- **Efficiency**: AWS CLI is optimized for bulk operations
|
||||
- **Robustness**: Built-in retry logic and error handling
|
||||
- **Features**: Automatic multipart uploads, checksums, and progress tracking
|
||||
- **Maintenance**: No need to manage complex SDK code for file operations
|
||||
|
||||
### AWS CLI Sync Command
|
||||
The sync uses: `aws s3 sync /local/jobs/path s3://bucket/jobs/ --delete`
|
||||
|
||||
This command:
|
||||
- Only uploads files that are new or modified (based on size and timestamp)
|
||||
- Automatically handles large files with multipart uploads
|
||||
- Deletes files from S3 that no longer exist locally (with `--delete` flag)
|
||||
- Provides detailed output of what was transferred
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **AWS CLI not found**: Install AWS CLI using the instructions above
|
||||
- **Permission denied**: Check IAM permissions and access keys
|
||||
- **Sync fails**: Check the application logs for detailed error messages
|
||||
- **Connection issues**: Verify S3 bucket name and region
|
||||
- **Test connection**: Use the status endpoint to verify S3 connectivity
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Scheduler**: Uses `node-cron` to schedule daily execution at midnight PST
|
||||
2. **AWS CLI Check**: Verifies AWS CLI is installed and available
|
||||
3. **Credential Setup**: Sets AWS credentials as environment variables
|
||||
4. **Sync Execution**: Runs `aws s3 sync` with appropriate parameters
|
||||
5. **Logging**: Captures and logs all command output and errors
|
||||
|
||||
## Dependencies
|
||||
|
||||
The implementation now only requires:
|
||||
- `node-cron` for scheduling
|
||||
- `fs-extra` for file system operations
|
||||
- Node.js built-in `child_process` for executing AWS CLI commands
|
||||
|
||||
No AWS SDK dependencies are needed!
|
||||
Reference in New Issue
Block a user