High-performance AWS Lambda function demonstrating CPU-intensive parallel processing with Rust and Rayon.
By default, AWS Lambda functions run in a single-threaded environment. However, Lambda allocates CPU power proportional to the configured memory - from ~1 vCPU at 1,792 MB to 6 vCPUs at 10,240 MB. For CPU-intensive workloads like cryptographic operations, image processing, or data transformations, leaving these extra vCPUs idle wastes both time and money.
This project demonstrates how to unlock Lambda's full CPU potential using Rust and Rayon, achieving:
- 6.6x speedup with 6 vCPUs vs single-threaded processing
- Near-linear scaling with proper workload distribution
- Cost optimization - ARM64 (Graviton2) offers best price/performance
This project implements a multi-threaded Rust Lambda function that processes bcrypt password hashes in parallel across multiple vCPUs. It demonstrates practical techniques for CPU-bound parallel processing on AWS Lambda, with comprehensive benchmarks across both ARM64 (Graviton2) and x86_64 architectures.
Key Features:
- Multi-threaded processing using Rayon's work-stealing scheduler
- Configurable worker count via environment variables
- Support for ARM64 (Graviton2) and x86_64 architectures
- Proper thread pool initialization during cold start
- Input validation and error handling
- Comprehensive performance benchmarks with P95/P99 metrics
rust-multithread-lambda/
├── src/
│ ├── main.rs # Lambda entry point with thread pool initialization
│ └── handler.rs # Request handler with Rayon implementation
├── scripts/
│ ├── comprehensive_test.sh # Full benchmark suite (ARM64 & x86_64, 20 runs per config)
│ ├── validation_test.sh # Quick validation test for deployments
│ ├── burst_test.sh # Work-stealing scheduler demonstration
│ └── cloudwatch_metrics.sh # CloudWatch metrics collection script
├── template.yaml # SAM template for deploying all 12 configurations
├── Cargo.toml # Dependencies and build configuration
└── README.md # This file
Before starting, you should have:
- An AWS account with Lambda permissions
- Rust 1.70 or later installed (Installation guide)
- Cargo Lambda installed (Installation guide)
- AWS CLI configured with your credentials
- Basic understanding of Rust and concurrency concepts
# Build for ARM64 (recommended)
cargo lambda build --release --arm64
# Or build for x86_64
cargo lambda build --release --x86-64Binary size: ~1.7 MB (uncompressed), ~0.8 MB (zipped)
# Deploy with 6144 MB memory (4 vCPUs) and 4 workers
cargo lambda deploy rust-multithread-lambda \
--memory 6144 \
--timeout 30 \
--env-vars WORKER_COUNT=4Deploy all 12 benchmark configurations (6 ARM64 + 6 x86_64) using the SAM template:
# Build for both architectures
cargo lambda build --release --arm64 --output-format zip
mv target/lambda/rust-multithread-lambda target/lambda/rust-multithread-lambda-arm64
cargo lambda build --release --x86-64 --output-format zip
mv target/lambda/rust-multithread-lambda target/lambda/rust-multithread-lambda-x86
# Extract bootstrap binaries (required by SAM)
cd target/lambda/rust-multithread-lambda-arm64 && unzip -o bootstrap.zip && cd -
cd target/lambda/rust-multithread-lambda-x86 && unzip -o bootstrap.zip && cd -
# Deploy with SAM
sam deploy --guided --stack-name rust-multithread-benchmarkThis deploys Lambda functions with the following naming convention:
- ARM64:
rust-bench-arm64-{vcpu}vcpu-{memory}mb - x86_64:
rust-bench-x86-{vcpu}vcpu-{memory}mb
# Test the function
aws lambda invoke \
--function-name rust-multithread-lambda \
--payload '{"count":20,"mode":"parallel"}' \
--cli-binary-format raw-in-base64-out \
response.json
# View results
cat response.json | jq .Example response:
{
"processed": 20,
"duration_ms": 463,
"mode": "parallel",
"workers": 4,
"detected_cpus": 4,
"avg_ms_per_item": 23.15,
"memory_used_kb": 3508,
"threads_used": 4
}{
"count": 20, // Number of items to process (1-1000)
"mode": "parallel" // "parallel", "sequential", or "auto"
}Input Validation:
countmust be between 1 and 1000- Invalid inputs return error messages
Tested on ARM64 (Graviton2) in us-east-1 with bcrypt hashing (cost factor 10). All results are averages from 20 warm invocations per configuration.
Understanding the metrics:
- Speedup: Performance improvement vs single-threaded baseline (1 worker)
- Efficiency: How well additional vCPUs are utilized (100% = perfect linear scaling)
- Cold Init: First invocation latency (includes thread pool initialization)
- Warm Processing: Average execution time after container warm-up
| Memory | vCPUs | Workers | Warm Processing | Speedup |
|---|---|---|---|---|
| 1536 MB | ~1 | 1 | 1,885 ms | 1.00x |
| 2048 MB | ~2 | 2 | 1,334 ms | 1.41x |
| 4096 MB | ~3 | 3 | 685 ms | 2.75x |
| 6144 MB | ~4 | 4 | 463 ms | 4.07x |
| 8192 MB | ~5 | 5 | 338 ms | 5.57x |
| 10240 MB | ~6 | 6 | 280 ms | 6.73x |
6144 MB (4 workers) offers the best balance of cost and performance - see Cost Analysis section below.
| Memory | vCPUs | Workers | Warm Processing | Speedup |
|---|---|---|---|---|
| 1536 MB | ~1 | 1 | 1,671 ms | 1.00x |
| 2048 MB | ~2 | 2 | 1,253 ms | 1.33x |
| 4096 MB | ~3 | 3 | 892 ms | 1.87x |
| 6144 MB | ~4 | 4 | 429 ms | 3.89x |
| 8192 MB | ~5 | 5 | 330 ms | 5.06x |
| 10240 MB | ~6 | 6 | 292 ms | 5.72x |
Recommended Configuration:
- Cost-Optimized: ARM64 with 6144 MB (4 workers) - best price/performance
- Performance: ARM64 with 10240 MB (6 workers) - fastest at 280ms (6.73x speedup)
- Cold starts: Similar to warm invocations (thread pool init is lightweight)
- Near-linear scaling: Up to 6.73x speedup with 6 workers on ARM64
- Memory efficiency: Uses only 1.5-3.5 MB regardless of allocation
- Multi-threading validated:
threads_usedfield proves actual parallel execution - Architecture: ARM64 recommended for better price/performance at higher vCPU counts
WORKER_COUNT: Number of parallel workers (1-6, default: auto-detect from CPU count)
- Memory: 1536-10240 MB
- Timeout: 30-120 seconds (depending on workload)
- Architecture: ARM64 or x86_64
- Runtime: provided.al2023
The function uses std::sync::Once to ensure the Rayon global thread pool is initialized exactly once during cold start:
static INIT: Once = Once::new();
pub fn init_thread_pool(workers: usize) {
INIT.call_once(|| {
let _ = rayon::ThreadPoolBuilder::new()
.num_threads(workers)
.build_global();
});
}Uses Rayon's par_iter() for data parallelism, with thread ID tracking to prove multi-threading:
use rayon::prelude::*;
use std::collections::HashSet;
use std::sync::Mutex;
let thread_ids: Mutex<HashSet<std::thread::ThreadId>> = Mutex::new(HashSet::new());
let results = items
.par_iter()
.map(|item| {
thread_ids.lock().unwrap().insert(std::thread::current().id());
hash_password(item)
})
.collect();
let threads_used = thread_ids.lock().unwrap().len();Use Multi-threading (Rayon) for:
- CPU-bound workloads where the bottleneck is computation
- Tasks that can be parallelized independently (embarrassingly parallel problems)
- Operations like:
- Cryptographic hashing/encryption
- Image/video processing
- Data transformations and calculations
- Scientific computing
Use Async (Tokio) for:
- I/O-bound workloads where the bottleneck is waiting for external resources
- Operations like:
- API calls to external services
- Database queries
- File I/O operations
- Network requests
Why Tokio is included:
- AWS Lambda runtime requires an async runtime (Tokio)
- Tokio handles the Lambda event loop and invocation management
- Rayon handles the parallel CPU work within each invocation
- They complement each other: Tokio for concurrency, Rayon for parallelism
Key Difference:
- Async (Tokio): Handles many tasks concurrently by switching between them while waiting for I/O
- Multi-threading (Rayon): Executes multiple computations simultaneously on different CPU cores
This project uses both: Tokio for the Lambda runtime, and Rayon for parallel CPU processing.
All scripts are located in the scripts/ directory:
Full benchmark suite that tests all 12 configurations (6 ARM64 + 6 x86_64):
./scripts/comprehensive_test.sh- Runs 20 warm invocations per configuration
- Calculates P50, P90, P95, P99 percentiles
- Generates CSV results in
test-results/ - Compares ARM64 vs x86_64 performance
Quick validation test for verifying deployments:
./scripts/validation_test.shDemonstrates Rayon's work-stealing scheduler behavior:
./scripts/burst_test.shCollects CloudWatch metrics for deployed functions:
./scripts/cloudwatch_metrics.shFor processing 20 bcrypt hashes (us-east-1 pricing):
Pricing: ARM64 $0.0000133334/GB-second, x86_64 $0.0000166667/GB-second
| Config | Memory | Duration | Cost per 1M Invocations |
|---|---|---|---|
| 1 worker | 1536 MB | 1,885 ms | $38.60 |
| 2 workers | 2048 MB | 1,334 ms | $36.46 |
| 3 workers | 4096 MB | 685 ms | $37.47 |
| 4 workers | 6144 MB | 463 ms | $37.97 |
| 5 workers | 8192 MB | 338 ms | $36.94 |
| 6 workers | 10240 MB | 280 ms | $38.27 |
| Config | Memory | Duration | Cost per 1M Invocations |
|---|---|---|---|
| 1 worker | 1536 MB | 1,671 ms | $42.78 |
| 2 workers | 2048 MB | 1,253 ms | $42.77 |
| 3 workers | 4096 MB | 892 ms | $60.80 |
| 4 workers | 6144 MB | 429 ms | $44.00 |
| 5 workers | 8192 MB | 330 ms | $45.10 |
| 6 workers | 10240 MB | 292 ms | $49.87 |
Key Insights:
- ARM64 consistently cheaper than x86_64 (15-30% savings)
- ARM64 5-6 workers offer best price/performance for high throughput
- ARM64 2 workers cheapest overall ($36.46 per 1M)
- x86_64 has anomalous performance at 3 vCPUs (possibly throttling)
Recommended for:
- Batch data processing
- Cryptographic operations (hashing, encryption)
- Image/video processing
- Scientific computing
- High-volume workloads (>100K invocations/day)
Not recommended for:
- I/O-bound operations (use async instead)
- Simple transformations (<100ms)
- Low-volume workloads (<10K invocations/day)
[dependencies]
lambda_runtime = "1.0.0"
tokio = { version = "1", features = ["macros", "rt-multi-thread"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
bcrypt = "0.15"
rayon = "1.7"
num_cpus = "1.16"{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
}
]
}# Delete Lambda function
aws lambda delete-function --function-name rust-multithread-lambda
# Delete CloudWatch logs
aws logs delete-log-group --log-group-name /aws/lambda/rust-multithread-lambda# Delete all 12 Lambda functions deployed via SAM
aws cloudformation delete-stack --stack-name rust-multithread-benchmarkThis project demonstrates that multi-threaded processing in AWS Lambda is not only possible but highly effective for CPU-bound workloads. Key takeaways:
-
Rayon makes parallelism simple: Just replace
.iter()with.par_iter()and Rayon handles thread management automatically. -
Thread pool initialization is critical: Initialize the global thread pool during cold start, not during request processing, using
std::sync::Once. -
ARM64 (Graviton2) delivers best value: 15-30% cheaper than x86_64 with comparable or better performance at higher vCPU counts.
-
Measure actual thread usage: The
threads_usedfield provides empirical proof that multi-threading is working correctly. -
Sweet spot varies by workload: For bcrypt hashing, 4-6 workers provide the best throughput-to-cost ratio.
MIT
