Cluster Configuration
Cluster Configuration is where you control the computational infrastructure that powers PJI's data processing workloads. This module allows you to allocate computing resources, optimize performance, and manage costs for heavy-duty processing tasks.
Who is this for? Platform administrators and technical operations teams responsible for infrastructure management, performance optimization, and cost control.
Allocate Compute Resources for Data Processing
Compute clusters provide the processing power for PJI's most resource-intensive operations. When you configure a cluster, you're determining where and how these workloads run:
- Data Ingestion Jobs: Processing large volumes of incoming clinical data from external sources
- NLP Processing Pipelines: Running natural language processing models to extract insights from unstructured text
- De-identification Workloads: Applying privacy-preserving transformations to protect patient identities
- Analytics Queries: Executing complex analytical computations across large datasets
Think of clusters as dedicated computing environments that handle background processing while keeping the interactive platform responsive.
Cluster Configuration Options
Configure your compute resources to match your organization's processing needs and budget. These settings determine how much computational power is available and how it's allocated.
Cluster Size
Choose the baseline computing capacity for your workloads:
| Size | When to Use It |
|---|---|
| Small | Ideal for light workloads, testing environments, or organizations processing small data volumes |
| Medium | Suitable for moderate processing needs with regular data ingestion and NLP tasks |
| Large | Recommended for high-volume environments with frequent batch processing and multiple concurrent workloads |
| X-Large | Designed for enterprise-scale operations with continuous, heavy processing demands |
How to choose: Start with the smallest size that meets your needs, then scale up if you observe performance issues or long job queues.
Auto-Scaling
Control whether your cluster automatically adjusts its capacity based on demand.
When enabled:
- The cluster automatically adds computing resources during peak processing times
- Resources scale back down during quiet periods to reduce costs
- You set minimum and maximum capacity limits to control scaling boundaries
When disabled:
- The cluster maintains a fixed capacity regardless of workload
- Provides predictable performance and costs
- Recommended when you have consistent, steady processing demands
Best practice: Enable auto-scaling if your workloads are unpredictable or have clear peak periods (e.g., nightly batch processing). Disable it for steady-state workloads where predictable costs matter more than dynamic optimization.
Instance Types
Select the type of computing hardware that best matches your workload characteristics:
| Instance Type | Optimized For | Best Used When |
|---|---|---|
| CPU-optimized | General processing tasks with high computational needs | Running standard data transformations, business logic, and general-purpose NLP |
| Memory-optimized | Large datasets that need to be held in RAM | Processing very large documents, maintaining large vocabularies, or working with memory-intensive algorithms |
| GPU | Parallel processing and deep learning models | Running advanced neural network models, accelerating complex NLP transformations, or processing medical imaging |
Note: Different workload types can use different instance types. For example, you might use GPU instances for specialized NLP models while using CPU instances for routine data ingestion.
Scheduling
Manage how multiple jobs compete for cluster resources:
Job Priorities: Assign importance levels to different types of workloads:
- High-priority jobs (e.g., urgent clinical analytics) run before lower-priority tasks
- Ensures critical processing completes on time even when the cluster is busy
- Prevents less important jobs from blocking time-sensitive operations
Resource Allocation: Control how much of the cluster each job type can use:
- Set limits to prevent any single workload from consuming all available resources
- Allocate guaranteed minimums for critical operations
- Balance competing demands from data ingestion, NLP, and analytics tasks
Common scheduling strategies:
- Reserve dedicated capacity for time-sensitive clinical workflows
- Allow batch processing jobs to use all available capacity during off-peak hours
- Queue lower-priority analytics during business hours and run them overnight
Cluster Monitoring
Monitoring tools help you understand how your cluster is performing and identify opportunities for optimization. Regular monitoring ensures you're getting the best performance for your investment.
Cluster Utilization Metrics
Track how much of your available computing capacity is actually being used:
What to monitor:
- CPU utilization: Percentage of processing power in use vs. sitting idle
- Memory usage: How much RAM is consumed by active jobs
- Storage I/O: Read/write activity on data storage systems
- Network throughput: Data transfer rates between cluster components
What to look for:
- Consistently low utilization: You may be over-provisioned and paying for unused capacity
- Frequently maxed out resources: Indicates you need a larger cluster or better auto-scaling
- Unbalanced usage: Some resources (e.g., CPU) maxed while others (e.g., memory) idle suggests suboptimal instance type selection
Job Execution Statistics
Understand how individual processing jobs are performing:
Key metrics:
- Job completion times: How long each type of workload takes to finish
- Success vs. failure rates: Track which jobs complete successfully and which encounter errors
- Queue wait times: How long jobs wait before starting execution
- Concurrent job counts: Number of jobs running simultaneously
Use these insights to:
- Identify bottlenecks where jobs are waiting too long to start
- Spot failing jobs that need troubleshooting
- Understand typical processing times for capacity planning
- Optimize scheduling to reduce conflicts between concurrent workloads
Resource Consumption Tracking
Monitor resource usage at a granular level to understand what's driving costs:
Track consumption by:
- Job type: Compare resource usage across data ingestion, NLP processing, and analytics
- Time period: Identify daily, weekly, or monthly patterns in resource demand
- User or department: Attribute costs to specific teams or use cases (if configured)
- Individual pipelines: See which NLP models or data transformations are most resource-intensive
This information helps you allocate costs accurately and make informed decisions about infrastructure investments.
Cost Optimization Recommendations
The monitoring system analyzes your usage patterns and suggests ways to reduce expenses without sacrificing performance:
Common recommendations:
- Rightsize your cluster: Suggestions to move to a smaller (or larger) size based on actual utilization
- Adjust auto-scaling thresholds: Optimize when the cluster scales up or down to match your workload patterns
- Change instance types: Switch to more cost-effective hardware that better matches your workload characteristics
- Optimize job scheduling: Shift non-urgent workloads to off-peak times when compute costs may be lower
- Consolidate underutilized clusters: Combine separate clusters that aren't being fully used
Best practice: Review cost optimization recommendations monthly and implement changes during planned maintenance windows to avoid disrupting production workloads.
Best Practices for Cluster Management
Start Small and Scale Up: Begin with conservative cluster sizes and expand based on actual performance data rather than theoretical needs.
Enable Monitoring from Day One: Track metrics from the start so you have baseline data for optimization decisions.
Schedule Resource-Intensive Jobs Strategically: Run heavy batch processing during off-peak hours to keep the platform responsive during business hours.
Test Configuration Changes in Non-Production: Before adjusting production cluster settings, validate changes in a test environment to avoid unexpected performance impacts.
Review Costs Regularly: Monthly reviews of resource consumption and costs help catch inefficiencies before they become expensive problems.
Getting Help
If you need assistance with cluster configuration:
- Performance Issues: Contact your PJI technical team or implementation partner for optimization guidance
- Cost Concerns: Review the cost optimization recommendations in the monitoring dashboard or consult with your account manager
- Configuration Questions: Reach out to PJI support for help choosing the right cluster settings for your workload