Cluster Configuration

Cluster Configuration is where you control the computational infrastructure that powers PJI's data processing workloads. This module allows you to allocate computing resources, optimize performance, and manage costs for heavy-duty processing tasks.

Who is this for? Platform administrators and technical operations teams responsible for infrastructure management, performance optimization, and cost control.

Allocate Compute Resources for Data Processing

Compute clusters provide the processing power for PJI's most resource-intensive operations. When you configure a cluster, you're determining where and how these workloads run:

Data Ingestion Jobs: Processing large volumes of incoming clinical data from external sources
NLP Processing Pipelines: Running natural language processing models to extract insights from unstructured text
De-identification Workloads: Applying privacy-preserving transformations to protect patient identities
Analytics Queries: Executing complex analytical computations across large datasets

Think of clusters as dedicated computing environments that handle background processing while keeping the interactive platform responsive.

Cluster Configuration Options

Configure your compute resources to match your organization's processing needs and budget. These settings determine how much computational power is available and how it's allocated.

Cluster Size

Choose the baseline computing capacity for your workloads:

Size	When to Use It
Small	Ideal for light workloads, testing environments, or organizations processing small data volumes
Medium	Suitable for moderate processing needs with regular data ingestion and NLP tasks
Large	Recommended for high-volume environments with frequent batch processing and multiple concurrent workloads
X-Large	Designed for enterprise-scale operations with continuous, heavy processing demands

How to choose: Start with the smallest size that meets your needs, then scale up if you observe performance issues or long job queues.

Auto-Scaling

Control whether your cluster automatically adjusts its capacity based on demand.

When enabled:

The cluster automatically adds computing resources during peak processing times
Resources scale back down during quiet periods to reduce costs
You set minimum and maximum capacity limits to control scaling boundaries

When disabled:

The cluster maintains a fixed capacity regardless of workload
Provides predictable performance and costs
Recommended when you have consistent, steady processing demands

Best practice: Enable auto-scaling if your workloads are unpredictable or have clear peak periods (e.g., nightly batch processing). Disable it for steady-state workloads where predictable costs matter more than dynamic optimization.

Instance Types

Select the type of computing hardware that best matches your workload characteristics:

Instance Type	Optimized For	Best Used When
CPU-optimized	General processing tasks with high computational needs	Running standard data transformations, business logic, and general-purpose NLP
Memory-optimized	Large datasets that need to be held in RAM	Processing very large documents, maintaining large vocabularies, or working with memory-intensive algorithms
GPU	Parallel processing and deep learning models	Running advanced neural network models, accelerating complex NLP transformations, or processing medical imaging

Note: Different workload types can use different instance types. For example, you might use GPU instances for specialized NLP models while using CPU instances for routine data ingestion.

Scheduling

Manage how multiple jobs compete for cluster resources:

Job Priorities: Assign importance levels to different types of workloads:

High-priority jobs (e.g., urgent clinical analytics) run before lower-priority tasks
Ensures critical processing completes on time even when the cluster is busy
Prevents less important jobs from blocking time-sensitive operations

Resource Allocation: Control how much of the cluster each job type can use:

Set limits to prevent any single workload from consuming all available resources
Allocate guaranteed minimums for critical operations
Balance competing demands from data ingestion, NLP, and analytics tasks

Common scheduling strategies:

Reserve dedicated capacity for time-sensitive clinical workflows
Allow batch processing jobs to use all available capacity during off-peak hours
Queue lower-priority analytics during business hours and run them overnight

Cluster Monitoring

Monitoring tools help you understand how your cluster is performing and identify opportunities for optimization. Regular monitoring ensures you're getting the best performance for your investment.

Cluster Utilization Metrics

Track how much of your available computing capacity is actually being used:

What to monitor:

CPU utilization: Percentage of processing power in use vs. sitting idle
Memory usage: How much RAM is consumed by active jobs
Storage I/O: Read/write activity on data storage systems
Network throughput: Data transfer rates between cluster components

What to look for:

Consistently low utilization: You may be over-provisioned and paying for unused capacity
Frequently maxed out resources: Indicates you need a larger cluster or better auto-scaling
Unbalanced usage: Some resources (e.g., CPU) maxed while others (e.g., memory) idle suggests suboptimal instance type selection

Job Execution Statistics

Understand how individual processing jobs are performing:

Key metrics:

Job completion times: How long each type of workload takes to finish
Success vs. failure rates: Track which jobs complete successfully and which encounter errors
Queue wait times: How long jobs wait before starting execution
Concurrent job counts: Number of jobs running simultaneously

Use these insights to:

Identify bottlenecks where jobs are waiting too long to start
Spot failing jobs that need troubleshooting
Understand typical processing times for capacity planning
Optimize scheduling to reduce conflicts between concurrent workloads

Resource Consumption Tracking

Monitor resource usage at a granular level to understand what's driving costs:

Track consumption by:

Job type: Compare resource usage across data ingestion, NLP processing, and analytics
Time period: Identify daily, weekly, or monthly patterns in resource demand
User or department: Attribute costs to specific teams or use cases (if configured)
Individual pipelines: See which NLP models or data transformations are most resource-intensive

This information helps you allocate costs accurately and make informed decisions about infrastructure investments.

Cost Optimization Recommendations

The monitoring system analyzes your usage patterns and suggests ways to reduce expenses without sacrificing performance:

Common recommendations:

Rightsize your cluster: Suggestions to move to a smaller (or larger) size based on actual utilization
Adjust auto-scaling thresholds: Optimize when the cluster scales up or down to match your workload patterns
Change instance types: Switch to more cost-effective hardware that better matches your workload characteristics
Optimize job scheduling: Shift non-urgent workloads to off-peak times when compute costs may be lower
Consolidate underutilized clusters: Combine separate clusters that aren't being fully used

Best practice: Review cost optimization recommendations monthly and implement changes during planned maintenance windows to avoid disrupting production workloads.

Best Practices for Cluster Management

Start Small and Scale Up: Begin with conservative cluster sizes and expand based on actual performance data rather than theoretical needs.

Enable Monitoring from Day One: Track metrics from the start so you have baseline data for optimization decisions.

Schedule Resource-Intensive Jobs Strategically: Run heavy batch processing during off-peak hours to keep the platform responsive during business hours.

Test Configuration Changes in Non-Production: Before adjusting production cluster settings, validate changes in a test environment to avoid unexpected performance impacts.

Review Costs Regularly: Monthly reviews of resource consumption and costs help catch inefficiencies before they become expensive problems.

Getting Help

If you need assistance with cluster configuration:

Performance Issues: Contact your PJI technical team or implementation partner for optimization guidance
Cost Concerns: Review the cost optimization recommendations in the monitoring dashboard or consult with your account manager
Configuration Questions: Reach out to PJI support for help choosing the right cluster settings for your workload

Allocate Compute Resources for Data Processing​

Cluster Configuration Options​

Cluster Size​

Auto-Scaling​

Instance Types​

Scheduling​

Cluster Monitoring​

Cluster Utilization Metrics​

Job Execution Statistics​

Resource Consumption Tracking​

Cost Optimization Recommendations​

Best Practices for Cluster Management​

Getting Help​