Deploying Patient Journey Intelligence on Databricks
Patient Journey Intelligence integrates with Databricks to provide a unified platform for healthcare data engineering, NLP processing, and analytics on your existing Databricks infrastructure.
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Databricks Workspace │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Compute Layer │ │
│ │ │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ │ │
│ │ │ Interactive │ │ Job Clusters │ │ SQL Warehouse │ │ │
│ │ │ Clusters │ │ │ │ │ │ │
│ │ │ │ │ - Ingestion │ │ - OMOP Query │ │ │
│ │ │ - Ad-hoc │ │ - NLP │ │ - Analytics │ │ │
│ │ │ - Analysis │ │ - De-ID │ │ │ │ │
│ │ └──────────────┘ └───────────────┘ └────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ │ │
│ │ │ Delta Lake │ │ DBFS/Unity │ │ External │ │ │
│ │ │ │ │ Catalog │ │ Storage │ │ │
│ │ │ - OMOP CDM │ │ │ │ │ │ │
│ │ │ - Curated │ │ - Metadata │ │ - S3/ADLS/GCS │ │ │
│ │ │ - Bronze/ │ │ - Lineage │ │ - Raw Files │ │ │
│ │ │ Silver/ │ │ - Governance │ │ │ │ │
│ │ │ Gold │ │ │ │ │ │ │
│ │ └──────────────┘ └──────────── ───┘ └────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Processing Frameworks │ │
│ │ │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ │ │
│ │ │ Spark NLP │ │ Delta Live │ │ MLflow │ │ │
│ │ │ │ │ Tables │ │ │ │ │
│ │ │ - Clinical │ │ │ │ - Model Mgmt │ │ │
│ │ │ NER/RE │ │ - Pipelines │ │ - Tracking │ │ │
│ │ │ - De-ID │ │ - Quality │ │ │ │ │
│ │ └──────────────┘ └───────────────┘ └────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────┐
│ Patient Journey Intelligence Application Layer (Optional) │
│ │
│ ┌──────────────┐ ┌───────────────┐ ┌─────────────────────┐ │
│ │ Web UI │ │ API Server │ │ Database (RDS/ │ │
│ │ (React) │ │ (REST) │ │ Azure DB/ │ │
│ │ │ │ │ │ Cloud SQL) │ │
│ │ - Cohorts │ │ - Workflow │ │ │ │
│ │ - Journeys │ │ - Metadata │ │ - User Data │ │
│ │ - Copilot │ │ │ │ - Config │ │
│ └──────────────┘ └───────────────┘ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Deployment Models
Model 1: Databricks-Native (Recommended)
Best for: Organizations with existing Databricks investments
- All data processing runs on Databricks
- OMOP CDM stored in Delta Lake tables
- Unity Catalog for governance
- Notebooks for exploration and development
- Delta Live Tables for ingestion pipelines
- Optional Patient Journey Intelligence UI deployed separately (EKS/AKS/GKE)
Advantages:
- Leverage existing Databricks infrastructure
- Unified platform for data engineering and analytics
- Native lineage and governance via Unity Catalog
- Cost-effective (no separate compute infrastructure)
Model 2: Hybrid Deployment
- Patient Journey Intelligence application layer on Kubernetes (AWS/Azure/GCP)
- Databricks for batch processing and analytics
- Data synchronized between platforms
Best for: Organizations requiring Patient Journey Intelligence's full UI capabilities with Databricks analytics
Key Features on Databricks
Delta Lake Storage
OMOP CDM tables stored as Delta tables:
- Bronze Layer: Raw ingested data
- Silver Layer: Cleaned and validated
- Gold Layer: OMOP CDM v5.4 tables
Benefits:
- ACID transactions
- Time travel (data versioning)
- Schema evolution
- Efficient upserts
Unity Catalog Integration
- Centralized metadata management
- Data lineage tracking
- Fine-grained access control
- Audit logging
- Data discovery
Spark NLP for Healthcare
Native integration with Spark NLP library:
- Clinical Named Entity Recognition (NER)
- Relation Extraction
- Assertion detection
- De-identification
- 1000+ pre-trained healthcare models
Delta Live Tables (DLT)
Declarative data pipelines for:
- Source data ingestion
- Data quality validation
- NLP processing workflows
- OMOP transformation
- Automated monitoring and recovery
Prerequisites
Databricks Workspace
- Edition: Premium or Enterprise
- Region: Choose based on data residency requirements
- Runtime: DBR 12.2 LTS or higher
- Unity Catalog: Enabled (recommended)
Cloud Provider Resources
AWS:
- S3 buckets for external storage
- IAM roles for cross-account access
- VPC peering (if needed)
Azure:
- Azure Data Lake Storage Gen2
- Service principals for access
- VNet peering (if needed)
GCP:
- Google Cloud Storage
- Service accounts
- VPC peering (if needed)
Compute Quotas
- Minimum 100 DBUs available
- Photon-enabled clusters (recommended)
- GPU instances for large-scale NLP (optional)
Installation Process
1. Workspace Configuration (Week 1)
# Install Patient Journey Intelligence libraries on Databricks cluster
%pip install johnsnowlabs==5.2.0
%pip install sparknlp-jsl==5.2.0
# Initialize John Snow Labs session
from johnsnowlabs import nlp
nlp.start(
spark=spark,
license_keys={"AWS_ACCESS_KEY_ID": "...", "AWS_SECRET_ACCESS_KEY": "..."}
)
2. Unity Catalog Setup (Week 1)
-- Create catalog for OMOP CDM
CREATE CATALOG IF NOT EXISTS omop_cdm;
-- Create schemas
CREATE SCHEMA IF NOT EXISTS omop_cdm.cdm_v5_4;
CREATE SCHEMA IF NOT EXISTS omop_cdm.vocabulary;
CREATE SCHEMA IF NOT EXISTS omop_cdm.metadata;
-- Set permissions
GRANT USAGE ON CATALOG omop_cdm TO `data_analysts`;
GRANT SELECT ON SCHEMA omop_cdm.cdm_v5_4 TO `data_analysts`;
3. Data Ingestion Pipeline (Week 2-3)
Create Delta Live Tables pipeline:
import dlt
from pyspark.sql.functions import *
@dlt.table(
comment="Raw clinical notes ingested from source systems",
table_properties={"quality": "bronze"}
)
def bronze_clinical_notes():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/mnt/raw/clinical_notes/")
)
@dlt.table(
comment="De-identified clinical notes",
table_properties={"quality": "silver"}
)
@dlt.expect_or_drop("valid_note", "note_text IS NOT NULL")
def silver_clinical_notes():
from sparknlp_jsl.annotator import DeIdentification
# Apply de-identification
df = dlt.read("bronze_clinical_notes")
# ... NLP pipeline code ...
return deidentified_df
@dlt.table(
comment="OMOP Note table with extracted entities",
table_properties={"quality": "gold"}
)
def gold_omop_note():
return transform_to_omop_note(dlt.read("silver_clinical_notes"))
4. OMOP CDM Creation (Week 3-4)
Deploy OMOP CDM v5.4 schema as Delta tables:
-- Person table
CREATE TABLE IF NOT EXISTS omop_cdm.cdm_v5_4.person (
person_id BIGINT NOT NULL,
gender_concept_id INT NOT NULL,
year_of_birth INT NOT NULL,
month_of_birth INT,
day_of_birth INT,
birth_datetime TIMESTAMP,
race_concept_id INT NOT NULL,
ethnicity_concept_id INT NOT NULL,
-- ... additional fields ...
) USING DELTA
LOCATION '/mnt/omop/person'
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
-- Create all OMOP tables...
5. Validation & Testing (Week 5-6)
- Data quality validation
- Performance benchmarking
- User acceptance testing
Resource Sizing
Small Deployment (< 100K patients)
| Resource | Specification | Monthly Cost |
|---|---|---|
| Job Clusters | 2 x i3.xlarge (8 cores, 30GB RAM) | ~100 DBUs/day = $1,500 |
| SQL Warehouse | Small (2X-Small) | ~50 DBUs/day = $750 |
| Storage (Delta) | 500 GB | $15 |
| Total | ~$2,300 |
Medium Deployment (100K - 1M patients)
| Resource | Specification | Monthly Cost |
|---|---|---|
| Job Clusters | 4 x i3.2xlarge (16 cores, 60GB RAM) | ~400 DBUs/day = $6,000 |
| SQL Warehouse | Medium (X-Small) | ~150 DBUs/day = $2,250 |
| Storage (Delta) | 5 TB | $150 |
| Total | ~$8,500 |
Large Deployment (> 1M patients)
| Resource | Specification | Monthly Cost |
|---|---|---|
| Job Clusters | 8 x i3.4xlarge (32 cores, 120GB RAM) | ~1,000 DBUs/day = $15,000 |
| SQL Warehouse | Large (Small) | ~400 DBUs/day = $6,000 |
| Storage (Delta) | 50 TB | $1,500 |
| Total | ~$23,000 |
Costs based on AWS Databricks Standard pricing. Photon-enabled clusters cost ~2x.
Performance Optimization
Cluster Configuration
# Recommended cluster config for NLP processing
spark.conf.set("spark.driver.memory", "32g")
spark.conf.set("spark.executor.memory", "32g")
spark.conf.set("spark.executor.cores", "8")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
Delta Table Optimization
-- Optimize tables regularly
OPTIMIZE omop_cdm.cdm_v5_4.person;
OPTIMIZE omop_cdm.cdm_v5_4.observation_period;
-- Z-order by common filter columns
OPTIMIZE omop_cdm.cdm_v5_4.condition_occurrence ZORDER BY (person_id, condition_start_date);
-- Vacuum old versions
VACUUM omop_cdm.cdm_v5_4.person RETAIN 168 HOURS;
Caching
# Cache frequently accessed tables
spark.sql("CACHE TABLE omop_cdm.vocabulary.concept")
Security & Compliance
Data Encryption
- Encryption at rest (cloud provider default)
- Encryption in transit (TLS 1.2+)
- Customer-managed keys (optional)
Access Control
- Unity Catalog fine-grained ACLs
- SSO via Azure AD / Okta / SAML
- Row-level and column-level security
- Dynamic view masking for PHI
Audit Logging
- Unity Catalog audit logs
- Cluster and job logs
- Query history tracking
Compliance
- HIPAA-eligible Databricks accounts
- BAA with Databricks
- Compliance logging and reporting
Integration with Patient Journey Intelligence Features
Cohort Builder
- Execute cohort queries via Databricks SQL
- Export cohorts as Delta tables
- Real-time cohort refresh
Patient Journeys
- Query timeline data from OMOP tables
- Join across person, visit, condition, drug, procedure tables
Copilot
- Natural language to SQL via Databricks AI
- Query OMOP data through conversational interface
Cost Optimization
Auto-Scaling
- Enable autoscaling for job clusters
- Scale down during off-peak hours
- Use spot instances for fault-tolerant workloads
Photon Engine
- 2-3x faster queries
- Higher throughput for NLP
- Consider cost/performance trade-off
Cluster Policies
- Enforce maximum cluster sizes
- Automatic termination after idle time
- Restrict instance types
Monitoring
Databricks Metrics
- Cluster utilization
- Job success/failure rates
- Query performance
- DBU consumption
Data Quality
- Delta Live Tables expectations
- Data freshness monitoring
- Schema drift detection
Next Steps
- Discovery Call: Discuss your Databricks environment with John Snow Labs
- Workspace Setup: Configure Unity Catalog and permissions
- Pilot Project: Process sample dataset
- Production Rollout: Scale to full data volume