Skip to main content

Deploying Patient Journey Intelligence on Databricks

Patient Journey Intelligence integrates with Databricks to provide a unified platform for healthcare data engineering, NLP processing, and analytics on your existing Databricks infrastructure.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│ Databricks Workspace │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Compute Layer │ │
│ │ │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ │ │
│ │ │ Interactive │ │ Job Clusters │ │ SQL Warehouse │ │ │
│ │ │ Clusters │ │ │ │ │ │ │
│ │ │ │ │ - Ingestion │ │ - OMOP Query │ │ │
│ │ │ - Ad-hoc │ │ - NLP │ │ - Analytics │ │ │
│ │ │ - Analysis │ │ - De-ID │ │ │ │ │
│ │ └──────────────┘ └───────────────┘ └────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ │ │
│ │ │ Delta Lake │ │ DBFS/Unity │ │ External │ │ │
│ │ │ │ │ Catalog │ │ Storage │ │ │
│ │ │ - OMOP CDM │ │ │ │ │ │ │
│ │ │ - Curated │ │ - Metadata │ │ - S3/ADLS/GCS │ │ │
│ │ │ - Bronze/ │ │ - Lineage │ │ - Raw Files │ │ │
│ │ │ Silver/ │ │ - Governance │ │ │ │ │
│ │ │ Gold │ │ │ │ │ │ │
│ │ └──────────────┘ └───────────────┘ └────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Processing Frameworks │ │
│ │ │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ │ │
│ │ │ Spark NLP │ │ Delta Live │ │ MLflow │ │ │
│ │ │ │ │ Tables │ │ │ │ │
│ │ │ - Clinical │ │ │ │ - Model Mgmt │ │ │
│ │ │ NER/RE │ │ - Pipelines │ │ - Tracking │ │ │
│ │ │ - De-ID │ │ - Quality │ │ │ │ │
│ │ └──────────────┘ └───────────────┘ └────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────────┐
│ Patient Journey Intelligence Application Layer (Optional) │
│ │
│ ┌──────────────┐ ┌───────────────┐ ┌─────────────────────┐ │
│ │ Web UI │ │ API Server │ │ Database (RDS/ │ │
│ │ (React) │ │ (REST) │ │ Azure DB/ │ │
│ │ │ │ │ │ Cloud SQL) │ │
│ │ - Cohorts │ │ - Workflow │ │ │ │
│ │ - Journeys │ │ - Metadata │ │ - User Data │ │
│ │ - Copilot │ │ │ │ - Config │ │
│ └──────────────┘ └───────────────┘ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘

Deployment Models

Best for: Organizations with existing Databricks investments

  • All data processing runs on Databricks
  • OMOP CDM stored in Delta Lake tables
  • Unity Catalog for governance
  • Notebooks for exploration and development
  • Delta Live Tables for ingestion pipelines
  • Optional Patient Journey Intelligence UI deployed separately (EKS/AKS/GKE)

Advantages:

  • Leverage existing Databricks infrastructure
  • Unified platform for data engineering and analytics
  • Native lineage and governance via Unity Catalog
  • Cost-effective (no separate compute infrastructure)

Model 2: Hybrid Deployment

  • Patient Journey Intelligence application layer on Kubernetes (AWS/Azure/GCP)
  • Databricks for batch processing and analytics
  • Data synchronized between platforms

Best for: Organizations requiring Patient Journey Intelligence's full UI capabilities with Databricks analytics

Key Features on Databricks

Delta Lake Storage

OMOP CDM tables stored as Delta tables:

  • Bronze Layer: Raw ingested data
  • Silver Layer: Cleaned and validated
  • Gold Layer: OMOP CDM v5.4 tables

Benefits:

  • ACID transactions
  • Time travel (data versioning)
  • Schema evolution
  • Efficient upserts

Unity Catalog Integration

  • Centralized metadata management
  • Data lineage tracking
  • Fine-grained access control
  • Audit logging
  • Data discovery

Spark NLP for Healthcare

Native integration with Spark NLP library:

  • Clinical Named Entity Recognition (NER)
  • Relation Extraction
  • Assertion detection
  • De-identification
  • 1000+ pre-trained healthcare models

Delta Live Tables (DLT)

Declarative data pipelines for:

  • Source data ingestion
  • Data quality validation
  • NLP processing workflows
  • OMOP transformation
  • Automated monitoring and recovery

Prerequisites

Databricks Workspace

  • Edition: Premium or Enterprise
  • Region: Choose based on data residency requirements
  • Runtime: DBR 12.2 LTS or higher
  • Unity Catalog: Enabled (recommended)

Cloud Provider Resources

AWS:

  • S3 buckets for external storage
  • IAM roles for cross-account access
  • VPC peering (if needed)

Azure:

  • Azure Data Lake Storage Gen2
  • Service principals for access
  • VNet peering (if needed)

GCP:

  • Google Cloud Storage
  • Service accounts
  • VPC peering (if needed)

Compute Quotas

  • Minimum 100 DBUs available
  • Photon-enabled clusters (recommended)
  • GPU instances for large-scale NLP (optional)

Installation Process

1. Workspace Configuration (Week 1)

# Install Patient Journey Intelligence libraries on Databricks cluster
%pip install johnsnowlabs==5.2.0
%pip install sparknlp-jsl==5.2.0

# Initialize John Snow Labs session
from johnsnowlabs import nlp
nlp.start(
spark=spark,
license_keys={"AWS_ACCESS_KEY_ID": "...", "AWS_SECRET_ACCESS_KEY": "..."}
)

2. Unity Catalog Setup (Week 1)

-- Create catalog for OMOP CDM
CREATE CATALOG IF NOT EXISTS omop_cdm;

-- Create schemas
CREATE SCHEMA IF NOT EXISTS omop_cdm.cdm_v5_4;
CREATE SCHEMA IF NOT EXISTS omop_cdm.vocabulary;
CREATE SCHEMA IF NOT EXISTS omop_cdm.metadata;

-- Set permissions
GRANT USAGE ON CATALOG omop_cdm TO `data_analysts`;
GRANT SELECT ON SCHEMA omop_cdm.cdm_v5_4 TO `data_analysts`;

3. Data Ingestion Pipeline (Week 2-3)

Create Delta Live Tables pipeline:

import dlt
from pyspark.sql.functions import *

@dlt.table(
comment="Raw clinical notes ingested from source systems",
table_properties={"quality": "bronze"}
)
def bronze_clinical_notes():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/mnt/raw/clinical_notes/")
)

@dlt.table(
comment="De-identified clinical notes",
table_properties={"quality": "silver"}
)
@dlt.expect_or_drop("valid_note", "note_text IS NOT NULL")
def silver_clinical_notes():
from sparknlp_jsl.annotator import DeIdentification

# Apply de-identification
df = dlt.read("bronze_clinical_notes")
# ... NLP pipeline code ...
return deidentified_df

@dlt.table(
comment="OMOP Note table with extracted entities",
table_properties={"quality": "gold"}
)
def gold_omop_note():
return transform_to_omop_note(dlt.read("silver_clinical_notes"))

4. OMOP CDM Creation (Week 3-4)

Deploy OMOP CDM v5.4 schema as Delta tables:

-- Person table
CREATE TABLE IF NOT EXISTS omop_cdm.cdm_v5_4.person (
person_id BIGINT NOT NULL,
gender_concept_id INT NOT NULL,
year_of_birth INT NOT NULL,
month_of_birth INT,
day_of_birth INT,
birth_datetime TIMESTAMP,
race_concept_id INT NOT NULL,
ethnicity_concept_id INT NOT NULL,
-- ... additional fields ...
) USING DELTA
LOCATION '/mnt/omop/person'
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

-- Create all OMOP tables...

5. Validation & Testing (Week 5-6)

  • Data quality validation
  • Performance benchmarking
  • User acceptance testing

Resource Sizing

Small Deployment (< 100K patients)

ResourceSpecificationMonthly Cost
Job Clusters2 x i3.xlarge (8 cores, 30GB RAM)~100 DBUs/day = $1,500
SQL WarehouseSmall (2X-Small)~50 DBUs/day = $750
Storage (Delta)500 GB$15
Total~$2,300

Medium Deployment (100K - 1M patients)

ResourceSpecificationMonthly Cost
Job Clusters4 x i3.2xlarge (16 cores, 60GB RAM)~400 DBUs/day = $6,000
SQL WarehouseMedium (X-Small)~150 DBUs/day = $2,250
Storage (Delta)5 TB$150
Total~$8,500

Large Deployment (> 1M patients)

ResourceSpecificationMonthly Cost
Job Clusters8 x i3.4xlarge (32 cores, 120GB RAM)~1,000 DBUs/day = $15,000
SQL WarehouseLarge (Small)~400 DBUs/day = $6,000
Storage (Delta)50 TB$1,500
Total~$23,000

Costs based on AWS Databricks Standard pricing. Photon-enabled clusters cost ~2x.

Performance Optimization

Cluster Configuration

# Recommended cluster config for NLP processing
spark.conf.set("spark.driver.memory", "32g")
spark.conf.set("spark.executor.memory", "32g")
spark.conf.set("spark.executor.cores", "8")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")

Delta Table Optimization

-- Optimize tables regularly
OPTIMIZE omop_cdm.cdm_v5_4.person;
OPTIMIZE omop_cdm.cdm_v5_4.observation_period;

-- Z-order by common filter columns
OPTIMIZE omop_cdm.cdm_v5_4.condition_occurrence ZORDER BY (person_id, condition_start_date);

-- Vacuum old versions
VACUUM omop_cdm.cdm_v5_4.person RETAIN 168 HOURS;

Caching

# Cache frequently accessed tables
spark.sql("CACHE TABLE omop_cdm.vocabulary.concept")

Security & Compliance

Data Encryption

  • Encryption at rest (cloud provider default)
  • Encryption in transit (TLS 1.2+)
  • Customer-managed keys (optional)

Access Control

  • Unity Catalog fine-grained ACLs
  • SSO via Azure AD / Okta / SAML
  • Row-level and column-level security
  • Dynamic view masking for PHI

Audit Logging

  • Unity Catalog audit logs
  • Cluster and job logs
  • Query history tracking

Compliance

  • HIPAA-eligible Databricks accounts
  • BAA with Databricks
  • Compliance logging and reporting

Integration with Patient Journey Intelligence Features

Cohort Builder

  • Execute cohort queries via Databricks SQL
  • Export cohorts as Delta tables
  • Real-time cohort refresh

Patient Journeys

  • Query timeline data from OMOP tables
  • Join across person, visit, condition, drug, procedure tables

Copilot

  • Natural language to SQL via Databricks AI
  • Query OMOP data through conversational interface

Cost Optimization

Auto-Scaling

  • Enable autoscaling for job clusters
  • Scale down during off-peak hours
  • Use spot instances for fault-tolerant workloads

Photon Engine

  • 2-3x faster queries
  • Higher throughput for NLP
  • Consider cost/performance trade-off

Cluster Policies

  • Enforce maximum cluster sizes
  • Automatic termination after idle time
  • Restrict instance types

Monitoring

Databricks Metrics

  • Cluster utilization
  • Job success/failure rates
  • Query performance
  • DBU consumption

Data Quality

  • Delta Live Tables expectations
  • Data freshness monitoring
  • Schema drift detection

Next Steps

  1. Discovery Call: Discuss your Databricks environment with John Snow Labs
  2. Workspace Setup: Configure Unity Catalog and permissions
  3. Pilot Project: Process sample dataset
  4. Production Rollout: Scale to full data volume

Additional Resources