Deploying Patient Journey Intelligence on Databricks

Patient Journey Intelligence integrates with Databricks to provide a unified platform for healthcare data engineering, NLP processing, and analytics on your existing Databricks infrastructure.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                     Databricks Workspace                         │
│                                                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    Compute Layer                            │ │
│  │                                                             │ │
│  │  ┌──────────────┐  ┌───────────────┐  ┌────────────────┐ │ │
│  │  │  Interactive │  │  Job Clusters │  │  SQL Warehouse │ │ │
│  │  │   Clusters   │  │               │  │                │ │ │
│  │  │              │  │  - Ingestion  │  │  - OMOP Query  │ │ │
│  │  │  - Ad-hoc    │  │  - NLP        │  │  - Analytics   │ │ │
│  │  │  - Analysis  │  │  - De-ID      │  │                │ │ │
│  │  └──────────────┘  └───────────────┘  └────────────────┘ │ │
│  │                                                             │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                   Storage Layer                             │ │
│  │                                                             │ │
│  │  ┌──────────────┐  ┌───────────────┐  ┌────────────────┐ │ │
│  │  │  Delta Lake  │  │  DBFS/Unity   │  │  External      │ │ │
│  │  │              │  │   Catalog     │  │  Storage       │ │ │
│  │  │  - OMOP CDM  │  │               │  │                │ │ │
│  │  │  - Curated   │  │  - Metadata   │  │  - S3/ADLS/GCS │ │ │
│  │  │  - Bronze/   │  │  - Lineage    │  │  - Raw Files   │ │ │
│  │  │    Silver/   │  │  - Governance │  │                │ │ │
│  │  │    Gold      │  │               │  │                │ │ │
│  │  └──────────────┘  └───────────────┘  └────────────────┘ │ │
│  │                                                             │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                Processing Frameworks                        │ │
│  │                                                             │ │
│  │  ┌──────────────┐  ┌───────────────┐  ┌────────────────┐ │ │
│  │  │  Spark NLP   │  │  Delta Live   │  │    MLflow      │ │ │
│  │  │              │  │    Tables     │  │                │ │ │
│  │  │  - Clinical  │  │               │  │  - Model Mgmt  │ │ │
│  │  │    NER/RE    │  │  - Pipelines  │  │  - Tracking    │ │ │
│  │  │  - De-ID     │  │  - Quality    │  │                │ │ │
│  │  └──────────────┘  └───────────────┘  └────────────────┘ │ │
│  │                                                             │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                   │
└───────────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────────┐
│                 Patient Journey Intelligence Application Layer (Optional)                  │
│                                                                     │
│  ┌──────────────┐  ┌───────────────┐  ┌─────────────────────┐   │
│  │   Web UI     │  │  API Server   │  │  Database (RDS/     │   │
│  │   (React)    │  │   (REST)      │  │   Azure DB/         │   │
│  │              │  │               │  │   Cloud SQL)        │   │
│  │  - Cohorts   │  │  - Workflow   │  │                     │   │
│  │  - Journeys  │  │  - Metadata   │  │  - User Data        │   │
│  │  - Copilot  │  │               │  │  - Config           │   │
│  └──────────────┘  └───────────────┘  └─────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Deployment Models

Model 1: Databricks-Native (Recommended)

Best for: Organizations with existing Databricks investments

All data processing runs on Databricks
OMOP CDM stored in Delta Lake tables
Unity Catalog for governance
Notebooks for exploration and development
Delta Live Tables for ingestion pipelines
Optional Patient Journey Intelligence UI deployed separately (EKS/AKS/GKE)

Advantages:

Leverage existing Databricks infrastructure
Unified platform for data engineering and analytics
Native lineage and governance via Unity Catalog
Cost-effective (no separate compute infrastructure)

Model 2: Hybrid Deployment

Patient Journey Intelligence application layer on Kubernetes (AWS/Azure/GCP)
Databricks for batch processing and analytics
Data synchronized between platforms

Best for: Organizations requiring Patient Journey Intelligence's full UI capabilities with Databricks analytics

Key Features on Databricks

Delta Lake Storage

OMOP CDM tables stored as Delta tables:

Bronze Layer: Raw ingested data
Silver Layer: Cleaned and validated
Gold Layer: OMOP CDM v5.4 tables

Benefits:

ACID transactions
Time travel (data versioning)
Schema evolution
Efficient upserts

Unity Catalog Integration

Centralized metadata management
Data lineage tracking
Fine-grained access control
Audit logging
Data discovery

Spark NLP for Healthcare

Native integration with Spark NLP library:

Clinical Named Entity Recognition (NER)
Relation Extraction
Assertion detection
De-identification
1000+ pre-trained healthcare models

Delta Live Tables (DLT)

Declarative data pipelines for:

Source data ingestion
Data quality validation
NLP processing workflows
OMOP transformation
Automated monitoring and recovery

Prerequisites

Databricks Workspace

Edition: Premium or Enterprise
Region: Choose based on data residency requirements
Runtime: DBR 12.2 LTS or higher
Unity Catalog: Enabled (recommended)

Cloud Provider Resources

AWS:

S3 buckets for external storage
IAM roles for cross-account access
VPC peering (if needed)

Azure:

Azure Data Lake Storage Gen2
Service principals for access
VNet peering (if needed)

GCP:

Google Cloud Storage
Service accounts
VPC peering (if needed)

Compute Quotas

Minimum 100 DBUs available
Photon-enabled clusters (recommended)
GPU instances for large-scale NLP (optional)

Installation Process

1. Workspace Configuration (Week 1)

# Install Patient Journey Intelligence libraries on Databricks cluster
%pip install johnsnowlabs==5.2.0
%pip install sparknlp-jsl==5.2.0

# Initialize John Snow Labs session
from johnsnowlabs import nlp
nlp.start(
    spark=spark,
    license_keys={"AWS_ACCESS_KEY_ID": "...", "AWS_SECRET_ACCESS_KEY": "..."}
)

2. Unity Catalog Setup (Week 1)

-- Create catalog for OMOP CDM
CREATE CATALOG IF NOT EXISTS omop_cdm;

-- Create schemas
CREATE SCHEMA IF NOT EXISTS omop_cdm.cdm_v5_4;
CREATE SCHEMA IF NOT EXISTS omop_cdm.vocabulary;
CREATE SCHEMA IF NOT EXISTS omop_cdm.metadata;

-- Set permissions
GRANT USAGE ON CATALOG omop_cdm TO `data_analysts`;
GRANT SELECT ON SCHEMA omop_cdm.cdm_v5_4 TO `data_analysts`;

3. Data Ingestion Pipeline (Week 2-3)

Create Delta Live Tables pipeline:

import dlt
from pyspark.sql.functions import *

@dlt.table(
  comment="Raw clinical notes ingested from source systems",
  table_properties={"quality": "bronze"}
)
def bronze_clinical_notes():
  return (
    spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .load("/mnt/raw/clinical_notes/")
  )

@dlt.table(
  comment="De-identified clinical notes",
  table_properties={"quality": "silver"}
)
@dlt.expect_or_drop("valid_note", "note_text IS NOT NULL")
def silver_clinical_notes():
  from sparknlp_jsl.annotator import DeIdentification

  # Apply de-identification
  df = dlt.read("bronze_clinical_notes")
  # ... NLP pipeline code ...
  return deidentified_df

@dlt.table(
  comment="OMOP Note table with extracted entities",
  table_properties={"quality": "gold"}
)
def gold_omop_note():
  return transform_to_omop_note(dlt.read("silver_clinical_notes"))

4. OMOP CDM Creation (Week 3-4)

Deploy OMOP CDM v5.4 schema as Delta tables:

-- Person table
CREATE TABLE IF NOT EXISTS omop_cdm.cdm_v5_4.person (
  person_id BIGINT NOT NULL,
  gender_concept_id INT NOT NULL,
  year_of_birth INT NOT NULL,
  month_of_birth INT,
  day_of_birth INT,
  birth_datetime TIMESTAMP,
  race_concept_id INT NOT NULL,
  ethnicity_concept_id INT NOT NULL,
  -- ... additional fields ...
) USING DELTA
LOCATION '/mnt/omop/person'
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

-- Create all OMOP tables...

5. Validation & Testing (Week 5-6)

Data quality validation
Performance benchmarking
User acceptance testing

Resource Sizing

Small Deployment (< 100K patients)

Resource	Specification	Monthly Cost
Job Clusters	2 x i3.xlarge (8 cores, 30GB RAM)	~100 DBUs/day = $1,500
SQL Warehouse	Small (2X-Small)	~50 DBUs/day = $750
Storage (Delta)	500 GB	$15
Total		~$2,300

Medium Deployment (100K - 1M patients)

Resource	Specification	Monthly Cost
Job Clusters	4 x i3.2xlarge (16 cores, 60GB RAM)	~400 DBUs/day = $6,000
SQL Warehouse	Medium (X-Small)	~150 DBUs/day = $2,250
Storage (Delta)	5 TB	$150
Total		~$8,500

Large Deployment (> 1M patients)

Resource	Specification	Monthly Cost
Job Clusters	8 x i3.4xlarge (32 cores, 120GB RAM)	~1,000 DBUs/day = $15,000
SQL Warehouse	Large (Small)	~400 DBUs/day = $6,000
Storage (Delta)	50 TB	$1,500
Total		~$23,000

Costs based on AWS Databricks Standard pricing. Photon-enabled clusters cost ~2x.

Performance Optimization

Cluster Configuration

# Recommended cluster config for NLP processing
spark.conf.set("spark.driver.memory", "32g")
spark.conf.set("spark.executor.memory", "32g")
spark.conf.set("spark.executor.cores", "8")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")

Delta Table Optimization

-- Optimize tables regularly
OPTIMIZE omop_cdm.cdm_v5_4.person;
OPTIMIZE omop_cdm.cdm_v5_4.observation_period;

-- Z-order by common filter columns
OPTIMIZE omop_cdm.cdm_v5_4.condition_occurrence ZORDER BY (person_id, condition_start_date);

-- Vacuum old versions
VACUUM omop_cdm.cdm_v5_4.person RETAIN 168 HOURS;

Caching

# Cache frequently accessed tables
spark.sql("CACHE TABLE omop_cdm.vocabulary.concept")

Security & Compliance

Data Encryption

Encryption at rest (cloud provider default)
Encryption in transit (TLS 1.2+)
Customer-managed keys (optional)

Access Control

Unity Catalog fine-grained ACLs
SSO via Azure AD / Okta / SAML
Row-level and column-level security
Dynamic view masking for PHI

Audit Logging

Unity Catalog audit logs
Cluster and job logs
Query history tracking

Compliance

HIPAA-eligible Databricks accounts
BAA with Databricks
Compliance logging and reporting

Integration with Patient Journey Intelligence Features

Cohort Builder

Execute cohort queries via Databricks SQL
Export cohorts as Delta tables
Real-time cohort refresh

Patient Journeys

Query timeline data from OMOP tables
Join across person, visit, condition, drug, procedure tables

Copilot

Natural language to SQL via Databricks AI
Query OMOP data through conversational interface

Cost Optimization

Auto-Scaling

Enable autoscaling for job clusters
Scale down during off-peak hours
Use spot instances for fault-tolerant workloads

Photon Engine

2-3x faster queries
Higher throughput for NLP
Consider cost/performance trade-off

Cluster Policies

Enforce maximum cluster sizes
Automatic termination after idle time
Restrict instance types

Monitoring

Databricks Metrics

Cluster utilization
Job success/failure rates
Query performance
DBU consumption

Data Quality

Delta Live Tables expectations
Data freshness monitoring
Schema drift detection

Next Steps

Discovery Call: Discuss your Databricks environment with John Snow Labs
Workspace Setup: Configure Unity Catalog and permissions
Pilot Project: Process sample dataset
Production Rollout: Scale to full data volume

Architecture Overview​

Deployment Models​

Model 1: Databricks-Native (Recommended)​

Model 2: Hybrid Deployment​

Key Features on Databricks​

Delta Lake Storage​

Unity Catalog Integration​

Spark NLP for Healthcare​

Delta Live Tables (DLT)​

Prerequisites​

Databricks Workspace​

Cloud Provider Resources​

Compute Quotas​

Installation Process​

1. Workspace Configuration (Week 1)​

2. Unity Catalog Setup (Week 1)​

3. Data Ingestion Pipeline (Week 2-3)​

4. OMOP CDM Creation (Week 3-4)​

5. Validation & Testing (Week 5-6)​

Resource Sizing​

Small Deployment (< 100K patients)​

Medium Deployment (100K - 1M patients)​

Large Deployment (> 1M patients)​

Performance Optimization​

Cluster Configuration​

Delta Table Optimization​

Caching​

Security & Compliance​

Data Encryption​

Access Control​

Audit Logging​

Compliance​

Integration with Patient Journey Intelligence Features​

Cohort Builder​

Patient Journeys​

Copilot​

Cost Optimization​

Auto-Scaling​

Photon Engine​

Cluster Policies​

Monitoring​

Databricks Metrics​

Data Quality​

Next Steps​

Additional Resources​