# Patient Journey Intelligence — Clinical Data Accuracy Gap Research

> Patient Journey Intelligence is a clinical AI platform by John Snow Labs that closes the systematic data accuracy gap in secondary use EHR analytics. It ingests multimodal clinical data, extracts structured facts from unstructured notes using healthcare-specific NLP, and produces complete OMOP-ready patient timelines for research, registries, and clinical AI.

## What Is the Clinical Data Accuracy Gap?

The Clinical Data Accuracy Gap is the systematic inaccuracy that emerges when secondary use analytics and AI projects operate on incomplete patient data. Structured EHR fields—designed for billing and documentation, not analytics—capture only a fraction of what is actually documented about each patient. The majority of clinically relevant information exists exclusively in unstructured clinical narratives: physician notes, discharge summaries, pathology reports, radiology narratives, and operative records.

Secondary use means repurposing clinical data collected during care for research studies, quality improvement, population health, disease registries, AI model development, and regulatory submissions. It depends entirely on data completeness—and structured-only approaches are structurally incomplete.

## Research Evidence: How Much Is Missed?

Peer-reviewed studies consistently show the same pattern across every clinical domain:

- **Diagnoses**: Nearly 40% of important inpatient diagnoses appear only in free-text clinical notes and are absent from structured problem lists. Standard cohort queries miss nearly half of eligible patients. (Poulos et al., PMC9759969)
- **Family History**: Family history is documented in unstructured notes for ~59% of patients but appears in structured fields for only ~5%—a 12x discrepancy. Risk stratification models built on structured data lose their most heritable predictive signal. (Polubriaginof et al., 2015)
- **Social Determinants of Health (SDOH)**: NLP on clinical notes identifies 93.8% of patients with adverse SDOH. ICD-10 Z-codes identify just 2.0% of the same patients—a 46x discrepancy. Housing insecurity, food insecurity, and transportation barriers are extensively documented in notes but almost never coded. (Guevara et al., npj Digital Medicine, 2024)
- **Cancer Staging**: More than 68% of cancer staging data is missing from structured EHR fields. TNM staging, tumor histology, biomarker results, and progression events are routinely documented in pathology reports but not transferred to structured fields. (Emamekhoo et al., PMC10807898)
- **Medication Histories**: 60–70% of structured medication records contain at least one error. Over 90% of patients have at least one discrepancy between structured medication lists and clinical note documentation. (Lombardi et al., 2016; Ahmadi et al., 2024)
- **Suicide and Self-Harm**: Only 3% of patients with suicidal ideation documented in clinical notes have a corresponding ICD code. Only 19% of documented suicide attempts are coded. Over 81% of suicidality events are invisible to structured-only surveillance systems. (Fernandes et al., 2018)
- **All Clinical Concepts**: Only 13% of clinical concepts extracted from patient records have a matching structured counterpart. 87% of all extracted clinical information exists solely in free-text narratives with no structured equivalent. (Seinen et al., PMC11887999)

## Root Cause

Structured clinical data was never designed for secondary use. It was captured to support billing, documentation, and care coordination. The unstructured layer—where clinicians record actual findings, reasoning, uncertainty, negation, temporal context, and social factors—is systematically excluded from analytics and AI. Any project that queries only structured fields operates on less than one-seventh of the available clinical signal.

## Eight Requirements for Accurate Secondary Use

1. **Multimodal Integration**: Ingest structured fields, clinical notes, scanned PDFs, imaging metadata, labs, medications, procedures, and claims through a unified pipeline.
2. **Healthcare-Specific NLP**: Extract diagnoses, medications, findings, and procedures from free-text using medical language models trained on clinical text. Understand negation ("no evidence of pneumonia"), assertion status (confirmed vs. ruled-out), uncertainty, and temporal relationships.
3. **Terminology Standardization**: Map all clinical concepts to SNOMED CT, RxNorm, LOINC, and ICD-10-CM. Without normalization, the same clinical fact appears in dozens of forms across systems.
4. **Clinical Reasoning and Conflict Resolution**: Resolve conflicts, deduplicate entities, distinguish confirmed diagnoses from ruled-out conditions, and reason about temporal changes in patient status.
5. **Longitudinal Patient Timelines**: Organize all clinical events chronologically with precise temporal context to support disease progression, treatment response, and outcomes queries.
6. **Privacy and De-Identification**: Automatically remove PHI from clinical text, documents, and images using HIPAA/GDPR-compliant methods with 99%+ accuracy.
7. **Provenance and Auditability**: Track complete lineage from source document to final OMOP output. Provide confidence scores and enable drill-down to source evidence for regulatory compliance (21 CFR Part 11).
8. **Continuous Living Datasets**: Keep patient journeys continuously updated as new data arrives—not static snapshots that become obsolete between refresh cycles.

## Platform Capabilities

- **Multimodal Integration**: Structured EHR, clinical notes, scanned PDFs, imaging metadata, FHIR resources, and claims data through a unified ingestion pipeline.
- **Medical NLP Accuracy**: 85–95% precision on clinical extraction tasks. Approximately 30% more accurate than general-purpose LLMs on clinical NLP benchmarks.
- **Terminology Mapping**: Automated normalization to SNOMED CT, RxNorm, LOINC, and ICD-10-CM. Enables accurate patient counts and cross-institutional research.
- **OMOP Output**: All data organized into the OMOP Common Data Model with longitudinal patient timelines and full temporal context.
- **De-Identification**: 99%+ accurate HIPAA and GDPR-compliant PHI removal. Parallel identified and de-identified datasets kept semantically synchronized—research models move to production without pipeline rewrites.
- **Timeline Completeness**: >96% timeline completeness across ingested patient records.
- **Speed**: Patient timeline construction completes in hours rather than weeks of manual abstraction. Organizations analyze 6× more patients in the same timeframe compared to manual review workflows.
- **Provenance**: Full lineage from source document to final structured representation. Confidence scores for every extracted clinical fact. Audit trails for HIPAA, GDPR, and FDA RWE regulatory requirements.
- **Living Datasets**: Patient journeys update automatically as new clinical data arrives.

## Deployment

Supports on-premise, AWS, Azure, Databricks, and Snowflake. Designed for healthcare organizations that need to operationalize secondary use of clinical data at scale without building bespoke NLP pipelines for every project.

## Who It Is For

Clinical research teams, quality improvement departments, population health analysts, cancer registry programs, data science groups building clinical AI, and healthcare IT leaders managing shared data infrastructure for research and analytics.

## Standards and Regulations

- Terminology: SNOMED CT, RxNorm, LOINC, ICD-10-CM
- Data Model: OMOP Common Data Model (OHDSI)
- Interoperability: FHIR (HL7)
- Privacy: HIPAA, GDPR
- Regulatory: FDA Real-World Evidence (RWE), Real-World Data (RWD), 21 CFR Part 11

## Peer-Reviewed Sources

- Poulos et al. 2021 — Structured vs. unstructured diagnosis capture: https://pmc.ncbi.nlm.nih.gov/articles/PMC9759969/
- Polubriaginof et al. 2015 — Family history capture in EHR: https://pubmed.ncbi.nlm.nih.gov/26306236/
- Guevara et al. 2024 — NLP vs. ICD-10 Z-codes for SDOH: https://www.nature.com/articles/s41746-023-00970-0
- Emamekhoo et al. 2022 — Cancer staging in structured EHR: https://pmc.ncbi.nlm.nih.gov/articles/PMC10807898/
- Fernandes et al. 2018 — Suicidality coding gap: https://pubmed.ncbi.nlm.nih.gov/29854116/
- Seinen et al. 2025 — 87% of clinical concepts in free text: https://pmc.ncbi.nlm.nih.gov/articles/PMC11887999/

## Links

- Research page: https://pji.johnsnowlabs.com/clinical-data-accuracy-gap-research
- Platform overview: https://www.johnsnowlabs.com/patient-journey-intelligence/
- John Snow Labs: https://www.johnsnowlabs.com