Secondary Use of Clinical Data: The Accuracy Challenge
Defining Secondary Use of Clinical Data
Secondary use means taking clinical data originally collected during patient care and repurposing it for something beyond direct treatment. This includes research studies, quality improvement programs, population health analytics, disease registries, AI model development, public health surveillance, and healthcare operations optimization.
Unlike operational systems designed to support real-time clinical workflows—prescribing medications, ordering labs, documenting visits—secondary use environments must work with years or decades of historically accumulated data. This data comes from multiple systems, uses inconsistent formats and terminologies, includes both structured and unstructured content, and often contains gaps, conflicts, and quality issues that went unnoticed during original data entry.
The promise of secondary use is compelling: unlock insights hidden in millions of patient encounters, identify at-risk populations before problems escalate, accelerate clinical trial recruitment, automate registry reporting, and train AI models that improve care. But realizing this promise depends entirely on whether the data accurately represents what actually happened to patients—and that's where most secondary use initiatives struggle.
The Accuracy Problem: Incomplete Patient Views Lead to Wrong Results
Beyond engineering cost, the most serious and pervasive problem with today's secondary use approaches is accuracy. When projects operate on an incomplete view of patient data, they don't merely return partial answers—they often return wrong or misleading results.
Incomplete Patient Data Creates Systematic Inaccuracy
Most healthcare organizations have spent decades accumulating clinical data across dozens of systems: electronic health records, laboratory information systems, radiology platforms, specialty registries, billing systems, scanned document repositories, and more. But integrating all of this data into a unified, longitudinal patient view is expensive and technically complex.
As a result, many secondary use initiatives cannot afford to reconstruct a complete patient journey across all available data modalities. The complexity and cost of integrating free-text notes, scanned documents, imaging metadata, specialty systems, and longitudinal context lead teams to limit scope to what is easiest to access—most commonly, structured EHR data alone.
⚠️ The Structured-Only Trap
Projects often systematically underestimate how much accuracy is lost by relying on a single modality. Structured EHR fields capture only a subset of clinically relevant information and frequently omit context such as certainty, negation, temporality, disease progression, adverse events, social factors, and clinician reasoning.
As a result, analytics, registries, cohorts, and AI models built on structured-only data routinely miss critical signals, misclassify patients, and produce biased or incomplete outputs.
The Evidence: How Much Accuracy Are We Losing?
Multiple peer-reviewed, published studies quantify the magnitude of accuracy loss when unstructured and longitudinal data are excluded from secondary use projects:
Clinical Diagnoses: The 40% Gap
~40%
Nearly 40% of important inpatient diagnoses are mentioned only in free-text clinical notes and are absent from structured problem lists. If your cohort query searches only structured diagnosis codes, you're systematically missing nearly half of the patients who actually have the condition you're studying.
Family History: The 12x Discrepancy
12×
Family history is documented in unstructured notes for ~59% of patients, but appears in structured fields for only ~5%. That's a 12-fold difference between what clinicians document and what gets captured in structured data (Waudby-Smith et al., AMIA Annu Symp Proc. 2015). Any genetics study, risk model, or preventive care program relying on structured family history alone operates with systematically incomplete inputs.
Social Determinants of Health: 90% vs. Minority
>90%
Models leveraging unstructured text identify over 90% of patients with adverse social determinants of health (housing insecurity, food insecurity, transportation barriers), while ICD-based approaches capture only a small minority (Guevara et al., npj Digital Medicine. 2024). Social factors are documented extensively in clinical notes but rarely coded—making them invisible to structured-only analytics.
Oncology: The Structured Data Gap
In oncology, the richest clinical information—tumor histology, staging details, biomarker results, treatment response, and progression events—exists predominantly in unstructured pathology reports and clinical notes rather than in structured EHR fields (Rubinstein et al., JCO Clin Cancer Inform. 2024). This creates systematic gaps in cancer registries, research cohorts, and quality reporting that rely on structured data alone.
>68%
Different studies report that cancer staging data is missing from structured EHR fields in 68%+ of encounters (Harris et al., JCO Clin Cancer Inform. 2022). TNM staging information is routinely documented in pathology reports and oncology notes but not transferred to structured problem lists or staging fields—making it invisible to automated cohort identification and registry reporting.
>98%
Biomarker data has a median of 14% missing and a mean of 22% missing across molecular epidemiology studies, with some studies reporting up to 98% missing data for key biomarker variables (Greenland & Finkle, Cancer Epidemiol Biomarkers Prev. 2012). Critical genomic results (ER/PR/HER2 status, PD-L1 expression, gene mutations) are documented in pathology narratives but frequently absent from structured fields—undermining precision oncology initiatives and outcomes research.
Medication Reconciliation: Discrepancies Everywhere
60% - 70%
Medication reconciliation studies consistently find an average of 60% to 70% of medication histories contain at least one error (Gómez-Cuervo et al., Eur J Cardiovasc Nurs. 2016), with over 90% of patients having at least one discrepancy between structured medication lists and what is documented in clinical notes and discharge summaries (Rehman et al., Pharmacy. 2024). Structured medication tables alone frequently misrepresent true medication exposure—critical for pharmacovigilance, drug-drug interaction detection, and outcomes research.
Suicidality and Self-Harm: The Coding Gap
>81%
Only 3% of patients with suicidal ideation and 19% with suicide attempts documented in clinical notes have corresponding ICD codes (Fernandes et al., Sci Rep. 2018). This means structured coding systems capture fewer than 1 in 5 documented suicide-related events. Psychiatric screening results, safety plans, and clinician assessments are documented extensively in notes but rarely coded—leaving structured-only surveillance systems blind to the vast majority of high-risk patients.
Clinical Prediction: The Unstructured Data Advantage
87%
Only 13% of clinical concepts extracted from patient records have matching structured counterparts (Dielissen et al., J Med Internet Res. 2025). In other words, 87% of extracted clinical information exists solely in free-text narratives—invisible to analytics that query only structured fields. This enormous gap explains why models trained on structured data alone routinely underperform.
Why Text Outperforms Codes: Structured diagnosis codes are designed for billing and administrative classification—not clinical nuance. A patient coded with "diabetes" tells you nothing about glycemic control, medication adherence, complications, or disease trajectory. Clinical notes document "HbA1c 11.2%, poorly controlled despite max dose metformin, recurrent DKA admissions, non-adherent to insulin." This narrative contains the actual predictive signals for outcomes, but remains inaccessible if you query only structured fields.
The Root Cause: Incomplete Patient Representation
These findings point to a consistent pattern: the dominant failure mode of many secondary use projects is not tooling, but incomplete patient representation.
Without first constructing a complete, multimodal patient journey over time—integrating structured data, unstructured notes, scanned documents, imaging metadata, and longitudinal context—secondary use analytics and AI applications operate on distorted inputs. This limits accuracy, biases results, and undermines clinical and regulatory trust.
✅ The Path Forward
A modern secondary use data platform must do more than move data—it must fundamentally change how organizations prepare, trust, and reuse clinical data over time. The opportunity is to replace fragmented, per-project pipelines with a single, shared foundation that transforms raw multimodal data into AI-ready patient journeys and keeps them continuously up-to-date as new data arrives.
What Secondary Use Requires
To achieve accurate secondary use of clinical data, organizations need eight foundational capabilities that go far beyond traditional data warehousing:
1. Complete Multimodal Data Integration
Requirement: Ingest data across all modalities—structured fields, clinical notes, scanned documents, imaging metadata, labs, medications, procedures, claims.
Why it matters: Relying on any single modality systematically excludes critical information, leading to incomplete cohorts and biased models.
2. Healthcare-Specific NLP
Requirement: Extract structured facts from unstructured text using medical language models—entity recognition, relation extraction, assertion detection, temporal reasoning.
Why it matters: Healthcare-specific models achieve 85–95% precision/recall and are ~30% more accurate than general-purpose LLMs on clinical tasks.
3. Terminology Standardization
Requirement: Map all clinical concepts to standard medical vocabularies (SNOMED CT, RxNorm, LOINC, ICD-10-CM).
Why it matters: Without normalization, the same clinical fact appears in dozens of forms, making accurate patient counts and cross-institution research impossible.
4. Clinical Reasoning & Conflict Resolution
Requirement: Resolve conflicts, deduplicate entities, distinguish assertion status (confirmed vs. ruled out), reason about temporal changes.
Why it matters: Real-world data is messy—without intelligent reasoning, systems either lose information or create duplicates and noise.
5. Longitudinal Patient Timelines
Requirement: Organize all clinical events chronologically with precise temporal context to support queries about disease progression, treatment response, outcomes.
Why it matters: Most clinical questions involve time—systems need to answer "when" questions, not just "does patient have X?"
6. Privacy & De-Identification
Requirement: Automatically remove PHI from clinical text, documents, and images using HIPAA-compliant methods with 99%+ accuracy.
Why it matters: Most secondary use requires de-identification—manual approaches are slow, expensive, and error-prone. Medical-specific patterns require specialized tools.
7. Provenance & Auditability
Requirement: Track complete lineage from source document to final output. Provide confidence scores. Enable drill-down to source evidence.
Why it matters: Regulatory compliance, research reproducibility, and clinical trust require transparency—not black boxes.
8. Continuous Updates & Living Datasets
Requirement: Keep patient journeys continuously up-to-date as new data arrives—not static snapshots that become stale.
Why it matters: Clinical data accumulates continuously. Quarterly refreshes miss opportunities and make results obsolete.
What You Gain: Accuracy, Speed, and Reuse
By deploying a platform that meets these requirements, organizations unlock transformative improvements in secondary use:
Speed to Insight
Patient timeline construction that previously took weeks of manual abstraction now completes in hours. Analyze 6× more patients in the same timeframe compared to manual review workflows.
Improved Completeness
One standard format for all data, >96% timeline completeness, confidence and audit trail, >99% de-identification accuracy.
Reuse at Scale
One standardized dataset supports cohorts, registries, analytics, and AI agents. Build the data foundation once, innovate endlessly on top. No redundant pipeline development, consistent results across applications.
Embedded Governance
Full provenance, lineage, and auditability of every derived clinical fact. Confidence scores for extracted data. Transparent handling of conflicts and missing data. Audit trails for regulatory compliance (HIPAA, 21 CFR Part 11, IRB).
Regulatory Readiness
Parallel identified and de-identified datasets kept semantically synchronized. Research models move to production without pipeline rewrites. Train on de-identified data, deploy on identified data—using identical feature definitions.
Future-Proof AI
A durable data foundation supporting advanced analytics, ML, and agentic systems at enterprise scale. New AI applications introduced without re-engineering data preparation. Evolves with advances in NLP and clinical AI.
The Result: A Repeatable Operating Model for Secondary Use
The outcome is not just cleaner data, but a repeatable operating model for secondary use—where accuracy, trust, and reuse are built into the foundation rather than retrofitted for each new project.
Instead of spending 80% of effort on data wrangling for every new use case, teams spend 10% on data integration (once) and 90% on innovation (continuously). Instead of accepting 40% incompleteness and hoping it doesn't matter, organizations build on 96%+ complete patient representations and trust their results. Instead of isolated data pipelines that conflict and diverge, a single shared foundation ensures consistency across cohorts, registries, analytics, and AI applications.
💡 The Bottom Line
Secondary use of clinical data has enormous potential—but only if the data accurately represents what happened to patients. Patient Journey Intelligence provides the foundation: complete multimodal integration, healthcare-specific NLP, terminology normalization, clinical reasoning, longitudinal timelines, de-identification, provenance, and continuous updates.
Build the foundation once. Innovate on accurate, trustworthy, reusable patient data—forever.