Skip to main content

Information Extraction

Information Extraction uses AI-powered Natural Language Processing (NLP) to transform unstructured clinical narratives into structured, coded facts. The platform analyzes clinical notes, discharge summaries, pathology reports, and radiology findings to extract entities, relationships, and clinical context that would otherwise remain inaccessible to analytics and research queries.

Extract Structured Facts from Clinical Narratives

When a clinician writes "Patient presented with acute onset chest pain radiating to left arm. ECG shows ST elevation. Troponin elevated at 2.5. Started on aspirin 325mg and atorvastatin 80mg. No prior history of CAD.", the NLP pipeline extracts:

  • Entities: chest pain, left arm, ST elevation, troponin, aspirin, atorvastatin, CAD
  • Relationships: chest pain → radiates to → left arm
  • Assertions: chest pain (present), CAD (absent/negated), onset (acute)
  • Dosages: aspirin 325mg, atorvastatin 80mg
  • Lab results: troponin 2.5 (abnormal)

This structured representation enables querying for "patients with acute MI who received dual antiplatelet therapy": information that exists only in narrative text, not in diagnosis codes or structured fields.


Core NLP Components

Named Entity Recognition (NER)

What it does: Identifies and classifies clinical concepts in text into predefined categories such as conditions, medications, procedures, anatomy, or test results.

Example:

Input text:
"Patient has type 2 diabetes mellitus controlled on metformin 1000mg BID. HbA1c 6.8%."Extracted entities:
  • PROBLEM: type 2 diabetes mellitus
  • DRUG: metformin
  • DOSAGE: 1000mg BID
  • TEST: HbA1c
  • TEST_RESULT: 6.8%

The NER model recognizes not just the disease name, but also the specific medication, dosage regimen, lab test, and quantitative result: all critical for building accurate patient timelines and cohorts.

Assertion Detection

What it does: Determines the clinical status of each extracted entity. Was the condition present or absent? Current or historical? Certain or suspected?

Example:

Input text:
"No evidence of pneumonia on chest X-ray. Patient denies shortness of breath. Family history of COPD. Possible early signs of bronchitis."Assertion status:
  • pneumonia → Absent (negated by "no evidence")
  • shortness of breath → Absent (negated by "denies")
  • COPD → Someone_Else (family history, not patient)
  • bronchitis → Hypothetical (qualified by "possible")

Without assertion detection, a query for "patients with pneumonia" would incorrectly include this patient. Assertion models ensure extracted facts reflect true clinical status, not just mention in text.

Relation Extraction (RE)

What it does: Identifies semantic relationships between entities. Which drug treats which condition? Which procedure was performed on which anatomical site? Which symptom is caused by which diagnosis?

Example:

Input text:
"Started lisinopril 10mg for hypertension. Patient reports headache possibly related to new medication."Extracted relationships:
  • DRUG-PROBLEM: lisinopril → treats → hypertension
  • DRUG-ADE: lisinopril → causes → headache (adverse drug event)

Relation extraction enables queries like "patients who developed headaches as side effects of ACE inhibitors": critical for pharmacovigilance and adverse event surveillance.

Temporal Extraction

What it does: Extracts time expressions and dates, and resolves relative references to absolute timestamps.

Example:

Input text (note dated 2024-03-15):
"Patient diagnosed with lupus 3 years ago. Started hydroxychloroquine last month. Labs drawn yesterday showed improvement."Resolved timestamps:
  • lupus diagnosis → ~2021-03-15
  • hydroxychloroquine start → ~2024-02-15
  • lab results → 2024-03-14

Temporal resolution is essential for constructing accurate patient timelines and determining sequence of clinical events.

Section Detection

What it does: Identifies document structure and section boundaries (History of Present Illness, Assessment, Plan, etc.). This provides context for interpreting entities.

Example:

Input text:

Chief Complaint: Chest pain
History: No prior cardiac history
Assessment: Rule out myocardial infarction
Plan: Admit to cardiology, serial troponins

Section context:
  • "Chest pain" in Chief Complaint → current presenting symptom
  • "No prior cardiac history" in History → negated past condition
  • "Rule out MI" in Assessment → hypothetical diagnosis under investigation

Section detection helps distinguish between what the patient has (current diagnoses), what they don't have (negated findings), and what is being ruled out (differential diagnoses).


Specialized NLP Pipelines

Different document types require tailored extraction models. The platform provides pre-configured pipelines optimized for specific clinical domains:

Clinical Notes Pipeline

Use case: General inpatient/outpatient notes, progress notes, H&P documentation

Capabilities:

  • Broad medical entity recognition across all specialties
  • Temporal extraction for dates, durations, and frequencies
  • Negation and uncertainty detection
  • Medication dosage and route extraction

Example use case: Building a cohort of patients with heart failure requires extracting not just ICD codes, but narrative mentions like "patient has reduced EF of 35% and NYHA Class III symptoms" from cardiology notes.

Radiology Reports Pipeline

Use case: CT scans, MRIs, X-rays, ultrasounds

Capabilities:

  • Anatomical location extraction with high precision
  • Imaging findings and observations
  • Severity qualifiers (mild, moderate, severe)
  • Measurement extraction (tumor size, lesion dimensions)

Example:

Input text:
"3.2 cm mass in right upper lobe with irregular borders. Moderate mediastinal lymphadenopathy. No pleural effusion."Extracted facts:
  • FINDING: mass (3.2 cm, right upper lobe, irregular borders)
  • FINDING: mediastinal lymphadenopathy (moderate severity)
  • FINDING: pleural effusion (absent/negated)

This enables queries like "patients with lung nodules >3cm" or "scans showing lymphadenopathy without effusion": essential for radiology research and surveillance.

Pathology Reports Pipeline

Use case: Surgical pathology, biopsies, cytology

Capabilities:

  • Tumor characteristics (histology, grade, margins)
  • Biomarker results (ER/PR/HER2 status, PD-L1 expression)
  • TNM staging components
  • Molecular findings (gene mutations, MSI status)

Example:

Input text:
"Invasive ductal carcinoma, grade 2. ER positive (90%), PR positive (70%), HER2 negative. Margins clear. 2 of 15 lymph nodes positive."Extracted biomarkers:
  • HISTOLOGY: invasive ductal carcinoma
  • GRADE: 2
  • ER_STATUS: positive (90%)
  • PR_STATUS: positive (70%)
  • HER2_STATUS: negative
  • MARGIN_STATUS: clear
  • LYMPH_NODE_STATUS: 2/15 positive

These details are critical for precision oncology but are often missing from structured fields: as documented in the secondary use requirements, 68% of staging data and 14-98% of biomarker data exist only in pathology narratives.

Oncology Pipeline

Use case: Oncology clinical notes, treatment summaries

Capabilities:

  • Cancer staging (TNM components)
  • Treatment regimen extraction (chemotherapy protocols)
  • Response assessment (RECIST criteria, partial response, stable disease)
  • Performance status (ECOG, Karnofsky)

Example:

Input text:
"Stage IIIb NSCLC (T3N2M0). Started carboplatin/pemetrexed cycle 1 today. ECOG 1. Restaging CT after cycle 4 showed partial response per RECIST 1.1."Extracted oncology facts:
  • CANCER_STAGE: IIIb
  • TNM: T3N2M0
  • CANCER_TYPE: NSCLC
  • TREATMENT: carboplatin/pemetrexed
  • TREATMENT_CYCLE: 1
  • PERFORMANCE_STATUS: ECOG 1
  • RESPONSE: partial response (RECIST 1.1)

This enables longitudinal tracking of treatment response and outcomes analysis that would be impossible with diagnosis codes alone.


Available NLP Pipelines

The platform provides specialized extraction pipelines optimized for different clinical domains:

Oncology

Extract cancer types, staging, biomarkers, and treatments from pathology reports and clinical notes.

Mental Health

Identify psychiatric diagnoses, substance use, and mental health indicators from clinical documentation.

Social Determinants

Extract housing, employment, education, and social support information from free-text notes.

Diagnoses & Procedures

Identify diagnoses, procedures, symptoms, and their assertion status (present, absent, uncertain) from clinical documents.

Radiology

Extract anatomical locations, imaging findings, and observations from radiology reports with assertion status.

Drugs & Adverse Events

Identify medications, dosages, adverse drug reactions, and drug-event relationships from clinical text.

Labs & Vitals

Detect laboratory results, vital signs, test names, and clinical measurements from medical records.

Risk Factors

Identify clinical risk factors including smoking status, alcohol use, obesity, hypertension, and other conditions for risk stratification and preventive care.

Genomics & Biomarkers

Extract gene names, variants, mutations, protein expressions, biomarkers, and molecular test results from pathology reports and genomic testing documentation.


Terminology Mapping

Extracted entities are automatically normalized to standard medical terminologies, enabling consistent representation and cross-institutional interoperability:

1

Extract Raw Text

"Started metformin for T2DM"

2

Identify Entities

DRUG: metformin   |   PROBLEM: T2DM

3

Map to Standard Codes

RxNorm: 6809 (Metformin)   |   SNOMED CT: 44054006 (Type 2 diabetes mellitus)

4

Store in OMOP CDM

Structured representation with standard vocabularies for analytics and research

Supported vocabularies:

  • SNOMED CT: Comprehensive clinical terminology for conditions, procedures, and findings
  • RxNorm: Standardized drug names and ingredients
  • LOINC: Laboratory tests and clinical observations
  • ICD-10-CM: Diagnosis codes for interoperability and regulatory reporting
  • CPT: Procedure codes for billing and quality measurement

This ensures that "Type 2 diabetes mellitus", "T2DM", "diabetes type 2", and "NIDDM" all map to the same SNOMED concept (44054006), enabling consistent querying across varied documentation styles.


Impact on Clinical Data Completeness

Clinical narratives contain the richest, most detailed clinical information: but remain inaccessible to analytics without NLP. Research shows:

  • 87% of clinical concepts extracted from patient records exist only in free text, with no structured counterparts (Dielissen et al., 2025)
  • 59% of family history is documented in notes but appears in structured fields for only 5% of patients (Waudby-Smith et al., 2015)
  • 68%+ of cancer staging data is missing from structured EHR fields (Harris et al., 2022)

Information extraction bridges this gap: transforming unstructured narratives into structured, queryable facts while preserving clinical nuance and context.