Data Curation Automation
Data Curation Automation provides a configurable, transparent, and scalable framework for transforming clinical text into structured, ontology-aligned outputs. By leveraging clinical ontologies, NLP pipelines, and validation workflows, the module supports a wide range of data curation use cases, from oncology registries and pathology abstraction to social determinants extraction and quality reporting.
Data Curation Workflow
This video demonstrates the interface visually without audio narration.
Core Functionalities
The module enables users to:
- Define extraction schemas using ontologies
- Select cohorts or documents for processing
- Execute automated NLP-based extraction
- Review, validate, and edit structured outputs
- Trace evidence across source documents
- Track curation versions for full auditability
The system supports any custom ontology, making it adaptable to diverse clinical, operational, or research domains.
Curation Overview Dashboard
The landing page summarizes the system's active and historical automation workflows.
Summary Metrics:
- Total Automations: Number of configured extraction runs
- Completed Runs: Successfully executed curation jobs
- In-Progress Runs: Actively running pipelines
- Queued Runs: Scheduled or pending executions
Curation List Table:
Displays all automation runs with key metadata:
- Automation Name
- Execution Status (Completed, In Progress, Failed)
- Associated Ontology
- Extraction Method (Patient-level or Document-level)
- Number of Patients / Documents Processed
- Execution Timestamp
- Results button to access curated data
This interface supports real-time monitoring of curation workflows and operational throughput.
Review and Validation Workspace
Selecting a completed run opens the Curation Details interface.
Left Panel:
- Patient list with selection and filtering capabilities
Main Panel:
- Structured output aligned to the chosen ontology
- View modes: Overview, Staging, Raw Results, and Version History
The UI dynamically adapts to the underlying ontology schema.
Overview Tab
Presents structured, consolidated values per patient based on the ontology schema.
May include:
- Demographics
- Clinical Findings & Diagnoses
- Pathology / Histology
- Laboratory Metrics
- Medications & Procedures
- Domain-specific attributes (e.g., social history, cancer staging)
Expandable Sections
For multi-source fields:
- Final consolidated value
- Source-level evidence
- Extraction logic notes
- Underlying document references
This tab provides a concise yet evidence-backed summary per patient.
Staging Tab
Visible only for ontologies with staging elements.
Supports review and adjustment of extracted cancer staging or disease severity attributes.
Features:
- Auto-populated values based on NLP extraction
- Underlying evidence display
- Manual override capability
- Domain-specific structure (e.g., TNM, AJCC)
Evidence Modal
Accessible from any field to view:
- Extracted Value
- Supporting Evidence (documents, snippets)
- Contradictory Evidence (alternative interpretations)
- Confidence Scores
- Highlighted text from original documentation
Additional capabilities:
- Document navigation
- Highlight alignment per field
- Comparison of multiple supporting sources
This modal enhances transparency and enables clinical-grade validation.
Raw Results Tab
Displays the raw JSON output from the extraction engine.
Includes:
- Field-level extractions
- Consolidation metadata
- Document-level statistics
- NLP confidence scores
- Cross-references to source documents
Used primarily by data engineers and integration teams.
Version History Tab
Maintains a full audit trail of edits and overrides.
Features:
- View and compare historical versions
- Restore previous states
- Track user-level changes
This supports traceability, regulatory compliance, and reproducibility in abstraction workflows.
Creating a New Automation
A guided, multi-step configuration wizard enables users to define a new curation job.
1. Ontology Selection
Choose a pre-existing ontology schema.
System displays:
- Ontology Name and Description
- Total Fields and Categories
- Required vs Optional Fields
- Field Data Types
- Extraction Guidelines (if configured)
Optional:
- Include OMOP Records: Merges structured Patient Journey Intelligence data with NLP-extracted values
2. Cohort Selection
Select cohorts from which patients will be drawn.
Each cohort includes:
- Name and Description
- Patient and Document Counts
- Metadata Preview
Multiple cohorts can be merged to define the processing population.
3. Patient or Document Selection
Refine selection based on patient attributes or clinical metadata.
Extraction Modes:
- Patient-Level: Produces one normalized result per patient
- Document-Level: Extracts and stores results per document
Filters:
- Document Type
- Date Range
- Age, Gender, Race, Ethnicity
- Diagnoses or clinical keywords
A real-time counter shows total patients and documents selected.
4. Automation Metadata
Provide configuration metadata:
- Registry Name (required)
- Optional Description
Used for version control, tracking, and organizational reporting.
5. Review & Launch
Final confirmation includes:
- Selected cohorts and patients
- Extraction mode
- Ontology schema summary
- Estimated workload
Click Create Automation to initiate pipeline execution.
Ontologies Management Module
Ontologies define the field structure and logic used for NLP extraction.
Ontology List View
Displays all available schemas:
- Ontology Name
- Description
- Field Count
- Creation Date
- Actions (View, Edit, Delete)
View Ontology
Presents complete schema configuration:
General Info:
- Name, Description, Field Count
- Last Updated
Field Details:
- Name and Display Name
- Data Type and Category
- Required flag
- Extraction Instructions
- Example Values
Edit Ontology
Edit existing ontologies by:
- Updating metadata
- Adding, editing, or removing fields
- Adjusting field settings (type, required, instructions)
Create Ontology
Define new schemas for domain-specific needs.
Workflow includes:
- Ontology Name and Description
- Add fields individually
- Define field attributes
- Save ontology
This supports fully customized schema development for new curation pipelines.
Best Practices
To ensure accuracy, consistency, and performance:
- Design ontologies with precision: Clear schema improves NLP output quality
- Start with pilot batches to validate configuration
- Regularly audit evidence to monitor NLP performance
- Use version history for audit trails
- Leverage extraction instructions to guide accurate data capture
The Data Curation Automation module transforms unstructured clinical text into validated, structured datasets using configurable ontologies and advanced NLP techniques.
Core strengths include:
- Ontology-based schema design
- Automated NLP pipelines with human-in-the-loop validation
- Evidence transparency and provenance
- Cohort-driven document selection
- Version control and auditability
- Scalability for clinical registries and operational pipelines
By combining automation with expert oversight, Patient Journey Intelligence delivers high-integrity data curation at scale, enabling analytics, research, and decision-making across diverse clinical domains.