Data Curation Automation

Data Curation Automation provides a configurable, transparent, and scalable framework for transforming clinical text into structured, ontology-aligned outputs. By leveraging clinical ontologies, NLP pipelines, and validation workflows, the module supports a wide range of data curation use cases, from oncology registries and pathology abstraction to social determinants extraction and quality reporting.

Data Curation Workflow

This video demonstrates the interface visually without audio narration.

Core Functionalities

The module enables users to:

Define extraction schemas using ontologies
Select cohorts or documents for processing
Execute automated NLP-based extraction
Review, validate, and edit structured outputs
Trace evidence across source documents
Track curation versions for full auditability

The system supports any custom ontology, making it adaptable to diverse clinical, operational, or research domains.

Curation Overview Dashboard

The landing page summarizes the system's active and historical automation workflows.

Summary Metrics:

Total Automations: Number of configured extraction runs
Completed Runs: Successfully executed curation jobs
In-Progress Runs: Actively running pipelines
Queued Runs: Scheduled or pending executions

Curation List Table:

Displays all automation runs with key metadata:

Automation Name
Execution Status (Completed, In Progress, Failed)
Associated Ontology
Extraction Method (Patient-level or Document-level)
Number of Patients / Documents Processed
Execution Timestamp
Results button to access curated data

This interface supports real-time monitoring of curation workflows and operational throughput.

Review and Validation Workspace

Selecting a completed run opens the Curation Details interface.

Left Panel:

Patient list with selection and filtering capabilities

Main Panel:

Structured output aligned to the chosen ontology
View modes: Overview, Staging, Raw Results, and Version History

The UI dynamically adapts to the underlying ontology schema.

Overview Tab

Presents structured, consolidated values per patient based on the ontology schema.

May include:

Demographics
Clinical Findings & Diagnoses
Pathology / Histology
Laboratory Metrics
Medications & Procedures
Domain-specific attributes (e.g., social history, cancer staging)

Expandable Sections

For multi-source fields:

Final consolidated value
Source-level evidence
Extraction logic notes
Underlying document references

This tab provides a concise yet evidence-backed summary per patient.

Staging Tab

Visible only for ontologies with staging elements.

Supports review and adjustment of extracted cancer staging or disease severity attributes.

Features:

Auto-populated values based on NLP extraction
Underlying evidence display
Manual override capability
Domain-specific structure (e.g., TNM, AJCC)

Accessible from any field to view:

Extracted Value
Supporting Evidence (documents, snippets)
Contradictory Evidence (alternative interpretations)
Confidence Scores
Highlighted text from original documentation

Additional capabilities:

Document navigation
Highlight alignment per field
Comparison of multiple supporting sources

This modal enhances transparency and enables clinical-grade validation.

Raw Results Tab

Displays the raw JSON output from the extraction engine.

Includes:

Field-level extractions
Consolidation metadata
Document-level statistics
NLP confidence scores
Cross-references to source documents

Used primarily by data engineers and integration teams.

Version History Tab

Maintains a full audit trail of edits and overrides.

Features:

View and compare historical versions
Restore previous states
Track user-level changes

This supports traceability, regulatory compliance, and reproducibility in abstraction workflows.

Creating a New Automation

A guided, multi-step configuration wizard enables users to define a new curation job.

1. Ontology Selection

Choose a pre-existing ontology schema.

System displays:

Ontology Name and Description
Total Fields and Categories
Required vs Optional Fields
Field Data Types
Extraction Guidelines (if configured)

Optional:

Include OMOP Records: Merges structured Patient Journey Intelligence data with NLP-extracted values

2. Cohort Selection

Select cohorts from which patients will be drawn.

Each cohort includes:

Name and Description
Patient and Document Counts
Metadata Preview

Multiple cohorts can be merged to define the processing population.

3. Patient or Document Selection

Refine selection based on patient attributes or clinical metadata.

Extraction Modes:

Patient-Level: Produces one normalized result per patient
Document-Level: Extracts and stores results per document

Filters:

Document Type
Date Range
Age, Gender, Race, Ethnicity
Diagnoses or clinical keywords

A real-time counter shows total patients and documents selected.

4. Automation Metadata

Provide configuration metadata:

Registry Name (required)
Optional Description

Used for version control, tracking, and organizational reporting.

5. Review & Launch

Final confirmation includes:

Selected cohorts and patients
Extraction mode
Ontology schema summary
Estimated workload

Click Create Automation to initiate pipeline execution.

Ontologies Management Module

Ontologies define the field structure and logic used for NLP extraction.

Ontology List View

Displays all available schemas:

Ontology Name
Description
Field Count
Creation Date
Actions (View, Edit, Delete)

View Ontology

Presents complete schema configuration:

General Info:

Name, Description, Field Count
Last Updated

Field Details:

Name and Display Name
Data Type and Category
Required flag
Extraction Instructions
Example Values

Edit Ontology

Edit existing ontologies by:

Updating metadata
Adding, editing, or removing fields
Adjusting field settings (type, required, instructions)

Create Ontology

Define new schemas for domain-specific needs.

Workflow includes:

Ontology Name and Description
Add fields individually
Define field attributes
Save ontology

This supports fully customized schema development for new curation pipelines.

Best Practices

To ensure accuracy, consistency, and performance:

Design ontologies with precision: Clear schema improves NLP output quality
Start with pilot batches to validate configuration
Regularly audit evidence to monitor NLP performance
Use version history for audit trails
Leverage extraction instructions to guide accurate data capture

The Data Curation Automation module transforms unstructured clinical text into validated, structured datasets using configurable ontologies and advanced NLP techniques.

Core strengths include:

Ontology-based schema design
Automated NLP pipelines with human-in-the-loop validation
Evidence transparency and provenance
Cohort-driven document selection
Version control and auditability
Scalability for clinical registries and operational pipelines

By combining automation with expert oversight, Patient Journey Intelligence delivers high-integrity data curation at scale, enabling analytics, research, and decision-making across diverse clinical domains.

Data Curation Workflow

Core Functionalities​

Curation Overview Dashboard​

Summary Metrics:​

Curation List Table:​

Review and Validation Workspace​

Left Panel:​

Main Panel:​

Overview Tab​

Expandable Sections​

Staging Tab​

Features:​

Evidence Modal​

Raw Results Tab​

Version History Tab​

Features:​

Creating a New Automation​

1. Ontology Selection​

2. Cohort Selection​

3. Patient or Document Selection​

Extraction Modes:​

Filters:​

4. Automation Metadata​

5. Review & Launch​

Ontologies Management Module​

Ontology List View​

View Ontology​

General Info:​

Field Details:​

Edit Ontology​

Create Ontology​

Best Practices​

Core Functionalities

Curation Overview Dashboard

Summary Metrics:

Curation List Table:

Review and Validation Workspace

Left Panel:

Main Panel:

Overview Tab

Expandable Sections

Staging Tab

Features:

Evidence Modal

Raw Results Tab

Version History Tab

Features:

Creating a New Automation

1. Ontology Selection

2. Cohort Selection

3. Patient or Document Selection

Extraction Modes:

Filters:

4. Automation Metadata

5. Review & Launch

Ontologies Management Module

Ontology List View

View Ontology

General Info:

Field Details:

Edit Ontology

Create Ontology

Best Practices