Skip to main content

Data Curation Automation

Data Curation Automation provides a configurable, transparent, and scalable framework for transforming clinical text into structured, ontology-aligned outputs. By leveraging clinical ontologies, NLP pipelines, and validation workflows, the module supports a wide range of data curation use cases, from oncology registries and pathology abstraction to social determinants extraction and quality reporting.

Data Curation Workflow

This video demonstrates the interface visually without audio narration.


Core Functionalities

The module enables users to:

  • Define extraction schemas using ontologies
  • Select cohorts or documents for processing
  • Execute automated NLP-based extraction
  • Review, validate, and edit structured outputs
  • Trace evidence across source documents
  • Track curation versions for full auditability

The system supports any custom ontology, making it adaptable to diverse clinical, operational, or research domains.


Curation Overview Dashboard

The landing page summarizes the system's active and historical automation workflows.

Summary Metrics:

  • Total Automations: Number of configured extraction runs
  • Completed Runs: Successfully executed curation jobs
  • In-Progress Runs: Actively running pipelines
  • Queued Runs: Scheduled or pending executions

Curation List Table:

Displays all automation runs with key metadata:

  • Automation Name
  • Execution Status (Completed, In Progress, Failed)
  • Associated Ontology
  • Extraction Method (Patient-level or Document-level)
  • Number of Patients / Documents Processed
  • Execution Timestamp
  • Results button to access curated data

This interface supports real-time monitoring of curation workflows and operational throughput.


Review and Validation Workspace

Selecting a completed run opens the Curation Details interface.

Left Panel:

  • Patient list with selection and filtering capabilities

Main Panel:

  • Structured output aligned to the chosen ontology
  • View modes: Overview, Staging, Raw Results, and Version History

The UI dynamically adapts to the underlying ontology schema.


Overview Tab

Presents structured, consolidated values per patient based on the ontology schema.

May include:

  • Demographics
  • Clinical Findings & Diagnoses
  • Pathology / Histology
  • Laboratory Metrics
  • Medications & Procedures
  • Domain-specific attributes (e.g., social history, cancer staging)

Expandable Sections

For multi-source fields:

  • Final consolidated value
  • Source-level evidence
  • Extraction logic notes
  • Underlying document references

This tab provides a concise yet evidence-backed summary per patient.


Staging Tab

Visible only for ontologies with staging elements.

Supports review and adjustment of extracted cancer staging or disease severity attributes.

Features:
  • Auto-populated values based on NLP extraction
  • Underlying evidence display
  • Manual override capability
  • Domain-specific structure (e.g., TNM, AJCC)

Evidence Modal

Accessible from any field to view:

  • Extracted Value
  • Supporting Evidence (documents, snippets)
  • Contradictory Evidence (alternative interpretations)
  • Confidence Scores
  • Highlighted text from original documentation

Additional capabilities:

  • Document navigation
  • Highlight alignment per field
  • Comparison of multiple supporting sources

This modal enhances transparency and enables clinical-grade validation.


Raw Results Tab

Displays the raw JSON output from the extraction engine.

Includes:

  • Field-level extractions
  • Consolidation metadata
  • Document-level statistics
  • NLP confidence scores
  • Cross-references to source documents

Used primarily by data engineers and integration teams.


Version History Tab

Maintains a full audit trail of edits and overrides.

Features:
  • View and compare historical versions
  • Restore previous states
  • Track user-level changes

This supports traceability, regulatory compliance, and reproducibility in abstraction workflows.


Creating a New Automation

A guided, multi-step configuration wizard enables users to define a new curation job.


1. Ontology Selection

Choose a pre-existing ontology schema.

System displays:

  • Ontology Name and Description
  • Total Fields and Categories
  • Required vs Optional Fields
  • Field Data Types
  • Extraction Guidelines (if configured)

Optional:

  • Include OMOP Records: Merges structured Patient Journey Intelligence data with NLP-extracted values

2. Cohort Selection

Select cohorts from which patients will be drawn.

Each cohort includes:

  • Name and Description
  • Patient and Document Counts
  • Metadata Preview

Multiple cohorts can be merged to define the processing population.


3. Patient or Document Selection

Refine selection based on patient attributes or clinical metadata.

Extraction Modes:
  • Patient-Level: Produces one normalized result per patient
  • Document-Level: Extracts and stores results per document
Filters:
  • Document Type
  • Date Range
  • Age, Gender, Race, Ethnicity
  • Diagnoses or clinical keywords

A real-time counter shows total patients and documents selected.


4. Automation Metadata

Provide configuration metadata:

  • Registry Name (required)
  • Optional Description

Used for version control, tracking, and organizational reporting.


5. Review & Launch

Final confirmation includes:

  • Selected cohorts and patients
  • Extraction mode
  • Ontology schema summary
  • Estimated workload

Click Create Automation to initiate pipeline execution.


Ontologies Management Module

Ontologies define the field structure and logic used for NLP extraction.


Ontology List View

Displays all available schemas:

  • Ontology Name
  • Description
  • Field Count
  • Creation Date
  • Actions (View, Edit, Delete)

View Ontology

Presents complete schema configuration:

General Info:

  • Name, Description, Field Count
  • Last Updated

Field Details:

  • Name and Display Name
  • Data Type and Category
  • Required flag
  • Extraction Instructions
  • Example Values

Edit Ontology

Edit existing ontologies by:

  • Updating metadata
  • Adding, editing, or removing fields
  • Adjusting field settings (type, required, instructions)

Create Ontology

Define new schemas for domain-specific needs.

Workflow includes:

  1. Ontology Name and Description
  2. Add fields individually
  3. Define field attributes
  4. Save ontology

This supports fully customized schema development for new curation pipelines.


Best Practices

To ensure accuracy, consistency, and performance:

  • Design ontologies with precision: Clear schema improves NLP output quality
  • Start with pilot batches to validate configuration
  • Regularly audit evidence to monitor NLP performance
  • Use version history for audit trails
  • Leverage extraction instructions to guide accurate data capture

The Data Curation Automation module transforms unstructured clinical text into validated, structured datasets using configurable ontologies and advanced NLP techniques.

Core strengths include:

  • Ontology-based schema design
  • Automated NLP pipelines with human-in-the-loop validation
  • Evidence transparency and provenance
  • Cohort-driven document selection
  • Version control and auditability
  • Scalability for clinical registries and operational pipelines

By combining automation with expert oversight, Patient Journey Intelligence delivers high-integrity data curation at scale, enabling analytics, research, and decision-making across diverse clinical domains.