Data Ingestion

The Data Ingestion module is your control center for bringing clinical data into the PJI platform. Whether you need to pull data from configured external sources or upload files from your computer, this module handles both scheduled automatic imports and on-demand manual uploads.

Behind the scenes, Data Ingestion transforms raw clinical documents into structured, standardized datasets ready for analysis—extracting clinical facts, removing protected health information (PHI), and organizing everything into the OMOP common data model.

Data Ingestion Walkthrough

This video demonstrates the interface visually without audio narration.

Understanding the Dashboard

Track All Data Import Jobs

The ingestion dashboard gives you a complete view of all data import jobs, whether they ran automatically on a schedule or were started manually. Each job shows:

Job ID: A unique tracking number for this specific import
Source Name: Where the data came from (e.g., "Main EHR" or "Manual Upload")
Source Type: The connection method used (S3, SFTP, HealthLake, HTTP Plugin, Local Upload)
Created At / Created By: When the job started and who initiated it
Documents Processed: How many files were imported
Schedule Type: None (manual), Daily, Weekly, or Monthly
Pipeline Duration: Total time from start to finish
Current Status: In Progress, Completed, or Failed
Progress Bar: A visual indicator showing how far along the job is

At a glance, you can see which jobs are running, which have completed successfully, and which need attention.

Creating an Ingestion Job

Starting a New Job

To import new data, click Add Ingestion and choose one of two methods:

Option 1: Pull from a Registered Data Source

Connect to an external system that's already been configured in the Data Sources module. This method allows you to:

Run scheduled imports automatically (daily, weekly, or monthly)
Manually trigger an import from a registered source whenever needed
Browse and select specific files from the source (if the connector supports file navigation)

Option 2: Upload Files from Your Computer

Upload documents directly from your local machine. Supported file types include:

Clinical documents: TXT, XML, HL7, PDF
Medical images: DICOM files
Other formats as configured in your PJI instance

This option is perfect for testing, one-time imports, or bringing in documents that aren't stored in a connected system.

Configuration Wizard

Creating a new ingestion job follows a simple step-by-step process:

Step 1: Choose Your Import Method

Select how you want to bring data in:

Data Source: Pull from a registered external system
Local Upload: Upload files from your computer

Step 2: Select Your Source (For External Systems)

If you chose to pull from a data source, you'll see all your configured connections organized by type:

HTTP Plugins
SFTP Endpoints
AWS HealthLake
Amazon S3 Buckets

Browse your available sources in either card view or table view, then select the one you want to use.

Step 3: Pick Your Files

For sources that support browsing, you'll see an interactive file explorer with:

Folder navigation to find your files
File type filters to show only certain types
Search to find files by name
Document preview to verify you're selecting the right files
De-identification toggle to automatically remove PHI during import

This step ensures you're importing exactly the files you need and can preview them before processing.

Step 4: Review and Start

Before the job starts, you'll see a summary of everything you've configured:

Import method (Data Source or Local Upload)
Selected source (which system or connection)
File list (what will be imported)
De-identification settings (whether PHI removal is enabled)

Click Start Ingestion to launch the job and begin processing.

Monitoring Job Execution

Processing Pipeline

Once started, each ingestion job runs through a series of automated steps. Click on any job to open the Job Status Modal and watch the progress in real time:

Document Retrieval: Securely downloads files from the source
De-identification: Automatically removes PHI (names, dates, IDs, etc.) if enabled
Information Extraction: Uses clinical NLP to extract diagnoses, medications, procedures, lab results, and imaging findings
Ontology & Graph Enrichment: Links extracted concepts to standard medical vocabularies (SNOMED CT, RxNorm, LOINC)
OMOP Conversion: Transforms clinical events into the OMOP common data model format
Merging & Deduplication: Identifies and consolidates duplicate records
Clinical Measure Recalculation: Updates quality metrics and analytics that depend on this data

For each step, you'll see:

Status indicator (running, completed, or failed)
How long it took to complete
Processing statistics (e.g., "250 documents processed, 1,847 clinical facts extracted")

The job is only marked Completed after all steps finish successfully.

Troubleshooting with Logs

The Logs tab provides detailed diagnostic information to help you understand what happened during processing:

Connection problems (authentication failures, network timeouts)
File format issues (unsupported formats, corrupted files)
NLP extraction errors (text that couldn't be processed)
Pipeline failures (which stage failed and why)
Document-level warnings (individual files that had issues)

Logs are essential for quality assurance and troubleshooting when jobs don't complete as expected.

Job Control Options

While a job is running, you have control options:

Re-run Job: Start the same import again with identical settings (useful after fixing an issue)
Stop Job: Immediately cancel a running job (useful if you notice a configuration mistake)

These controls give you flexibility to recover from errors or iterate on your import configuration.

Automated Scheduling

Scheduled Imports

If you configured a schedule for a data source (in the Data Sources module), PJI will automatically run imports at the specified times without any manual intervention.

On the dashboard, scheduled sources show:

Frequency: How often they run (Daily, Weekly, Monthly)
Next Execution: When the next automatic import will start
Last Execution: When the most recent import completed

Scheduled imports ensure your data stays up-to-date without requiring manual action.

Key Capabilities

The Data Ingestion module enables you to:

Import data manually or automatically from external systems
Process clinical documents end-to-end with full NLP extraction and standardization
Preview files before importing to ensure you're getting the right data
Monitor processing in real time with step-by-step status updates
Access detailed logs for quality control and troubleshooting
Re-run or cancel jobs as needed for operational flexibility
Maintain full visibility into how data enters your system

Best Practices

To get the most out of Data Ingestion:

Start small: Test with a few documents before importing thousands
Use de-identification: Enable PHI removal when working with patient data for research or testing
Check your source configuration: Verify connection settings before running large imports
Monitor actively: Watch the status indicators and review logs, especially for first-time imports
Leverage schedules: Set up automatic imports for stable, recurring data sources

Why It Matters

The Data Ingestion module is the foundation of your clinical data pipeline in PJI. It ensures:

Reliable Data Flow: Secure, consistent transfer from diverse clinical systems
High Data Quality: Standardized, structured data ready for analysis
Complete Transparency: Real-time visibility into every processing step
Enterprise Scalability: Handle everything from single documents to millions of records
Regulatory Compliance: Automatic PHI protection, encryption, and full audit trails

By automating and centralizing data imports, Data Ingestion transforms fragmented clinical documents into a unified, analytics-ready dataset—enabling patient journeys, cohort analysis, quality measurement, and evidence-based insights across your organization.

Data Ingestion Walkthrough

Understanding the Dashboard​

Track All Data Import Jobs​

Creating an Ingestion Job​

Starting a New Job​

Option 1: Pull from a Registered Data Source​

Option 2: Upload Files from Your Computer​

Configuration Wizard​

Step 1: Choose Your Import Method​

Step 2: Select Your Source (For External Systems)​

Step 3: Pick Your Files​

Step 4: Review and Start​

Monitoring Job Execution​

Processing Pipeline​

Troubleshooting with Logs​

Job Control Options​

Automated Scheduling​

Scheduled Imports​

Key Capabilities​

Best Practices​

Why It Matters​