Data Ingestion
The Data Ingestion module is your control center for bringing clinical data into the PJI platform. Whether you need to pull data from configured external sources or upload files from your computer, this module handles both scheduled automatic imports and on-demand manual uploads.
Behind the scenes, Data Ingestion transforms raw clinical documents into structured, standardized datasets ready for analysis—extracting clinical facts, removing protected health information (PHI), and organizing everything into the OMOP common data model.
Data Ingestion Walkthrough
This video demonstrates the interface visually without audio narration.
Understanding the Dashboard
Track All Data Import Jobs
The ingestion dashboard gives you a complete view of all data import jobs, whether they ran automatically on a schedule or were started manually. Each job shows:
- Job ID: A unique tracking number for this specific import
- Source Name: Where the data came from (e.g., "Main EHR" or "Manual Upload")
- Source Type: The connection method used (S3, SFTP, HealthLake, HTTP Plugin, Local Upload)
- Created At / Created By: When the job started and who initiated it
- Documents Processed: How many files were imported
- Schedule Type: None (manual), Daily, Weekly, or Monthly
- Pipeline Duration: Total time from start to finish
- Current Status: In Progress, Completed, or Failed
- Progress Bar: A visual indicator showing how far along the job is
At a glance, you can see which jobs are running, which have completed successfully, and which need attention.
Creating an Ingestion Job
Starting a New Job
To import new data, click Add Ingestion and choose one of two methods:
Option 1: Pull from a Registered Data Source
Connect to an external system that's already been configured in the Data Sources module. This method allows you to:
- Run scheduled imports automatically (daily, weekly, or monthly)
- Manually trigger an import from a registered source whenever needed
- Browse and select specific files from the source (if the connector supports file navigation)
Option 2: Upload Files from Your Computer
Upload documents directly from your local machine. Supported file types include:
- Clinical documents: TXT, XML, HL7, PDF
- Medical images: DICOM files
- Other formats as configured in your PJI instance
This option is perfect for testing, one-time imports, or bringing in documents that aren't stored in a connected system.
Configuration Wizard
Creating a new ingestion job follows a simple step-by-step process:
Step 1: Choose Your Import Method
Select how you want to bring data in:
- Data Source: Pull from a registered external system
- Local Upload: Upload files from your computer
Step 2: Select Your Source (For External Systems)
If you chose to pull from a data source, you'll see all your configured connections organized by type:
- HTTP Plugins
- SFTP Endpoints
- AWS HealthLake
- Amazon S3 Buckets
Browse your available sources in either card view or table view, then select the one you want to use.
Step 3: Pick Your Files
For sources that support browsing, you'll see an interactive file explorer with:
- Folder navigation to find your files
- File type filters to show only certain types
- Search to find files by name
- Document preview to verify you're selecting the right files
- De-identification toggle to automatically remove PHI during import
This step ensures you're importing exactly the files you need and can preview them before processing.
Step 4: Review and Start
Before the job starts, you'll see a summary of everything you've configured:
- Import method (Data Source or Local Upload)
- Selected source (which system or connection)
- File list (what will be imported)
- De-identification settings (whether PHI removal is enabled)
Click Start Ingestion to launch the job and begin processing.
Monitoring Job Execution
Processing Pipeline
Once started, each ingestion job runs through a series of automated steps. Click on any job to open the Job Status Modal and watch the progress in real time:
- Document Retrieval: Securely downloads files from the source
- De-identification: Automatically removes PHI (names, dates, IDs, etc.) if enabled
- Information Extraction: Uses clinical NLP to extract diagnoses, medications, procedures, lab results, and imaging findings
- Ontology & Graph Enrichment: Links extracted concepts to standard medical vocabularies (SNOMED CT, RxNorm, LOINC)
- OMOP Conversion: Transforms clinical events into the OMOP common data model format
- Merging & Deduplication: Identifies and consolidates duplicate records
- Clinical Measure Recalculation: Updates quality metrics and analytics that depend on this data
For each step, you'll see:
- Status indicator (running, completed, or failed)
- How long it took to complete
- Processing statistics (e.g., "250 documents processed, 1,847 clinical facts extracted")
The job is only marked Completed after all steps finish successfully.
Troubleshooting with Logs
The Logs tab provides detailed diagnostic information to help you understand what happened during processing:
- Connection problems (authentication failures, network timeouts)
- File format issues (unsupported formats, corrupted files)
- NLP extraction errors (text that couldn't be processed)
- Pipeline failures (which stage failed and why)
- Document-level warnings (individual files that had issues)
Logs are essential for quality assurance and troubleshooting when jobs don't complete as expected.
Job Control Options
While a job is running, you have control options:
- Re-run Job: Start the same import again with identical settings (useful after fixing an issue)
- Stop Job: Immediately cancel a running job (useful if you notice a configuration mistake)
These controls give you flexibility to recover from errors or iterate on your import configuration.
Automated Scheduling
Scheduled Imports
If you configured a schedule for a data source (in the Data Sources module), PJI will automatically run imports at the specified times without any manual intervention.
On the dashboard, scheduled sources show:
- Frequency: How often they run (Daily, Weekly, Monthly)
- Next Execution: When the next automatic import will start
- Last Execution: When the most recent import completed
Scheduled imports ensure your data stays up-to-date without requiring manual action.
Key Capabilities
The Data Ingestion module enables you to:
- Import data manually or automatically from external systems
- Process clinical documents end-to-end with full NLP extraction and standardization
- Preview files before importing to ensure you're getting the right data
- Monitor processing in real time with step-by-step status updates
- Access detailed logs for quality control and troubleshooting
- Re-run or cancel jobs as needed for operational flexibility
- Maintain full visibility into how data enters your system
Best Practices
To get the most out of Data Ingestion:
- Start small: Test with a few documents before importing thousands
- Use de-identification: Enable PHI removal when working with patient data for research or testing
- Check your source configuration: Verify connection settings before running large imports
- Monitor actively: Watch the status indicators and review logs, especially for first-time imports
- Leverage schedules: Set up automatic imports for stable, recurring data sources
Why It Matters
The Data Ingestion module is the foundation of your clinical data pipeline in PJI. It ensures:
- Reliable Data Flow: Secure, consistent transfer from diverse clinical systems
- High Data Quality: Standardized, structured data ready for analysis
- Complete Transparency: Real-time visibility into every processing step
- Enterprise Scalability: Handle everything from single documents to millions of records
- Regulatory Compliance: Automatic PHI protection, encryption, and full audit trails
By automating and centralizing data imports, Data Ingestion transforms fragmented clinical documents into a unified, analytics-ready dataset—enabling patient journeys, cohort analysis, quality measurement, and evidence-based insights across your organization.