Guidelines on Data Quality
Basic Concepts of Data Quality
Data quality refers to the condition of a dataset, ensuring it is accurate, complete, consistent, and reliable for its intended use. Poor data quality can lead to flawed decision-making, operational issues, and compliance risks.
1. Data Quality Issues – What errors affect data?
Value Errors
Value errors occur when data fields contain invalid, inappropriate, or wrongly formatted values.
Examples:
A name field containing numerical values.
Dates stored as plain text or in inconsistent formats.
Out-of-range numeric values, like a temperature reading of 2000°C.
Completeness Gaps
Completeness issues arise when required data is missing or outdated.
Examples:
Null or blank fields in patient records.
Critical fields left unpopulated during form submissions.
Outdated contact or demographic information.
Consistency Conflicts
These are discrepancies in data that should otherwise be uniform across systems or records.
Examples:
A patient’s name spelled differently across two systems.
Duplicate records for the same individual.
Data formats varying between integrated systems (e.g., one system uses DD/MM/YYYY while another uses MM-DD-YYYY).
2. Issue Discovery – How to find data quality issues?
Detecting and identifying quality issues is crucial before corrective actions can be taken.
Middleware Monitoring
Middleware tools and services can monitor the flow of data between systems, ensuring integrity in real-time.
Approaches include:
Payload validation (ensuring data conforms to schema).
Logging and error detection.
Contract testing (ensuring expected fields and formats are present).
User Feedback
End-users often identify issues through interaction with the system.
Techniques:
Data profiling to analyze patterns and detect anomalies.
Encouraging users to report inaccuracies or inconsistencies.
Validations and alerts in the UI when irregular data is detected.
Input Checks
Proactive validation at the point of data entry can prevent errors from being introduced.
Methods:
Database constraints (e.g., NOT NULL, UNIQUE).
API validations (schema enforcement, required fields).
Input masks or dropdowns to limit entry options.
3. Issue Correction – How to fix errors in data?
Once data issues are identified, the next step is correction through manual or automated means.
Manual Curation
Hands-on review and correction of records.
Examples:
Running data fix scripts to clean old entries.
Manual review by data stewards.
Adding rules in forms or workflows to enforce correct entry.
Automated Data Enrichment
Use of algorithms or external datasets to improve data quality.
Approaches:
Looking up missing values using reference datasets.
Auto-filling address or demographic details using external APIs.
Applying business rules to infer or correct values.
Data Cleansing
A systematic approach to cleaning and standardizing data.
Tools/Methods:
Master Data Management (MDM) systems.
De-duplication algorithms.
Regular audits and review cycles.