Overview
Data Quality Profile Metrics are comprehensive measurements that assess the quality, integrity, and reliability of data across various dimensions. These metrics provide quantitative assessments of data attributes and characteristics, enabling organizations to establish baselines, identify issues, and monitor changes in data quality over time.How to Use Data Quality Profile Metrics
Data Discovery
Use profile metrics to gain insights into the structure, content, and characteristics of datasets without prior knowledge.
Quality Assessment
Evaluate data against defined quality dimensions such as completeness, accuracy, consistency, and timeliness.
Issue Identification
Detect anomalies, outliers, and patterns that may indicate data quality problems.
Data Preparation
Inform data cleaning and transformation processes before analysis or migration.
Monitoring
Track changes in data quality metrics over time to identify trends and potential degradation.
Governance
Support data governance initiatives by providing objective measurements of data quality.
Benefits of Data Quality Profile Metrics
Improved Decision Making
Improved Decision Making
Higher quality data leads to more accurate insights and better business decisions at every level of the organization.
Reduced Costs
Reduced Costs
Early identification of data issues prevents costly errors and rework downstream in analytics pipelines and operational systems.
Enhanced Efficiency
Enhanced Efficiency
Standardized metrics streamline data quality assessment processes, reducing manual investigation time and enabling automation.
Increased Trust
Increased Trust
Objective measures build confidence in data among analysts, business users, and executives who rely on that data.
Regulatory Compliance
Regulatory Compliance
Helps meet data quality requirements for regulatory frameworks including GDPR, CCPA, HIPAA, BCBS 239, and SOX.
Better Data Integration
Better Data Integration
Facilitates smoother data migrations and system integrations by identifying incompatibilities before they become blocking issues.
Operational Excellence
Operational Excellence
Supports continuous improvement in data management practices through measurable, repeatable quality assessments.
Distribution Metrics
Distribution metrics analyze how data values are spread across a dataset, helping identify completeness and uniqueness issues.Completeness Metrics
Completeness metrics measure the proportion of present versus absent data values, revealing gaps in data coverage.| Metric | Definition |
|---|---|
| Null Count | Total records where the field contains a NULL value. High null counts may indicate data collection failures, optional fields not populated, or ETL pipeline gaps. |
| Empty Count | Count of records containing an empty string ('') rather than NULL. Common in text fields and may indicate form submissions with no input. |
| Non-Empty Count | Count of records containing a meaningful, non-null, non-empty value. The primary completeness indicator. Non-Empty Count = Total Records - Null Count - Empty Count |
| Space Count | Count of records where the field contains only whitespace characters. Space-only values are neither null nor empty but contain no useful information. |
Completeness Rate FormulaTarget completeness rates above 95% for critical business fields. Fields below 80% completeness should trigger a data quality investigation.
Uniqueness Metrics
| Metric | Definition |
|---|---|
| Distinct Count | The number of unique values present in a field, excluding nulls. For a primary key field, distinct count should equal total record count. |
| Repeating Count | The number of values that appear more than once. Repeating Count = Total Count - Distinct Count. High repeating counts in key fields indicate referential integrity issues. |
Character Distribution Metrics
| Metric | Definition |
|---|---|
| Digits Count | Count of records containing numeric digit characters (0–9). Validates that numeric identifiers, phone numbers, or postal codes contain the expected digits. |
| Special Character Count | Count of records containing non-alphanumeric characters such as @, #, $, %, &. High counts in fields expected to contain only letters or numbers indicate data entry issues. |
| Alphabet Count | Count of records containing purely alphabetic characters (A–Z, a–z). Validates that text-only fields (e.g., names) do not contain unexpected numeric or special characters. |
| Alpha Numeric Count | Count of records containing both alphabetic and numeric characters. Essential for fields like product codes, license plates, or reference numbers. |
Space Distribution Metrics
| Metric | Definition |
|---|---|
| Leading Space | Count of records where the field value begins with one or more space characters. ' John' will not match 'John' in a database lookup. |
| Trailing Space | Count of records where the field value ends with one or more space characters. Particularly problematic in fixed-length fields exported from legacy systems. |
| Outer Space | Combined count of records with either leading or trailing spaces. An aggregated signal for edge whitespace issues across the field. |
| Inner Space | Count of records containing multiple consecutive internal spaces. Example: 'John Smith' (two spaces) vs 'John Smith' (one space). Indicates data merging errors. |
| Whitespace Count | Total count of all whitespace characters (spaces, tabs, carriage returns, line feeds) across all values. High counts relative to total characters may indicate padded or poorly formatted values. |
Numeric Distribution Metrics
| Metric | Definition |
|---|---|
| Zero Count | Count of records where the numeric field contains exactly 0. Distinguishing zeros from nulls is critical — a zero balance differs fundamentally from an unknown balance. |
| Positive Count | Count of records where the numeric field is greater than zero. |
| Negative Count | Count of records where the numeric field is less than zero. Unexpected negatives in fields like age, quantity, or price indicate data entry errors or sign convention mismatches. |
Frequency Metrics
Frequency metrics measure how often specific values, patterns, or lengths occur in your data.Value Frequency Metrics
| Metric | Definition |
|---|---|
| Min Value | The smallest value in the field for the given data type (numeric minimum, lexicographic minimum for strings, earliest date). Helps identify lower-bound outliers. |
| Max Value | The largest value in the field. Identifies upper-bound outliers and validates ceiling constraints. Example: a Max Value of 150 in an Age field indicates a data error. |
| Enum (Value Occurrence Counts) | An enumeration of all distinct values along with their occurrence frequency. Invaluable for discovering unexpected values, validating categorical fields, and identifying dominant or rare values. |
Length Frequency Metrics
| Metric | Definition |
|---|---|
| Min Length | The minimum character length observed across all non-null values. A value below the expected minimum indicates truncation or data entry errors. |
| Max Length | The maximum character length observed. Values exceeding defined field length constraints indicate potential data truncation or schema mismatches. |
| Length Distribution | A frequency distribution of record counts at each distinct character length. A consistent length indicates fixed-format codes; variable lengths indicate free-text fields or format inconsistencies. |
Pattern Frequency Metrics
| Metric | Definition |
|---|---|
| Regular Expressions | Counts of values matching predefined or auto-discovered regex patterns. Records not matching the expected pattern are format violation candidates. |
| Long Pattern | Auto-discovers patterns using character-type acronyms: A for alphabetic, N for numeric. Example: 'ABC-123' → 'AAA-NNN'. Reveals the full structural landscape of a field. |
| Short Pattern | A compressed variant of Long Pattern that collapses consecutive identical character types. Example: 'AAANNN' → 'AN'. Quickly identifies fundamentally different high-level structures. |
Range Frequency Metrics
| Metric | Definition |
|---|---|
| Value Range | Max Value - Min Value. A range much larger than expected suggests outliers or errors at the extremes of the distribution. |
| Length Range | Max Length - Min Length. A range of 0 indicates all values are the same length (fixed-format). A large range in an expected fixed-length field indicates format inconsistencies. |
Pattern Metrics
Pattern metrics identify and analyze recurring structures within data values.Regular Expression Pattern Matching
Validates that values conform to a precisely defined structural format. Used for fields with strict format requirements such as email addresses, phone numbers, postal codes, SSNs, and ISBNs.- Predefined Patterns — Standard regex patterns applied immediately for common field types (dates, currency, identifiers).
- Auto-discovered Patterns — The profiler derives regex patterns from observed data, capturing the dominant format when the expected format is unknown.
- Match vs. Non-Match Analysis — Reports the count and percentage of matching vs. non-matching records. Non-matching records are format violation candidates.
- Multiple Pattern Support — A single field may have multiple valid patterns (e.g., international phone numbers). Multiple rules can be applied simultaneously.
Long Pattern Analysis
Each character is replaced with its type code —A for alphabetic, N for numeric, special characters preserved as-is. The profiler collects all unique long patterns and their frequency counts.
Use cases: Initial data discovery when the expected format is unknown, detecting gradual format drift, comparing formats across sources before integration.
Short Pattern Analysis
Consecutive identical character type codes are collapsed into single tokens. Complements Long Pattern by identifying structural categories rather than specific format variants.Short and Long patterns should be used together, not as substitutes. Short patterns identify categories; Long patterns identify specific variants within those categories.
Statistical Metrics
Statistical metrics provide mathematical measurements of data characteristics and distribution properties. These apply primarily to numeric and date/time fields.Central Tendency
| Metric | Definition |
|---|---|
| Mean | Sum of all non-null values divided by non-null record count. Sensitive to outliers — use with caution in skewed distributions. Mean = Σ values / Count |
| Mode | The most frequently occurring value. Applicable to both numeric and categorical data. A field may be unimodal, multimodal, or have no mode. Resistant to outliers. |
| Median (Q2) | The middle value when all non-null values are sorted. For even counts, the average of the two middle values. Robust to outliers — preferred over mean in skewed distributions. |
Dispersion
| Metric | Definition |
|---|---|
| Range | Max - Min. Simple measure of total spread; highly sensitive to outliers since it depends only on the two extreme values. |
| Standard Deviation | Average distance of each value from the mean, expressed in the same units as the data. SD = √(Σ(x - mean)² / N). Values beyond ±3 SD are typically considered outliers. |
| Variance | The square of the standard deviation. Expressed in squared units — mathematically useful in statistical calculations but less interpretable for direct reporting. |
Distribution Shape
| Metric | Interpretation |
|---|---|
| Skewness | Measures asymmetry. 0 = symmetric. Positive = right tail (few high extremes). Negative = left tail. Values beyond ±1 indicate significant asymmetry; beyond ±2 indicate extreme asymmetry requiring investigation. |
| Kurtosis | Measures tailedness relative to a normal distribution. Excess Kurtosis = Kurtosis - 3. > 0 = heavier tails (more extreme values). < 0 = lighter tails. High excess kurtosis signals outlier concentration. |
Skewness & Kurtosis Quick Reference
| Range | Signal |
|---|---|
Skewness -1 to +1 | Approximately symmetric — acceptable |
Skewness -2 to -1 or +1 to +2 | Moderately skewed — investigate |
Skewness < -2 or > +2 | Highly skewed — likely outliers or errors |
Excess Kurtosis ~0 | Normal-like distribution |
Excess Kurtosis > 1 | Heavy tails — more extreme values than normal |
Excess Kurtosis < -1 | Light tails — fewer extreme values than normal |
Quantiles
| Quantile | Percentile | Description |
|---|---|---|
| Q0 | 0th | Minimum value. Absolute lower bound of the data. |
| Q1 | 25th | First quartile. 25% of values fall at or below this point. Used to define the IQR lower bound. |
| Q2 | 50th | Median. The midpoint of the distribution — robust to outliers. |
| Q3 | 75th | Third quartile. 75% of values fall at or below this point. Used to define the IQR upper bound. |
| Q4 | 100th | Maximum value. Absolute upper bound of the data. |
Other Statistical Measures
| Metric | Definition |
|---|---|
| Sum | Arithmetic total of all non-null numeric values. Validate against control totals from source systems to detect data loss during ingestion or transformation. |
| Margin of Error | Quantifies sampling uncertainty when profiling a dataset sample rather than the full population. MOE = Z × (SD / √n) where Z = 1.96 for 95% confidence. |
Implementation Guidance
Profiling Workflow
Define Scope
Identify the datasets, tables, and fields to be profiled. Prioritize based on criticality to business processes and downstream dependencies.
Select Metrics
Choose metric categories based on field data types. Apply distribution and completeness metrics universally; statistical metrics to numeric and date fields only.
Execute Profiling
Run the profiling engine against the target data. For large datasets, consider sampling strategies while accounting for margin of error.
Review Results
Analyze metric outputs against defined quality thresholds and business rules. Flag anomalies for investigation.
Document Findings
Record baseline metrics, identified issues, and recommended remediation actions in a data quality register.
Remediate Issues
Execute data cleaning and transformation activities based on profiling findings before promotion to downstream systems.
Thresholds by Field Type
| Field Type | Critical Metrics | Recommended Thresholds |
|---|---|---|
| Primary Key / Identifier | Null Count, Distinct Count, Repeating Count | Null Rate = 0%, Uniqueness = 100%, No Repeating Values |
| Required Text Field | Null Count, Empty Count, Space Count | Combined Null + Empty + Space Rate < 1% |
| Optional Text Field | Null Count, Empty Count, Pattern Frequency | Null Rate documented; Pattern Conformance > 95% |
| Numeric Measure | Null Count, Zero Count, Min/Max Value, Std Dev | Z-Scores within ±3 for expected measures |
| Date / Timestamp | Null Count, Min Value, Max Value, Pattern | No future dates beyond tolerance; no dates before system go-live |
| Categorical / Code Field | Distinct Count, Enum Frequency, Null Count | Distinct values match approved reference data; Null Rate < 5% |
| Free Text | Null Count, Length Distribution, Special Characters | Max Length within field constraints; no control characters |