Skip to main content

Overview

Data Quality Profile Metrics are comprehensive measurements that assess the quality, integrity, and reliability of data across various dimensions. These metrics provide quantitative assessments of data attributes and characteristics, enabling organizations to establish baselines, identify issues, and monitor changes in data quality over time.

How to Use Data Quality Profile Metrics

Data Discovery

Use profile metrics to gain insights into the structure, content, and characteristics of datasets without prior knowledge.

Quality Assessment

Evaluate data against defined quality dimensions such as completeness, accuracy, consistency, and timeliness.

Issue Identification

Detect anomalies, outliers, and patterns that may indicate data quality problems.

Data Preparation

Inform data cleaning and transformation processes before analysis or migration.

Monitoring

Track changes in data quality metrics over time to identify trends and potential degradation.

Governance

Support data governance initiatives by providing objective measurements of data quality.

Benefits of Data Quality Profile Metrics

Higher quality data leads to more accurate insights and better business decisions at every level of the organization.
Early identification of data issues prevents costly errors and rework downstream in analytics pipelines and operational systems.
Standardized metrics streamline data quality assessment processes, reducing manual investigation time and enabling automation.
Objective measures build confidence in data among analysts, business users, and executives who rely on that data.
Helps meet data quality requirements for regulatory frameworks including GDPR, CCPA, HIPAA, BCBS 239, and SOX.
Facilitates smoother data migrations and system integrations by identifying incompatibilities before they become blocking issues.
Supports continuous improvement in data management practices through measurable, repeatable quality assessments.

Distribution Metrics

Distribution metrics analyze how data values are spread across a dataset, helping identify completeness and uniqueness issues.

Completeness Metrics

Completeness metrics measure the proportion of present versus absent data values, revealing gaps in data coverage.
MetricDefinition
Null CountTotal records where the field contains a NULL value. High null counts may indicate data collection failures, optional fields not populated, or ETL pipeline gaps.
Empty CountCount of records containing an empty string ('') rather than NULL. Common in text fields and may indicate form submissions with no input.
Non-Empty CountCount of records containing a meaningful, non-null, non-empty value. The primary completeness indicator. Non-Empty Count = Total Records - Null Count - Empty Count
Space CountCount of records where the field contains only whitespace characters. Space-only values are neither null nor empty but contain no useful information.
Completeness Rate Formula
Completeness Rate (%) = (Non-Empty Count / Total Record Count) × 100
Target completeness rates above 95% for critical business fields. Fields below 80% completeness should trigger a data quality investigation.

Uniqueness Metrics

MetricDefinition
Distinct CountThe number of unique values present in a field, excluding nulls. For a primary key field, distinct count should equal total record count.
Repeating CountThe number of values that appear more than once. Repeating Count = Total Count - Distinct Count. High repeating counts in key fields indicate referential integrity issues.

Character Distribution Metrics

MetricDefinition
Digits CountCount of records containing numeric digit characters (0–9). Validates that numeric identifiers, phone numbers, or postal codes contain the expected digits.
Special Character CountCount of records containing non-alphanumeric characters such as @, #, $, %, &. High counts in fields expected to contain only letters or numbers indicate data entry issues.
Alphabet CountCount of records containing purely alphabetic characters (A–Z, a–z). Validates that text-only fields (e.g., names) do not contain unexpected numeric or special characters.
Alpha Numeric CountCount of records containing both alphabetic and numeric characters. Essential for fields like product codes, license plates, or reference numbers.

Space Distribution Metrics

Leading and trailing spaces are invisible in most UIs but cause exact-match comparisons and joins to fail silently. Always profile for space distribution before data integration work.
MetricDefinition
Leading SpaceCount of records where the field value begins with one or more space characters. ' John' will not match 'John' in a database lookup.
Trailing SpaceCount of records where the field value ends with one or more space characters. Particularly problematic in fixed-length fields exported from legacy systems.
Outer SpaceCombined count of records with either leading or trailing spaces. An aggregated signal for edge whitespace issues across the field.
Inner SpaceCount of records containing multiple consecutive internal spaces. Example: 'John Smith' (two spaces) vs 'John Smith' (one space). Indicates data merging errors.
Whitespace CountTotal count of all whitespace characters (spaces, tabs, carriage returns, line feeds) across all values. High counts relative to total characters may indicate padded or poorly formatted values.

Numeric Distribution Metrics

MetricDefinition
Zero CountCount of records where the numeric field contains exactly 0. Distinguishing zeros from nulls is critical — a zero balance differs fundamentally from an unknown balance.
Positive CountCount of records where the numeric field is greater than zero.
Negative CountCount of records where the numeric field is less than zero. Unexpected negatives in fields like age, quantity, or price indicate data entry errors or sign convention mismatches.
Numeric Sign Verification
Zero Count + Positive Count + Negative Count + Null Count = Total Record Count
Use this identity to validate metric calculations. Any discrepancy indicates a profiling error or data type mismatch.

Frequency Metrics

Frequency metrics measure how often specific values, patterns, or lengths occur in your data.

Value Frequency Metrics

MetricDefinition
Min ValueThe smallest value in the field for the given data type (numeric minimum, lexicographic minimum for strings, earliest date). Helps identify lower-bound outliers.
Max ValueThe largest value in the field. Identifies upper-bound outliers and validates ceiling constraints. Example: a Max Value of 150 in an Age field indicates a data error.
Enum (Value Occurrence Counts)An enumeration of all distinct values along with their occurrence frequency. Invaluable for discovering unexpected values, validating categorical fields, and identifying dominant or rare values.

Length Frequency Metrics

MetricDefinition
Min LengthThe minimum character length observed across all non-null values. A value below the expected minimum indicates truncation or data entry errors.
Max LengthThe maximum character length observed. Values exceeding defined field length constraints indicate potential data truncation or schema mismatches.
Length DistributionA frequency distribution of record counts at each distinct character length. A consistent length indicates fixed-format codes; variable lengths indicate free-text fields or format inconsistencies.

Pattern Frequency Metrics

MetricDefinition
Regular ExpressionsCounts of values matching predefined or auto-discovered regex patterns. Records not matching the expected pattern are format violation candidates.
Long PatternAuto-discovers patterns using character-type acronyms: A for alphabetic, N for numeric. Example: 'ABC-123''AAA-NNN'. Reveals the full structural landscape of a field.
Short PatternA compressed variant of Long Pattern that collapses consecutive identical character types. Example: 'AAANNN''AN'. Quickly identifies fundamentally different high-level structures.
Phone Number Field:
  'NNN-NNN-NNNN'   → 78% of records  (standard US format)
  '(NNN) NNN-NNNN' → 15% of records  (alternate format)
  'NNNNNNNNNN'     →  5% of records  (no separators)
  'AAAAAAAAAA'     →  2% of records  (text — data error!)

Range Frequency Metrics

MetricDefinition
Value RangeMax Value - Min Value. A range much larger than expected suggests outliers or errors at the extremes of the distribution.
Length RangeMax Length - Min Length. A range of 0 indicates all values are the same length (fixed-format). A large range in an expected fixed-length field indicates format inconsistencies.

Pattern Metrics

Pattern metrics identify and analyze recurring structures within data values.

Regular Expression Pattern Matching

Validates that values conform to a precisely defined structural format. Used for fields with strict format requirements such as email addresses, phone numbers, postal codes, SSNs, and ISBNs.
  • Predefined Patterns — Standard regex patterns applied immediately for common field types (dates, currency, identifiers).
  • Auto-discovered Patterns — The profiler derives regex patterns from observed data, capturing the dominant format when the expected format is unknown.
  • Match vs. Non-Match Analysis — Reports the count and percentage of matching vs. non-matching records. Non-matching records are format violation candidates.
  • Multiple Pattern Support — A single field may have multiple valid patterns (e.g., international phone numbers). Multiple rules can be applied simultaneously.

Long Pattern Analysis

Each character is replaced with its type code — A for alphabetic, N for numeric, special characters preserved as-is. The profiler collects all unique long patterns and their frequency counts. Use cases: Initial data discovery when the expected format is unknown, detecting gradual format drift, comparing formats across sources before integration.

Short Pattern Analysis

Consecutive identical character type codes are collapsed into single tokens. Complements Long Pattern by identifying structural categories rather than specific format variants.
Short and Long patterns should be used together, not as substitutes. Short patterns identify categories; Long patterns identify specific variants within those categories.

Statistical Metrics

Statistical metrics provide mathematical measurements of data characteristics and distribution properties. These apply primarily to numeric and date/time fields.

Central Tendency

MetricDefinition
MeanSum of all non-null values divided by non-null record count. Sensitive to outliers — use with caution in skewed distributions. Mean = Σ values / Count
ModeThe most frequently occurring value. Applicable to both numeric and categorical data. A field may be unimodal, multimodal, or have no mode. Resistant to outliers.
Median (Q2)The middle value when all non-null values are sorted. For even counts, the average of the two middle values. Robust to outliers — preferred over mean in skewed distributions.

Dispersion

MetricDefinition
RangeMax - Min. Simple measure of total spread; highly sensitive to outliers since it depends only on the two extreme values.
Standard DeviationAverage distance of each value from the mean, expressed in the same units as the data. SD = √(Σ(x - mean)² / N). Values beyond ±3 SD are typically considered outliers.
VarianceThe square of the standard deviation. Expressed in squared units — mathematically useful in statistical calculations but less interpretable for direct reporting.

Distribution Shape

MetricInterpretation
SkewnessMeasures asymmetry. 0 = symmetric. Positive = right tail (few high extremes). Negative = left tail. Values beyond ±1 indicate significant asymmetry; beyond ±2 indicate extreme asymmetry requiring investigation.
KurtosisMeasures tailedness relative to a normal distribution. Excess Kurtosis = Kurtosis - 3. > 0 = heavier tails (more extreme values). < 0 = lighter tails. High excess kurtosis signals outlier concentration.
Skewness & Kurtosis Quick Reference
RangeSignal
Skewness -1 to +1Approximately symmetric — acceptable
Skewness -2 to -1 or +1 to +2Moderately skewed — investigate
Skewness < -2 or > +2Highly skewed — likely outliers or errors
Excess Kurtosis ~0Normal-like distribution
Excess Kurtosis > 1Heavy tails — more extreme values than normal
Excess Kurtosis < -1Light tails — fewer extreme values than normal

Quantiles

QuantilePercentileDescription
Q00thMinimum value. Absolute lower bound of the data.
Q125thFirst quartile. 25% of values fall at or below this point. Used to define the IQR lower bound.
Q250thMedian. The midpoint of the distribution — robust to outliers.
Q375thThird quartile. 75% of values fall at or below this point. Used to define the IQR upper bound.
Q4100thMaximum value. Absolute upper bound of the data.
Outlier Detection — Tukey Fence Method
IQR          = Q3 - Q1
Lower Fence  = Q1 - 1.5 × IQR
Upper Fence  = Q3 + 1.5 × IQR
Values outside the fences are statistical outliers. Use 3.0 × IQR for extreme outlier detection only.

Other Statistical Measures

MetricDefinition
SumArithmetic total of all non-null numeric values. Validate against control totals from source systems to detect data loss during ingestion or transformation.
Margin of ErrorQuantifies sampling uncertainty when profiling a dataset sample rather than the full population. MOE = Z × (SD / √n) where Z = 1.96 for 95% confidence.

Implementation Guidance

Profiling Workflow

1

Define Scope

Identify the datasets, tables, and fields to be profiled. Prioritize based on criticality to business processes and downstream dependencies.
2

Select Metrics

Choose metric categories based on field data types. Apply distribution and completeness metrics universally; statistical metrics to numeric and date fields only.
3

Execute Profiling

Run the profiling engine against the target data. For large datasets, consider sampling strategies while accounting for margin of error.
4

Review Results

Analyze metric outputs against defined quality thresholds and business rules. Flag anomalies for investigation.
5

Document Findings

Record baseline metrics, identified issues, and recommended remediation actions in a data quality register.
6

Remediate Issues

Execute data cleaning and transformation activities based on profiling findings before promotion to downstream systems.
7

Establish Monitoring

Schedule periodic re-profiling to detect quality drift and validate that remediation actions remain effective over time.

Thresholds by Field Type

Field TypeCritical MetricsRecommended Thresholds
Primary Key / IdentifierNull Count, Distinct Count, Repeating CountNull Rate = 0%, Uniqueness = 100%, No Repeating Values
Required Text FieldNull Count, Empty Count, Space CountCombined Null + Empty + Space Rate < 1%
Optional Text FieldNull Count, Empty Count, Pattern FrequencyNull Rate documented; Pattern Conformance > 95%
Numeric MeasureNull Count, Zero Count, Min/Max Value, Std DevZ-Scores within ±3 for expected measures
Date / TimestampNull Count, Min Value, Max Value, PatternNo future dates beyond tolerance; no dates before system go-live
Categorical / Code FieldDistinct Count, Enum Frequency, Null CountDistinct values match approved reference data; Null Rate < 5%
Free TextNull Count, Length Distribution, Special CharactersMax Length within field constraints; no control characters