Data Profiling - DQLabs PRIZM

Overview

Data Quality Profile Metrics are comprehensive measurements that assess the quality, integrity, and reliability of data across various dimensions. These metrics provide quantitative assessments of data attributes and characteristics, enabling organizations to establish baselines, identify issues, and monitor changes in data quality over time.

How to Use Data Quality Profile Metrics

Data Discovery

Use profile metrics to gain insights into the structure, content, and characteristics of datasets without prior knowledge.

Quality Assessment

Evaluate data against defined quality dimensions such as completeness, accuracy, consistency, and timeliness.

Issue Identification

Detect anomalies, outliers, and patterns that may indicate data quality problems.

Data Preparation

Inform data cleaning and transformation processes before analysis or migration.

Monitoring

Track changes in data quality metrics over time to identify trends and potential degradation.

Governance

Support data governance initiatives by providing objective measurements of data quality.

Benefits of Data Quality Profile Metrics

Improved Decision Making

Higher quality data leads to more accurate insights and better business decisions at every level of the organization.

Reduced Costs

Early identification of data issues prevents costly errors and rework downstream in analytics pipelines and operational systems.

Enhanced Efficiency

Standardized metrics streamline data quality assessment processes, reducing manual investigation time and enabling automation.

Increased Trust

Objective measures build confidence in data among analysts, business users, and executives who rely on that data.

Regulatory Compliance

Helps meet data quality requirements for regulatory frameworks including GDPR, CCPA, HIPAA, BCBS 239, and SOX.

Better Data Integration

Facilitates smoother data migrations and system integrations by identifying incompatibilities before they become blocking issues.

Operational Excellence

Supports continuous improvement in data management practices through measurable, repeatable quality assessments.

Distribution Metrics

Distribution metrics analyze how data values are spread across a dataset, helping identify completeness and uniqueness issues.

Completeness Metrics

Completeness metrics measure the proportion of present versus absent data values, revealing gaps in data coverage.

Metric	Definition
Null Count	Total records where the field contains a `NULL` value. High null counts may indicate data collection failures, optional fields not populated, or ETL pipeline gaps.
Empty Count	Count of records containing an empty string `('')` rather than NULL. Common in text fields and may indicate form submissions with no input.
Non-Empty Count	Count of records containing a meaningful, non-null, non-empty value. The primary completeness indicator. `Non-Empty Count = Total Records - Null Count - Empty Count`
Space Count	Count of records where the field contains only whitespace characters. Space-only values are neither null nor empty but contain no useful information.

Completeness Rate Formula

Completeness Rate (%) = (Non-Empty Count / Total Record Count) × 100

Target completeness rates above 95% for critical business fields. Fields below 80% completeness should trigger a data quality investigation.

Uniqueness Metrics

Metric	Definition
Distinct Count	The number of unique values present in a field, excluding nulls. For a primary key field, distinct count should equal total record count.
Repeating Count	The number of values that appear more than once. `Repeating Count = Total Count - Distinct Count`. High repeating counts in key fields indicate referential integrity issues.

Character Distribution Metrics

Metric	Definition
Digits Count	Count of records containing numeric digit characters (0–9). Validates that numeric identifiers, phone numbers, or postal codes contain the expected digits.
Special Character Count	Count of records containing non-alphanumeric characters such as `@`, `#`, `$`, `%`, `&`. High counts in fields expected to contain only letters or numbers indicate data entry issues.
Alphabet Count	Count of records containing purely alphabetic characters (A–Z, a–z). Validates that text-only fields (e.g., names) do not contain unexpected numeric or special characters.
Alpha Numeric Count	Count of records containing both alphabetic and numeric characters. Essential for fields like product codes, license plates, or reference numbers.

Space Distribution Metrics

Leading and trailing spaces are invisible in most UIs but cause exact-match comparisons and joins to fail silently. Always profile for space distribution before data integration work.

Metric	Definition
Leading Space	Count of records where the field value begins with one or more space characters. `' John'` will not match `'John'` in a database lookup.
Trailing Space	Count of records where the field value ends with one or more space characters. Particularly problematic in fixed-length fields exported from legacy systems.
Outer Space	Combined count of records with either leading or trailing spaces. An aggregated signal for edge whitespace issues across the field.
Inner Space	Count of records containing multiple consecutive internal spaces. Example: `'John Smith'` (two spaces) vs `'John Smith'` (one space). Indicates data merging errors.
Whitespace Count	Total count of all whitespace characters (spaces, tabs, carriage returns, line feeds) across all values. High counts relative to total characters may indicate padded or poorly formatted values.

Numeric Distribution Metrics

Metric	Definition
Zero Count	Count of records where the numeric field contains exactly `0`. Distinguishing zeros from nulls is critical — a zero balance differs fundamentally from an unknown balance.
Positive Count	Count of records where the numeric field is greater than zero.
Negative Count	Count of records where the numeric field is less than zero. Unexpected negatives in fields like age, quantity, or price indicate data entry errors or sign convention mismatches.

Numeric Sign Verification

Zero Count + Positive Count + Negative Count + Null Count = Total Record Count

Use this identity to validate metric calculations. Any discrepancy indicates a profiling error or data type mismatch.

Frequency Metrics

Frequency metrics measure how often specific values, patterns, or lengths occur in your data.

Value Frequency Metrics

Metric	Definition
Min Value	The smallest value in the field for the given data type (numeric minimum, lexicographic minimum for strings, earliest date). Helps identify lower-bound outliers.
Max Value	The largest value in the field. Identifies upper-bound outliers and validates ceiling constraints. Example: a `Max Value` of `150` in an Age field indicates a data error.
Enum (Value Occurrence Counts)	An enumeration of all distinct values along with their occurrence frequency. Invaluable for discovering unexpected values, validating categorical fields, and identifying dominant or rare values.

Length Frequency Metrics

Metric	Definition
Min Length	The minimum character length observed across all non-null values. A value below the expected minimum indicates truncation or data entry errors.
Max Length	The maximum character length observed. Values exceeding defined field length constraints indicate potential data truncation or schema mismatches.
Length Distribution	A frequency distribution of record counts at each distinct character length. A consistent length indicates fixed-format codes; variable lengths indicate free-text fields or format inconsistencies.

Pattern Frequency Metrics

Metric	Definition
Regular Expressions	Counts of values matching predefined or auto-discovered regex patterns. Records not matching the expected pattern are format violation candidates.
Long Pattern	Auto-discovers patterns using character-type acronyms: `A` for alphabetic, `N` for numeric. Example: `'ABC-123'` → `'AAA-NNN'`. Reveals the full structural landscape of a field.
Short Pattern	A compressed variant of Long Pattern that collapses consecutive identical character types. Example: `'AAANNN'` → `'AN'`. Quickly identifies fundamentally different high-level structures.

Phone Number Field:
  'NNN-NNN-NNNN'   → 78% of records  (standard US format)
  '(NNN) NNN-NNNN' → 15% of records  (alternate format)
  'NNNNNNNNNN'     →  5% of records  (no separators)
  'AAAAAAAAAA'     →  2% of records  (text — data error!)

Range Frequency Metrics

Metric	Definition
Value Range	`Max Value - Min Value`. A range much larger than expected suggests outliers or errors at the extremes of the distribution.
Length Range	`Max Length - Min Length`. A range of `0` indicates all values are the same length (fixed-format). A large range in an expected fixed-length field indicates format inconsistencies.

Pattern Metrics

Pattern metrics identify and analyze recurring structures within data values.

Regular Expression Pattern Matching

Validates that values conform to a precisely defined structural format. Used for fields with strict format requirements such as email addresses, phone numbers, postal codes, SSNs, and ISBNs.

Predefined Patterns — Standard regex patterns applied immediately for common field types (dates, currency, identifiers).
Auto-discovered Patterns — The profiler derives regex patterns from observed data, capturing the dominant format when the expected format is unknown.
Match vs. Non-Match Analysis — Reports the count and percentage of matching vs. non-matching records. Non-matching records are format violation candidates.
Multiple Pattern Support — A single field may have multiple valid patterns (e.g., international phone numbers). Multiple rules can be applied simultaneously.

Long Pattern Analysis

Each character is replaced with its type code — A for alphabetic, N for numeric, special characters preserved as-is. The profiler collects all unique long patterns and their frequency counts. Use cases: Initial data discovery when the expected format is unknown, detecting gradual format drift, comparing formats across sources before integration.

Short Pattern Analysis

Consecutive identical character type codes are collapsed into single tokens. Complements Long Pattern by identifying structural categories rather than specific format variants.

Short and Long patterns should be used together, not as substitutes. Short patterns identify categories; Long patterns identify specific variants within those categories.

Statistical Metrics

Statistical metrics provide mathematical measurements of data characteristics and distribution properties. These apply primarily to numeric and date/time fields.

Central Tendency

Metric	Definition
Mean	Sum of all non-null values divided by non-null record count. Sensitive to outliers — use with caution in skewed distributions. `Mean = Σ values / Count`
Mode	The most frequently occurring value. Applicable to both numeric and categorical data. A field may be unimodal, multimodal, or have no mode. Resistant to outliers.
Median (Q2)	The middle value when all non-null values are sorted. For even counts, the average of the two middle values. Robust to outliers — preferred over mean in skewed distributions.

Dispersion

Metric	Definition
Range	`Max - Min`. Simple measure of total spread; highly sensitive to outliers since it depends only on the two extreme values.
Standard Deviation	Average distance of each value from the mean, expressed in the same units as the data. `SD = √(Σ(x - mean)² / N)`. Values beyond ±3 SD are typically considered outliers.
Variance	The square of the standard deviation. Expressed in squared units — mathematically useful in statistical calculations but less interpretable for direct reporting.

Distribution Shape

Metric	Interpretation
Skewness	Measures asymmetry. `0` = symmetric. Positive = right tail (few high extremes). Negative = left tail. Values beyond `±1` indicate significant asymmetry; beyond `±2` indicate extreme asymmetry requiring investigation.
Kurtosis	Measures tailedness relative to a normal distribution. `Excess Kurtosis = Kurtosis - 3`. `> 0` = heavier tails (more extreme values). `< 0` = lighter tails. High excess kurtosis signals outlier concentration.

Skewness & Kurtosis Quick Reference

Range	Signal
Skewness `-1` to `+1`	Approximately symmetric — acceptable
Skewness `-2` to `-1` or `+1` to `+2`	Moderately skewed — investigate
Skewness `< -2` or `> +2`	Highly skewed — likely outliers or errors
Excess Kurtosis `~0`	Normal-like distribution
Excess Kurtosis `> 1`	Heavy tails — more extreme values than normal
Excess Kurtosis `< -1`	Light tails — fewer extreme values than normal

Quantiles

Quantile	Percentile	Description
Q0	0th	Minimum value. Absolute lower bound of the data.
Q1	25th	First quartile. 25% of values fall at or below this point. Used to define the IQR lower bound.
Q2	50th	Median. The midpoint of the distribution — robust to outliers.
Q3	75th	Third quartile. 75% of values fall at or below this point. Used to define the IQR upper bound.
Q4	100th	Maximum value. Absolute upper bound of the data.

Outlier Detection — Tukey Fence Method

IQR          = Q3 - Q1
Lower Fence  = Q1 - 1.5 × IQR
Upper Fence  = Q3 + 1.5 × IQR

Values outside the fences are statistical outliers. Use 3.0 × IQR for extreme outlier detection only.

Other Statistical Measures

Metric	Definition
Sum	Arithmetic total of all non-null numeric values. Validate against control totals from source systems to detect data loss during ingestion or transformation.
Margin of Error	Quantifies sampling uncertainty when profiling a dataset sample rather than the full population. `MOE = Z × (SD / √n)` where `Z = 1.96` for 95% confidence.

Implementation Guidance

Profiling Workflow

Define Scope

Identify the datasets, tables, and fields to be profiled. Prioritize based on criticality to business processes and downstream dependencies.

Select Metrics

Choose metric categories based on field data types. Apply distribution and completeness metrics universally; statistical metrics to numeric and date fields only.

Execute Profiling

Run the profiling engine against the target data. For large datasets, consider sampling strategies while accounting for margin of error.

Review Results

Analyze metric outputs against defined quality thresholds and business rules. Flag anomalies for investigation.

Document Findings

Record baseline metrics, identified issues, and recommended remediation actions in a data quality register.

Remediate Issues

Execute data cleaning and transformation activities based on profiling findings before promotion to downstream systems.

Establish Monitoring

Schedule periodic re-profiling to detect quality drift and validate that remediation actions remain effective over time.

Thresholds by Field Type

Field Type	Critical Metrics	Recommended Thresholds
Primary Key / Identifier	Null Count, Distinct Count, Repeating Count	Null Rate = 0%, Uniqueness = 100%, No Repeating Values
Required Text Field	Null Count, Empty Count, Space Count	Combined Null + Empty + Space Rate < 1%
Optional Text Field	Null Count, Empty Count, Pattern Frequency	Null Rate documented; Pattern Conformance > 95%
Numeric Measure	Null Count, Zero Count, Min/Max Value, Std Dev	Z-Scores within ±3 for expected measures
Date / Timestamp	Null Count, Min Value, Max Value, Pattern	No future dates beyond tolerance; no dates before system go-live
Categorical / Code Field	Distinct Count, Enum Frequency, Null Count	Distinct values match approved reference data; Null Rate < 5%
Free Text	Null Count, Length Distribution, Special Characters	Max Length within field constraints; no control characters

​Overview

​How to Use Data Quality Profile Metrics

Data Discovery

Quality Assessment

Issue Identification

Data Preparation

Monitoring

Governance

​Benefits of Data Quality Profile Metrics

​Distribution Metrics

​Completeness Metrics

​Uniqueness Metrics

​Character Distribution Metrics

​Space Distribution Metrics

​Numeric Distribution Metrics

​Frequency Metrics

​Value Frequency Metrics

​Length Frequency Metrics

​Pattern Frequency Metrics

​Range Frequency Metrics

​Pattern Metrics

​Regular Expression Pattern Matching

​Long Pattern Analysis

​Short Pattern Analysis

​Statistical Metrics

​Central Tendency

​Dispersion

​Distribution Shape

​Quantiles

​Other Statistical Measures

​Implementation Guidance

​Profiling Workflow

​Thresholds by Field Type

Overview

How to Use Data Quality Profile Metrics

Benefits of Data Quality Profile Metrics

Distribution Metrics

Completeness Metrics

Uniqueness Metrics

Character Distribution Metrics

Space Distribution Metrics

Numeric Distribution Metrics

Frequency Metrics

Value Frequency Metrics

Length Frequency Metrics

Pattern Frequency Metrics

Range Frequency Metrics

Pattern Metrics

Regular Expression Pattern Matching

Long Pattern Analysis

Short Pattern Analysis

Statistical Metrics

Central Tendency

Dispersion

Distribution Shape

Quantiles

Other Statistical Measures

Implementation Guidance

Profiling Workflow

Thresholds by Field Type