Data QA Using Python¶

A data quality analysis (QA) problem over a table using pandas without drowning in code.Let’s discover a systematic QA analysis, step-by-step.

You have a table ith these columns:

| A | B | C | width | D | Region |

Each row represents record. Region groups related records geografically.

A, B, C, width, and D are attributes to test for completeness, consistency, and validity.

Domain Knowledge¶

You need to have some domain knowledge to not make some blueprint analysis just based on the method itself. i.e. in this example the width cannot be too small, group by can be performed by region. It can be some other rules like if I have noticed:

A → sometimes empty

B → often empty

C → may have false positives

width → presumably numeric (complete?)

Region → categorical, useful for grouping

D (another attribute?) → complete in some Region groups

Objective¶

Goal is to measure how good or reliable this data is according to some rules - generally get a sense of completeness, validity, uniqueness, and consistency.

Detect missing data (A, B, width)
Flag possible false positives in C
Check completeness of D per Region
Summarize QA metrics per region

All this is a form of data quality assessment — which is a component of profiling which probably is the topic for future post.

Key Metrics¶

Check	What it shows	Example
Count / Missing %	How much data is missing	df.isna().sum()
Distinct count	How varied the data is	df.nunique()
Range check	Are numeric values within expected bounds	e.g., width > 3.0

Design Python (pandas) QA Analysis¶

In [ ]:

from sklearn import datasets
import pandas as pd
import numpy as np 

# load the iris dataset
iris = datasets.load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df_mock_qa_problem = iris_df.copy()
# get only the first two features for simplicity 
iris_df = iris_df.iloc[:, :2]
# add randomly Null values for sepal width
np.random.seed(0) # for reproducibility

null_indices = np.random.choice(iris_df.index, size=11, replace=False)
print("Indices with null values:", null_indices)
iris_df.loc[null_indices, 'sepal width (cm)'] = np.nan
# add a bogus categorical region column randomly
bogus_regions = ['Region_A', 'Region_B', 'Region_C']
iris_df['Region'] = np.random.choice(bogus_regions, size=len(iris_df))
print(iris_df.head(20))

Indices with null values: [114  62  33 107   7 100  40  86  76  71 134]
    sepal length (cm)  sepal width (cm)    Region
0                 5.1               3.5  Region_A
1                 4.9               3.0  Region_A
2                 4.7               3.2  Region_C
3                 4.6               3.1  Region_C
4                 5.0               3.6  Region_A
5                 5.4               3.9  Region_A
6                 4.6               3.4  Region_A
7                 5.0               NaN  Region_B
8                 4.4               2.9  Region_C
9                 4.9               3.1  Region_A
10                5.4               3.7  Region_A
11                4.8               3.4  Region_B
12                4.8               3.0  Region_A
13                4.3               3.0  Region_C
14                5.8               4.0  Region_B
15                5.7               4.4  Region_B
16                5.4               3.9  Region_B
17                5.1               3.5  Region_A
18                5.7               3.8  Region_A
19                5.1               3.8  Region_A

In [ ]:

import pandas as pd
import numpy as np
import click
df = iris_df
# --- 1 Basic completeness checks ---
completeness = df.isnull().mean().to_frame("missing_pct") * 100
completeness["non_missing_pct"] = 100 - completeness["missing_pct"]

# --- 2 Groupwise completeness by Region ---
region_completeness = (
  df.groupby("Region")
    .apply(lambda g: g.isnull().mean() * 100)
    .rename_axis(index=None)
)

# --- 3 Validity checks ---
validity = pd.DataFrame()
validity["width_invalid"] = (df["sepal width (cm)"] <= 0).sum()
validity["width_null"] = df["sepal width (cm)"].isnull().sum()
valid_C_values = df["sepal length (cm)"].isin([True, False, 0, 1]).mean() * 100
validity["C_valid_pct"] = valid_C_values
"""
# --- 5 Optional: QA scoring per Region ---
region_qa = (
  df.groupby("Region")
    .apply(lambda g: 1 - (g[["A","B","width"]].isnull().sum().sum() / (len(g)*3)))
    .reset_index(name="qa_score")
)
region_qa["qa_label"] = pd.cut(region_qa["qa_score"],
                 bins=[0,0.7,0.9,1],
                 labels=["Poor","Fair","Good"])

# --- 6 Optional: False positive analysis for C ---
false_pos_candidates = df[df["C"] == True]

"""

Out[ ]:

'\n# --- 4 Consistency check: D should be consistent within Region ---\nregion_D_consistency = (\n  df.groupby("Region")["D"]\n    .nunique()\n    .reset_index(name="distinct_D_values")\n)\nregion_D_consistency["consistent_D"] = region_D_consistency["distinct_D_values"] == 1\n\n# --- 5 Optional: QA scoring per Region ---\nregion_qa = (\n  df.groupby("Region")\n    .apply(lambda g: 1 - (g[["A","B","width"]].isnull().sum().sum() / (len(g)*3)))\n    .reset_index(name="qa_score")\n)\nregion_qa["qa_label"] = pd.cut(region_qa["qa_score"],\n                 bins=[0,0.7,0.9,1],\n                 labels=["Poor","Fair","Good"])\n\n# --- 6 Optional: False positive analysis for C ---\nfalse_pos_candidates = df[df["C"] == True]\n\n# --- Output results using click style ---\nclick.secho("=== Overall Completeness ===", fg="cyan", bold=True)\nclick.echo(completeness)\nclick.secho("\n=== Missing % by Region ===", fg="cyan", bold=True)\nclick.echo(region_completeness)\nclick.secho("\n=== Validity Checks ===", fg="cyan", bold=True)\nclick.echo(validity.T)\nclick.secho("\n=== Consistency of D per Region ===", fg="cyan", bold=True)\nclick.echo(region_D_consistency)\nclick.secho("\n=== QA Score per Region ===", fg="cyan", bold=True)\nclick.echo(region_qa)\nclick.secho("\n=== Sample suspected false positives in C ===", fg="cyan", bold=True)\nclick.echo(false_pos_candidates.head(10))\n'

In [12]:

print(completeness)

                   missing_pct  non_missing_pct
sepal length (cm)     0.000000       100.000000
sepal width (cm)      7.333333        92.666667
Region                0.000000       100.000000

Bonus LLN-theory Application in QA¶

I have no time to develop this further, but here is an interesting idea:

You want to use pandas to check whether the distribution of a categorical variable A is consistent across regions (B). Great application to the idea behind the Law of Large Numbers: if data generation is consistent, the frequency of categories in A should be roughly stable across large-enough subsets (regions).

import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

# Example: assume df has columns "A" (category) and "B" (region)
# df = pd.read_csv("your_data.csv")

# 1. Overall distribution of A
overall_dist = df['A'].value_counts(normalize=True)

# 2. Regional distribution of A
regional_dist = (
    df.groupby('B')['A']
    .value_counts(normalize=True)
    .unstack(fill_value=0)
)

# 3. Compare each region’s distribution to the overall one
# Compute absolute deviation for each category
deviation = (regional_dist - overall_dist).abs()

# 4. Flag regions with large deviation
threshold = 0.05  # e.g. more than 5 percentage points off
flagged_regions = deviation.max(axis=1)[deviation.max(axis=1) > threshold]

print("Regions with anomalous category distributions in A:")
print(flagged_regions)

# 5. Optional: run chi-square test for independence (A vs B)
contingency = pd.crosstab(df['B'], df['A'])
chi2, p, dof, expected = chi2_contingency(contingency)

print("\nChi-square test for independence:")
print(f"Chi2 statistic = {chi2:.2f}, p-value = {p:.4g}")