Data QA Using Python¶
A data quality analysis (QA) problem over a table using pandas without drowning in code.Let’s discover a systematic QA analysis, step-by-step.
You have a table ith these columns:
| A | B | C | width | D | Region |
Each row represents record. Region groups related records geografically.
A, B, C, width, and D are attributes to test for completeness, consistency, and validity.
Domain Knowledge¶
You need to have some domain knowledge to not make some blueprint analysis just based on the method itself. i.e. in this example the width cannot be too small, group by can be performed by region. It can be some other rules like if I have noticed:
A → sometimes empty
B → often empty
C → may have false positives
width → presumably numeric (complete?)
Region → categorical, useful for grouping
D (another attribute?) → complete in some Region groups
Objective¶
Goal is to measure how good or reliable this data is according to some rules - generally get a sense of completeness, validity, uniqueness, and consistency.
Detect missing data (A, B, width)
Flag possible false positives in C
Check completeness of D per Region
Summarize QA metrics per region
All this is a form of data quality assessment — which is a component of profiling which probably is the topic for future post.
Key Metrics¶
| Check | What it shows | Example |
|---|---|---|
| Count / Missing % | How much data is missing | df.isna().sum() |
| Distinct count | How varied the data is | df.nunique() |
| Range check | Are numeric values within expected bounds | e.g., width > 3.0 |
Design Python (pandas) QA Analysis¶
from sklearn import datasets
import pandas as pd
import numpy as np
# load the iris dataset
iris = datasets.load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df_mock_qa_problem = iris_df.copy()
# get only the first two features for simplicity
iris_df = iris_df.iloc[:, :2]
# add randomly Null values for sepal width
np.random.seed(0) # for reproducibility
null_indices = np.random.choice(iris_df.index, size=11, replace=False)
print("Indices with null values:", null_indices)
iris_df.loc[null_indices, 'sepal width (cm)'] = np.nan
# add a bogus categorical region column randomly
bogus_regions = ['Region_A', 'Region_B', 'Region_C']
iris_df['Region'] = np.random.choice(bogus_regions, size=len(iris_df))
print(iris_df.head(20))
import pandas as pd
import numpy as np
import click
df = iris_df
# --- 1 Basic completeness checks ---
completeness = df.isnull().mean().to_frame("missing_pct") * 100
completeness["non_missing_pct"] = 100 - completeness["missing_pct"]
# --- 2 Groupwise completeness by Region ---
region_completeness = (
df.groupby("Region")
.apply(lambda g: g.isnull().mean() * 100)
.rename_axis(index=None)
)
# --- 3 Validity checks ---
validity = pd.DataFrame()
validity["width_invalid"] = (df["sepal width (cm)"] <= 0).sum()
validity["width_null"] = df["sepal width (cm)"].isnull().sum()
valid_C_values = df["sepal length (cm)"].isin([True, False, 0, 1]).mean() * 100
validity["C_valid_pct"] = valid_C_values
"""
# --- 5 Optional: QA scoring per Region ---
region_qa = (
df.groupby("Region")
.apply(lambda g: 1 - (g[["A","B","width"]].isnull().sum().sum() / (len(g)*3)))
.reset_index(name="qa_score")
)
region_qa["qa_label"] = pd.cut(region_qa["qa_score"],
bins=[0,0.7,0.9,1],
labels=["Poor","Fair","Good"])
# --- 6 Optional: False positive analysis for C ---
false_pos_candidates = df[df["C"] == True]
"""
print(completeness)
Bonus LLN-theory Application in QA¶
I have no time to develop this further, but here is an interesting idea:
You want to use pandas to check whether the distribution of a categorical variable A is consistent across regions (B). Great application to the idea behind the Law of Large Numbers: if data generation is consistent, the frequency of categories in A should be roughly stable across large-enough subsets (regions).
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
# Example: assume df has columns "A" (category) and "B" (region)
# df = pd.read_csv("your_data.csv")
# 1. Overall distribution of A
overall_dist = df['A'].value_counts(normalize=True)
# 2. Regional distribution of A
regional_dist = (
df.groupby('B')['A']
.value_counts(normalize=True)
.unstack(fill_value=0)
)
# 3. Compare each region’s distribution to the overall one
# Compute absolute deviation for each category
deviation = (regional_dist - overall_dist).abs()
# 4. Flag regions with large deviation
threshold = 0.05 # e.g. more than 5 percentage points off
flagged_regions = deviation.max(axis=1)[deviation.max(axis=1) > threshold]
print("Regions with anomalous category distributions in A:")
print(flagged_regions)
# 5. Optional: run chi-square test for independence (A vs B)
contingency = pd.crosstab(df['B'], df['A'])
chi2, p, dof, expected = chi2_contingency(contingency)
print("\nChi-square test for independence:")
print(f"Chi2 statistic = {chi2:.2f}, p-value = {p:.4g}")