2.2 Statistics#
Descriptive statistics summarize and describe dataset characteristics, and are great to give an overview of your data to whomever is approaching your work.
You could calculate these statistics by hand, but pandas
already offers methods to call them directly from your df_c
.
2.1 Measures of Central Tendency#
mean = df_c["addit1"].mean() # Arithmetic mean
median = df_c["addit1"].median() # Middle value
print(f"Mean: {mean}")
print(f"median: {median}")
Mean: 0.9530791788856305
median: 1.0
2.2 Measures of Dispersion#
std_dev = df_y["IQ_Vocab"].std() # Standard deviation
variance = df_y["IQ_Vocab"].var() # Variance
quantiles = df_y["IQ_Vocab"].quantile([0.25, 0.5, 0.75]) # Quartiles
print(f"Standard deviation: {std_dev}")
print(f"Variance: {variance}")
print("Quantiles: ")
print(quantiles)
Standard deviation: 8.125015262500927
Variance: 66.015873015873
Quantiles:
0.25 60.0
0.50 64.0
0.75 70.0
Name: IQ_Vocab, dtype: float64
2.3 .describe()
method#
Alternatively, pandas
offer us a convenient way to get a first statistical summary. The count
column will tell you how many values were used for the calculations in each column.
df_subset = df_c.iloc[:, :5] # subset only the first 5 columns
df_subset.describe()
Unnamed: 0 | addit1 | addit2 | addit3 | addit4 | |
---|---|---|---|---|---|
count | 341.000000 | 341.000000 | 341.000000 | 341.000000 | 341.000000 |
mean | 171.000000 | 0.953079 | 0.926686 | 0.879765 | 0.935484 |
std | 98.582453 | 0.211780 | 0.261034 | 0.325714 | 0.246031 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 86.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
50% | 171.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
75% | 256.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
max | 341.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
2.4 Mode, correlation, contingency tables#
# MODE
mode_value = df_c["class"].mode() # Most frequent value(s)
print("Mode:", mode_value.tolist()) # We use .tolist() to prettify the output
print()
# CORRELATION MATRIX
correlation_matrix = df_c.iloc[:, :5].corr()
print("Correlation Matrix:")
print(correlation_matrix)
print()
# CONTINGENCY TABLE (CROSS-TABULATION)
contingency_table = pd.crosstab(df_y["Gender"], df_y["Handedness"])
print("Contingency Table:")
print(contingency_table)
Mode: ['second']
Correlation Matrix:
Unnamed: 0 addit1 addit2 addit3 addit4
Unnamed: 0 1.000000 0.096782 0.076920 0.300166 0.124174
addit1 0.096782 1.000000 -0.009205 0.045889 0.111074
addit2 0.076920 -0.009205 1.000000 0.207355 0.017728
addit3 0.300166 0.045889 0.207355 1.000000 0.196536
addit4 0.124174 0.111074 0.017728 0.196536 1.000000
Contingency Table:
Handedness Left Right
Gender
Female 4 29
Male 0 33
A contingency table summarizes the relationship between two categorical variables. It is useful for understanding frequency distributions across groups (e.g., how handedness varies by gender).
Other Useful Descriptive Statistics to Consider#
Skewness and Kurtosis: Insight into the shape of distributions.
Range and IQR (Interquartile Range): For detecting outliers.
Z-scores: Standardized scores help identify outliers or compare variables on a common scale.
Value counts: Great for summarizing categorical variables.
# SKEWNESS AND KURTOSIS
skewness = df_c["addit1"].skew()
kurtosis = df_c["addit1"].kurt()
print(f"Skewness: {skewness}")
print(f"Kurtosis: {kurtosis}")
Skewness: -4.304014813618405
Kurtosis: 16.622000435026045
Skewness measures the asymmetry of the distribution. A skewness close to 0 indicates a symmetric distribution.
Kurtosis indicates the “tailedness” of the distribution—higher values mean more extreme outliers.
# RANGE AND IQR
range_value = df_y["IQ_Matrix"].max() - df_y["IQ_Matrix"].min()
iqr = df_y["IQ_Matrix"].quantile(0.75) - df_y["IQ_Matrix"].quantile(0.25)
print(f"Range: {range_value}")
print(f"IQR (Interquartile Range): {iqr}")
Range: 35.0
IQR (Interquartile Range): 7.5
Range shows the spread between the highest and lowest values.
IQR (Interquartile Range) captures the middle 50% of the data and is robust to outliers.
# VALUE COUNTS
value_counts = df_y["Gender"].value_counts()
print("Value Counts for 'Gender':")
print(value_counts)
Value Counts for 'Gender':
Gender
Male 39
Female 37
Name: count, dtype: int64
Value counts show the frequency of each category in a categorical column—ideal for understanding group sizes.
# Z-SCORES
from scipy.stats import zscore
z_scores = zscore(df_c["addit1"].dropna())
print("Z-scores (first 5):", z_scores[:5])
Z-scores (first 5): [0.22188008 0.22188008 0.22188008 0.22188008 0.22188008]
Z-scores standardize values relative to the mean and standard deviation, helping identify how far a value is from the average.