2.2 Statistics#

Descriptive statistics summarize and describe dataset characteristics, and are great to give an overview of your data to whomever is approaching your work. You could calculate these statistics by hand, but pandas already offers methods to call them directly from your df_c.

2.1 Measures of Central Tendency#

mean = df_c["addit1"].mean()     # Arithmetic mean
median = df_c["addit1"].median() # Middle value

print(f"Mean: {mean}")
print(f"median: {median}")
Mean: 0.9530791788856305
median: 1.0

2.2 Measures of Dispersion#

std_dev = df_y["IQ_Vocab"].std()   # Standard deviation
variance = df_y["IQ_Vocab"].var()  # Variance
quantiles = df_y["IQ_Vocab"].quantile([0.25, 0.5, 0.75])  # Quartiles

print(f"Standard deviation: {std_dev}")
print(f"Variance: {variance}")
print("Quantiles: ")
print(quantiles)
Standard deviation: 8.125015262500927
Variance: 66.015873015873
Quantiles: 
0.25    60.0
0.50    64.0
0.75    70.0
Name: IQ_Vocab, dtype: float64

2.3 .describe() method#

Alternatively, pandas offer us a convenient way to get a first statistical summary. The count column will tell you how many values were used for the calculations in each column.

df_subset = df_c.iloc[:, :5]   # subset only the first 5 columns
df_subset.describe()
Unnamed: 0 addit1 addit2 addit3 addit4
count 341.000000 341.000000 341.000000 341.000000 341.000000
mean 171.000000 0.953079 0.926686 0.879765 0.935484
std 98.582453 0.211780 0.261034 0.325714 0.246031
min 1.000000 0.000000 0.000000 0.000000 0.000000
25% 86.000000 1.000000 1.000000 1.000000 1.000000
50% 171.000000 1.000000 1.000000 1.000000 1.000000
75% 256.000000 1.000000 1.000000 1.000000 1.000000
max 341.000000 1.000000 1.000000 1.000000 1.000000

2.4 Mode, correlation, contingency tables#

# MODE
mode_value = df_c["class"].mode()    # Most frequent value(s)
print("Mode:", mode_value.tolist())  # We use .tolist() to prettify the output
print()

# CORRELATION MATRIX
correlation_matrix = df_c.iloc[:, :5].corr()
print("Correlation Matrix:")
print(correlation_matrix)
print()

# CONTINGENCY TABLE (CROSS-TABULATION)
contingency_table = pd.crosstab(df_y["Gender"], df_y["Handedness"])
print("Contingency Table:")
print(contingency_table)
Mode: ['second']

Correlation Matrix:
            Unnamed: 0    addit1    addit2    addit3    addit4
Unnamed: 0    1.000000  0.096782  0.076920  0.300166  0.124174
addit1        0.096782  1.000000 -0.009205  0.045889  0.111074
addit2        0.076920 -0.009205  1.000000  0.207355  0.017728
addit3        0.300166  0.045889  0.207355  1.000000  0.196536
addit4        0.124174  0.111074  0.017728  0.196536  1.000000

Contingency Table:
Handedness  Left  Right
Gender                 
Female         4     29
Male           0     33

A contingency table summarizes the relationship between two categorical variables. It is useful for understanding frequency distributions across groups (e.g., how handedness varies by gender).

Other Useful Descriptive Statistics to Consider#

  • Skewness and Kurtosis: Insight into the shape of distributions.

  • Range and IQR (Interquartile Range): For detecting outliers.

  • Z-scores: Standardized scores help identify outliers or compare variables on a common scale.

  • Value counts: Great for summarizing categorical variables.

# SKEWNESS AND KURTOSIS
skewness = df_c["addit1"].skew()
kurtosis = df_c["addit1"].kurt()
print(f"Skewness: {skewness}")
print(f"Kurtosis: {kurtosis}")
Skewness: -4.304014813618405
Kurtosis: 16.622000435026045

Skewness measures the asymmetry of the distribution. A skewness close to 0 indicates a symmetric distribution.
Kurtosis indicates the “tailedness” of the distribution—higher values mean more extreme outliers.


# RANGE AND IQR
range_value = df_y["IQ_Matrix"].max() - df_y["IQ_Matrix"].min()
iqr = df_y["IQ_Matrix"].quantile(0.75) - df_y["IQ_Matrix"].quantile(0.25)
print(f"Range: {range_value}")
print(f"IQR (Interquartile Range): {iqr}")
Range: 35.0
IQR (Interquartile Range): 7.5

Range shows the spread between the highest and lowest values.
IQR (Interquartile Range) captures the middle 50% of the data and is robust to outliers.


# VALUE COUNTS
value_counts = df_y["Gender"].value_counts()
print("Value Counts for 'Gender':")
print(value_counts)
Value Counts for 'Gender':
Gender
Male      39
Female    37
Name: count, dtype: int64

Value counts show the frequency of each category in a categorical column—ideal for understanding group sizes.


# Z-SCORES
from scipy.stats import zscore
z_scores = zscore(df_c["addit1"].dropna())
print("Z-scores (first 5):", z_scores[:5])
Z-scores (first 5): [0.22188008 0.22188008 0.22188008 0.22188008 0.22188008]

Z-scores standardize values relative to the mean and standard deviation, helping identify how far a value is from the average.