2.4 Exercises#
Data Curation and Exploration#
We will now solve practical data analysis tasks. Try to reason through each step before running the code.
1️⃣ Load and Inspect the Data#
Instructions:
Download data here and store it in your Test_Theory_and_Costruction folder.
Load the
child_num_skills.csv
datasetLoad the Yeatman (
subject.csv
) datasetPrint the shape and column names in a clear format for both datasets
# Import packages
# Load datasets
# Inspect datasets
2️⃣: Missing Data Detection with Logical indexing#
Instructions:
Create a loop that goes through each column in the
df_y
Store in an array whether the column had missing values or not:
Ex.
bool_arr = [True, False, ...]
Use this array to count how many columns have missing values
Replace missing
Handedness
with'Unknown'
Drop remaining rows with missing values
How many participant did you have to exclude from your dataset?
# Loop through df_y
# Replace and drop
# Print the results
3️⃣ Summary Statistics and Interpretation#
Instructions:
Calculate and interpret the mean, median, and standard deviation for
addit1
andsubtr1
Calculate the correlation between all addition items.
Load the dataset once more, but this time shuffle its columns using the
.sample()
methodOnce that is done, calculate the correlation betwen all subtraction items
This time do not explicitly specify
df_c[["subit1", "subit2", ...]]
You can use loops, logical indexing, or a combination of both
# Calculate the statistics
# Print the statistics
# Inspect the results
# Calculate the correlation between "addit" items
# Calculate teh correlation between all subtraction items (after shuffle)
4️⃣ Visualization#
Instructions:
Use either
matplotlib
orseaborn
to create a boxplot of IQ scores by genderInclude IQ, Matrix, and Vocab scores
Include two more rows to your subplots
ax[1]
: same data, customize the plot to have a minimal styleax[2]
: same data, customize the plot to be as informative and refined as possibleYou can use:
Matplotlib Cheat Sheets
Matplotlib Documentation
Seaborn Cheat Sheets
Seaborn Documentation
# Import necessary packages
# Create your fig and ax objects
# Plot the data
Volountary Exercise: Full Pipeline — Participant-Level Visualization with Time and Accuracy#
Instructions:
Use the
df_c
dataset (Child Numeracy Skills)Plot each participant’s total time to complete the test as a scatter plot
Overlay three horizontal lines (in red) indicating:
The mean completion time
The first quartile (25th percentile)
The last quartile (75th percentile)
On the same plot, draw a second line (in a different color) that represents the total number of correct responses for each participant
Bonus: Try adding a legend to clarify the plot
Bonus: Try sorting participant in the dataset by class, can you detect any pattern in the plot
Finally, explain what you see in the plot - what other kind of plot could you use for further investigation?
# Specify variables you are interested in
# Plot the data