2.4 Exercises#

Data Curation and Exploration#

We will now solve practical data analysis tasks. Try to reason through each step before running the code.

1️⃣ Load and Inspect the Data#

Instructions:

  • Download data here and store it in your Test_Theory_and_Costruction folder.

  • Load the child_num_skills.csv dataset

  • Load the Yeatman (subject.csv) dataset

  • Print the shape and column names in a clear format for both datasets

# Import packages


# Load datasets


# Inspect datasets

2️⃣: Missing Data Detection with Logical indexing#

Instructions:

  • Create a loop that goes through each column in the df_y

  • Store in an array whether the column had missing values or not:

    • Ex. bool_arr = [True, False, ...]

  • Use this array to count how many columns have missing values

  • Replace missing Handedness with 'Unknown'

  • Drop remaining rows with missing values

  • How many participant did you have to exclude from your dataset?

# Loop through df_y


# Replace and drop


# Print the results

3️⃣ Summary Statistics and Interpretation#

Instructions:

  • Calculate and interpret the mean, median, and standard deviation for addit1 and subtr1

  • Calculate the correlation between all addition items.

  • Load the dataset once more, but this time shuffle its columns using the .sample() method

  • Once that is done, calculate the correlation betwen all subtraction items

    • This time do not explicitly specify df_c[["subit1", "subit2", ...]]

    • You can use loops, logical indexing, or a combination of both

# Calculate the statistics


# Print the statistics


# Inspect the results
# Calculate the correlation between "addit" items
# Calculate teh correlation between all subtraction items (after shuffle)

4️⃣ Visualization#

Instructions:

  • Use either matplotlib or seaborn to create a boxplot of IQ scores by gender

  • Include IQ, Matrix, and Vocab scores

  • Include two more rows to your subplots

  • ax[1]: same data, customize the plot to have a minimal style

  • ax[2]: same data, customize the plot to be as informative and refined as possible

  • You can use:

# Import necessary packages


# Create your fig and ax objects


# Plot the data

Volountary Exercise: Full Pipeline — Participant-Level Visualization with Time and Accuracy#

Instructions:

  • Use the df_c dataset (Child Numeracy Skills)

  • Plot each participant’s total time to complete the test as a scatter plot

  • Overlay three horizontal lines (in red) indicating:

    • The mean completion time

    • The first quartile (25th percentile)

    • The last quartile (75th percentile)

  • On the same plot, draw a second line (in a different color) that represents the total number of correct responses for each participant

  • Bonus: Try adding a legend to clarify the plot

  • Bonus: Try sorting participant in the dataset by class, can you detect any pattern in the plot

  • Finally, explain what you see in the plot - what other kind of plot could you use for further investigation?

# Specify variables you are interested in


# Plot the data