2.4 Exercises

2.4 Exercises#

We will now solve practical data analysis tasks. Try to reason through each step before running the code.

Instructions:

# Import packages

# Load datasets

# Inspect datasets

Instructions:

Create a loop that goes through each column in the df_y
Store in an array whether the column had missing values or not:
- Ex. bool_arr = [True, False, ...]
Use this array to count how many columns have missing values
Replace missing Handedness with 'Unknown'
Drop remaining rows with missing values
How many participant did you have to exclude from your dataset?

# Loop through df_y

# Replace and drop

# Print the results

Instructions:

Calculate and interpret the mean, median, and standard deviation for addit1 and subtr1
Calculate the correlation between all addition items.
Load the dataset once more, but this time shuffle its columns using the .sample() method
Once that is done, calculate the correlation betwen all subtraction items
- This time do not explicitly specify df_c[["subit1", "subit2", ...]]
- You can use loops, logical indexing, or a combination of both

# Calculate the statistics

# Print the statistics

# Inspect the results

# Calculate the correlation between "addit" items

# Calculate teh correlation between all subtraction items (after shuffle)

Instructions:

Use either matplotlib or seaborn to create a boxplot of IQ scores by gender
Include IQ, Matrix, and Vocab scores
Include two more rows to your subplots
ax[1]: same data, customize the plot to have a minimal style
ax[2]: same data, customize the plot to be as informative and refined as possible
You can use:
- Matplotlib Cheat Sheets
- Matplotlib Documentation
- Seaborn Cheat Sheets
- Seaborn Documentation

# Import necessary packages

# Create your fig and ax objects

# Plot the data

Instructions:

Use the df_c dataset (Child Numeracy Skills)
Plot each participant’s total time to complete the test as a scatter plot
Overlay three horizontal lines (in red) indicating:
- The mean completion time
- The first quartile (25th percentile)
- The last quartile (75th percentile)
On the same plot, draw a second line (in a different color) that represents the total number of correct responses for each participant
Bonus: Try adding a legend to clarify the plot
Bonus: Try sorting participant in the dataset by class, can you detect any pattern in the plot
Finally, explain what you see in the plot - what other kind of plot could you use for further investigation?

# Specify variables you are interested in

# Plot the data