2.1 Data Handling#

First, we load the dataset:

Loading Data#

import pandas as pd

df_c = pd.read_csv("./data/child_num_skills.csv", index_col=0)  
df_c.head()  
addit1 addit2 addit3 addit4 addit5 addit6 addit7 addit8 subtr1 subtr2 subtr3 subtr4 subtr5 subtr6 subtr7 subtr8 class time
1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 second 39
2 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 second 37
3 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 second 34
4 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 second 28
5 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 second 32

Exploring Data#

Once loaded, we can inspect the structure of our dataset using a variety of methods. We can use .info() to get a closer look into our data:

print(df_c.info())        # Summary including data types and non-null counts
print(df_c.columns)       # List of column names
<class 'pandas.core.frame.DataFrame'>
Index: 341 entries, 1 to 341
Data columns (total 18 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   addit1  341 non-null    int64 
 1   addit2  341 non-null    int64 
 2   addit3  341 non-null    int64 
 3   addit4  341 non-null    int64 
 4   addit5  341 non-null    int64 
 5   addit6  341 non-null    int64 
 6   addit7  341 non-null    int64 
 7   addit8  341 non-null    int64 
 8   subtr1  341 non-null    int64 
 9   subtr2  341 non-null    int64 
 10  subtr3  341 non-null    int64 
 11  subtr4  341 non-null    int64 
 12  subtr5  341 non-null    int64 
 13  subtr6  341 non-null    int64 
 14  subtr7  341 non-null    int64 
 15  subtr8  341 non-null    int64 
 16  class   341 non-null    object
 17  time    341 non-null    int64 
dtypes: int64(17), object(1)
memory usage: 50.6+ KB
None
Index(['addit1', 'addit2', 'addit3', 'addit4', 'addit5', 'addit6', 'addit7',
       'addit8', 'subtr1', 'subtr2', 'subtr3', 'subtr4', 'subtr5', 'subtr6',
       'subtr7', 'subtr8', 'class', 'time'],
      dtype='object')

Handling Missing Data#

Real-world datasets often include missing values. Child Numeracy Skills does not contain any missing data, so let us load the Yeatman dataset for this part:

df_y = pd.read_csv("https://yeatmanlab.github.io/AFQBrowser-demo/data/subjects.csv",
                      usecols=[1,2,3,4,5,6,7],
                      na_values="NaN",
                      index_col=0)
print(df_y.head())
             Age  Gender Handedness     IQ  IQ_Matrix  IQ_Vocab
subjectID                                                      
subject_000   20    Male        NaN  139.0       65.0      77.0
subject_001   31    Male        NaN  129.0       58.0      74.0
subject_002   18  Female        NaN  130.0       63.0      70.0
subject_003   28    Male      Right    NaN        NaN       NaN
subject_004   29    Male        NaN    NaN        NaN       NaN

We can count the number of missing value per category:

missing_values_count = df_y.isna().sum()    # Count of missing values per column

print("In our dataset the following columns have the specified number of missing values:")
print(missing_values_count)
In our dataset the following columns have the specified number of missing values:
Age            0
Gender         1
Handedness    11
IQ            14
IQ_Matrix     14
IQ_Vocab      14
dtype: int64

NaN values need to be removed, or we won’t be able to perform our analyses. We could fix our dataset using either one of the following strategies:

  • Use pandas method .dropna().

  • Use pandas method .fillna().

df_y['Handedness'] = df_y['Handedness'].fillna(0)     # Replace missing values in a column
df_y = df_y.dropna()                                  # Drop all rows with any missing value

df_y.head()
Age Gender Handedness IQ IQ_Matrix IQ_Vocab
subjectID
subject_000 20 Male 0 139.0 65.0 77.0
subject_001 31 Male 0 129.0 58.0 74.0
subject_002 18 Female 0 130.0 63.0 70.0
subject_006 39 Male 0 138.0 70.0 71.0
subject_007 34 Male Right 115.0 66.0 51.0

Filtering and Subsetting#

Subset data through logical indexing or by selecting specific columns:

df_c_filtered = df_c[df_c["time"] > 40]         # Rows where total time taken is greater than 40
df_c_subset = df_c[["time", "addit2"]]          # Only keep selected columns

df_c_filtered.head()                             
addit1 addit2 addit3 addit4 addit5 addit6 addit7 addit8 subtr1 subtr2 subtr3 subtr4 subtr5 subtr6 subtr7 subtr8 class time
11 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 second 47
13 0 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 second 52
14 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 second 43
15 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 second 45
17 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 second 44

Alternatively we can drop specific columns using the .drop() method.

df_only_items = df_c.drop(columns=['class', 'time'])

df_only_items.head()                        
addit1 addit2 addit3 addit4 addit5 addit6 addit7 addit8 subtr1 subtr2 subtr3 subtr4 subtr5 subtr6 subtr7 subtr8
1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1
2 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1
3 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1
4 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1
5 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1

Full control

Here we only saw few application. If you want to feel like you are in full control, I higly recommend checking pandas documentation.