2.1 Data Handling

2.1 Data Handling#

First, we load the dataset:

Loading Data#

import pandas as pd

df_c = pd.read_csv("./data/child_num_skills.csv", index_col=0)  
df_c.head()  

	addit1	addit2	addit3	addit4	addit5	addit6	addit7	addit8	subtr1	subtr2	subtr3	subtr4	subtr5	subtr6	subtr7	subtr8	class	time
1	1	1	1	1	1	1	1	0	1	1	1	1	1	1	0	1	second	39
2	1	1	1	1	0	1	1	0	1	1	1	1	1	1	0	1	second	37
3	1	1	1	1	1	1	1	1	1	1	0	1	1	0	0	1	second	34
4	1	1	1	1	1	1	1	1	1	1	0	1	1	0	1	1	second	28
5	1	1	1	1	1	1	1	1	1	1	0	1	1	0	0	1	second	32

Exploring Data#

Once loaded, we can inspect the structure of our dataset using a variety of methods. We can use .info() to get a closer look into our data:

print(df_c.info())        # Summary including data types and non-null counts
print(df_c.columns)       # List of column names

<class 'pandas.core.frame.DataFrame'>
Index: 341 entries, 1 to 341
Data columns (total 18 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   addit1  341 non-null    int64 
 1   addit2  341 non-null    int64 
 2   addit3  341 non-null    int64 
 3   addit4  341 non-null    int64 
 4   addit5  341 non-null    int64 
 5   addit6  341 non-null    int64 
 6   addit7  341 non-null    int64 
 7   addit8  341 non-null    int64 
 8   subtr1  341 non-null    int64 
 9   subtr2  341 non-null    int64 
 10  subtr3  341 non-null    int64 
 11  subtr4  341 non-null    int64 
 12  subtr5  341 non-null    int64 
 13  subtr6  341 non-null    int64 
 14  subtr7  341 non-null    int64 
 15  subtr8  341 non-null    int64 
 16  class   341 non-null    object
 17  time    341 non-null    int64 
dtypes: int64(17), object(1)
memory usage: 50.6+ KB
None
Index(['addit1', 'addit2', 'addit3', 'addit4', 'addit5', 'addit6', 'addit7',
       'addit8', 'subtr1', 'subtr2', 'subtr3', 'subtr4', 'subtr5', 'subtr6',
       'subtr7', 'subtr8', 'class', 'time'],
      dtype='object')

Handling Missing Data#

Real-world datasets often include missing values. Child Numeracy Skills does not contain any missing data, so let us load the Yeatman dataset for this part:

df_y = pd.read_csv("https://yeatmanlab.github.io/AFQBrowser-demo/data/subjects.csv",
                      usecols=[1,2,3,4,5,6,7],
                      na_values="NaN",
                      index_col=0)
print(df_y.head())

             Age  Gender Handedness     IQ  IQ_Matrix  IQ_Vocab
subjectID                                                      
subject_000   20    Male        NaN  139.0       65.0      77.0
subject_001   31    Male        NaN  129.0       58.0      74.0
subject_002   18  Female        NaN  130.0       63.0      70.0
subject_003   28    Male      Right    NaN        NaN       NaN
subject_004   29    Male        NaN    NaN        NaN       NaN

We can count the number of missing value per category:

missing_values_count = df_y.isna().sum()    # Count of missing values per column

print("In our dataset the following columns have the specified number of missing values:")
print(missing_values_count)

In our dataset the following columns have the specified number of missing values:
Age            0
Gender         1
Handedness    11
IQ            14
IQ_Matrix     14
IQ_Vocab      14
dtype: int64

NaN values need to be removed, or we won’t be able to perform our analyses. We could fix our dataset using either one of the following strategies:

Use pandas method .dropna().
Use pandas method .fillna().

df_y['Handedness'] = df_y['Handedness'].fillna(0)     # Replace missing values in a column
df_y = df_y.dropna()                                  # Drop all rows with any missing value

df_y.head()

	Age	Gender	Handedness	IQ	IQ_Matrix	IQ_Vocab
subjectID
subject_000	20	Male	0	139.0	65.0	77.0
subject_001	31	Male	0	129.0	58.0	74.0
subject_002	18	Female	0	130.0	63.0	70.0
subject_006	39	Male	0	138.0	70.0	71.0
subject_007	34	Male	Right	115.0	66.0	51.0

Filtering and Subsetting#

Subset data through logical indexing or by selecting specific columns:

df_c_filtered = df_c[df_c["time"] > 40]         # Rows where total time taken is greater than 40
df_c_subset = df_c[["time", "addit2"]]          # Only keep selected columns

df_c_filtered.head()                             

	addit1	addit2	addit3	addit4	addit5	addit6	addit7	addit8	subtr1	subtr2	subtr3	subtr4	subtr5	subtr6	subtr7	subtr8	class	time
11	1	1	1	1	1	1	1	0	0	1	0	1	1	1	1	1	second	47
13	0	0	0	1	0	0	1	1	1	1	0	1	1	1	0	1	second	52
14	1	1	1	1	0	1	1	0	1	1	1	1	1	1	0	0	second	43
15	1	1	1	1	1	0	1	1	1	1	1	1	1	1	1	1	second	45
17	1	1	1	1	1	1	1	0	1	0	0	1	1	0	0	0	second	44

Alternatively we can drop specific columns using the .drop() method.

df_only_items = df_c.drop(columns=['class', 'time'])

df_only_items.head()                        

	addit1	addit2	addit3	addit4	addit5	addit6	addit7	addit8	subtr1	subtr2	subtr3	subtr4	subtr5	subtr6	subtr7	subtr8
1	1	1	1	1	1	1	1	0	1	1	1	1	1	1	0	1
2	1	1	1	1	0	1	1	0	1	1	1	1	1	1	0	1
3	1	1	1	1	1	1	1	1	1	1	0	1	1	0	0	1
4	1	1	1	1	1	1	1	1	1	1	0	1	1	0	1	1
5	1	1	1	1	1	1	1	1	1	1	0	1	1	0	0	1

Full control

Here we only saw few application. If you want to feel like you are in full control, I higly recommend checking pandas documentation.

	addit1	addit2	addit3	addit4	addit5	addit6	addit7	addit8	subtr1	subtr2	subtr3	subtr4	subtr5	subtr6	subtr7	subtr8
1	1	1	1	1	1	1	1	0	1	1	1	1	1	1	0	1
2	1	1	1	1	0	1	1	0	1	1	1	1	1	1	0	1
3	1	1	1	1	1	1	1	1	1	1	0	1	1	0	0	1
4	1	1	1	1	1	1	1	1	1	1	0	1	1	0	1	1
5	1	1	1	1	1	1	1	1	1	1	0	1	1	0	0	1

	addit1	addit2	addit3	addit4	addit5	addit6	addit7	addit8	subtr1	subtr2	subtr3	subtr4	subtr5	subtr6	subtr7	subtr8
1	1	1	1	1	1	1	1	0	1	1	1	1	1	1	0	1
2	1	1	1	1	0	1	1	0	1	1	1	1	1	1	0	1
3	1	1	1	1	1	1	1	1	1	1	0	1	1	0	0	1
4	1	1	1	1	1	1	1	1	1	1	0	1	1	0	1	1
5	1	1	1	1	1	1	1	1	1	1	0	1	1	0	0	1