2.1 Data Handling#
First, we load the dataset:
Loading Data#
import pandas as pd
df_c = pd.read_csv("./data/child_num_skills.csv", index_col=0)
df_c.head()
addit1 | addit2 | addit3 | addit4 | addit5 | addit6 | addit7 | addit8 | subtr1 | subtr2 | subtr3 | subtr4 | subtr5 | subtr6 | subtr7 | subtr8 | class | time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | second | 39 |
2 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | second | 37 |
3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | second | 34 |
4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | second | 28 |
5 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | second | 32 |
Exploring Data#
Once loaded, we can inspect the structure of our dataset using a variety of methods. We can use .info() to get a closer look into our data:
print(df_c.info()) # Summary including data types and non-null counts
print(df_c.columns) # List of column names
<class 'pandas.core.frame.DataFrame'>
Index: 341 entries, 1 to 341
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 addit1 341 non-null int64
1 addit2 341 non-null int64
2 addit3 341 non-null int64
3 addit4 341 non-null int64
4 addit5 341 non-null int64
5 addit6 341 non-null int64
6 addit7 341 non-null int64
7 addit8 341 non-null int64
8 subtr1 341 non-null int64
9 subtr2 341 non-null int64
10 subtr3 341 non-null int64
11 subtr4 341 non-null int64
12 subtr5 341 non-null int64
13 subtr6 341 non-null int64
14 subtr7 341 non-null int64
15 subtr8 341 non-null int64
16 class 341 non-null object
17 time 341 non-null int64
dtypes: int64(17), object(1)
memory usage: 50.6+ KB
None
Index(['addit1', 'addit2', 'addit3', 'addit4', 'addit5', 'addit6', 'addit7',
'addit8', 'subtr1', 'subtr2', 'subtr3', 'subtr4', 'subtr5', 'subtr6',
'subtr7', 'subtr8', 'class', 'time'],
dtype='object')
Handling Missing Data#
Real-world datasets often include missing values. Child Numeracy Skills does not contain any missing data, so let us load the Yeatman dataset for this part:
df_y = pd.read_csv("https://yeatmanlab.github.io/AFQBrowser-demo/data/subjects.csv",
usecols=[1,2,3,4,5,6,7],
na_values="NaN",
index_col=0)
print(df_y.head())
Age Gender Handedness IQ IQ_Matrix IQ_Vocab
subjectID
subject_000 20 Male NaN 139.0 65.0 77.0
subject_001 31 Male NaN 129.0 58.0 74.0
subject_002 18 Female NaN 130.0 63.0 70.0
subject_003 28 Male Right NaN NaN NaN
subject_004 29 Male NaN NaN NaN NaN
We can count the number of missing value per category:
missing_values_count = df_y.isna().sum() # Count of missing values per column
print("In our dataset the following columns have the specified number of missing values:")
print(missing_values_count)
In our dataset the following columns have the specified number of missing values:
Age 0
Gender 1
Handedness 11
IQ 14
IQ_Matrix 14
IQ_Vocab 14
dtype: int64
NaN
values need to be removed, or we won’t be able to perform our analyses. We could fix our dataset using either one of the following strategies:
Use
pandas
method.dropna()
.Use
pandas
method.fillna()
.
df_y['Handedness'] = df_y['Handedness'].fillna(0) # Replace missing values in a column
df_y = df_y.dropna() # Drop all rows with any missing value
df_y.head()
Age | Gender | Handedness | IQ | IQ_Matrix | IQ_Vocab | |
---|---|---|---|---|---|---|
subjectID | ||||||
subject_000 | 20 | Male | 0 | 139.0 | 65.0 | 77.0 |
subject_001 | 31 | Male | 0 | 129.0 | 58.0 | 74.0 |
subject_002 | 18 | Female | 0 | 130.0 | 63.0 | 70.0 |
subject_006 | 39 | Male | 0 | 138.0 | 70.0 | 71.0 |
subject_007 | 34 | Male | Right | 115.0 | 66.0 | 51.0 |
Filtering and Subsetting#
Subset data through logical indexing or by selecting specific columns:
df_c_filtered = df_c[df_c["time"] > 40] # Rows where total time taken is greater than 40
df_c_subset = df_c[["time", "addit2"]] # Only keep selected columns
df_c_filtered.head()
addit1 | addit2 | addit3 | addit4 | addit5 | addit6 | addit7 | addit8 | subtr1 | subtr2 | subtr3 | subtr4 | subtr5 | subtr6 | subtr7 | subtr8 | class | time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | second | 47 |
13 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | second | 52 |
14 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | second | 43 |
15 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | second | 45 |
17 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | second | 44 |
Alternatively we can drop specific columns using the .drop()
method.
df_only_items = df_c.drop(columns=['class', 'time'])
df_only_items.head()
addit1 | addit2 | addit3 | addit4 | addit5 | addit6 | addit7 | addit8 | subtr1 | subtr2 | subtr3 | subtr4 | subtr5 | subtr6 | subtr7 | subtr8 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
2 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
5 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
Full control
Here we only saw few application. If you want to feel like you are in full control, I higly recommend checking pandas
documentation.