Data curation#
Data curation is the process of creating, organizing, and maintaining data sets so that people seeking information can access and use them. Curation involves collecting, structuring, indexing, and cataloging data for users within an organization, group, or the general public. In this session, we will ensure that we have the scientific computing skills necessary to clean, explore, summarize, and visualize data efficiently using the tools you were introduced to last semester: the pandas
, numpy
, and matplotlib
libraries.
Datasets#
In this session, we will use the Child Numeracy Skills and the Yeatman datasets:
Child Numeracy Skills#
The dataset contains responses from a classroom-based arithmetic assessment. The dataset captures performance across multiple addition and subtraction tasks, allowing for the analysis of numerical ability in children.
Variables#
addit1
toaddit8
: Scores for eight addition problems, assessing mental arithmetic and computation speed.subtr1
tosubtr8
: Scores for eight subtraction problems, similarly targeting subtraction skills.class
: Indicates the class group the child belongs to (useful for group comparisons or multilevel modeling).time
: Represents the total time taken (in seconds) to complete the task, serving as a measure of processing speed or efficiency.
Usage#
This dataset is well-suited for:
Exploratory data analysis
Item-level psychometric modeling
Group comparisons (e.g., by class)
Linking accuracy with processing time
Yeatman dataset#
This dataset contains demographic and cognitive profile data for 77 participants involved in a diffusion MRI study. Each row represents an individual subject and includes attributes relevant for analysis of structural brain connectivity.
Columns#
subjectID
: Unique identifier for each participant.Age
: Participant’s age in years.Gender
: Categorical variable indicating biological sex (Male/Female).Handedness
: Preferred hand for tasks (e.g., ‘Right’, ‘Left’). Some entries are missing.IQ
: Composite intelligence quotient score.IQ_Matrix
: Score on a matrix reasoning task (non-verbal IQ component).IQ_Vocab
: Score on a vocabulary test (verbal IQ component).
Notes#
Some variables contain missing data, especially
Handedness
,IQ
,IQ_Matrix
, andIQ_Vocab
.