Skip to content

Commit f11d6c2

Browse files
committed
Update NSD project briefs
1 parent 92c0100 commit f11d6c2

6 files changed

Lines changed: 458 additions & 81 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@ Website: <https://processingcomplexdata.github.io/>
66

77
Lecturer: [Javier Garcia-Bernardo](https://javier.science/), Assistant Professor of Social Data Science, Department of Methodology & Statistics, Utrecht University.
88

9-
The course site is built with Jekyll and hosted on GitHub Pages. The landing page is [index.md](index.md). The course manual is [course_manual.md](course_manual.md), project guidelines are in [project_guidelines.md](project_guidelines.md), and the six group projects live in [projects/](projects/).
9+
The course site is built with Jekyll and hosted on GitHub Pages. The landing page is [index.md](index.md). The course manual is [course_manual.md](course_manual.md), project guidelines are in [project_guidelines.md](project_guidelines.md), and the group project briefs live in [projects/](projects/).
1010

1111
All materials are licensed [CC-BY-4.0](LICENSE).
1.48 MB
Loading
1.06 MB
Loading

index.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
## About the course
66

7-
Contrary to what most introductory data science and statistics courses teach, real-world scientific data come in an enormous variety of formats, sizes, structures, and procedures — from simple tables to spatiotemporal arrays, normalized relational schemas, nested API responses, raw scraped web pages, networks, and domain-specific scientific standards. This course gives students hands-on experience with handling, processing, and modelling six families of complex data, in a hackathon-style format where each group goes deep on one data type and teaches the rest of the class.
7+
Contrary to what most introductory data science and statistics courses teach, real-world scientific data come in an enormous variety of formats, sizes, structures, and procedures — from simple tables to spatiotemporal arrays, normalized relational schemas, nested API responses, raw scraped web pages, networks, sampled time series, and domain-specific scientific standards. This course gives students hands-on experience with handling, processing, and modelling several families of complex data, in a hackathon-style format where each group goes deep on one data type and teaches the rest of the class.
88

99
The narrative spine of the course is *from raw traces to defensible claims*. Each group works through a single pipeline: raw source → operationalized clean object → baseline model with one sensitivity check → presentation.
1010

@@ -32,5 +32,6 @@ The narrative spine of the course is *from raw traces to defensible claims*. Eac
3232
| Networks | [projects/networks.md](projects/networks.md) | What is the relationship between gender and cross-program relations in high school? |
3333
| Messy web text | [projects/messy_web_text.md](projects/messy_web_text.md) | Do company sustainability pages differ linguistically from public-interest climate information pages? |
3434
| Relational database | [projects/relational_database.md](projects/relational_database.md) | Which driver, constructor, grid, circuit, and season characteristics are associated with F1 finishing points? |
35-
| Time series | [projects/time_series.md](projects/time_series.md) | How does an fMRI signal change across NSD scan sessions? |
35+
| Scientific programming | [projects/scientific_programming.md](projects/scientific_programming.md) | In one NSD session, are mean beta responses different in V1 and hV4? |
36+
| Time series | [projects/time_series.md](projects/time_series.md) | Is gaze movement lower while an NSD target image is on screen than nearby periods when it is not? |
3637
| API data | [projects/api_data.md](projects/api_data.md) | Which study attributes are associated with completed versus ongoing clinical trials? |

projects/scientific_programming.md

Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
# Scientific Programming Project: Neuroimaging Data Standards
2+
3+
- Project name: `scientific_programming_neuroimaging`
4+
- Research question: __In NSD subject 1, session 1, is the mean beta response different in V1 and hV4?__
5+
- Optional extension: __If the core pipeline works, repeat the same ROI comparison for a few additional sessions or revisit the V1 session-drift question.__
6+
- Programming language: `R` suggested for the class version (`RNifti`, `dplyr`, `tidyr`, `ggplot2`). Python remains optional for students who already know `nibabel` or `nilearn`.
7+
- Expert contact: TBD, Ben Harvey?
8+
9+
> **Canonical course conventions live in [project_guidelines.md](../project_guidelines.md).** That file is the source of truth for the four required workflow files (`week1_explore.qmd`, `week2_operationalize_clean.qmd`, `week3_model.qmd`, `week4_storytelling.qmd`), the `data/model_data.rds` -> `data/model_results.rds` pipeline, the raw-data policy, quality-check requirements, decision logs, and contribution tracking. Read it before starting and treat anything below as project-specific guidance on top of those conventions.
10+
11+
![Posterior view of NSD visual ROI mask](/assets/img/projects/nsd_visual_roi_posterior.png)
12+
13+
*Posterior view of an NSD visual ROI mask: grey points are valid brain voxels outside this visual ROI set; colored points show V1, V2, V3, and hV4 subdivisions. The practical task is to connect voxel coordinates, ROI labels, trials, and beta values into one small analysis table.*
14+
15+
## Tutorial framing
16+
17+
This project is about scientific data standards and binary scientific arrays. The
18+
raw object is not one tidy table. It is a small set of files that only make sense
19+
together: BIDS metadata, JSON sidecars, event TSV files, NIfTI beta maps, ROI
20+
masks, and a dataset manual.
21+
22+
The project should stay deliberately small. Students should not try to do a full
23+
fMRI study, fit a complex neural encoding model, download all NSD sessions, or
24+
solve image-category labeling. The class version uses one subject, one session,
25+
one single-trial beta file, and one visual ROI mask.
26+
27+
Students should learn three things:
28+
29+
1. How a scientific repository uses standards and metadata to make a large
30+
dataset understandable.
31+
2. How a NIfTI beta file and a NIfTI ROI mask encode different but aligned
32+
arrays.
33+
3. How to reduce voxelwise trial data to a small trial-by-ROI table that answers
34+
one simple question.
35+
36+
The core research question is:
37+
38+
> Are mean trial-level beta responses different in V1 and hV4 for one NSD
39+
> subject-session?
40+
41+
This is not meant to be a new neuroscience contribution. It is a defensible
42+
mini-question that forces students to work with real neuroimaging files without
43+
drowning in the full NSD dataset.
44+
45+
## Peer-teaching checklist
46+
47+
| Dimension | This project teaches |
48+
|---|---|
49+
| Data structure | Participant/session folders, event metadata, 4D beta images, 3D ROI masks, voxel coordinates, trial index, and ROI labels. |
50+
| Storage system | Scientific repository on AWS organized through BIDS-style and NSD-specific conventions. |
51+
| File formats | NIfTI `.nii.gz`, TSV, JSON, CSV, MATLAB design files if needed, and RDS/CSV outputs created by students. |
52+
| Encoding | Text metadata, JSON sidecars, tabular event files, binary scientific image arrays, and integer-coded ROI masks. |
53+
| Model | A paired trial-level ROI comparison, such as a one-sample test or bootstrap interval for `hV4 - V1` trial differences. |
54+
| Key aspects to explain | Data standards, provenance, voxel-to-ROI mapping, NIfTI dimensions, trial indexing, ROI aggregation, file sizes, and what is lost when voxel maps become ROI means. |
55+
56+
## Resources
57+
### Data source
58+
59+
The practical uses the **Natural Scenes Dataset (NSD)**. European fMRI datasets
60+
are difficult to share publicly because detailed brain images are often treated
61+
as individually identifiable under the GDPR. NSD is an American dataset that can
62+
be shared through AWS Open Data after signing the NSD data access agreement.
63+
64+
- Dataset and documentation: https://naturalscenesdataset.org/
65+
- Main reference paper: Allen et al. (2022), Nature Neuroscience. https://doi.org/10.1038/s41593-021-00962-x
66+
- Optional extension reference for session drift: https://doi.org/10.1038/s41467-023-40144-w
67+
68+
Minimum NSD files for the class version:
69+
70+
- BIDS metadata:
71+
`nsddata_rawdata/dataset_description.json`,
72+
`nsddata_rawdata/participants.tsv`, and
73+
`nsddata_rawdata/task-nsdcore_bold.json`.
74+
- One example event file:
75+
`nsddata_rawdata/sub-01/ses-nsd01/func/sub-01_ses-nsd01_task-nsdcore_run-01_events.tsv`.
76+
- Subject 1 visual ROI mask:
77+
`nsddata/ppdata/subj01/func1pt8mm/roi/prf-visualrois.nii.gz`.
78+
- One single-trial beta file:
79+
`nsddata_betas/ppdata/subj01/func1pt8mm/betas_fithrf_GLMdenoise_RR/betas_session01.nii.gz`.
80+
- Stimulus metadata only for inspection, not for the core model:
81+
`nsddata/experiments/nsd/nsd_stim_info_merged.csv`.
82+
- Optional image examples:
83+
a few files from `nsddata_stimuli/stimuli/nsd/shared1000/`.
84+
85+
Students should not download all raw BOLD volumes, all subjects, all sessions of
86+
single-trial betas, or the full 37 GB `nsd_stimuli.hdf5` file. If laptop storage
87+
or memory is a problem, the instructor can provide a pre-cropped subset or let
88+
students use `meanbeta_session01.nii.gz` for Week 1 exploration only. The main
89+
Week 2 table should still be built from a documented NIfTI beta file plus the ROI
90+
mask.
91+
92+
### ROI codes
93+
94+
For the core question, combine:
95+
96+
- V1 = ROI codes `1` and `2` (`V1v`, `V1d`)
97+
- hV4 = ROI code `7`
98+
99+
Students do not need to analyze every ROI. They should understand that the ROI
100+
mask is an integer-coded spatial map whose dimensions must align with the beta
101+
file's spatial dimensions.
102+
103+
### Knowledge sources
104+
105+
- BIDS documentation for neuroimaging data organization and metadata.
106+
- Basic introductions to NIfTI, JSON sidecars, events files, and participant
107+
metadata.
108+
- NSD documentation, the main paper, and the dataset manual.
109+
- R packages: `RNifti`, `dplyr`, `tidyr`, `ggplot2`, `readr`.
110+
- Optional Python equivalents: `nibabel`, `numpy`, `pandas`, `nilearn`.
111+
112+
## Week-by-week
113+
### Week 1
114+
115+
Start from the raw scientific repository, identify the files that belong
116+
together, and make a written manifest before downloading large files.
117+
118+
Week 1 exact data checklist:
119+
120+
- Read and accept the NSD data terms.
121+
- Download only the BIDS metadata files listed above.
122+
- Download one event TSV for `sub-01`, `ses-nsd01`, `run-01`.
123+
- Download the subject-1 visual ROI mask.
124+
- Download one selected beta file, preferably `betas_session01.nii.gz`.
125+
- Optionally download a few `shared1000` images so the group can see what kind
126+
of stimuli were used.
127+
- Save a local manifest with file paths, file sizes, and the reason each file is
128+
needed.
129+
- In R, load only the headers first and write down the dimensions of the ROI mask
130+
and beta file.
131+
132+
Skip in Week 1:
133+
134+
- all raw BOLD fMRI volumes;
135+
- all subjects;
136+
- all beta sessions;
137+
- the full `nsd_stimuli.hdf5`;
138+
- animate/inanimate labeling;
139+
- session-drift modeling.
140+
141+
Week 1 questions:
142+
143+
- What is BIDS, what is NIfTI, and what is a JSON sidecar?
144+
- What is a voxel?
145+
- What does each dimension of the beta file represent?
146+
- What is an ROI mask, and why are most voxels outside this visual ROI set?
147+
- Which files are raw measurements, which are metadata, and which are derived
148+
analysis files?
149+
- Why is the data-use agreement part of the scientific data structure?
150+
151+
Prepare for roundtable in week 2:
152+
153+
- Explain why scientific data standards exist.
154+
- Explain the difference between an event TSV, a beta NIfTI file, and an ROI
155+
mask.
156+
- Explain why a binary array is not self-explanatory without metadata.
157+
- Explain what can go wrong if the beta file and ROI mask dimensions do not
158+
match.
159+
160+
### Week 2
161+
162+
Build the smallest useful analysis table from the raw files.
163+
164+
- Load the ROI mask with `RNifti::readNifti()`.
165+
- Load the beta file with `RNifti::readNifti()`.
166+
- Confirm that the first three beta dimensions match the ROI mask dimensions.
167+
- Create a voxel table only for V1 and hV4 voxels.
168+
- For each trial, compute the mean beta in V1 and the mean beta in hV4.
169+
- Save `data/model_data.rds` with columns such as:
170+
`subject`, `session`, `trial`, `roi`, `n_voxels`, and `mean_beta`.
171+
172+
The Week 2 output should be small. Students should not save a copy of the full
173+
NIfTI array as an RDS file.
174+
175+
Prepare for roundtable in week 3:
176+
177+
- Explain how voxelwise maps became a trial-by-ROI table.
178+
- Explain what was gained and lost by averaging over voxels.
179+
- Explain why V1 was built from `V1v` and `V1d`.
180+
- Explain one quality check: dimensions match, non-empty ROIs, plausible number
181+
of trials, or no all-missing beta summaries.
182+
183+
### Week 3
184+
185+
Fit a very small model on the Week 2 table.
186+
187+
Recommended analysis:
188+
189+
```r
190+
wide <- tidyr::pivot_wider(
191+
model_data,
192+
names_from = roi,
193+
values_from = mean_beta
194+
)
195+
196+
wide$hV4_minus_V1 <- wide$hV4 - wide$V1
197+
t.test(wide$hV4_minus_V1)
198+
```
199+
200+
Equivalent model:
201+
202+
```r
203+
lm(hV4_minus_V1 ~ 1, data = wide)
204+
```
205+
206+
The intercept is the average trial-level difference between hV4 and V1. This is
207+
simple enough to explain and still depends on the real scientific-programming
208+
work: students had to read NIfTI files, decode the ROI mask, align dimensions,
209+
and aggregate a 4D array into a table.
210+
211+
Sensitivity check:
212+
213+
- Repeat the comparison with `V1v` and `V1d` separately, or
214+
- repeat after removing trials with extreme beta values, or
215+
- repeat with `meanbeta_session01.nii.gz` as a descriptive check only.
216+
217+
Prepare for roundtable in week 4:
218+
219+
- Explain which parameter answers the research question.
220+
- Explain why the model is small but the data processing was not trivial.
221+
- Explain why trial-level values are not the same as raw BOLD time series.
222+
- Explain why session drift and image-category questions are extensions, not the
223+
core project.
224+
225+
### Week 4
226+
227+
Visualize and tell a story about the raw-data-to-table pipeline.
228+
229+
- Show the ROI mask image or 3D ROI plot.
230+
- Show the distribution of trial-level mean beta values for V1 and hV4.
231+
- Show paired trial differences or a confidence interval for `hV4 - V1`.
232+
- Show the local file manifest and the final `model_data.rds` structure.
233+
- Make the limitations explicit: one subject, one session, two ROIs, ROI means
234+
rather than voxelwise modeling, and no claim about all visual cortex or all
235+
people.
236+
237+
The final story should make a course-level argument:
238+
239+
> Scientific programming is not just fitting a model. It is knowing how raw
240+
> domain files, metadata, binary arrays, ROI labels, and data-use agreements
241+
> become a defensible analysis table.

0 commit comments

Comments
 (0)