An article by Reilly et al. (2019) raised some important questions/issues for the field of pupillometry. One issue that has been bothering me as of late (hence the post) pertains to objectively defining a time window to look at. Although some commonalities exist across the literature, there is no uniform standard for isolating a peak range or delineating the width of this window. In this blog post I present a method we can use to address this issue. To do so, I will be using the example data set from the gazeR package (Geller et al. (n.d.)). This data comes from a lexical decison experiment where participants judged the lexicality of cursive and type-print stimuli.

# Common Approaches

In the pupillometry literature, I have seen a few different analytic approaches to the above issue. Some will preform seperate statistical tests at each time point (via *t* tests, LMM, ANOVA, regression, what have you), or they just ignore time altogether and aggregate data into seperate bins. I do not like these approaches for several reasons.

time bin analyses creates a trade-off between power and temporal resolution (smaller time bins) and introduces experimenter bias in selection of time bins/windows. There is also the issues of multiple comparisons and autocorrelation

Statistical thresholding (i.e.,

*p*< 0.05 is significant but*p*> 0.05 is not) creates false discretization of continuous processes.Does not take into account individual differences

While my go to would be growth curve modeling (GCA) or generalzied additive mixed modeling (GAMMs), I wanted to highlight a method I believe is a good alternative: cluster-based permutation tests (henceforth CPT)

# CPT

Let’s first look at a graph of the data.

```
library(gazer)
library(purrr)
library(Rmisc)
library(magrittr)
library(ggstatsplot)
library(knitr)
library(tidyverse)
agg_subject<-read_csv("blog_data.csv")
```

`## Warning: Missing column names filled in: 'X1' [1]`

```
agg_subject$subject<-as.factor(agg_subject$subject)
agg_subject$script<-as.factor(agg_subject$script)
agg_subject<-select(agg_subject, -X1)
cur1 <- subset(agg_subject, timebins <= 3500)
#runningSE <- cur1 %>%
# split(.$timebins) %>%
# map(~Rmisc::summarySEwithin(data = ., measurevar = "aggbaseline", withinvars = "script", idvar="subject"))
#WSCI <- map_df(runningSE, extract) %>%
# mutate(Time = rep(unique(cur1$timebins), each = 2))
#Note, you'll have to change 2 to match the number of conditions
WSCI.plot <- ggplot(agg_subject,aes(x=timebins, y=aggbaseline, linetype=script, color=script), size=3) + stat_summary(fun.y = "mean", geom = "line",size = 1,aes(colour = NULL))
print(WSCI.plot)
```

From this, we might be tempted to declare a significant difference arisisng ~1500-2500 ms. However, we cannot rely on the graphical analysis and we need a statistical test to be able to affirm that this observation is caused by diffierntial effort between the two scripts.

Permuation cluster analysis is a technique that is becoming increasingly popular, especially in the cognitvie neuropsychology domain to analyze MEG-EEG (Maris and Oostenveld 2007). While there exists a number of cluster analysis functions in MNE-Python (Gramfort et al. 2014) and Matlab’s FieldTrip (Oostenveld et al. 2011), what I want to show you is how you can do this analysis in R. The implementation of CPT has not been widely used in pupillometry research. I will hopefully show you that it can be a very useful tool to have in your aresenal.

Before I show you how to apply this method, I want to briefly go over the CPT method. The CPT is a data-driven method which increases power and controls for Type I errors across multiple comparisons (exatcly what we need when looking at pupil changes across the time course!). The clustering method involves conducting dependent-sample t-tests for every data point (condition by time). In the first step, adjacent data points that cross the mass-univariate significance threshold (*p* < .05) are combined to create a cluster. The sum of the *t*-statistic (or *F*) are calculated and form the basis for the cluster level statistic. In the second step, a surrogate null-distribution is created by first randomly assigning one of the two conditions within subjects (this is done n times) and retaining the cluster statistic for each randomization. In the third and final step, the cluster level statistic of the real comparison is compared against the null distribution, with clusters falling in the highest or lowest 2.5% considered to be significant.

While writing this blog post, I saw that Dale Barr presented on CPT in the context of eye-tracking. I figured I would take this opportunity to test out the packages he created.

`devtools::install_github("dalejbarr/exchangr")`

```
## Skipping install of 'exchangr' from a github remote, the SHA1 (685f4b55) has not changed since last install.
## Use `force = TRUE` to force installation
```

`devtools::install_github("dalejbarr/clusterperm")`

```
## Skipping install of 'clusterperm' from a github remote, the SHA1 (b05c87f6) has not changed since last install.
## Use `force = TRUE` to force installation
```

```
library(exchangr)
library(clusterperm)
```

The ‘clusterperm’ package has a pretty nifty `aov_by_bin`

function. If we applied a multlipe comparison correction (e.g., holm) to this data.

```
cur2 <- aov_by_bin(agg_subject, timebins, # clusterperm package
aggbaseline ~ script + Error(subject))
```

`## Warning: `.key` is deprecated`

```
## Warning: `cols` is now required.
## Please use `cols = c(result)`
```

```
cur2$p_adjuct<-p.adjust(cur2$p, method="holm")
cur2_p=subset(cur2, cur2$p_adjuct <= .05)
knitr::kable(cur2_p)
```

timebins | effect | stat | p | p_adjuct |
---|---|---|---|---|

2000 | script | 12.97351 | 0.0008456 | 0.0202952 |

2100 | script | 17.08613 | 0.0001720 | 0.0044723 |

2200 | script | 14.51092 | 0.0004587 | 0.0114670 |

2300 | script | 12.96859 | 0.0008473 | 0.0202952 |

How does it compare to CPT?

```
orig <- detect_clusters_by_effect(cur2, effect, timebins, stat, p)
knitr::kable(orig)
```

effect | b0 | b1 | sign | cms |
---|---|---|---|---|

script | 0 | 100 | 1 | 10.121215 |

script | 900 | 1000 | -1 | 8.756152 |

script | 1900 | 2500 | 1 | 82.198279 |

It looks like the multiple comparison correction is less sensitive.

Wait! we do not have *p*-values!!! Where are the damn *p*-values!? What is life? How am I suppose to make an informed decison? What is cool about Dale’s R package is that it has a permutation function that allows you to build the null model and obtain those coveted *p*-values.

```
dat_prec <- nest(agg_subject, -subject, -script)
nhds_prec <- cluster_nhds(
100, dat_prec, timebins,
aggbaseline ~ script + Error(subject),
shuffle_each, script, subject) # only use 100 permutations. Should use 1000-2000.
## get p-values
results_prec <- pvalues(orig, nhds_prec)
knitr::kable(results_prec)
```

effect | b0 | b1 | sign | cms | p |
---|---|---|---|---|---|

script | 0 | 100 | 1 | 10.121215 | 0.2277228 |

script | 900 | 1000 | -1 | 8.756152 | 0.2673267 |

script | 1900 | 2500 | 1 | 82.198279 | 0.0099010 |

Whew! I feel much better now!

From this graph we can see that there is one signifcant cluster: from 1900-2500 ms. If you were using this as a more explortatory approach (which I think we should), we know have a time period to look at where we can perform more common tests (e.g., max or mean during that time period).

There are some misunderstandings users should be made aware of when it comes to CPT (see Sassenhagen and Draschkow 2019) for a nice discussion). If we want to ask more specific questions (i.e., at what exact time points effects arise), CPT is not the method to use. We can only be certain that a difference exists; we cannot know the exact nature of that difference.

Overall, I think CPT can be a very useful tool in determing an appropiate time range to look at, and can even lead to more powerful inferences. Further, the use of CPT obviates the need for multiple comparisons and autocorrelation corrections. However, it does have issues. For instance, it cannot take into account subject and item variability. In addition, it is not able to tell you what the difference is, only that there is a difference.

In my next blog, I will discuss applying generalized additive mixed modeling to your pupillometry data. I think it might be the perfect soultion!

# Cited

Geller, Jason, Matthew Winn, Tristan Mahr, and Daniel Mirman. n.d. “GazeR: A Package for Processing Gaze Position and Pupil Size Data.” PsyArXiv. doi:10.31234/OSF.IO/GVCXB.

Gramfort, Alexandre, Martin Luessi, Eric Larson, Denis A. Engemann, Daniel Strohmeier, Christian Brodbeck, Lauri Parkkonen, and Matti S. Hämäläinen. 2014. “MNE software for processing MEG and EEG data.” *NeuroImage* 86 (February): 446–60. doi:10.1016/j.neuroimage.2013.10.027.

Maris, Eric, and Robert Oostenveld. 2007. “Nonparametric statistical testing of EEG- and MEG-data.” *Journal of Neuroscience Methods* 164 (1): 177–90. doi:10.1016/j.jneumeth.2007.03.024.

Oostenveld, Robert, Pascal Fries, Eric Maris, and Jan-Mathijs Schoffelen. 2011. “FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data.” *Computational Intelligence and Neuroscience* 2011 (December). Hindawi: 156869. doi:10.1155/2011/156869.

Reilly, Jamie, Alexandra Kelly, Seung Hwan Kim, Savannah Jett, and Bonnie Zuckerman. 2019. “The human task-evoked pupillary response function is linear: Implications for baseline response scaling in pupillometry.” *Behavior Research Methods* 51 (2): 865–78. doi:10.3758/s13428-018-1134-4.

Sassenhagen, Jona, and Dejan Draschkow. 2019. “Cluster‐based permutation tests of MEG/EEG data do not establish significance of effect latency or location.” *Psychophysiology* 56 (6). John Wiley & Sons, Ltd (10.1111): e13335. doi:10.1111/psyp.13335.