Research Projects

Current work

Causal discovery across multiple domains

Discovering causal relationships is a central goal of science. There is an explosion of different datasets stemming from the same system. For example, one might have access to sequencing data and experimental CRISPR perturbations across different labs. Each lab is considered a separate domain, where distributions may shift due to changes in lab protocols. Another example is neural recordings in human patients. Although structural connections among different parts of the brain may be consistent across patients, each patient can be considered a different domain where the distribution of neural activity is different. Learning causal structure from these different distributions is not well characterized. We provide a theoretical characterization and learning algorithm for learning from observational and experimental data stemming from multiple domains.

Let us provide another hypothetical example. Consider the medial temporal lobe (MT), premotor cortex (PM), somatosensory cortex (S1) and the motor cortex (M1). These regions have been hypothesized to work together to generate movement. Our overarching goal is to ultimately understand how these brain regions interact to facilitate movement in humans. In parallel, neuroscientists may also be interested in what portions of this computation differs in bonobos. Recording technology allows us to record electrophysiological activity in these brain regions for both bonobos and humans. We can also design an experiment, where we have subjects move a cursor in response to visual stimuli. Finally, we can also apply stimulation, or lesion experiments in bonobos that we could otherwise not do in humans. By systematically combining these observational and experimental distributions from both humans and bonobos, one can learn the following causal structure, where we not only get the cause-and-effect relationships among relevant brain regions, but also see where the computation may differ between humans and bonobos. This difference is indicated by the black-square, which we call an S-node. The learned relationships depicted by the graph seems plausible where the MT and S1 send their signals via the PM to the M1. Obviously, the underlying system can be more complex, but the recovered graph can serve as a starting point for further scientific inquiry.

Modern random forests

Although random forest classifiers are extremely successful for tabular data, they are not state of the art for structured data. We develop a random forest algorithm better-suited for such data as images and time series by using structured projections of features which take into account the data geometry. This enables state of the art performance in biomedical settings with low sample sizes.

Estimating mutual information in high-dimensions

Posterior probabilities from machine learning classifies are typically overconfidant. We leverage “honesty” a fundamental property that can be imbued on random forests to prove that (conditional) entropy and therefore (conditional) mutual information can be estimated. We show robustness of our information-theoretic estimators in high-dimensions on the OpenML-CC18 datasets, fMRI data and EEG data.

Prior work

Localizing the epileptogenic zone in drug-resistant epilepsy patients

My PhD thesis work focused on developing algorithms for seizure localization in drug-resistant epilepsy patients. I worked with multivariate time-series, such as intracranial EEG and scalp EEG data. I also did 2D and 3D image analysis with T1 MRI and CT scans of patients to understand the spatial relationships of epileptic networks. Specifically, I analyzed EEG data using dynamical systems, control theory, machine learning and statistics. I combined biological understanding of the brain with mathematical models of dynamical systems. I also utilized machine learning and statistical models to answer relevant hypothesis-driven questions, with a focus on learning manifold structure from a low number of samples. Moreover, I developed a suite of open-source software tools contributing to packages such as the Brain Imaging Data Structure (BIDS), MNE, The Virtual Brain (TVB), scikit-learn and other scientific packages.

Diagnosing Epilepsy patients

While scalp EEG is important for diagnosing epilepsy, a single routine EEG is limited in its diagnostic value. Only a small percentage of routine EEGs show interictal epileptiform discharges (IEDs) and overall misdiagnosis rates of epilepsy are 20-30%. We aim to demonstrate how analyzing network properties in EEG recordings can be used to improve the speed and accuracy of epilepsy diagnosis - even in the absence of IEDs.

Open source

Another core aspect of my research is the development of fundamental scientific software, which impacts hundreds to thousands of researchers. My contributions range from software for analyzing neural data, to general machine learning to scientific computing with sparse arrays.