Overview

  • We work to develop statistical models to answer biological questions, balancing biological interpretability, theoretical guarantees, and computational tractability
  • We deal with modern big data which is highly interconnected through graphical structures: phylogenetic networks to study evolution, interaction networks to study microbial communities in soil and plants, neural networks to predict phenotypes from genotypes
  • Our work produces a collection of new statistical methods with solid theoretical guarantees and efficient computational implementations that are adaptable to analyze the complex characteristics of modern big biological data
  • We do not live in a statistical bubble! We always welcome new collaborations that can help our research to be relevant and applied to real-life data

Interested in joining the lab?

Information in Opportunities.

Funding

The lab is currently funded by the following awards.


Statistical phylogenomics

The Tree of Life is the graphical structure that represents the evolutionary process from single-cell organisms at the origin of life to the vast biodiversity we see today. Reconstructing this tree from genomic sequences is challenging due to the variety of biological forces that shape the signal in the data which constantly push the boundaries of statistical models. In addition, the big data reality can make inference methods obsolete due to their lack of scalability.

In this lab, we work to produce novel statistical models and methods to reconstruct the Tree of Life that are theoretically sound yet computationally efficient and scalable to meet the ever-growing needs of biological bit data. We strive to accompany our theoretical work with open-source publicly available software.

Examples of our current research involve:

  • extension of phylogenetic network inference methods to broader classes of networks
  • robustness of phylogenetic inference to microbial datasets like bacteria and viruses
  • alternative sampling scheme to MCMC that is expected to produce more efficient bayesian phylogenetic estimation
  • statistical properties of BHV space and possible extension to networks

Our work is not purely methodological. Among our current collaborations, we can highlight:

  • studying the ancestral protein sequences of Potyvirus and Picornavirus
  • reconstructing phylogenetic trees and networks of grapes and carrots and studying the evolution of traits related to climate resistance via comparative methods
  • reconstructing phylogenetic trees and networks of Escovopsis and studying the evolution of vesicular shapes via comparative methods

Want to learn more about phylogenetics (especially networks)? See this list of resources that starts with introductory videos and then a small subset of relevant papers in the field.


Statistics in genomics and microbiome

Microbial communities are among the main driving forces in the biosphere. Many critical biological processes inside and outside the human body are governed by microbes. Understanding the composition of microbial communities and what environmental factors play a role in shaping this composition is crucial to comprehend processes connected to human, plant and soil health, as well as to predict microbial responses to environmental perturbations such as climate change in a planet macroscopic scale or diet in a human microscopic scale.

In this lab, we work to produce tools to better represent microbial communities (via networks) and use the microbial communities as potential predictors of plant, soil or human health phenotypes. New models are centered on high-dimensional statistical models like penalized regression and post selection inference that can simultaneously model all microbes across the microbiome.

Examples of our current research involve:

  • estimation of microbial interactions accounting for spatial correlations via Ising models
  • network regression framework to understand the effects of microbial communities on a response
  • post selection inference and penalized regression models applied to human or plant disease
  • high dimensional models for the integration of different omics data types applied to human microbiome research and plant microbiome reseach

Our work is not purely methodological. Among our current collaborations, we can highlight:

  • studying how root microbial communities affect potato health and response to environmental changes
  • studying the effect of lung microbiome on health outcomes in cystic fibrosis patients

Statistical view in deep learning

For the last decades, deep learning has enabled unprecedented prediction potential in a plethora of applications. In particular, neural networks (NNs) are already successfully used in computer vision, astrophysics, and even in cancer histology. The main reason for their state-of-the-art accurate prediction is their flexibility, mutating its architecture to fit almost any type of data and any type of model. Yet, the poor generalization outside the training data, the lack of statistical guarantees of confidence, and the notion that they are a “black box” model have hampered their development in translational fields like personalized medicine where inaccurate predictions might result in grave consequences. Furthermore, NN methods are known for being “data-hungry”, meaning huge amounts of data are required for training and validating. This requirement prohibits its use in fields with comparatively smaller sample sizes such as human health where restrictions on data sharing and privacy limit the researchers’ ability to acquire large enough datasets for NN.

In this lab, we work to explore the potential of NN in biomedical areas. On one side, we work on data related to human health like precision medicine, or the emergence of antibiotic-resistance in infectious diseases. On the other side, we work on data related to soil and plant health like the use of biocontrol mycoviruses to fight against the emergence of fungicide-resistance crop pathogens.

Examples of our current research involve:

  • robustness of NN models to predict microbial phenotypes from genomic sequences: antibiotic-resistance on Staphylococcus aureus and Pseudomonas aeruginosa and hypovirulence potential of mycoviruses on Sclerotinia
  • connections of statistical concepts of uncertainty (confidence intervals or hypothesis testing) on NN models

Publications

Preprints under revision

Bjorner, M., Molloy, E., Dewey, C., Solís-Lemus, C.. (2022). Detectability of Varied Hybridization Scenarios using Genome-Scale Hybrid Detection Methods. arxiv (2022): 2211.00712.

Justison, J., Solís-Lemus, C., Heath, T. (2022). SiPhyNetwork: An R package for Simulating Phylogenetic Networks. biorxiv (2022): 2022.10.26.513953.

Rattray, JB, Walden, R., Marquez-Zacarias, P., Molotkova, E., Perron, G., Solís-Lemus, C., Pimentel-Alarcon, D., Brown, S. (2022). Machine learning identification of Pseudomonas aeruginosa strains from colony image data. biorxiv (2022): 2022.09.02.506375v1.

Ozminkowski, S., Wu, Y., Yang, L., Xu, Z., Selberg, L., Huang, C. , Solís-Lemus, C. (2022). BioKlustering: a web app for semi-supervised learning of maximally imbalanced genomic data arxiv (2022): 2209.11730, github, bioklustering.wid.wisc.edu

Ozminkowski, S., Solís-Lemus, C. (2022). Identifying microbial drivers in biological phenotypes with a Bayesian Network Regression model, arXiv (2022): 2208.05600, github

Shen, Y., Solís-Lemus, C., S. K. Deshpande. (2022). Sparse Gaussian chain graphs with the spike-and-slab LASSO: Algorithms and asymptotics, arXiv (2022): 2207.07020

Solís-Lemus, C., S. Yang, L. Zepeda-Nunez (2022). Accurate Phylogenetic Inference with a Symmetry-preserving Neural Network Model, arXiv (2022): 2201.04663, github

Solís-Lemus, C., A. M. Holleman, A. Todor, B. Bradley, K. J. Ressler, D. Ghosh, M. P. Epstein. (2021). A Kernel Method for Dissecting Genetic Signals in Tests of High-Dimensional Phenotypes, bioRxiv 2021.07.29.454336

Shen, Y., Solís-Lemus, C. (2021). CARlasso: An R package for the estimation of sparse microbial networks with predictors, arXiv (2021): 2107.13763, github

Shen, Y., Solís-Lemus, C. (2021). The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models, arXiv (2021): 2107.01306

Tiley, George P., Andrew A. Crowl, Paul S. Manos, Emily B. Sessa, Solís-Lemus, C., Anne D. Yoder, and J. Gordon Burleigh (2021) Phasing Alleles Improves Network Inference with Allopolyploids, bioRxiv (2021): 10.1101/2021.05.04.442457

Shen, Y., Solís-Lemus, C. (2020). Bayesian Conditional Auto-Regressive LASSO Models to Learn Sparse Networks with Predictors, arXiv (2020): 2012.08397, github

Solís-Lemus, C., A. Coen and Cecile Ané. 2020. On the identifiability of phylogenetic networks under a pseudolikelihood model, arxiv (2020): 2010.01758, github



2022   Sun, Y., T.M. Maeda, Solís-Lemus, C., D. Pimentel-Alarcon, Z. Burivalova  
    Classification of animal sounds in a hyperdiverse rainforest using Convolutional Neural Networks  
    DOI: 10.1016/j.ecolind.2022.109621  
    ecoind (653k) arxiv (653k) github (653k)  

2022   Liu, Y., Solís-Lemus, C.  
    WI Fast Stats: a collection of web apps for the visualization and analysis of WI Fast Plants data  
    DOI: 10.21105/jose.00159  
    jose (653k) arxiv (653k) github (653k) twitter (653k) instagram (653k) youtube (653k)  

2022   Zhang, Z., Cheng, S., Solís-Lemus, C.
    Towards a robust out-of-the-box neural network model for genomic data
    DOI: 10.1186/s12859-022-04660-8
    bmc (653k) arxiv (653k) github (653k) twitter (653k) instagram (653k) youtube (653k)

2022   G. A. Satten, S. W. Curtis, C. Solís-Lemus, C., E. J. Leslie, M. P. Epstein
    Efficient Estimation of Indirect Effects in Case-Control Studies Using a Unified Likelihood Framework
    DOI: 10.1002/sim.9390
    statmed (653k) bioarxiv (653k)

2021   Su M, Davis MH, Peterson J, Solís-Lemus, C., Satola SW, Read TD.
    Effect of genetic background on the evolution of Vancomycin-Intermediate Staphylococcus aureus (VISA)
    DOI: 10.7717/peerj.11764
    peerj (653k) github (653k)

2021   Moller A., Winston K., Ji S., Wang J., Hargita Davis M., Solís-Lemus, C., Read T.
    Genes Influencing Phage Host Range in Staphylococcus aureus on a Species-Wide Scale
    DOI: 10.1128/mSphere.01263-20
    msphere (653k) github (653k)

2020   Guerrero, V. and Solís-Lemus, C.
    A generalized measure of relative dispersion
    DOI: 10.1016/j.spl.2020.108806
    statprob (653k)

2020   Solís-Lemus, C., S. T. Fischer, A. Todor, C. Liu, E. J. Leslie, D. J. Cutler, D. Ghosh and M. P. Epstein
    Leveraging Family History in Case-Control Analyses of Rare Variation
    DOI: 10.1534/genetics.119.302846
    genetics (653k) github (653k)

2020   M. Su, J. Lyles, R. A. Petit III, J. M. Peterson, M. Hargita, H .Tang, Solís-Lemus, C., C. Quave, T. D. Read
    Genomic analysis of variability in delta-toxin levels between Staphylococcus aureus strains
    DOI: 10.7717/peerj.8717
    peerj (653k)

2019   Solís-Lemus, C., Ma, X., Hostetter II, M., Kundu, S., Qiu, P., Pimentel-Alarcón D.
    Prediction of functional markers of mass cytometry data via deep learning
    DOI: 10.1007/978-3-030-33416-1_5
    biomed (653k)

2018   Solís-Lemus, C., Pimentel-Alarcón D.
    Breaking the Limits of Subspace Inference
    56th Annual Allerton Conference on Communication, Control, and Computing

2018   Spooner, D.M., Ruess, H., Arbizu, C.I., Rodriguez, F., Solís-Lemus, C.
    Greatly reduced phylogenetic structure in the cultivated potato clade (Solanum section Petota pro parte)
    DOI: 10.1002/ajb2.1008
    botany (653k)

2018   Bastide, P., Solís-Lemus, C., Kriebel, R., Sparks, K.W., Ané, C.
    Phylogenetic Comparative Methods on Phylogenetic Networks with Reticulations
    DOI: 10.1093/sysbio/syy033
    systbio (653k) bioarxiv (653k)

2017   Solís-Lemus, C., Bastide, P., Ané, C.
    PhyloNetworks: a package for phylogenetic networks
    DOI: 10.1093/molbev/msx235
    mbe (653k) github (653k)

2017   Ané, C., Bastide, P., Mariadassou, M., Robin, S., Solís-Lemus, C.
    Processus d’évolution réticulée: tests de signal phylogénétique
    Journées de Statistique

2017   Pimentel-Alarcón D., Biswas A., Solís-Lemus, C.
    Adversarial Principal Component Analysis
    IEEE International Symposium on Information Theory (ISIT)

2016   Solís-Lemus, C., Ané, C.
    Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting
    DOI: 10.1371/journal.pgen.1005896
    plosgen (653k) arxiv (653k) github (653k)

2016   Solís-Lemus, C., Yang, M., Ané, C.
    Inconsistency of species-tree methods under gene flow
    DOI: 10.1093/sysbio/syw030
    systbio (653k) github (653k)

2016   Baum, D., Ané, C., Larget, B., Solís-Lemus, C., Ho, L.S.T, Boone, P., Drummond, C., Bontrager, M., Hunter, S., Saucier, B.
    Statistical evidence for common ancestry: application to primates
    DOI: 10.1111/evo.12934
    evol (653k)

2016   Pimentel-Alarcón D., Solís-Lemus, C.
    Crime detection via crowdsourcing
    8th Mexican Conference on Pattern Recognition, Springer International

2015   Solís-Lemus, C., L.L. Knowles and C. Ané
    Bayesian species delimitation combining multiple genes and traits in a unified framework
    DOI: 10.1111/evo.12582
    evol (653k) github (653k)

2015   Solís-Lemus, C.
    Statistical methods to infer population structure with coalescence and gene flow.
    PhD dissertation, Department of Statistics, University of Wisconsin-Madison
    pdf (653k)

Awards

NSF CAREER

DEB Award 2144367
Title: CAREER: Towards Scalable and Robust Inference of Phylogenetic Networks
Dates: February 1, 2022 to January 31, 2027
Personnel:
- PI: Claudia Solis-Lemus

Project summary

Scientists world-wide are engaged in efforts to understand how all planetary biodiversity evolved. This diversification process is represented through the Tree of Life. Achieving the goal of a complete estimate of the Tree of Life would allow us to fully understand the development and evolution of important biological traits in nature, for example, those related to resilience to extinction when exposed to environmental threats such as climate change. It would also provide information about the emergence and evolution of novel human pathogens that pose severe threats to human health. Thus, the development of statistical and computational tools to reconstruct the Tree of Life are paramount in evolutionary biology, systematics, conservation efforts, and human health research. Existing tree reconstruction methods, however, are limited because they do not account for important biological processes such as species hybridization, introgression or horizontal gene transfer, and thus, recent years have seen an explosion of methods to reconstruct phylogenetic networks rather than trees. Existing network reconstruction methods lack statistical guarantees ensuring the detection of reticulate signals in data, are not scalable enough for big data, and are tailored to reconstruct simple networks. Thus, they are not sufficient to tackle the complexity of reticulate evolution in fungi, prokaryotes, or viruses. This project will develop novel network inference methods with strong statistical guarantees that are robust enough to infer complex networks and scalable enough to accommodate big data. The methods will allow the integration of all organisms into the Tree of Life and thus help to complete a broader picture of evolution across all domains of life. The project will produce open source software and data science modules for K-16 outreach, and includes a strong focus on training underrepresented groups in STEM.

Apply! New positions funded by NSF CAREER

Publications supported by the award

DOE: Computational Tool Development for Integrative Systems Biology Data Analysis

DE-FOA-0002217
Title: Harnessing the power of big omics data: Novel statistical tools to study the role of microbial communities in fundamental biological processes
Dates: September 14, 2020 to September 14, 2022
Personnel:
- PI: Claudia Solis-Lemus
- Sam Ozminkowski (MS student in Statistics)
- Marianne Bjorner (MS student in CS)
- Rosa Aghdam (postdoc)

Project summary

Microbial communities are among the main driving forces of biogeochemical processes in the biosphere. In particular, many critical soil processes such as mineral weathering, and soil cycling of mineral-sorbed organic matter are governed by mineral-associated microbes. Understanding the composition of microbial communities and what environmental factors play a role in shaping this composition is crucial to comprehend soil biological processes and to predict microbial responses to environmental changes. In order to identify the driving factors in soil biological processes, we need robust statistical tools that are able to connect a set of predictors with a specific phenotype. Yet, the innovation in the statistical theory for biochemical and biophysical processes has not matched the increasing complexity of soil data. Indeed, existing statistical techniques have four main drawbacks: 1) they perform poorly on high-dimensional highly sparse data, such as soil metagenomics; 2) they ignore spatial correlation structure which can be a key component in soil-related data; 3) they do not provide valid p-values under high-dimensional settings making them unable to detect significant factors driving the phenotype of interest, and 4) they tend to focus on abundance matrices to represent microbial compositions which can be flawed due to its compositional nature (sum to 1 restriction) that affects how proportions behave in different experimental settings (e.g. changes in proportions in the microbial composition does not necessarily reflect actual biological changes in the interactions). The overall objective of this proposal is to pioneer the development of the next generation of statistical theory (accompanied by open-source publicly available software) for soil omics data. Our novel statistical methods will overcome existing challenges in standard approaches in three ways: 1) they will inherently account for high-dimensional highly interconnected data through the development of novel mixed-effects sparse learning models; 2) they will produce valid adaptive p-values through post selection inference, and 3) they will be implemented in open-source publicly available software that will serve the broader scientific community.

Publications supported by the award


USDA-NIFA: hatch project 1023699

Title: Novel interaction and network statistical models for microbiome data
Dates: October 1, 2020 to September 30 2022
Personnel
- PI: Claudia Solis-Lemus
- Yunyi Shen (MS student in Statistics)
- Sam Ozminkowski (MS student in Statistics)

Project summary

The growing food demand can only be sustained through rigorous and consistent support of plant and soil health worldwide. Recognizing the microbial, environmental and agricultural factors that drive plant and soil phenotypes is crucial to comprehend processes connected to plant and soil health, to identify global practices of sustainable agriculture, as well as to predict plant and soil responses to environmental perturbations such as climate change. In order to identify the driving factors in plant and soil health, we need robust statistical tools that are able to connect a set of predictors with a specific phenotype. Yet, the innovation in the methodological data science tools for agricultural practices has not matched the increasing complexity of soil and plant data. The overall objective of this project is to develop a next generation of statistical theory (accompanied by open-source publicly available software) for soil and plant data by exploiting the high- dimensional highly interconnected data through the development of novel microbiome interaction models. By harnessing the power of big data through new statistical theory in sparse learning, and network regression models, our work will produce tools that can better understand the drivers in soil and plant health to aid in the adoption of global practices of sustainable agriculture, which are vital to meet the ever-increasing need for food availability in the XXI century.

Publications supported by the award


Wisconsin Potato and Vegetable Growers Association, Inc.

Title: Development of bioinformatic tools to leverage certification data for enhanced seed potato production
Dates: March 15, 2021 to June 30, 2022
Personnel:
- PI: Claudia Solis-Lemus
- co-PI: Renee Rioux
- Haoming Chen (undergraduate student in CS)
- Elaine Wu (undergraduate student in CS)

Project summary

The overarching objective of this proposal is to initiate development of a virtual tool for analyzing and visualizing field data collected each year by the Wisconsin Seed Potato Certification Program (WSPCP) for use on the plant health certificate. Specific objectives include: 1) Creating an enhanced cloud-based database to house seed certification program data, 2) Developing visualization tools for interacting with seed potato certification program data, and 3) Generating data analytics capability to extrapolate from trends in the available data.

Software supported by the award

  • Potato Dashboard (only available for WI seed certification staff at the moment)