research
Overview
- We work to develop statistical models to answer biological questions, balancing biological interpretability, theoretical guarantees, and computational tractability
- We deal with modern big data which is highly interconnected through graphical structures: phylogenetic networks to study evolution, interaction networks to study microbial communities in soil and plants, neural networks to predict phenotypes from genotypes
- Our work produces a collection of new statistical methods with solid theoretical guarantees and efficient computational implementations that are adaptable to analyze the complex characteristics of modern big biological data
- We do not live in a statistical bubble! We always welcome new collaborations that can help our research to be relevant and applied to real-life data
Interested in joining the lab?
Information in Opportunities.
Funding
The lab is currently funded by the following awards.
Statistical phylogenomics
In this lab, we work to produce novel statistical models and methods to reconstruct the Tree of Life that are theoretically sound yet computationally efficient and scalable to meet the ever-growing needs of biological bit data. We strive to accompany our theoretical work with open-source publicly available software.
Examples of our current research involve:
- extension of phylogenetic network inference methods to broader classes of networks
- robustness of phylogenetic inference to microbial datasets like bacteria and viruses
- alternative sampling scheme to MCMC that is expected to produce more efficient bayesian phylogenetic estimation
- statistical properties of BHV space and possible extension to networks
Our work is not purely methodological. Among our current collaborations, we can highlight:
- studying the ancestral protein sequences of Potyvirus and Picornavirus
- reconstructing phylogenetic trees and networks of grapes and carrots and studying the evolution of traits related to climate resistance via comparative methods
- reconstructing phylogenetic trees and networks of Escovopsis and studying the evolution of vesicular shapes via comparative methods
Want to learn more about phylogenetics (especially networks)? See this list of resources that starts with introductory videos and then a small subset of relevant papers in the field.
Statistics in genomics and microbiome
In this lab, we work to produce tools to better represent microbial communities (via networks) and use the microbial communities as potential predictors of plant, soil or human health phenotypes. New models are centered on high-dimensional statistical models like penalized regression and post selection inference that can simultaneously model all microbes across the microbiome.
Examples of our current research involve:
- estimation of microbial interactions accounting for spatial correlations via Ising models
- network regression framework to understand the effects of microbial communities on a response
- post selection inference and penalized regression models applied to human or plant disease
- high dimensional models for the integration of different omics data types applied to human microbiome research and plant microbiome reseach
Our work is not purely methodological. Among our current collaborations, we can highlight:
- studying how root microbial communities affect potato health and response to environmental changes
- studying the effect of lung microbiome on health outcomes in cystic fibrosis patients
Statistical view in deep learning
For the last decades, deep learning has enabled unprecedented prediction potential in a plethora of applications. In particular, neural networks (NNs) are already successfully used in computer vision, astrophysics, and even in cancer histology. The main reason for their state-of-the-art accurate prediction is their flexibility, mutating its architecture to fit almost any type of data and any type of model. Yet, the poor generalization outside the training data, the lack of statistical guarantees of confidence, and the notion that they are a “black box” model have hampered their development in translational fields like personalized medicine where inaccurate predictions might result in grave consequences. Furthermore, NN methods are known for being “data-hungry”, meaning huge amounts of data are required for training and validating. This requirement prohibits its use in fields with comparatively smaller sample sizes such as human health where restrictions on data sharing and privacy limit the researchers’ ability to acquire large enough datasets for NN.
In this lab, we work to explore the potential of NN in biomedical areas. On one side, we work on data related to human health like precision medicine, or the emergence of antibiotic-resistance in infectious diseases. On the other side, we work on data related to soil and plant health like the use of biocontrol mycoviruses to fight against the emergence of fungicide-resistance crop pathogens.
Examples of our current research involve:
- robustness of NN models to predict microbial phenotypes from genomic sequences: antibiotic-resistance on Staphylococcus aureus and Pseudomonas aeruginosa and hypovirulence potential of mycoviruses on Sclerotinia
- connections of statistical concepts of uncertainty (confidence intervals or hypothesis testing) on NN models
Publications
Preprints under revision
Bjorner, M., Molloy, E., Dewey, C., Solís-Lemus, C.. (2022). Detectability of Varied Hybridization Scenarios using Genome-Scale Hybrid Detection Methods. arxiv (2022): 2211.00712.
Justison, J., Solís-Lemus, C., Heath, T. (2022). SiPhyNetwork: An R package for Simulating Phylogenetic Networks. biorxiv (2022): 2022.10.26.513953.
Rattray, JB, Walden, R., Marquez-Zacarias, P., Molotkova, E., Perron, G., Solís-Lemus, C., Pimentel-Alarcon, D., Brown, S. (2022). Machine learning identification of Pseudomonas aeruginosa strains from colony image data. biorxiv (2022): 2022.09.02.506375v1.
Ozminkowski, S., Wu, Y., Yang, L., Xu, Z., Selberg, L., Huang, C. , Solís-Lemus, C. (2022). BioKlustering: a web app for semi-supervised learning of maximally imbalanced genomic data arxiv (2022): 2209.11730, github, bioklustering.wid.wisc.edu
Ozminkowski, S., Solís-Lemus, C. (2022). Identifying microbial drivers in biological phenotypes with a Bayesian Network Regression model, arXiv (2022): 2208.05600, github
Shen, Y., Solís-Lemus, C., S. K. Deshpande. (2022). Sparse Gaussian chain graphs with the spike-and-slab LASSO: Algorithms and asymptotics, arXiv (2022): 2207.07020
Solís-Lemus, C., S. Yang, L. Zepeda-Nunez (2022). Accurate Phylogenetic Inference with a Symmetry-preserving Neural Network Model, arXiv (2022): 2201.04663, github
Solís-Lemus, C., A. M. Holleman, A. Todor, B. Bradley, K. J. Ressler, D. Ghosh, M. P. Epstein. (2021). A Kernel Method for Dissecting Genetic Signals in Tests of High-Dimensional Phenotypes, bioRxiv 2021.07.29.454336
Shen, Y., Solís-Lemus, C. (2021). CARlasso: An R package for the estimation of sparse microbial networks with predictors, arXiv (2021): 2107.13763, github
Shen, Y., Solís-Lemus, C. (2021). The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models, arXiv (2021): 2107.01306
Tiley, George P., Andrew A. Crowl, Paul S. Manos, Emily B. Sessa, Solís-Lemus, C., Anne D. Yoder, and J. Gordon Burleigh (2021) Phasing Alleles Improves Network Inference with Allopolyploids, bioRxiv (2021): 10.1101/2021.05.04.442457
Shen, Y., Solís-Lemus, C. (2020). Bayesian Conditional Auto-Regressive LASSO Models to Learn Sparse Networks with Predictors, arXiv (2020): 2012.08397, github
Solís-Lemus, C., A. Coen and Cecile Ané. 2020. On the identifiability of phylogenetic networks under a pseudolikelihood model, arxiv (2020): 2010.01758, github
2022 | Sun, Y., T.M. Maeda, Solís-Lemus, C., D. Pimentel-Alarcon, Z. Burivalova | ||
Classification of animal sounds in a hyperdiverse rainforest using Convolutional Neural Networks | |||
DOI: 10.1016/j.ecolind.2022.109621 | |||
2022 | Liu, Y., Solís-Lemus, C. | ||
WI Fast Stats: a collection of web apps for the visualization and analysis of WI Fast Plants data | |||
DOI: 10.21105/jose.00159 | |||
2022 | Zhang, Z., Cheng, S., Solís-Lemus, C. | |
Towards a robust out-of-the-box neural network model for genomic data | ||
DOI: 10.1186/s12859-022-04660-8 | ||
2022 | G. A. Satten, S. W. Curtis, C. Solís-Lemus, C., E. J. Leslie, M. P. Epstein | |
Efficient Estimation of Indirect Effects in Case-Control Studies Using a Unified Likelihood Framework | ||
DOI: 10.1002/sim.9390 | ||
2021 | Su M, Davis MH, Peterson J, Solís-Lemus, C., Satola SW, Read TD. | |
Effect of genetic background on the evolution of Vancomycin-Intermediate Staphylococcus aureus (VISA) | ||
DOI: 10.7717/peerj.11764 | ||
2021 | Moller A., Winston K., Ji S., Wang J., Hargita Davis M., Solís-Lemus, C., Read T. | |
Genes Influencing Phage Host Range in Staphylococcus aureus on a Species-Wide Scale | ||
DOI: 10.1128/mSphere.01263-20 | ||
2020 | Guerrero, V. and Solís-Lemus, C. | |
A generalized measure of relative dispersion | ||
DOI: 10.1016/j.spl.2020.108806 | ||
2020 | Solís-Lemus, C., S. T. Fischer, A. Todor, C. Liu, E. J. Leslie, D. J. Cutler, D. Ghosh and M. P. Epstein | |
Leveraging Family History in Case-Control Analyses of Rare Variation | ||
DOI: 10.1534/genetics.119.302846 | ||
2020 | M. Su, J. Lyles, R. A. Petit III, J. M. Peterson, M. Hargita, H .Tang, Solís-Lemus, C., C. Quave, T. D. Read | |
Genomic analysis of variability in delta-toxin levels between Staphylococcus aureus strains | ||
DOI: 10.7717/peerj.8717 | ||
2019 | Solís-Lemus, C., Ma, X., Hostetter II, M., Kundu, S., Qiu, P., Pimentel-Alarcón D. | |
Prediction of functional markers of mass cytometry data via deep learning | ||
DOI: 10.1007/978-3-030-33416-1_5 | ||
2018 | Solís-Lemus, C., Pimentel-Alarcón D. | |
Breaking the Limits of Subspace Inference | ||
56th Annual Allerton Conference on Communication, Control, and Computing |
2018 | Spooner, D.M., Ruess, H., Arbizu, C.I., Rodriguez, F., Solís-Lemus, C. | |
Greatly reduced phylogenetic structure in the cultivated potato clade (Solanum section Petota pro parte) | ||
DOI: 10.1002/ajb2.1008 | ||
2018 | Bastide, P., Solís-Lemus, C., Kriebel, R., Sparks, K.W., Ané, C. | |
Phylogenetic Comparative Methods on Phylogenetic Networks with Reticulations | ||
DOI: 10.1093/sysbio/syy033 | ||
2017 | Solís-Lemus, C., Bastide, P., Ané, C. | |
PhyloNetworks: a package for phylogenetic networks | ||
DOI: 10.1093/molbev/msx235 | ||
2017 | Ané, C., Bastide, P., Mariadassou, M., Robin, S., Solís-Lemus, C. | |
Processus d’évolution réticulée: tests de signal phylogénétique | ||
Journées de Statistique |
2017 | Pimentel-Alarcón D., Biswas A., Solís-Lemus, C. | |
Adversarial Principal Component Analysis | ||
IEEE International Symposium on Information Theory (ISIT) |
2016 | Solís-Lemus, C., Ané, C. | |
Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting | ||
DOI: 10.1371/journal.pgen.1005896 | ||
2016 | Solís-Lemus, C., Yang, M., Ané, C. | |
Inconsistency of species-tree methods under gene flow | ||
DOI: 10.1093/sysbio/syw030 | ||
2016 | Baum, D., Ané, C., Larget, B., Solís-Lemus, C., Ho, L.S.T, Boone, P., Drummond, C., Bontrager, M., Hunter, S., Saucier, B. | |
Statistical evidence for common ancestry: application to primates | ||
DOI: 10.1111/evo.12934 | ||
2016 | Pimentel-Alarcón D., Solís-Lemus, C. | |
Crime detection via crowdsourcing | ||
8th Mexican Conference on Pattern Recognition, Springer International |
2015 | Solís-Lemus, C., L.L. Knowles and C. Ané | |
Bayesian species delimitation combining multiple genes and traits in a unified framework | ||
DOI: 10.1111/evo.12582 | ||
2015 | Solís-Lemus, C. | |
Statistical methods to infer population structure with coalescence and gene flow. | ||
PhD dissertation, Department of Statistics, University of Wisconsin-Madison | ||
Awards
NSF CAREER
Title: CAREER: Towards Scalable and Robust Inference of Phylogenetic Networks
Dates: February 1, 2022 to January 31, 2027
Personnel:
- PI: Claudia Solis-Lemus
Project summary
Scientists world-wide are engaged in efforts to understand how all planetary biodiversity evolved. This diversification process is represented through the Tree of Life. Achieving the goal of a complete estimate of the Tree of Life would allow us to fully understand the development and evolution of important biological traits in nature, for example, those related to resilience to extinction when exposed to environmental threats such as climate change. It would also provide information about the emergence and evolution of novel human pathogens that pose severe threats to human health. Thus, the development of statistical and computational tools to reconstruct the Tree of Life are paramount in evolutionary biology, systematics, conservation efforts, and human health research. Existing tree reconstruction methods, however, are limited because they do not account for important biological processes such as species hybridization, introgression or horizontal gene transfer, and thus, recent years have seen an explosion of methods to reconstruct phylogenetic networks rather than trees. Existing network reconstruction methods lack statistical guarantees ensuring the detection of reticulate signals in data, are not scalable enough for big data, and are tailored to reconstruct simple networks. Thus, they are not sufficient to tackle the complexity of reticulate evolution in fungi, prokaryotes, or viruses. This project will develop novel network inference methods with strong statistical guarantees that are robust enough to infer complex networks and scalable enough to accommodate big data. The methods will allow the integration of all organisms into the Tree of Life and thus help to complete a broader picture of evolution across all domains of life. The project will produce open source software and data science modules for K-16 outreach, and includes a strong focus on training underrepresented groups in STEM.
Apply! New positions funded by NSF CAREER
- Postdoctoral researcher in the inference of phylogenetic networks
- Postdoctoral position in statistical education
- Project assistantship in Julia package development and maintenance
Publications supported by the award
- Justison et al (2022). biorxiv (2022): 2022.10.26.513953
- Bjorner et al (2022). arxiv (2022): 2211.00712
DOE: Computational Tool Development for Integrative Systems Biology Data Analysis
Title: Harnessing the power of big omics data: Novel statistical tools to study the role of microbial communities in fundamental biological processes
Dates: September 14, 2020 to September 14, 2022
Personnel:
- PI: Claudia Solis-Lemus
- Sam Ozminkowski (MS student in Statistics)
- Marianne Bjorner (MS student in CS)
- Rosa Aghdam (postdoc)
Project summary
Microbial communities are among the main driving forces of biogeochemical processes in the biosphere. In particular, many critical soil processes such as mineral weathering, and soil cycling of mineral-sorbed organic matter are governed by mineral-associated microbes. Understanding the composition of microbial communities and what environmental factors play a role in shaping this composition is crucial to comprehend soil biological processes and to predict microbial responses to environmental changes. In order to identify the driving factors in soil biological processes, we need robust statistical tools that are able to connect a set of predictors with a specific phenotype. Yet, the innovation in the statistical theory for biochemical and biophysical processes has not matched the increasing complexity of soil data. Indeed, existing statistical techniques have four main drawbacks: 1) they perform poorly on high-dimensional highly sparse data, such as soil metagenomics; 2) they ignore spatial correlation structure which can be a key component in soil-related data; 3) they do not provide valid p-values under high-dimensional settings making them unable to detect significant factors driving the phenotype of interest, and 4) they tend to focus on abundance matrices to represent microbial compositions which can be flawed due to its compositional nature (sum to 1 restriction) that affects how proportions behave in different experimental settings (e.g. changes in proportions in the microbial composition does not necessarily reflect actual biological changes in the interactions). The overall objective of this proposal is to pioneer the development of the next generation of statistical theory (accompanied by open-source publicly available software) for soil omics data. Our novel statistical methods will overcome existing challenges in standard approaches in three ways: 1) they will inherently account for high-dimensional highly interconnected data through the development of novel mixed-effects sparse learning models; 2) they will produce valid adaptive p-values through post selection inference, and 3) they will be implemented in open-source publicly available software that will serve the broader scientific community.
Publications supported by the award
- Shen and Solis-Lemus (2020) arXiv:2012.08397
- Zhang et al (2020) BMC Bioinformatics DOI:10.1186/s12859-022-04660-8
- Liu and Solis-Lemus (2020) JOSE DOI:10.21105/jose.00159
- Shen and Solis-Lemus (2021) arXiv:2107.01306
- Shen and Solis-Lemus (2021) arXiv:2107.13763
- Ozminkowski and Solis-Lemus, C. (2022) arXiv (2022): 2208.05600
- Ozminkowski and Solis-Lemus, C. (2022) arXiv (2022): 2208.05600
- Ozminkowski et al (2022) arxiv (2022): 2209.11730
- Shen et al (2022) arXiv (2022): 2207.07020
- Bjorner et al (2022) arxiv (2022): 2211.00712
USDA-NIFA: hatch project 1023699
Project summary
The growing food demand can only be sustained through rigorous and consistent support of plant and soil health worldwide. Recognizing the microbial, environmental and agricultural factors that drive plant and soil phenotypes is crucial to comprehend processes connected to plant and soil health, to identify global practices of sustainable agriculture, as well as to predict plant and soil responses to environmental perturbations such as climate change. In order to identify the driving factors in plant and soil health, we need robust statistical tools that are able to connect a set of predictors with a specific phenotype. Yet, the innovation in the methodological data science tools for agricultural practices has not matched the increasing complexity of soil and plant data. The overall objective of this project is to develop a next generation of statistical theory (accompanied by open-source publicly available software) for soil and plant data by exploiting the high- dimensional highly interconnected data through the development of novel microbiome interaction models. By harnessing the power of big data through new statistical theory in sparse learning, and network regression models, our work will produce tools that can better understand the drivers in soil and plant health to aid in the adoption of global practices of sustainable agriculture, which are vital to meet the ever-increasing need for food availability in the XXI century.
Publications supported by the award
- Shen and Solis-Lemus (2020) arXiv:2012.08397
- Shen and Solis-Lemus (2021) arXiv:2107.01306
- Ozminkowski and Solis-Lemus, C. (2022) arXiv (2022): 2208.05600
- Ozminkowski et al (2022) arxiv (2022): 2209.11730
- Shen et al (2022) arXiv (2022): 2207.07020
Wisconsin Potato and Vegetable Growers Association, Inc.
Project summary
The overarching objective of this proposal is to initiate development of a virtual tool for analyzing and visualizing field data collected each year by the Wisconsin Seed Potato Certification Program (WSPCP) for use on the plant health certificate. Specific objectives include: 1) Creating an enhanced cloud-based database to house seed certification program data, 2) Developing visualization tools for interacting with seed potato certification program data, and 3) Generating data analytics capability to extrapolate from trends in the available data.
Software supported by the award
- Potato Dashboard (only available for WI seed certification staff at the moment)