Anna Yannakopoulos, Graduate Student (PhD), Computational Mathematics, Science & Engineering, Michigan State University
The DNA sequence stored in every cell in a human body is virtually the same, inherited through millions of rounds of cell division that began with a single zygote after the fusion of sperm and egg. However, given the same set of DNA instructions, cells with distinct structures and functions emerge because only certain sets of genes coded within the DNA are active in a given cell. Single-cell RNA-sequencing allows us to record which genes are active in individual cells. After sequencing thousands of cells in a tissue, we can cluster cells with similar profiles of active genes which represent distinct cell types. While clustering the cells is relatively straightforward, assigning the clusters the correct cell-type label in an unbiased manner is challenging. Here, we use natural language processing to associate genes appearing in PubMed abstracts with cell types listed in the UBERON Cell Ontology. We combine this information with gene expression profiles from single-cell RNA-sequencing experiments to form a cell by cell type matrix of continuous values that capture the correspondence between the genes active in a cell and any cell type in present in the ontology. We then build a one-vs-the-rest regularized regression model for each cell cluster that consists of a ranked list of beta coefficients and their corresponding cell-type term where non-zero coefficients represent potential cluster annotations. By further propagating beta values through the ontology graph, we are able to find the correct cell-type label among the top ranked terms for many of the cell types tested. This work provides a proof of principle that NLP can be used to create unbiased lists of genes activated in specific cell types, and that unsupervised models can use this information to annotate anonymous clusters of cells with correct cell-type labels.
Kayla Johnson, Graduate Student (PhD), Biochemistry & Molecular Biology, Computational Mathematics, Science and Engineering, Michigan State University
As the cost of RNA sequencing has continued to fall, the amount of publicly available RNA-seq data has continued to grow; currently, there are over 80,000 publicly-available human RNA-seq samples. A predominant method for studying gene function in specific biological contexts is to construct gene co-expression networks using transcriptomes from conditions of interest. Although many studies have focussed on best preprocessing procedures for use of RNA-seq data in differential expression analysis, little attention has been given to determining best processing methods for gene co-expression. Constructing robust co-expression networks depends on accurately quantifying expression from read counts which are affected by the presence of experimental and technical artifacts, which introduce non-biological variation into the data. In this research, we leverage thousands of uniformly aligned RNA-seq samples from various experiments that span diverse tissues to compare methods. We construct gene co-expression networks using different within-sample normalizations, between-sample normalizations and network transformations to evaluate the resulting networks based on their ability to recover documented tissue-naive and tissue-specific gene functional relationships. Our results show that a select few normalizations are clearly superior to all others and a certain few should be used only in datasets with particular characteristics. This comprehensive benchmarking provides guidance for best practices in deriving robust gene co-expression networks from RNA-seq data.
Stephanie L. Hickey, Graduate Student (PhD), Biochemistry & Molecular Biology, Computational Mathematics, Science and Engineering, Michigan State University
The embryonic cell lineages that give rise to the fetus, placenta, and yolk sac must be properly specified early in development in order to support a healthy pregnancy. During mouse embryogenesis, this specification takes place in a two-step process. The first step occurs at embryonic day (E) 3.0, when the trophectoderm (TE), which later become the placenta, separates from the inner cell mass (ICM). Between E3.5 and 3.75 the inner cell mass is segregated into the epiblast (EPI) and primitive endoderm (PE) which develop into the fetus and yolk sac, respectively. Fgf/Mapk signaling is known to be important in the switch from unspecified ICM to committed EPI and PE cells, but the mechanisms that ensure exit from the progenitor state and entry into a differentiative state are not wholly clear. For example, various contradictory roles for BMP signaling molecules in the blastocyst have been described, but no consensus has been reached. Here, we reanalyze publicly available single-cell transcriptomic data collected from mouse embryos at distinct developmental stages to assess the cell-type specific expression patterns of BMP ligands and receptors. We identified E4.5 EPI-specific expression of BMP4, and further experiments provided evidence of BMP signal transduction in PE cells at the same developmental stage. These data suggest that BMP4 secreted from EPI cells may influence the identity of PE cells.