Research

July 5, 2019 at 2:03 pm

Research in Hua Lab Makes Breakthrough on Genomics Studies and Teaching

Dr. Zhihua Hua, portrait

Dr. Zhihua Hua

With the significant improvement of DNA sequencing technologies, more and more genome sequencing data are made available to the public. However, to understand these genome sequences is not easy because they seem to be randomly and repetitively written by four letters, A, T, C, and G, which are called nucleotides, the building blocks of DNA.

How to convert these genome sequencing data into meaningful biological information is one of the most challenging research topics in biology. A recently published study from Dr. Zhihua Hua’s laboratory in the Environmental & Plant Biology Department and the Molecular & Cellular Biology program at Ohio University tackles this challenging question using bioinformatics, an interdisciplinary approach combining mathematical modeling, computational programming, and evolutionary biology analysis.

Hua’s lab developed an approach using the evolutionary history of a gene superfamily and a tool capable of crunching through data from long-dead siblings as well as cousins in other plants. The results can help uncover previously unknown members of a gene family and help in analyzing the functions of genome sequences.

Starting with the Impossible

Hua developed this interest when he worked with Dr. Richard D. Vierstra at the University of Wisconsin-Madison. He was challenged by Vierstra to understand the biological functions of 700 unknown genes in Arabidopsis thaliana, a mustard weed plant that has been widely used for understanding fundamental biological questions. However, it is not possible for a biologist to characterize the functions of 700 unknown genes. One unknown gene could even take the entire career of a biologist.

This challenge pushed Hua to find solutions outside the box. He developed an idea to let the evolutionary history predict the winners of these 700 unknown genes based on the theory that their activities in the current Arabidopsis genome have been selected through millions of years of evolution. Unfortunately, it was not easy to find the evolutionary history of these genes because they were born in the ancestors of Arabidopsis at different time periods. Some of them had many siblings that were dead before Arabidopsis emerged in the history; some are orphans that are only present in Arabidopsis; and some have cousins in many other species.

The key to find these relationships is to discover the siblings and cousins of these unknown genes in many other genomes, which are written by A, T, C, and G. The rapid accumulation of sequenced genomes benefited his approach.

Tracking a Gene Superfamily

In biology, a group of genes that share similar sequence features is called gene superfamily. The 700 unknown genes that Hua was challenged on belong to one particular gene superfamily, called F-box, whose protein products recognize abnormal and/or unwanted proteins in all eukaryotic cells to target them for degradation. Many of its known members have been demonstrated to play important roles in a wide range of life processes in plants as well as in humans, including the cell cycle, tumorigenesis, circadian rhythms, etc.

When Hua started to look for siblings and cousins of his 700 unknown genes, he recognized that there were many unreported F-box genes that were hidden in many sequenced genomes. How to find these hidden genes became another challenge. After much practice, Hua developed a mathematical algorithm, called Closing-Target-Trimming. The algorithm is to iteratively trim away the members of F-box genes that have been reported so that the unreported ones will be exposed in the genome and can then be further carefully annotated. This algorithm was first published in his postdoctoral work in 2011 and was recognized as a top 10 percent most-cited work of the journal, PLoS ONE, in its 10th anniversary in 2017.

Building the Tool to Find New Superfamily Members

How to apply the Closing-Target-Trimming algorithm to find unreported superfamily members was also challenging. In order to read through all the nucleotides in a genome, the best way is to use a computer. This requires a good skill in computational programming.

With very little prior programming experience, Hua urged himself to master one very efficient text-mining program, called Perl, and he developed a Perl package to deploy the Closing-Target-Trimming algorithm in finding unreported superfamily members in any genomes. His algorithm and program turned out to be very effective in genomic studies of gene superfamilies.

At his own lab at Ohio University, he and his group members have applied this approach to expand the research into other gene superfamilies. To date, the Hua lab has published six articles in this field in several prestigious journals, including Cell, Plant Journal, Peer J, International Journal of Molecular Sciences, and PLoS ONE.

To develop vigorous research and teaching programs, Hua has been striving to transmit his research skills to students both in his lab and in his classes and also to the science community. In the past five years as a faculty member at Ohio University, he developed an upper-level course, PBIO 4280/5280 Laboratory in Genomics Techniques (also cross-listed as MCB 5280), in teaching genomics techniques. He published a textbook for this course in early 2019 to introduce programming and genome evolutionary analysis skills to biology students across the campus.

To further benefit biologists with little programming experience in studying the genomics of gene superfamilies, Hua and his undergraduate student, Matthew Early, recently published two programming packages, CTT and CTTdocker, that employ the Closing-Target-Trimming algorithm for discovering hidden superfamily members in any sequenced genomes. This study is remarkable in that it demonstrated that while the current genome annotations have on average 15 percent of the genes undiscovered, they can be re-discovered by these programs. The study has also demonstrated the effectiveness of the programs. For example, the CTT program was able to go through 40.9 billion nucleotides and find new genes within four days. This work is equivalent to read through 15 different volumes of the American Heritage Dictionary of the English Language to find sentences that share a similar pattern within one hour.

This work has been published recently on PLoS ONE in an article titled “Closing target trimming and CTTdocker programs for discovering hidden superfamily loci in genomes.”

These research projects involved two graduate students, Paymon Doordian and Peifeng Yu; two PACE undergraduate students, William Vu (Chemistry & Biochemistry major) and Early (Computer Science major); and one visiting scholar, Dr. Zhenyu Gao, from Dr. Hua’s lab. The research was funded by a number of sponsors, including Ohio University Startup fund, PACE funds, OURC grant, Baker grant, and National Science Foundation-CAREER Award to Hua.

References [*: corresponding author; (g): graduate students; (u): undergraduate students, (v): visiting scholar; Bold: Hua Lab members]

Leave a Reply

Your email address will not be published. Required fields are marked *

*