N - Datasets
Mitchell, R. A. C. 2024. universal_grass_peps. Rothamsted Research. https://doi.org/10.23637/rothamsted.98ywz
|Mitchell, R. A. C.
This dataset relates to a bioinformatics pipeline developed by Rowan Mitchell during 2018-2023 that seeks to identify all universal protein-coding genes in grasses and to estimate how specific they are to grasses. The dataset has 6 components: (1) universal_grass_peps.xlsx contains summary information on all the universal groups of peps identified. (2) files in scripts/* are the Perl source code and bash scripts that were used to generate the data. (3) files in genBlastG/* are genome annotation files for each novel gene model generated by the genBlastG files in the pipeline. (4) hmms/*.msa.fa are the multiple alignment sequence fasta files, one for each group. (5) files hmms/final_db.hmms* are for use to search the database with query sequences using the HMMER package. (6) file in lookup/* allow users to find which groups a grass query pep ID is a member of, or associated to, for 16 different grass species.
|Year of Publication
|Digital Object Identifier (DOI)
|Biotechnology and Biological Sciences Research Council
|Funder project or code
|Xylan arabinosyl transferases: identification and characterisation of their role in determining properties of grass cell walls
|Designing Future Wheat (DFW) [ISPG]
CC BY 4.0
File Access Level
CC BY 4.0
File Access Level
|Data collection method
The genomic sequences that were used in the pipeline were taken from Ensembl Plants release 56 (February 2023).
|Data preparation and processing activities
A bioinformatics pipeline was developed to identify highly-conserved universal grass genes using 16 grass full genomes in Ensembl Plants release 56. The first steps used existing gene models to generate groups of grass orthologs to rice and maize genes present in most grass species and refined membership of these groups such as to optimise the Hidden Markov Model (HMM) profile score from the HMMER package. These were then supplemented using new gene models found in grass genomes with the genBlastG tool; this step increased the number of universal groups by >2-fold to give 12,609 highly conserved, universal groups. Specificity for these groups was assessed using closest matching gene models from non-monocot species. Possible cut-off values were tested with sets of genes expected to be either of common function for all plants or of commelinid- / grass-specific function. A specificity metric based on HMM score from grass group profiles performed better than % identity as a means of discriminating between the specific and common function sets. Using an appropriate cut-off for this metric, 5,973 of the groups were identified as specific to monocots of which 66% appeared to be grass specific.
Permalink - https://repository.rothamsted.ac.uk/item/98ywz/universal-grass-peps