universal_grass_peps

N - Datasets

Mitchell, R. A. C. 2024. universal_grass_peps. Rothamsted Research. https://doi.org/10.23637/rothamsted.98ywz

AuthorsMitchell, R. A. C.
Abstract

This dataset relates to a bioinformatics pipeline developed by Rowan Mitchell during 2018-2023 that seeks to identify all universal protein-coding genes in grasses and to estimate how specific they are to grasses. The dataset has 6 components: (1) universal_grass_peps.xlsx contains summary information on all the universal groups of peps identified. (2) files in scripts/* are the Perl source code and bash scripts that were used to generate the data. (3) files in genBlastG/* are genome annotation files for each novel gene model generated by the genBlastG files in the pipeline. (4) hmms/*.msa.fa are the multiple alignment sequence fasta files, one for each group. (5) files hmms/final_db.hmms* are for use to search the database with query sequences using the HMMER package. (6) file in lookup/* allow users to find which groups a grass query pep ID is a member of, or associated to, for 16 different grass species.

Year of Publication2024
PublisherRothamsted Research
Digital Object Identifier (DOI)https://doi.org/10.23637/rothamsted.98ywz
Keywordsgenomic features
data analysis
networks
FunderBiotechnology and Biological Sciences Research Council
Funder project or codeXylan arabinosyl transferases: identification and characterisation of their role in determining properties of grass cell walls
Designing Future Wheat (DFW) [ISPG]
Data files
Copyright license
CC BY 4.0
Data type
Archive
Contents
Data
File Access Level
Open
Data files
Copyright license
CC BY 4.0
Data type
Text
Contents
Additional metadata
File Access Level
Open
Data collection method

The genomic sequences that were used in the pipeline were taken from Ensembl Plants release 56 (February 2023).
https://feb2023-plants.ensembl.org/index.html

Data preparation and processing activities

A bioinformatics pipeline was developed to identify highly-conserved universal grass genes using 16 grass full genomes in Ensembl Plants release 56. The first steps used existing gene models to generate groups of grass orthologs to rice and maize genes present in most grass species and refined membership of these groups such as to optimise the Hidden Markov Model (HMM) profile score from the HMMER package. These were then supplemented using new gene models found in grass genomes with the genBlastG tool; this step increased the number of universal groups by >2-fold to give 12,609 highly conserved, universal groups. Specificity for these groups was assessed using closest matching gene models from non-monocot species. Possible cut-off values were tested with sets of genes expected to be either of common function for all plants or of commelinid- / grass-specific function. A specificity metric based on HMM score from grass group profiles performed better than % identity as a means of discriminating between the specific and common function sets. Using an appropriate cut-off for this metric, 5,973 of the groups were identified as specific to monocots of which 66% appeared to be grass specific.

Permalink - https://repository.rothamsted.ac.uk/item/98ywz/universal-grass-peps

97 total views
10 total downloads
7 views this month
0 downloads this month
Download files as zip