A Near-Chromosome Level Genome Assembly of the European Hovery, Sphaerophoria Rueppellii (Diptera: Syrphidae), and a Comparative Analysis of Insecticide Resistance-Related Gene Families in Hemipteran Crop Pests and Pollinators

Background: Sphaerophoria rueppellii, a European species of hovery, is a highly effective benecial predator of crop pests including aphids, thrips and coleopteran/lepidopteran larvae in integrated pest management (IPM) programmes. It is also a key pollinator of a wide variety of important agricultural crops. No genomic information is currently available for S. rueppellii. Without genomic information for such benecial predator species, we are unable to perform comparative analyses of insecticide target-sites and genes encoding metabolic enzymes potentially responsible for insecticide resistance, between crop pests and their predators. These metabolic mechanisms include several gene families cytochrome P450 monooxygenases (P450s), ATP binding cassette transporters (ABCs), glutathione-S-transferases (GSTs), UDP-glycosyltransferases (UGTs) and carboxyl/choline esterases (CCEs). Methods and ndings: In this study, a high-quality near-chromosome level de novo genome assembly (as well as a mitochondrial genome assembly) for S. rueppellii has been generated using a hybrid approach with PacBio long-read and Illumina short-read data, followed by super scaffolding using Hi-C data. The nal assembly achieved a scaffold N50 of 87Mb, a total genome size of 537.6Mb and a level of completeness of 96% using a set of 1,658 core insect genes present as full-length genes. The assembly was annotated with 14,249 protein-coding genes. Comparative analysis revealed gene expansions of CYP6Zx P450s, epsilon-class GSTs, dietary CCEs and multiple UGT families (UGT37/302/308/430/431). Conversely, ABCs, delta-class GSTs and non-CYP6Zx P450s showed limited expansion. Differences were seen in the distributions of resistance-associated gene families at the subfamily levels between S. rueppellii and some crop pests. Conclusion and signicance: This assembly is the rst published genome for a predatory member of the Syrphidae family and will serve as a useful resource for further research into selectivity and potential tolerance of insecticides by benecial predators. Furthermore, the expansion of some gene families often linked to insecticide resistance and selectivity may be an indicator of the capacity of this predator to detoxify IPM selective insecticides. These ndings could be exploited by targeted insecticide screens and functional studies to increase effectiveness of IPM strategies, which aim to increase crop yields by sustainably and effectively controlling pests without impacting benecial predator populations.


Introduction
Loss of crops to insect pests can account for more than 10% of potential yield, as a result of both direct feeding damage and the transfer of plant viruses via insect feeding [1]. Methods of controlling insect pests are therefore critical to ensure that crop yields are maximised to sustain the growing world population. Insecticides play a key role in pest control strategies. Many modern insecticides are known to be selective for pests without harming bene cials. However, some insecticides such as pyrethroids tend to be non-speci c, and as a result are often toxic to both their target pest species and bene cial predators. Applications of such non-speci c insecticides can reduce predator populations so that they are unable to act as an effective natural control. This can lead to pest populations surging, with instances of higher populations than pre-pesticide application [2][3][4].
Bene cial predators, such as those in the Syrphid family, are effective in the biological control of crop pests. Syrphid adults typically feed on nectar and pollen, however, the larvae of roughly one-third of syrphid species feed on crop pests such as aphids, thrips and coleopteran and lepidopteran larvae [5][6][7][8][9][10][11]. Predatory Syrphidae are able to feed on up tõ 500 aphids during their larval stage, which is a higher daily feeding rate than other aphid predators [12]. For example, S. rueppellii were able to reduce aphid (Myzus persicae) populations by 84% in a eld experiment [13]. Specialised adaptations present within adult female Syrphidae allow them to detect aphid pheromones and increase their e cacy as biological control agents -for example, adult females often lay their eggs in close proximity to aphid colonies to ensure a plentiful food supply for emerging larvae [14]. Syrphid adults also avoid laying their eggs close to parasitised aphids [15] S. rueppellii larvae were obtained from 'biopestgroup.com'. CO 2 was used for anaesthesia to allow the insects to be sorted from the substrate. The larvae were then ash frozen with liquid N 2 and stored at -80°C. The whole process was completed within 48 hours of arrival.
For transcriptome sequencing, RNA extractions were carried out in-house at Rothamsted Research using the Bioline Isolate II RNA Mini Kit. 30µg of RNA was obtained from ~5 individuals. The library was constructed with an insert size of 150bp and PolyA selection for rRNA removal. Sequencing was performed by Genewiz (New Jersey, US) using Illumina HiSeq 4000 with a 2x150bp paired-end con guration.
For short-read genomic sequencing, DNA extractions were performed in-house at Rothamsted Research using the commercial DNAzol reagent. Short reads were sequenced using 1.1µg of DNA obtained from ~5 individuals and a library with an insert size of 200bp. Sequencing was performed by Genewiz (New Jersey, US) using Illumina HiSeq 4000 with a 2x150bp paired-end con guration. K-mer counting of the raw Illumina DNA data was performed using Jelly sh 2.2.6 [44]. Canonical (-C) 21-mers (-m 21) were counted and a histogram of k-mer frequencies outputted. GenomeScope 2.0 [45] was used to process this histogram with ploidy set to 2 and maximum k-mer coverage cut-off set to 10,000.
For long-read genomic sequencing, whole insects were sent directly to Georgia Genomics (University of Georgia, US) who performed the DNA extractions using ~15 individuals. To obtain long-read PacBio data, a 15-30Kb SMRTbell library was produced with an insert size of 24,000bp and a 15 hour sequencing run was carried out using PacBio Sequel II.
For Hi-C sequencing, whole insects were sent directly to Arima Genomics (San Diego, US) who carried out the DNA extractions using 10 individuals. Arima-QC and library preparation were also performed in-house. Sequencing was performed using Illumina HiSeq X with a 2 x 150bp paired-end con guration.

Genome quality assessment
Basic metrics from the genome assembly were calculated using a script developed for the 'Assemblathon' [46]. These metrics include scaffold/contig N50, longest and shortest scaffold length, number of scaffolds exceeding a range of lengths and number of gaps/N's in the assembly.
The completeness of the genome assembly and annotation for S. rueppellii was assessed using the Benchmarking Universal Single-Copy Orthologs (BUSCO) [47] of the insect gene set (insecta odb9). 'Genome' mode was used to assess the assembly, and 'protein' mode to assess the annotation. 'Fly' was used as the training species for Augustus gene prediction. BUSCO assessments were then run with default parameters. QuickMerge v0.3 [53] was used to merge the Flye and Platanus-Allee assemblies, with Flye as the reference assembly.
BUSCO outputs were compared between the merged assembly and the standalone assemblies to identify core insect genes which had been lost during the merging process. Full-length contigs containing these missing genes were extracted from the standalone assemblies and added to the merged assembly, based on the assumption that these contigs would also contain other missing genes (i.e. those not included in BUSCO's list of 1,658 core insect genes).
Purge Haplotigs v1.0.0 [54] was next used to perform redundant contig removal from the merged assembly. Parameters '-l 5 -m 30 -h 190' were chosen from the coverage histogram outputted in the rst step of the pipeline. The percent cutoff for identifying a contig as a haplotig was set to '-a 40', (the default value is 70, however a lower cutoff was chosen due to a very high level of duplication). This tool takes read depth coverage into consideration to reduce over-purging of repetitive regions and paralogous contigs, whilst still coping well with highly heterozygous assemblies.
The Hi-C data was processed using Juicer v1.5 [55] and used as input to the 3D-DNA de novo genome assembly pipeline (version 180922) [56] alongside the draft assembly to produce a candidate chromosome-length genome assembly. Contact matrices were generated by aligning the Hi-C dataset to the genome assembly after Hi-C scaffolding, and were then visualised using JuiceBox Assembly Tools v1.11.08 [57]. The parameters used were as follows: '--mode haploid -build-gapped-map --sort-output'. Additional nishing on the scaffolds was performed in JuiceBox to correct mis-joins.
Multiple rounds of Pilon [58] error polishing were performed, using the Illumina short read data, until no further improvement in BUSCO score was seen. A nal round of Purge Haplotigs was then performed to reduce duplication further. Parameters '-l 10 -m 50 -h 150' were chosen from the coverage histogram outputted in the rst step of the pipeline. The percent cutoff for identifying a contig as a haplotig was set to '-a 80'.

Mitochondrial genome assembly
The mitochondrial genome was found and extracted by running a BLAST search of the S. rueppellii genome against the Syrphus ribesii mitochondrial genome, which is publicly available at NCBI, GenBank accession number: MW091497.1.

Annotation
Gene prediction was performed using the MAKER v2.31.8 pipeline [59] through the incorporation of both transcriptome evidence and ab initio gene prediction as well as a custom repeat library (see below). MAKER was run using Augustus v3. 3 A de novo species speci c repeat library was constructed using RepeatModeller v1.0.7 [64] to identify repeat models.
These models were searched against the GenBank non-redundant (nr) protein database for Arthropoda (e value <10 −3 ) using Blastx to remove any potential protein-coding genes. This was combined with transposon data to create a custom library. Transposons were identi ed from the transcriptome assembly by running HMMER: hmmscan [65] against the Pfam database [66] and ltering the resultant Pfam descriptions for those containing "transposon". A search for transposons was also performed on transcripts produced from MAKER and these transposons were then added to the custom repeat library which was used for a second round of MAKER. RepeatMasker v4.0.7 [67] was used to mask repeats in the genome assembly using these repeat libraries, as well as to estimate the abundances of all predicted repeats.
Evidence from assembled transcripts was transferred to the genome assembly via MAKER. The output from this was then used to produce a high con dence level gene model training set. Overlapping and redundant gene models were removed.
Augustus and GeneMark were trained using this training set prior to being used for ab initio gene predictions. FGeneSH was run based on the Drosophila melanogaster genome.
The best transcripts (classi ed by reasonable transcript size and homology to other species) from both the ab initio gene prediction annotation and the transcriptome-based annotation were selected using Evigene and combined to create the nal annotation.

Comparative genomics and phylogenetic analysis
To produce the species tree, orthogroup gene trees were produced using OrthoFinder [75] and the tree was inferred from these using the STAG method [76].
In order to identify genes potentially involved in insecticide resistance, the PFAM domains assigned to gene models during annotation (as described in the 'Genome Annotation' methods section) were used as follows: CCEs (PF00135/IPR002018), GSTs (IPR004045/PF02798), (IPR004046/PF00043), P450s (IPR001128/PF00067), ABCs (IPR003439/PF00005) and UGTs (IPR002213/PF00201). Proteins from UniProtKB for the classes of interest, from hemipteran species, were used for BLAST queries against S. rueppellii to identify any missed genes and to assist with subfamily assignment within these classes. Subfamily assignment for S. rueppellii gene families was nalised using phylogenetic trees which were produced using MAFFT alignments [77,78] and RaxML v8.2.11 [79]. The GAMMA LG protein model [80] was used and a bootstrap consensus tree was inferred from 100 replicates.
Manual checks and curation were performed for genes potentially involved in insecticide resistance. Increased copy numbers of genes linked to insecticide resistance often led to adjacent tandem duplications being incorrectly annotated by MAKER as one gene model; therefore curation was important to prevent incorrect gene numbers being reported in later analyses. The exon/intron boundaries and start/stop codons of the genes were con rmed through visualization in IGV

Results And Discussion
Sequencing 30 individuals of S. rueppellii were required to produce su cient DNA and RNA for sequencing. Since they were obtained commercially, the level of inbreeding of the culture was not known. However, all individuals were obtained from a single colony within the rearing facility. A high heterozygosity level was therefore a possibility and this was kept in mind during the assembly process.

Raw data
The DNA sequencing generated 6,748,327 PacBio reads with a total length of 83.2 Gbp (277x) and a polymerase read length N50 of 63,285bp.
A total of 125.3Gb was produced from the Illumina HiSeq platform for whole genome sequencing, as well as 36.9 Gb for transcriptome sequencing. Quality trimming of Illumina reads using Trimmomatic to remove adapters and any reads <36bp resulted in a 2.9% loss of reads for whole genome sequencing and a 5.18% loss of reads for transcriptome sequencing (Table 1). A total of 21.6Gb of sequencing data was produced from Arima-HiC. Analysis of proximal ligation gave a library QC metric of 30% (a high-quality Arima-HiC library is >15%).

Genome metrics evaluation based on raw reads
The raw read k-mer analysis with GenomeScope 2.0 (see gure 1) estimated a haploid genome size of ~400Mb (Table 2), which is an underestimate of the nal assembly size of 537Mb. However, such discrepancies are often seen when using kmer frequency to estimate genome size in genomes with high repeat content and high heterozygosity [84]. Genome repeat length was 170Mb, 42% of the total estimated genome size. The heterozygosity rate ranged from 3.24-3.36%. This indicates a fairly high level of heterozygosity, which was taken into consideration in the assembly strategy.
Assembly Flye and Platanus-Allee were used to produce 2 separate assemblies. Flye had the best assembly statistics in terms of scaffold N50 (100,207bp with 18 scaffolds >1 million bp) and BUSCO completeness score (99.2%), however, duplication was very high (48.3%) for this assembly, even after subsetting the longest reads to get 150x coverage (duplication was 63.8% prior to subsetting). The total number of scaffolds was 50,164. Platanus-Allee had a lower scaffold N50 (42,845bp with 0 scaffolds >1 million bp) and a slightly lower BUSCO completeness score (97.6%), however, duplication was much lower (3.6%). The total number of scaffolds was 67,142.
The Flye and Platanus-Allee assemblies were merged using QuickMerge, and some manual curation was performed to bring back falsely removed contigs. This resulted in an assembly with a completeness score of 96.5%, duplication of 15.5%, a scaffold N50 of 67,653bp and a total of 59,284 scaffolds, 16 of which were >1 million bp.
A subsequent round of Purge Haplotigs brought the duplication score down to 4.6% whilst still maintaining a completeness of 95.6%. Scaffold N50 increased to 126,450bp and the total number of scaffolds was reduced to 15,009.
This draft assembly was next used for scaffolding with Hi-C data using the 3D-DNA de novo genome assembly pipeline. This increased the scaffold N50 to 87,361,475 bp, with 5 scaffolds > 10 million bp. The total number of scaffolds was reduced to 11,549, with 6 chromosomal-level scaffolds, numbered by sequence length ( gure 2). The BUSCO completeness score was reduced to 94.6%, however, a round of Pilon error polishing brought this back up to 96.4% (subsequent rounds of Pilon worsened the BUSCO score). A nal run with Purge Haplotigs reduced duplication from 4-3%. Statistics of the nal assembly are shown in Table 3. The nal assembly size of 537.6Mb was slightly larger than the assembled genome size for E. pertinax (482Mb) [85], but closely matches the genome size for Episyrphus balteatus (530Mb) from the Syrphidae family, which was calculated using ow cytometry and can therefore be considered a more accurate estimate [86].
From the Infernal tool inference of RNA alignments, a total of 2,292 non-coding RNA elements were found in the genome (Table 4).

Repeat annotation
Transposable and repetitive elements made up 30% of the S. rueppellii genome (Table 5). This is consistent with the reported repeat content of genomes of Diptera species, which ranges widely from 7% (Drosophila simulans) to 55% (Aedes aegypti) [87]. 16.15% of the S. rueppellii genome (77,619,601bp) was masked for annotation -some repeats are annotated but not masked, such as those less than 10bp in length. The majority of these were LINES (9.97%) and interspersed repeats (14.35%). Details of transposable and repetitive elements are shown in Table 5. Comparative genomics UDP-glycosyltransferases UDP-glycosyltransferases (UGTs) are phase II detoxi cation enzymes which are involved in insecticide metabolism. The mechanisms of UGT-mediated resistance are for example based on the conjugation of P450-functionalized substrates, however, their upregulation has been shown in resistant strains of P. xylostella [37], and they have been linked to diamide resistance in Chilo suppressalis [90] and neonicotinoid resistance in Diaphorina citri [91] and they also contribute to insecticide detoxi cation via the elimination of oxidative stress in Apis cerana [92]. There are 46 UGTs in the S. rueppellii genome (Table 6) ; and in three mosquito species (Anopheles sinensis, Anopheles gambiae, Aedes aegypti) expansion is only seen in UGT308 [94]. S. rueppellii also has a much higher number of UGT genes compared to other pollinator species.
Hemiptera crop pest species tend to have higher numbers of UGT genes than Diptera, as shown in Table 6. This tends to be the result of substantial gene expansion concentrated within a single UGT family. For example: a UGT352 expansion in Bemisia tabaci accounted for 36 of its 76 UGTs; the UGT344 family accounted for 35 of Acyrthosiphon pisum's 72 UGTs and the UGT201 family accounted for 33 of Tetranychus urticae's 81 UGTs. These lineage speci c expansions have previously been linked to increased detoxi cation of plant secondary metabolites [100] and therefore the increased number of UGTs in Hemiptera compared to Diptera may be linked to differences in diet. Host plant adaptation alone has been shown to usually be insu cient to confer insecticide resistance, and therefore higher numbers of UGTs in Hemiptera cannot be assumed to correspond to increased insecticide tolerance/resistance [101]. However, upregulation of UGTs from 7 different UGT families, including 6 UGT344 members, has been associated with thiamethoxam resistance in Aphis gossypii [102]. It is therefore possible that expansion in UGT families may be associated with both host plant adaptation and insecticide resistance. Further study into the role of individual UGTs would be needed to clarify whether differences in total numbers of UGTs are associated with differences in insecticide tolerance/resistance between Hemiptera and Diptera.
9 of the S. rueppellii UGT genes belong to the UGT302 and UGT308 families, which are suggested to be the ones most associated with resistance to pyrethroid insecticides [94]. This suggests that expansion of these families in S. rueppellii could be a response to pyrethroid exposure. Expansion of these gene families has been reported in A. sinensis -14 of its 30 UGT genes belonged to the UGT302/308 families and 7 of these were considered strong candidates for pyrethroid resistance [94].
The most signi cant expansion for S. rueppellii is seen in the UGT431 family. This family is unique to S. rueppellii, but is closest in sequence similarity to the UGT37 and UGT430 families which also exhibited some expansion. The UGT37 family has been shown to be upregulated during organophosphorus pesticide exposure in Caenorhabditis elegans [103]. The UGT37 family exhibits lineage speci c expansion in D. melanogaster and is its largest UGT gene family with members spread across ve different genome locations [93]. This differs from the S. rueppellii genome, where the majority (12/14) of the UGT37 and UGT431 families are located in a cluster of adjacent genes on chromosome 2 within 0.17Mb of genomic space. This could suggest the UGT37 family may have expanded more recently in S. rueppellii, however, the percentage identity within this cluster ranges from 33-70%, which indicates that at least part of the cluster can be considered "old". Since these genes have not been fully dispersed in the genome, there may be a selective advantage for preserving the cluster on chromosome 2 as a heritable unit, i.e. UGT37/431 members may be required for the same mechanism. Based on the links of UGT37 to pesticide resistance, the expansion of the UGT37/431 families and preservation of the gene cluster could be an adaptational response to pesticide exposure.

Glutathione S-transferases
The glutathione S-transferases (GSTs) family is large and functionally diverse, and has been shown to confer resistance to all main insecticide classes. For example, the delta and epsilon classes have been linked to pyrethroid resistance in A. aegypti and N. lugens [104,105]. GST-mediated detoxi cation of insecticides takes place via several different mechanisms, including protecting against oxidative stress, binding and sequestration of the insecticide and by catalysing the conjugation of glutathione to insecticides and their metabolites to reduce their toxicity and facilitate excretion, respectively [39].
S. rueppellii has 23 GSTs (Table 7), which are located on proposed chromosomes 1-3, with members of the same family located on the same chromosome. (Chr1: Theta and Omega, Chr2: Epsilon, Chr3: Sigma, Delta and Zeta.) A phylogenetic tree of these GSTs, with likely tandem duplications indicated, is shown in gure 6. The total number of GSTs is slightly lower in S. rueppellii compared to other Diptera species, although higher than other pollinators.
Sigma-GSTs are associated with detoxi cation of oxidants produced during pollen and nectar metabolism in bees [106]. However, S. rueppellii has a reduced number of sigma-GSTs compared to other pollinators. This suggests S. rueppellii may use different detoxi cation enzymes to cope with these oxidants, or perhaps a different pathway for pollen and nectar metabolism.  [109], Culex pipiens quinquefasciatus [110], Apis mellifera, Bombus impatiens, Bombus huntii [111], Thrips palmi [99], Myzus persicae, Acyrthosiphon pisum, Trialeurodes vaporariorum and Bemisia tabaci [112] and their distribution across classes. Within the Diptera species the majority of GSTs are present within the epsilon and delta class, however, for S. rueppellii whilst the numbers of epsilon GSTs are comparable to other Diptera species, the numbers of delta class GSTs are notably lower.

S. rueppellii + close relatives Pollinators
The epsilon class is the largest class in S. rueppellii, as a result of substantial class-speci c expansion. 7 epsilon members are clustered within 31kb, with a percentage identity ranging from 35-81%, this indicates that whilst some members of the cluster are the result of recent tandem duplications, others are the result of far older duplications. Clusters of epsilon GSTs are common across Diptera species, with clusters of 8 epsilon genes seen in A. aegypti and A. gambiae and a cluster of 11 epsilon genes in D. melanogaster. The preservation of these clusters suggests that maintaining epsilon genes as a heritable cluster confers a selective advantage, likely in terms of conferring increased insecticide resistance. This cluster and class speci c expansion may therefore imply an increased degree of GST delta-linked pyrethroid tolerance/resistance in S. rueppellii compared to Hemiptera crop pests, which have at most 1 epsilon gene.
In contrast to the epsilon class, S. rueppellii's delta class is far smaller, as a result of minimal recent class-speci c expansion. Only 2 of the genes are directly adjacent, and were likely a recent tandem duplication based on their 88% sequence identity, whilst the other two members are dispersed through the genome (across 7.8Mb of genomic space). This follows the pattern seen in some other Diptera species, which also have delta genes more widely dispersed than epsilon. For example, 3 separate clusters are seen in both A. aegypti and A. gambiae, (although in D. melanogaster a single cluster of 11 delta genes is present) [113]. This reduced number of delta GSTs could imply a reduced degree of GST delta-linked pyrethroid resistance in S. rueppellii compared to Hemiptera crop pests, although this may be counteracted by the signi cant expansion within the epsilon class. The lack of preservation of delta clusters may also suggest that they confer a less signi cant selective advantage than do the epsilon GSTs.
The sigma class of GSTs has been associated with the detoxi cation of organophosphorus insecticides [114]. All Diptera species included in analysis had only 1 sigma gene, and this was also the case for S. rueppellii. All crop pest species had larger sigma classes. This may imply a reduced level of GST sigma-linked organophosphorus resistance compared to Hemiptera crop pests.
The distribution of S. rueppelli's ABC genes across subfamilies is similar to other species, except for the ABCC and ABCG subfamilies, which are smaller in S. rueppellii than all other Diptera species and the majority of Hemiptera crop pests. These are two of the families most associated with insecticide resistance, and so their reduced size suggests that ABCmediated tolerance/resistance to insecticides could be lower in S. rueppellii compared to these other species.
The ABCA subfamily is expanded in Diptera, whilst the ABCH subfamily is expanded in Hemiptera. However these subfamilies do not have strong links to insecticide resistance, and so these differences would likely not contribute to any variation in tolerance/resistance levels.
The percentage identity of ABC genes within S. rueppellii ranges from 0%-71%, with the exception of one pair of genes with an identity of 89%. This suggests that there has been little recent lineage speci c expansion within the S. rueppellii ABC transporter family, and this is supported by the numbers of the genes in the ABC subfamilies, which are either similar to or lower than other Diptera species. Any lineage-speci c expansion seen in S. rueppellii is minimal, demonstrated by the small size of gene clusters. Species-speci c and lineage-speci c ABC expansions on a much larger scale have been reported in a variety of arthropods such as Tribolium castaneum and Tetranychus urticae, although whether these expansions contribute directly to increased insecticide resistance is not yet known [129] Cytochrome P450 monooxygenases Cytochrome P450 monooxygenases (P450s) are a diverse superfamily capable of metabolizing a huge variety of endogenous and exogenous substrates. In insects they are associated with growth and development, metabolism of pesticides and plant toxins as well as the production and metabolism of insect hormones and pheromones [138,139]. P450s are associated with the resistance to insecticides from a variety of classes, including pyrethroids, carbamates and neonicotinoids and many examples of resistance are linked to upregulated P450s [140][141][142][143]. They are also linked to the activation of organophosphates and other pro-insecticides (a form of insecticide which is metabolized into an active form inside the host) [144] often as a result of downregulation [145,146]. *Values in brackets represent total gene numbers including partial and fragment genes. For other species partial and fragment P450 genes were excluded in cases where they were listed as such -some may remain in the counts if o cial naming and curation had not taken place.
A total of 69 full-length P450 genes were identi ed in the S. rueppellii genome, as well as 4 P450 fragment genes (Table  10). These genes were named by Dr David Nelson using his in-house pipeline [82]. The total number of P450s varies widely between insect species, ranging from 44 for Bombus huntii to 196 for C. pipiens. S. rueppellii falls at the lower end of this range, however when compared to other dipteran species, this is mainly due to the reduced size of the CYP4 clan.
34 of the P450 genes have 55-97% identity to another sequenced P450, 38 have 40-55% identity, and 1 gene has <40% identity. 9 genes (CYP18A1, CYP301-304A1, CYP307A2, CYP314A1, CYP315A1 and CYP49A1) were classi ed as orthologs to P450s from Lucilia cuprina, Ceratitis capitata and Musca domestica. These genes are involved in a conserved pathway, found in all insects, for the essential growth hormone 20-hydroxyecdysone [151]. Orthologs were not present for other genes, likely because other P450s are involved in detoxi cation, and therefore vary during evolution based on the organism's environment and adaptation.
The CYPome diversity value was 52%, based on the presence of 38 CYP subfamilies and 73 genes. The CYPome follows the pattern of other arthropods, with most CYP families having few genes, whilst only a few CYP families have many genes. [149] The majority of S. rueppellii P450s (34) belong to the CYP3 clan (Table 10), which is the one most associated with insecticide resistance, notably the CYP6 and CYP9 families [139], both of which were present in S. rueppellii. CYP3 is also the largest clan in other pollinators and in several other diptera species and hemipteran crop pest species.
The largest sub-family speci c expansion is in clan 3, within the CYP6Zx family, with 16 members: CYP6ZQ1-11, CYP6ZR1-4 and CYPZS1 ( gure 9). CYP6ZQ1-11 (excluding Q7) are located contiguously within a 0.2Mb region of chromosome 3 ( gure 10). Within this cluster there is no consistent relationship or pattern between the proximity of the CYP6Zx genes or their gene structure with their percent identity, which ranged from 33-90%. The lower end of the percent identity within the cluster indicates that at least part of the cluster can be considered "old", and therefore, since these genes have not been fully dispersed in the genome, there may be a selective advantage for preserving the cluster on chromosome 3 as a heritable unit.
Whether the large CYP6Zx expansion may confer an increased degree of tolerance to xenobiotics in S. rueppellii remains to be investigated. Overall, numbers of the resistance-associated CYP3 clan are similar or lower than Hemiptera crop pests, suggesting that P450-mediated insecticide tolerance/resistance mechanisms may not be as prevalent as for other species.
The CYP4 clan is vastly expanded in many arthropods [152], and whilst the CYP4 clan is not as strongly associated with insecticide resistance as CYP3, studies have shown upregulation of some CYP4 genes in response to insecticide exposure [141,[153][154][155]. S. rueppellii has a lower number of CYP4 genes compared to many other dipteran species and crop pests, however, compared to other pollinators the CYP4 subfamily is relatively large. A reduced number of CYP4 genes is common within pollinators [95,156], but the reasons behind this are not yet known.
Pollinators use P450s for the detoxi cation of pollen avonoids, notably the CYP6AS subfamily which is often expanded in honey bees; however, this subfamily is absent in S. rueppellii [157,158]. It is likely that another subfamily is responsible for avonoid detoxi cation in S. rueppellii (possibly the expanded CYP6Zx subfamily) and future studies assessing P450 upregulation in response to avonoids could help identify this.

Point Mutations
Point mutations in genes encoding insecticide targets which are known to confer insensitivity to insecticides were searched for in the S. rueppellii genes. This includes those in the sodium channel para gene, which can confer resistance to pyrethroids; the GABA-gated ion channel RDL which can lead to multiple insecticide resistance; the acetylcholinesterase (ace-2) enzyme which is associated with organophosphate and carbamate resistance; the Ryanodine receptor which is linked to diamide resistance and acetyl CoA carboxylase which is linked to keto-enol resistance. Despite mutations in these genes having been observed across Diptera species including house ies and mosquitoes as well as crop pests such as white ies, aphids and diamondback moths, none were found in this S. rueppellii genome [159][160][161][162][163][164][165][166][167][168][169][170][171][172][173].
Overall, target site mediated tolerance/resistance is not seen in S. rueppellii. Although it is important to note that the S. rueppellii genome assembly was a consensus of ~30 individuals, therefore mutations would likely only be apparent if they were present in the majority of the population.

Conclusions
Here we present the rst high quality genome of S. rueppellii including the mitochondrial genome enabled by PacBio longread technology combined with low error-rate short-read Illumina sequencing. Hi-C data permitted further scaffolding of this genome to a near-chromosome level assembly. The genome completeness is of excellent quality for comparative and functional genomics analyses and provides a useful rst reference for predatory Syrphidae.
Comparative analyses of S. rueppellii with crop pests showed evidence that S. rueppellii has a detoxi cation gene inventory comparable to selected crop pests, with a few notable differences: lineage-speci c expansions were seen within detoxi cation gene families such as UGTs and P450, whereas the ABC transporter family lacks such expansions compared to some crop pests. No mutations were found in common insecticide target-sites, suggesting a lack of selectivity of insecticides at the protein/receptor binding level.
Comparative analyses of S. rueppellii with pollinators showed that S. rueppellii has an increased number of genes in all detoxi cation families, in particular: UGTs, non-sigma class GSTs and CYP4 P450s. This could be in part due to S. rueppellii needing more detoxi cation genes for its diet: hover ies lack the eusocial behavioural mechanisms seen in bees, such as processing nectar into honey and converting pollen into 'beebread', which result in a dilution of toxins and hence reduce the need for detoxi cation enzymes in bees [156]. Additionally, the considerably longer migratory distance covered by hover ies compared to bees [21] may have resulted in hover ies being exposed to a wider variety of xenobiotics, and could perhaps have resulted in expansion of associated detoxi cation genes.
Despite the reduced number of detoxi cation genes in pollinators such as A. mellifera, they appear to be no more sensitive to insecticides than other insects [156,174]. Insects with a pollen-based diet have been found to have an increased degree of insecticide tolerance, with many of the same genes being upregulated in response to both pollen and to certain insecticides [175]. This suggests that the unique set of detoxi cation genes required by pollinators for their diet, could perhaps impart an increased degree of insecticide tolerance without the need for the extent of gene expansion seen in other insect species. This may mean that despite S. rueppellii having fewer detoxi cation genes than some crop pests, this might not necessarily be indicative of reduced insecticide tolerance. However, this is not to say that insecticides are not a major problem for S. rueppellii, with clear evidence that the same neonicotinoids (imidacloprid and thiamethoxam) which are toxic to honey bees are also toxic to S. rueppellii [176,177].
This study provides a good basis for beginning to identify differences in genes encoding potential tolerance/resistance mechanisms between crop pests and S. rueppellii which could be exploited when selecting targeted insecticides for use in IPM strategies. Evidence of gene expansions in resistance-associated gene families implies that S. rueppellii is certainly capable of developing resistance to a variety of insecticides, which could be used to our advantage through the selective breeding and selection of resistant strains of S. rueppellii for use in IPM.
An interesting future comparison could be to look at the differences in olfactory genes between S. rueppellii and E. pertinax (the non-predatory European hover y), as this may give some indication of the genes involved in detecting aphid pheromones and avoidance of parasitised aphids.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
The genome and transcriptome assemblies generated in this study (as well as the raw sequencing data used to produce them) are available under BioProject: PRJEB48036. The manually curated Sphaerophoria rueppellii genes used for comparative analyses are included in additional le 1.

Figure 2
The Sphaerophoria rueppellii genome visualised in JuiceBox, with Hi-C contacts shown in red. Blue edges = superscaffolds/potential chromosomes. Black circles = likely centromeres. Grey boxes = centromere -centromere interchromosomal interactions. (Potential chromosome 3 had no obvious centromere, which may have been due to a misassembly. The 7th scaffold was mostly repeat regions -evidenced by the lack of interactions with the rest of the genome.) Page 30/37 The mitochondrial genome for Sphaerophoria rueppellii, visualised using Geneious and annotation track obtained using MITOS2.

Figure 5
Phylogenetic tree of S. rueppellii UDP-glycosyltransferases. Amino acid sequences were aligned using MAFFT and analyzed using RAxML (the GAMMA LG protein model was used). The bootstrap consensus tree was inferred from 100 replicates. Coloured nodes indicate groups of likely recent tandem duplications, based on genes within the cluster having >70% similarity using Blosum45 with threshold 0, and being located adjacently in the genome.

Figure 6
Phylogenetic tree of the Sphaerophoria rueppellii glutathione S-transferases. Amino acid sequences were aligned using MAFFT and analyzed using RAxML (the GAMMA LG protein model was used). The bootstrap consensus tree was inferred from 100 replicates. Coloured nodes indicate groups of likely recent tandem duplications, based on genes within the cluster having >70% similarity using Blosum45 with threshold 0, and being located adjacently in the genome.

Figure 7
Phylogenetic tree of the Sphaerophoria rueppellii carboxyl/cholinesterases. Amino acid sequences were aligned using MAFFT and analyzed using RAxML (the GAMMA LG protein model was used). The bootstrap consensus tree was inferred from 100 replicates. Coloured nodes indicate groups of likely recent tandem duplications, based on genes within the cluster having >70% similarity using Blosum45 with threshold 0, and being located adjacently in the genome.
Maker_SR0001396-RA was a gene fragment, and was not included in the nal gene count or analysis, all others are fulllength genes.

Figure 8
Phylogenetic tree of the Sphaerophoria rueppellii ABC transporters. Amino acid sequences were aligned using MAFFT and analyzed using RAxML (the GAMMA LG protein model was used). The bootstrap consensus tree was inferred from 100 replicates. Coloured nodes indicate groups of likely recent tandem duplications, based on genes within the cluster having >70% similarity using Blosum45 with threshold 0, and being located adjacently in the genome.

Figure 9
Phylogenetic tree of the Sphaerophoria rueppellii Cytochrome P450s. Amino acid sequences were aligned using MAFFT and analyzed using RAxML (the GAMMA LG protein model was used). The bootstrap consensus tree was inferred from 100 replicates. Coloured nodes indicate groups of likely recent tandem duplications, based on genes within the cluster having >70% similarity using Blosum45 with threshold 0, and being located adjacently in the genome. The CYP6Zx family is part of clan 3.