Beta Fulltext view is in preview — article structure may vary. Browse all articles
Contents
Open Access Journal of Microbiology & Biotechnology Research Article 15 min read

In Silico Prediction of Small Open Reading Frames from Intergenic Regions of Cucumis sativus L. Var. Hardwickii

Chieng GSW, Tan BC and Teo CH*
* Corresponding author
ISSN: 2576-7771  10.23880/oajmb-16000242  Received: November 07, 2022  Published: November 23, 2022
  views
 1 references
 4 figures
 1 table
PDF
Keywords
Cucumis sativus var. Hardwickii Small Open Reading Frame Transcribed sORF Cucumber Coding sORF
Abstract

Small open reading frames play important roles in growth and development regulation in plant species. However, their sequences and functions remain poorly understood in many plant species including Cucumis sativus which is Asia's fourth most important vegetable. The breeding of climate-resilient cucumbers is of great importance to ensure their sustainability under extreme climate conditions. In this study, we aim to predict the intergenic sORFs from C. sativus var. hardwickii and determine their expression profiles in transcriptome datasets. We identified a total of 50,191 coding sORFs from var. hardwickii genome. In addition, 1,311 transcribed sORFs were detected in RNA-seq datasets of var. hardwickii and shared homology to sequences deposited in the cucumber EST database, and among these, 91 transcribed sORFs with translation potential were detected. The findings of this study provide insight into sequence diversity and expression patterns of sORFs in C. sativus, which could help in developing climate-resilient cucumbers.

Introduction

Cucurbitaceae used to be documented as a monophyletic family without any close relatives [1], but with the addition of more mitochondrial and chloroplast genome sequences of old and new plant materials, a few of the closest relatives were discovered in this family [2]. Within Cucurbitaceae, there are roughly 66 species in the genus Cucumis, and cucumber (Cucumis sativus) is the only one having 2n = 2x = 14 chromosomes [3]. Among the many varieties of Cucumis sativus, wild cucumber (C. sativus var. hardwickii), semi-wild Xishuangbanna cucumber (C. sativus var. xishuangbannesis), the Sikkim cucumber (C. sativus var. sikkimensis) and the cultivated cucumber (C. sativus var. sativus) are cross- compatible [4]. Today, China has been ranked as one of the top producers and largest domesticators of cucumbers. Based on the statistics from FAOSTAT [5], the total cucumber production in China in 2018 was 56.24 million tons from 1.044 million hectares. The cucumber production and area utilised for cucumber production in China stand at 52.7% and 74.8% of the corresponding world totals respectively. As for the yield per unit area, China exceeded the world average by 42% with a total of 53.86 kg/ha. Apart from China, India

is also a competitive player in terms of its annual cucumber production. In a study by Sanjeev, et al. [6], the annual production of cucumber was 0.698 million tons from 45,000 ha with a productivity of 15.5 t/ha. For both countries, the most concerning issue is having low productivity and challenging climatic diversity [6, 7].

Small open reading frame (sORF) as its name suggests, is a shorter version of the canonical ORF. The size of sORF ranges from 30 bp to 300 bp [8]. Their minuscule size has caused them to be excluded from most gene prediction methods [9]. The length cut-off filter used in most gene prediction methods is 300 bp and any sequences below this cut-off will be considered as being non-functional [10]. Another contributing factor causing sORF to be diminished is that short sequences normally have low evolutionary conservation scores, an indicator of the functionality of a gene [11]. To date, there is still no standard classification for sORF and small peptide derived from sORF but researchers have made attempts to classify them into different categories, namely upstream ORF (uORF), intergenic sORF, long noncoding ORF (lncORF), short CDS, short isoform, downstream ORF (dORF), CDS-sORF, interlaced-sORF, miREP, microprotein, hormone-like peptide and defensin [9, 12, 13, 14, 15, 16].

sORFs have been reported to get translated into small peptides and have functional roles in plants [17, 18, 19]. Ong, et al. [16] reviewed the roles of sORF-encoded proteins (SEPs) in several biological processes, including cell signaling, abiotic stress responses, morphogenesis, and growth regulations. In addition, studies also reported that short proteins also function as secreted peptides and hormones. Quio, et al. [19] reported an 18 aa plant polypeptide hormone known as systemin that engages in plant defence mechanisms and secretes phytosulfokine pentapeptides (PSK; 100 aa) that function in regulating plant growth and stress responses.

In this study, we used the reference genome of C. sativus var. hardwickii PI183967 for in silico sORF prediction and characterisation. The main objective of this study was to identify and characterise intergenic sORFs from C. sativus var. hardwickii PI183967 genome. To achieve this objective, we first identify the coding sORFs from the genome sequences of C. sativus var. hardwickii PI183967 using sORFfinder. We then classified the coding sORFs into coding sORF with transcription potential (transcribed sORF) and with both transcription and translation potential (translated sORF) based on the outcomes from the transcript (RNA-seq and EST) and protein sequence homology search (SWISS-PROT) analysis. We determined the sORF expression profiles in RNA-seq datasets of C. sativus var. hardwickii using gene expression tools. Finally, the potential biological functions of the translated sORFs were annotated using Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pipeline. The findings from this study will set an important foundation for the development of climate- resilient cucumbers and other crop species.

Material and Methods

Data Retrieval of Cucumis sativus Reference Genome and Transcriptomes

The reference genome sequences and annotation file of C. sativus var. hardwickii PI183967 were retrieved from CuGenDBv2 (http://cucurbitgenomics.org/). The transcriptome datasets of C. sativus var. hardwickii (Project ID: PRJNA624798) were retrieved from NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra).

In silico Prediction of Small Open Reading Frame

The CDS, exon, intron, and intergenic regions of C. sativus var. hardwickii PI183967 were extracted from its reference genome using gff2sequence [20]. sORFfinder [8] was used to predict the coding sORFs from C. sativus var. hardwickii PI183967 genome. RepeatMasker (http://www. repeatmasker.org) was used to mask the repeat sequences in the coding sORFs and the masked sequences were removed using an in-house script. Sequence clustering of the coding sORFs was performed using CD-HIT [21] with a clustering threshold of 95% to cluster the redundant sequences into sequence clusters.

Characterisation of Small Open Reading Frame

To identify transcribed sORFs, the coding sORFs were blast searched against Cucumber EST collection version 3 (http://cucurbitgenomics.org/est/cucumber) using the blastn algorithm with homology search parameters “-evalue 1e-5 -per_identity 97 -qcov_hsp_perc 100”. The nucleotide sequences of transcribed sORFs were translated to amino acid sequences using gotranseq (https://github.com/feliixx/ gotranseq). The amino acid sequences of transcribed sORFs were blast searched locally against the high quality manually annotated and non-redundant protein sequences retrieved from the SWISS-PROT database (https://www.uniprot.org/ help/downloads) using the blastp algorithm with homology search parameters “-evalue 1e-5 -per_identity 97 -qcov_hsp_ perc 100”. The transcribed sORFs that shared high homology to SWISS-PROT protein sequences were designated as transcribed sORFs with translation potential (translated sORFs). The amino acid sequences of translated sORFs were then blast searched against the plant sORF database, PsORF (http://psorf.whu.edu.cn/)[22].

Transcriptome Analysis of Small Open Reading Frames

The coding sORF (csORF) sequences of C. sativus var. hardwickii PI183967 were mapped to its reference genome using blat (https://github.com/djhshih/blat) and the outputs were converted to gene annotation file using blat2gtf.pl (https://github.com/IGBIllinois/HOMER/blob/ master/bin/blat2gtf.pl). The sORF annotation file was combined with the reference genome annotation file using agat_sp_merge_annotations.pl from AGAT package (https:// github.com/NBISweden/AGAT). The RNA-seq reads were aligned and mapped to C. sativus var. hardwickii PI183967 genome using HISAT2 (http://daehwankimlab.github.io/ hisat2/) and the expression profile of sORFs in the RNA-seq datasets were determined using Stringtie (https://ccb.jhu. edu/software/stringtie/#install) together with the updated gene annotation file. The sORF sequence IDs were retrieved from the Stringtie output file using an in-house script and the sORF sequences were then extracted from the coding sORF file using the subseq function in the seqtk package (https:// github.com/lh3/seqtk). The sORF sequences identified by HISAT2/Stringtie pipeline were combined with the sORF sequences that shared homology to cucumber ESTs to form a final set of transcribed sORF.

Functional Annotation of Small Open Reading Frame

To assign a biological function to transcribed sORF, Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) were performed using the clusterProfiler R package (https://bioconductor.org/packages/release/bioc/ html/clusterProfiler.html). The transcribed sORF amino acid sequences were first blast searched against SWISS- PROT database using the blastp algorithm. The SWISS-PROT protein IDs were extracted from the blast output using an in- house script and converted to Gene ID using the “Retrieve/ ID Mapping” function from the UniProt website (https:// www.uniprot.org/id-mapping). The GO category analysis was conducted using the group GO function embedded in the clusterProfiler R package. The enrichment analyses of GO and KEGG were performed using the enrich GO and enrich KEGG function embedded in the clusterProfiler R package.

Results and Discussion

Identification of Small Open Reading Frame in C. sativus var. Hardwickii

The total number of sORFs in an organism varied from species to species. In Arabidopsis thaliana, approximately 33,809 sORFs were predicted from the intergenic regions using sORFfinder with 7,159 sORFs are coding sORFs and 2,996 coding sORFs likely expressed in at least one experimental condition of the tilling array data [23]. Using the same sORF detection pipeline, a total of 850,540 sORFs were predicted from the genome of C. sativus var. hardwickii PI183967 and 50,584 coding sORFs were detected (Table 1).

Description of sORFNumber of
sORF
sORF850,540
Coding sORF50,584
Unique coding ORF50,191
Transcribed sORF with homolog in
cucumber EST database
693
Transcribed sORF in leaf transcriptome1,311
Transcribed sORF with homolog in
SWISS-PROT
91
Transcribed sORF with homolog in
PsORF
82

Table 1: Summary of intergenic small open reading frames found in C. sativus var. hardwickii.

Repeat Masker and CD-HIT were conducted to remove the csORFs with repeat sequence homology and to cluster redundant csORFs to unique coding sORFs. A final set of 50,191 csORFs was blast searched against the Cucumber EST collection to obtain 693 transcribed sORFs that shared high homology to cucumber EST sequences. Out of the 693 transcribed sORFs, 91 showed high homology to protein sequences deposited in the SWISS-PROT database. Using the leaf transcriptome datasets of C. sativus var. hardwickii, we identified 1,311 transcribed sORFs (Table 1).

Transcribed sORF with Translational Potential in Cucumis sativus

Using a protein sequence homology search approach, 91 transcribed sORFs with translational potential were detected for C. sativus var. hardwickii (Table 1). Besides SWISS-PROT, we also blast searched the transcribed sORFs against the sORFs with translational potential deposited in the PsORF database. The PsORF database is a collection of plant sORFs from 35 different plant species [22]. The authors collected multi-omics datasets including genome, transcriptome, Ribo- seq, and mass spectrum from public databases and built a bioinformatics pipeline to detect sORFs in these datasets. Results from blast search against the PsORF database showed that 90.11 % of C. sativus var. hardwickii transcribed sORFs with the homologs in SWISS-PROT also have homologs in the PsORF database (Figure 1). This indicates that these transcribed sORFs might have translational potential. Using a proteomic approach, Castellana, et al. [17] identified ∼5,000 small peptides in Arabidopsis and some of these small peptides were novel and/or identified by Hanada, et al. [23].

Figure 1: Transcribed sORFs with translational potential in C. sativus var. hardwickki.
Click to enlarge
Figure 1: Transcribed sORFs with translational potential in C. sativus var. hardwickki.

Functional Classification of Cucumis sativus var. hardwickii sORF

From the 91 transcribed sORFs with the translational potential of C. sativus var. hardwickii, 2,005 unigenes with Entrez ID were retrieved from the UnitProt database for Gene Ontology (GO) analysis using the clusterProfiler (Figure 2). Among the 3 distinct categories of GO classes, molecular functions were the most represented functional group for sORF functional annotation (Figure 2). The top three GO terms for biological processes (BP) are nitrogen compound metabolic process, reproductive process, and immune response. For molecular function (MF), top three GO terms are protein binding, hydrolase activity, and oxireductase activity whereas for cellular component (CC), top three GO terms are cytoplasm, intracellular organelle, and myosin complex (Figure 2).

Figure 2: Gene ontology (GO) classification of C. sativus var. hardwickii transcribed sORFs.
Click to enlarge
Figure 2: Gene ontology (GO) classification of C. sativus var. hardwickii transcribed sORFs.

We also performed the GO enrichment analysis of transcribed sORFs of C. sativus var. hardwickii (Figure 3). The transcribed sORFs were enriched in the GO terms of BP and CC. No enrichment was detected for MF. In C. sativus var. hardwickii, most of the transcribed sORFs showed significant CC enrichment in the endosome, cytoskeleton, and actin cytoskeleton and to a lesser extent in the Golgi membrane and trans-Golgi network membrane. The cytoskeleton mainly functions as a structure for cell shape and internal organization and is made up of 3 elements, namely, microtubules, intermediate filaments, and actin [24]. As shown in the GO enrichment plot (Figure 3), the transcribed sORFs with enriched GO terms related to actin were detected in C. sativus var. hardwickii. Actin mainly functions in cellular physiological processes in plants including cell growth, cytokinesis, cell division, and several intracellular trafficking events [25]. This indicates that the transcribed sORFs of C. sativus might play a role in growth and development regulations. Hanada, et al. [18] reported that overexpression of sORFs showed varying morphological changes in transgenic A. thaliana and was associated with a higher growth rate.

Figure 3: Bubble plot showing the enriched GO terms. X-axis in the bar plot stood for gene ratio, while the y-axis indicates different BP and CC. The size of the circles in each plot is positively correlated with the number of genes involved in each subgroups while the colour of the circles indicate their significance level.
Click to enlarge
Figure 3: Bubble plot showing the enriched GO terms. X-axis in the bar plot stood for gene ratio, while the y-axis indicates different BP and CC. The size of the circles in each plot is positively correlated with the number of genes involved in each subgroups while the colour of the circles indicate their significance level.

For the enriched KEGG pathways, the transcribed sORFs in C. sativus var. hardwickii was enriched in KEGG Ontology (KO) terms of plant hormone signal transduction (Figure 4). Plant sORFs have been demonstrated to play important roles in cell signalling, abiotic stress response, morphogenesis, and growth regulation [16, 18, 26, 27, 28, 29, 30, 31, 32, 33, 34]. Apart from that, some of the transcribed sORFs identified in C. sativus var. hardwickii are significantly associated with KO terms of various metabolisms. This indicates that C. sativus transcribed sORFs play a significant role in plant growth and development, and environmental stress responses.

Figure 4: Bubble plot of enriched KEGG pathway. The x-axis indicates the gene ratio and the y-axis stood for the pathway enriched.
Click to enlarge
Figure 4: Bubble plot of enriched KEGG pathway. The x-axis indicates the gene ratio and the y-axis stood for the pathway enriched.

Conclusion

In this study, we have established a bioinformatics pipeline for the identification of small open reading frames (sORFs) in Cucumis sativus. Using the pipeline, different types of sORFs were identified from the genome and transcriptome datasets of Cucumis sativus var. hardwickii. GO and KEGG terms that enriched in growth and development, and stress response were predicted for the transcribed sORFs with translational potential. Further classification of sORFs is needed to minimise conflicting sORF annotations and ease categorisations of sORFs. Having a complete database of transcribed sORFs and translated sORFs in Cucumis sativus would help us understand the roles of plant SEPs, especially in the biotic and abiotic stress responses. With that being said, cucumber producers will be able to improve crop viability, especially in harsh weather conditions, and produce cucumbers with higher market value.

References

  1. The reference genome sequences and annotation file of C. sativus var. hardwickii PI183967 were retrieved from CuGenDBv2 (http://cucurbitgenomics.org/). The transcriptome datasets of C. sativus var. hardwickii (Project ID: PRJNA624798) were retrieved from NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra).

Cite this article

BibTeX
APA
RIS
@article{chieng2022,
  title   = {In Silico Prediction of Small Open Reading Frames from Intergenic Regions of Cucumis sativus L. Var. Hardwickii},
  author  = {Chieng GSW, Tan BC and Teo CH},
  journal = {Open Access Journal of Microbiology & Biotechnology},
  year    = {2022},
  volume  = {7},
  number  = {4},
  doi     = {10.23880/oajmb-16000242}
}
Chieng GSW, Tan BC and Teo CH (2022). In Silico Prediction of Small Open Reading Frames from Intergenic Regions of Cucumis sativus L. Var. Hardwickii. Open Access Journal of Microbiology & Biotechnology, 7(4). https://doi.org/10.23880/oajmb-16000242
TY  - JOUR
TI  - In Silico Prediction of Small Open Reading Frames from Intergenic Regions of Cucumis sativus L. Var. Hardwickii
AU  - Chieng GSW, Tan BC and Teo CH
JO  - Open Access Journal of Microbiology & Biotechnology
PY  - 2022
VL  - 7
IS  - 4
DO  - 10.23880/oajmb-16000242
ER  -