Different types of transcripts create the RNA world
RNA examination has become a central issue in contemporary transcriptome research, molecular pathomechanisms assessments of many diseases, including cancer, as well as evolutionary and developmental studies. This approach can be justified by RNA as the main source of information regarding dynamic changes in gene expression in response to different stimuli. On the other hand, DNA only provides a static picture of cell genomics [1–3]. The RNA world hypothesis renewed the importance of RNA research by underlining the ability of RNA to store data about past eras. The next decade is likely to witness an enormous rise in interest in new RNA types, their functions and possible applications in innovative therapies and diagnostics. Despite scientists facing a real possibility of acquiring a complete knowledge about the RNA world, the characteristics of some RNAs have not been dealt with in depth. The aim of this work is to explain the probable advantages of RNA examining and to extend current knowledge to include technologies and tools used in this field [1–3].
Although the HGP occurred in 2003, its reverberation can still be felt today. The groundbreaking programme aiming at genome sequencing established that only 20,000–25,000 genes are protein-coding, accounting for less than 2% of the total genomic sequence [4, 5]. The fact that over 90% of the human genome is transcribed, significantly influenced the present understanding of molecular genetics, confirming that the RNA world is far more complex than assumed in the past century [5–7]. Classification of RNA particles depends on whether they are protein-coding or non-coding RNAs as well as on their length [8].
Even though, ncRNAs, such as transfer RNA (tRNA), ribosomal RNA (rRNA), and small nuclear RNAs (snRNAs), have been known for over 60 years, an abundant number of new functional ncRNAs along with their new features is still being described [8,9]. The majority of RNAs in a cell are ncRNAs [9]. They play a pivotal role as structural, catalytic, or regulatory RNAs [10–13]. ncRNAs are divided into two categories depending on their length: small non-coding RNAs (sncRNAs), which are shorter than 200 nucleotides, and long non-coding RNAs (lncRNAs) with a sequence longer than 200 bp [9]. rRNAs and tRNAs are usually excluded from this classification [14]. The features of RNA molecules are presented in Figure 1. A rapid development of RNA isolation techniques, sequencing capacity, and computational analysis facilitated expanding knowledge of the RNA world. Increasingly, more types of RNA and their functions are being discovered and tested for suitability in medicine as diagnostic or therapeutic measures. They are derived from different sources of biomaterials [15–19]. The uses of RNA-seq data in research are presented in Figure 2.
How to find the RNA maps?
The first step in research based on RNA-seq data is finding its proper source, which at the same time is the most problematic one. In fact, some databases are useless or inaccessible because of analysis inability. Fortunately, in recent years more and more portals have become “user-friendly” with tools for quick analysis mostly by the “one click” method. It should be underlined that most of them are based on TCGA data and in our opinion they are also the most developed and comprehensive ones.
The first RNA map: cBioPortal Platform
cBioPortal is an open platform for search, visualization, and multifaceted analysis of cancer genomics (mainly mutations and gene expression) and clinical databases. Cancer genomics data present in the public database are based on the TCGA project. The public version of cBioPortal contains over 200 studies, all of which are the TCGA projects and published results from the literature. The available data are divided according to the organs which are affected by cancer. It is also possible to search information by the exact tumor type. The programmes’ convenient interface provides many useful analytical functions including a graphical summary of gene-level data across multiple platforms, analysis of the correlation between genes, survival analysis, and visualization of individual patient characteristics [20]. The aim of the cBioPortal is to significantly reduce the barriers between complex genomic data and cancer researchers by providing fast, intuitive, and high-quality access to molecular profiles and clinical features from large cancer genomics projects, enabling researchers to translate these intricate data sets into biological knowledge and medical applications [21]. The portal converts molecular profiling data from cancer tissues and cell lines into easily understandable genetic, epigenetic, and proteomic data. Owing to the easy-to-use interface, it is possible to customize data storage options and interactively study genetic changes in samples, genes, and pathways and even link them to clinical results. This site provides graphical summaries of results from multiple platforms, web visualization, and analysis, survival analysis, and software access. Complex cancer genomic profiles are available to researchers and clinicians without any need for bioinformatics knowledge, which is a great advantage [22]. The platform supports and stores non-synonymous mutations, DNA copy number data (presumed discrete values for each gene), messenger RNA (mRNA) and microRNA (miRNA) expression data, protein, and phosphoprotein level data (based on RPPA or mass spectrometry), DNA methylation data and unidentified clinical data. In the Data Kits section, the user can obtain a breakdown of the available types of cancer research data [21].
In addition to an easy-to-use interface and the ability to employ multiple data visualization tools by the OncoPrinter and MutationMapper, a great advantage of cBioPortal is that the platform incorporates data from large consortia, such as TCGA and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET), which guarantees their reliability. The platform also provides an opportunity to customize data storage for each user. Searching for information does not require bioinformatic skills, which is a great benefit over similar platforms. cBioPortal also ensures that public data is available to all users and can serve as an academic tool, while private data is available only to an owner or authorized users. All data is frequently updated. Most of it comes from the literature, but more accurate data is sometimes collected directly from researchers. The platform also has a group comparison option, which is a set of analytic tools enabling a user to compare the clinical and genomic characteristics of the user-defined samples. For mRNA and miRNA expression studies, the program uses a Z-score value that indicates the number of standard deviations from the average expression in the reference population. This parameter is useful to determine whether the gene is adjusted up or down from normal samples or all other tumor samples. RNASeqV2 results from the TCGA are processed and normalized using RSEM, which is a software package for estimating gene and isoform expression levels based on RNA-seq data. To check the miRNA data in the portal, either mature or precursor miRNAs can be entered. Then, the portal creates one internal ID for each pair of ID precursors and maps the mature IDs, which will also be displayed in Oncoprint. The open-source software is handled by an external company, Hyve, which provides commercial support for cBioPortal. It helps with implementation, data loading, development, advice, and training [21].
Like any similar portal, cBioPortal has several disadvantages. Firstly, installing the local version requires system management skills, including the installation and configuration of Tomcat and MySQL, in which no technical support from the site is provided. Secondly, for many studies, only somatic mutation data and limited clinical data are available, which may not deliver the full knowledge needed. For the TCGA studies, different types of data are available than for other studies. Also, germline mutations are supported by the cBioPortal, but with a few exceptions they are not available for public use. Additionally, the portal does not store raw or probe-level data, it only contains data at the gene level. The data for the different isoforms of a given gene is combined, and raw data and probe level data are available through the Gene Expression Omnibus (GEO) and the Genotypes and Phenotypes (dbGaP) from the National Center for Biotechnology Information (NCBI), or the Genomic Data Commons (GDC), from National Cancer Institute (NCI). Another disadvantage is that the cBioPortal does not store any expression data from normal tissue samples, which precludes some useful comparisons. Finally, the database has miRNA data for only a few tests that are not up-to-date. To download more recent miRNA data, the cBioPortal refers to Broad Firehose or GDC [21–23].
How to read the RNA maps?
The growing number of RNA-seq data is not correlated with the expertise of specialists who are able to analyze and interpret the results as well as generate their graphical presentation. In spite of the popularity of the R programming language, frequently used with tools from Bioconductor [24, 25], and the Python programming language with SciPy library [26], analysis of data is problematic. Most scientists learn these coding languages on their own, mostly by trial and error. However, in the majority of cases, R and Python programs are complicated for biologists and a lot of beginners quickly give up. Due to the problematic nature of data analysis and the constant increase in the capacity of transcriptomic data, on-demand analysis portals have emerged. Most of them are completely free of charge and enable statistical analysis by generating results in a graphical form. Usually, these portals use data derived from the TCGA project, but in some cases uploading their own data set is possible. The advantages and disadvantages of databases with RNA expression data are summarized in Table 1.
Database |
Advantages |
Disadvantages |
Link to database |
The cBio Cancer Genomics Portal |
The portal complements existing tools (ICGC and TCGA, IGV) by offering a unique focus on analyzing discrete genomic events across integrated data types The database contains a tutorial and allows users to download or visualize data |
Does not currently support synonymous mutations Does not include information on normal tissues |
|
UALCAN |
Included data is helpful in identifying candidate biomarkers Obtained information (box plots, KM-plots, and heatmaps) can be printed directly The database contains a tutorial and allows users to download or visualize data The database contains information on normal tissues |
A lot of information is required to be collected from external links |
|
ENCORI |
The database contains the miRNA-pseudogene interaction networks and miRNA-lncRNA interactions Drafts interaction maps between miRNAs and circRNAs ENCORI contains a tutorial and allows users to download or visualize data The database contains information on normal tissues. |
Majority of CLIP-Seq and Degradome-Seq clusters could not be clearly predicted to be miRNA targets The availability is restricted to selected hours |
|
The Kaplan-Meier Plotter |
The database determines if a given gene is a potential marker, oncogene, or tumor suppressor It contains a tutorial and allows users to download or visualize data The database contains information on normal tissues |
Microarrays and RNA-seq datasets are not combined |
|
Oncomine |
Premium account allows expanded analyses, multi-gene search and custom concept upload The database contains a tutorial and allows users to download or visualize data It contains information on normal tissues |
User account is required to perform the analyses |
|
Cancer-specific circRNAs database |
The database allows users to determine whether circRNAs are cancer-specific Users can admire potential functional regulation and translation of circRNAs by predicting the MRE, RBP, and ORF The website contains a tutorial and allows users to download or visualize data The database contains information on normal tissues |
The database does not include any primary tumor samples |
|
YM500 |
The website provides novel findings including IsomiRs, novel miRNAs, or miRNA expression The database contains an extensive number of smRNA-Seq datasets of various tissue/cell types and information on arm-switching The portal contains a tutorial and allows users to download or visualize data It contains information on normal tissues |
Some of comparisons depicted on charts lack p-values and description of employed statistical tests |
|
OncoMir Cancer Database |
The database allows users to analyze miRNA-derived survival outcome signatures for many cancer types or to analyze miRNA clusters Signature on predicting overall cancer survival is presented through a Kaplan-Meier survival curve The website allows users to download or visualize data The database contains information on normal tissues |
Some recently identified new miRNAs may not be available The database does not contain a tutorial |
|
BioXpress |
Unification by a central cancer disease ontology The database allows users to download or visualize data It contains information on normal tissues |
The database does not contain a tutorial |
UALCAN
UALCAN is an interactive web portal for quick and easy evaluation of publicly available cancer transcriptome sequencing data to identify therapeutic targets and cancer biomarkers. Additionally, UALCAN helps to visualize the data in a simple, downloadable format. This portal uses TCGA level 3 RNA-seq and clinical data from 31 cancer types, and enables researchers to analyze a relative expression of an investigated gene across tumor and normal samples. Moreover, it allows browsing subgroups of cancers depending on the pathological stage, patient race, tumor grade and other clinicopathologic features. This information can be crucial to correlate gene expression with patient survival.
Data analysis: The UALCAN analysis page contains three panels. The left side panel shows a list of cancer types, which are hyperlinked to a web page showing heatmaps of top differentially expressed genes. The upper-right side panel allows a user to look through official gene symbols and types of cancer of interest, while the bottom-right side panel guides the user by gene classes The clinical patient data and gene expression were collected from the TCGA and processed to generate three major types of graphical outputs, described as follows: 1) Box and whisker plot showing gene expression level in different cancers and their subtypes/sub-stages. Primary tumor samples were categorized on clinical patient data and box plots were generated of the expression level of each gene across various subgroups. The categories of boxplots are: individual cancer stages, tumor grade, patient age, race, gender, body weight, smoking status, drinking habit, and molecular signature; 2) Heatmap showing differential gene expression between normal and tumor samples; and 3) Kaplan–Meier survival plot [27].
ENCORI
ENCORI is an open-source platform and an updated version of Starbase, which contains information about the miRNA-ncRNA, ncRNA-RNA, miRNA-mRNA, RNA binding protein-ncRNA (RBP-ncRNA), RNA-RNA, and RBP-mRNA interactions from CLIP-seq, degradome-seq, and RNA-RNA interactome data. That platform contains data on 32 types of cancers derived from 10,882 RNA-seq and 10,546 miRNA-seq data. Moreover, it facilitates the discovery of miRNA–target interaction maps from CLIP-seq and degradome-seq data from six organisms: human, mouse, Caenorhabditis elegans, Arabidopsis thaliana, Oryza sativa, and Vitis vinifera.
Data analysis: At the upper part of that page users can choose from many the interfaces to display RNA interactions. After that, they can either enter a gene symbol or select a gene from the list box. To increase the quality of a search, users can apply filters such as interaction number, genome, or clade. A connected link launches external resources such as NCBI, University of California Santa Cruz (UCSC) and The Arabidopsis Information Resource (TAIR) databases, from which one can obtain more comprehensive information. Users are also allowed to download, analyze, and visualize large-scale data sets [28, 29].
The Kaplan-Meier Plotter (KM)
The Kaplan-Meier Plotter is a database that provides detailed expression comparisons of interesting genes in cancer and normal tissue and helps the researcher to assess the correlation between 54,000 genes expression and survival in 21 cancer types. The largest datasets include breast, lung, ovarian, and gastric cancer. The miRNA subsystems contain 11,000 samples from 20 different cancer types. Gene expression data and overall survival information are downloaded from the European Genome-phenome Archive (EGA), GEO, and TCGA. To analyze the prognostic value of a chosen gene, patient samples are split into two groups according to various quantile expressions of the proposed biomarker. The two patient cohorts are compared using a Kaplan-Meier survival plot, and the hazard ratio with 95% confidence intervals and log-rank p-value are calculated.
Data analysis: Specifically, the web page consists of four panels: ‘protein’, ‘miRNA’, ‘RNA-seq’, and ‘mRNA gene chips’. To perform survival analysis, users simply choose one of them and type in the gene symbol they are interested in. They can also set required analysis parameters to increase the accuracy of it. The output of gene expression levels in the different groups of patients and cell lines is shown according to the clinical or pathological status. The methodology behind the generation of survival curves is rather unclear. The „How does it work?” section of the website informs that a log-rank p-value is provided which suggests that it is calculated using the log-rank test. On the other hand, the reference cited as a main source of information regarding the tool describes the Cox regression model, called also Cox’s proportional hazards model, which could be used as the equivalent of a log-rank test. Both of these methods assume proportional hazards. Interestingly, an alternative, Gehan-Breslow-Wilcoxon test, during calculation associates early time-point deaths with higher weights. Results of the KM-plotter analysis tool include the Kaplan–Meier plot containing the hazard ratio and p-value without defined information regarding the employed statistical test [30–32].
Oncomine
Oncomine is a cancer-profiling database that integrates microarray data analysis with other resources, including gene ontology annotations, to facilitate rapid interpretation of a genes’ potential role in cancer. This website currently hosts gene expression and sample data from 500 cancer types and a wide quantity of cancer-related cell lines. It is made not only for individual researchers, but also for multinational companies, and it enables them to visualize and assess the diversified expression of a selected gene across all available datasets. After choosing a gene of interest, users receive a list of differential expression analyses in which the gene was included. For the chosen analyses, the statistical results are provided and linked to graphical representations of microarray data.
Data analysis: Oncomine data can be searched by gene, multiple genes or by a number of cancer-related terms. The visualizations are then classified by data samples, such as result, Gleason’s grade, molecular change, stage, treatment, survival and relapse, to identify genes that are deregulated in particular cancer subtypes or that are related to specific clinical or pathological parameters. Such solutions make this database very effective in studying genes that are overexpressed in many cancer datasets to verify the relationship between transcription and disease. Many studies regarding the cancer microarray focus on identifying potential therapeutic targets or diagnostic markers. To achieve this, genes are presented with their ontological descriptors. By collecting information from the Gene Ontology (GO) consortium, three ontological categories have been created: 1) kinases which could be inhibited by small molecule kinase inhibitors; 2) membrane-bound molecules which could be targeted by antibody therapies; and 3) secreted molecules which could serve as serum biomarkers.
Finally, it needs to be mentioned that Oncomine integrates information from other bioinformatic resources, including Swiss-Prot, Unigene, and LocusLink as well as provides direct links to Human Protein Reference Database (HPRD) or the pathway resources Kyoto Encyclopedia of Genes and Genomes (KEGG) [33, 34].
Cancer-specific circRNAs database (CSCD)
Cancer-specific circular RNAs (circRNAs) database (CSCD) was created to collect and describe these molecules that have been associated with particular malignancies. CSCD gathered available RNA-seq datasets (total RNA with rRNA depleted or poly A- enriched) from 87 cancer cell line samples. The updated CSCD version 2.0 is created by over 1000 samples which include 825 tissues and 288 cell lines. An integrated analysis was performed using 4 popular algorithms: circRNA identifier (CIRI), circRNA_finder, find_circ, and circexplorer. In CSCD, 272,152 cancer-specific circRNAs were deposited. Currently, in CSCD 2.0, there are 1,013,461 circRNAs from cancer samples, 1,533,704 circRNAs from normal samples, and 354,422 circRNAs from both types of samples. This database provides the following components which could make a significant contribution to circRNA research in cancer: 1) miRNA target sites in cancer-specific circRNAs (CS-circRNAs); 2) CS-circRNAs; 3) RBPs binding sites in CS-circRNAs; 4) Cancer-specific alternative splicing associated with CS-circRNAs; 5) Potential Open Reading Frames (ORFs) in CS-circRNAs.
Data analysis: The main page of CSCD is composed of three panels: 1) Query Panel in which users can search circRNAs ID by selecting sample name, sample type, gene symbol, and cellular localization and search circRNA ID in a searching box. The output list displays all information for each cancer-specific circRNA, including genomic coordinates, host gene, location, circbase ID, annotation, spliced exon, ratios (circRNA/linear gene), prediction algorithm, and abundance. By selecting a gene symbol, users are redirected to the Gene Panel with all circRNAs in different samples; 2) Gene Panel to view all circRNAs for a selected gene. The image displays gene structures with different colors of rectangles for exons and black lines for introns, while circRNAs are shown as colored curves. To obtain a zoom of the high resolution image, the upper right corner of the panel should be selected. Users can also view detailed information about cancer-specific circRNA, including algorithm, junction reads, location, and sample date by clicking the “CircRNA”. Other information can be found in the transcript tab, gene tab, circRNA tab, and splicing tab. 3) CircRNA Panel which visualizes specific circRNA by connecting exons in a colored circle. It enables users to view the position and number of RBP, microRNA response element (MRE), and open reading frame (ORF) elements. Then, they can check the detailed information through the RBP, circRNA, MRE, and ORF tabs [35–37].
YM500 portal
YM500 is a database that includes more than 8000 cancer-related small RNA (smRNA) datasets, arm-switching discovery, integrated pipelines for novel miRNA prediction, miRNA quantification from smRNA-seq, isomiR identification and functional sncRNAs in smRNA-seq datasets. Researchers can gain miRNA-related information from 141 mouse and 468 human smRNA-seq datasets by using a user-friendly web interface.
Data analysis: YM500 provides interactive query interfaces, such as Expression, isomiRs, Novel miRNAs, arm-switching, Meta-analysis, Survival, Cancer, Result, Download, or Documentation. The analysis of data is divided into main blocks: 1) expression: this feature helps users search for miRNAs or sncRNA according to ID lists or miRNA cluster definitions. Several statistical charts are added to help researchers depict the expression profile of a given sncRNA across distinct cancer types. They can explore the Tissue Expression, Histogram, or Cancer Expression in the Expression profile or browse for an individuals’ cancer boxplots; 2) survival: this section has two features: ‘All Cancer Types’ and ‘Specific Sample Group’. The first one presents the survival analysis of a specific sncRNA (or miRNA) in all types of cancer, including a table summarizing all cancers and a Kaplan-Meier plot for each individual cancer type. The ‘Specific Sample Group’ helps researchers to identify a subset of samples in one type of cancer, such as triple negative breast cancer, in order to perform survival analysis according to dozens of clinical characteristics; 3) cancer: this section contains a calculated outcome of differential expression analyses, miRNA–gene interactions, and cancer miRNA-related pathways for a specific type of cancer that includes smRNA-seq and RNA-seq data of tumor and normal tissues for the same individuals. It also provides information on “individual miRNA-gene-specific pairs” to help scientists study the interactions between miRNAs and genes according to user-defined criteria. To illustrate the relationships between multiple miRNA-genes interactions, users can apply the Cytoscape Web tool for interactive web visualization. The correlations of each pair of miRNA-genes between differently expressed miRNAs and genes are calculated and classified into three groups: “Predicated”, “Validated” and “Without any evidence”; 4) meta-analysis: it is the new function that allows researchers to identify differentially expressed miRNAs and arm-switching events from two user-defined, specific sets of samples. Users can pick one or more datasets and define a sample type they would like to study, such as “Primary Solid Tumor” or “Solid Tissue Normal”. This subpage also includes a list of clinical criteria, such as tumor stage, distant metastases, and lymph node status, to help scientists select a subset of well-defined cancer samples according to one or more clinical parameters; 5) novel miRNAs: they can be searched according to the exact sequence of a mature miRNA or genomic location. Users gain information about provided mature and precursor miRNA, including predicted target genes, prediction algorithms, RNA secondary structures, expression profiles, and hyperlinks to three commonly used genome browsers; 6) isomiRs: this section enables researchers to find the isomiRs of the known miRNAs in miRBase. To that end, users shall define the criteria of isomiRs according to the number of read counts, the number of mismatches, the number of expressed samples, and isomiR types (addition or trimming at 5'/3' end); 7) arm switching: this option provides two ways to study the process. YM500 allows users to select a specific miRNA precursor and profile the expression of two arms between samples and tissues. Thus, a quick review of arm-switching events in a specific miRNA species can be done. Using another illustration method, users can select two groups of samples, as annotated in the database, and YM500 identifies miRNA precursors miRNAs, whose dominant expression switches from one arm to the other between the two groups [38-41].
OncoMir Cancer Database (OMCD)
The OncoMir Cancer Database allows a simple, systematic and comparative analysis of miRNA expression sequencing data derived from more than 10,000 cancer patients. It also provides clinical information and organ-specific controls present in the TCGA database. Control vs. control, tumor vs. control, and tumor vs. tumor statistical analyses are included in the OncoMir Cancer Database to generate clear and defensible hypotheses. The OMCD database contains miRNA profiling information from samples representing 33 different sets of tumors and corresponding to normal tissue data, if available.
Data analysis: Users can search for miRNAs by name or type of cancer to get a list of important relationships between miRNAs and malignancy. In addition, this search interface can be used to retrieve RNA seq data of one or more miRNAs across multiple types of cancer to gain a detailed view of cancer-dependent miRNA expression profiles. Users can also search for miRNA statistical results based on user-defined thresholds or search for miRNA statistical results by specific miRNA [42–44].
BioXpress
BioXpress is a database for gene expression and cancer associations, where expression levels are mapped to genes using RNA sequence data from the ICGC, TCGA, Expression Atlas and publications. It includes expression data from 6,361 patients, 64 cancer types and 17,469 genes with 9,513 of the genes displaying differential expression between tumor and normal samples. All types of cancer are mapped to Disease Ontology terms to facilitate a uniform pan-cancer analysis.
Data analysis: Users can search BioXpress using UniProtKB/ RefSeq accession, HUGO Gene Nomenclature Committee gene symbol, or RefSeq accessions. Moreover, differentially expressed genes for a specific cancer type can be also retrieved. Query submission redirects to the results page, organized into the chart and table components. The default chart for a BioXpress transcription search lists the frequencies of patients corresponding to each expression trend (under- or overexpression) for the queried transcript across all relevant cancer types. Here, patients with log2 fold change values higher than zero are considered to follow an overexpression trend, while those with less than zero follow an under-expression trend. The second chart illustrates the percentage of patients whose individual expression trend matches the significant trend reported for the queried transcript across different cancer types, with two colored series denoting different thresholds. The third chart shows a box plot of tumor sample expression, involving expression from those samples with matched normal data and all unpaired tumor samples from the TCGA. In addition to expression values and various statistics, the hit table contains a variety of identifiers and hyperlinks to relevant BioMuta entries and other external resources. As an alternative, the advanced cancer type search enables users to combine multiple search terms, including transcript and cancer types, in one query while simultaneously filtering hits by trend and significance threshold. The default view for the cancer type search displays the significance of the patient’s expression for the top 20 transcripts matching the queried cancer type and meeting other search criteria. All data in BioXpress can be downloaded, including a list of genes that differ significantly in two or more types of cancers [45–47].
Conclusions and future directions
In this review we presented and discussed the commonly used data storage and analysis portals, such as cBioPortal, UALCAN, ENCORI, and others. In our opinion, biomarkers based on various RNA molecules, both coding and non-protein coding, will contribute to the personalization of oncology. However, there is a need to describe these transcriptome profiles and link them with the clinicopathological data of patients. A huge limitation to achieving this goal is the lack of high-throughput data analysis skills acquired during university classes. Most transcriptomics researchers do not have bioinformatics training. For them, the fastest option is to use websites where quick data analysis is possible. However, these databases also have many disadvantages. The first one being that portals based on the TCGA data are not connected with each other. There is no possibility of analyzing the same samples in the context of, for example, miRNA:lncRNA nor miRNA:mRNA. Moreover, the selection of the samples by their ID number is also impossible. What is more, there is no ability to select nor exclude specified patients using their clinical-pathological parameters. It should be noted that no expression level for healthy tissue for some databases or datasets can be found as well as there is no ability for analysis of the adjacent matched healthy and cancer samples from the same patient. This problem can be solved only by downloading data from the Xena Platform [48].
Inconsistent gene nomenclature is the second issue. Unfortunately, some authors still use outdated synonyms or improper nomenclature and unification of gene names used in databases with the HUGO Gene Nomenclature Committee (HGNC) (https://www.genenames.org/) should be a good practice. This portal sets the standards for human gene nomenclature and is organized by the HUGO [49]. The HGNC portal gives tools like: i) BioMart server, which is the filter for selection of proper data from a large set of data; ii) HCOP tool for orthology predictions search and iii) Multi-symbol checker which is for comparison of genes’ symbols. It should be noted that HGNC enables grouping genes according to gene types or loci [49]. In our opinion, HUGO should obligate the authors of portals to standardize gene names in accordance with the guidelines, and the authors of newly created portals should follow the directions. This will allow the data to be organized and mistakes and understatements to be avoided.
The next issue is that the different normalization methods in specified databases were applied [50]. However, in the case of the TCGA-based websites there is no possibility of changing the normalization and the only solution for the user is to download raw RNA-seq data from the Xena Platform and normalize it by themselves.
It should be also emphasized that it is not always clear what statistical test was used. In most cases information about it is not included on the database webpage, the reference is not clearly explained or there is only a note about the script used, which is not understandable for users without programming knowledge. Moreover, there is often no statistical analysis presented as an expotrable table, nor a possibility of graphic modification of figures, for example as on the UALCAN database [27].
The last issue is that some databases, such as ENCORI, have limited access due to a lot of users and server overload, or full access is paid as in the case of Oncomine. Despite these inconveniences, it is possible to perform preliminary data analysis, visualize the results, download the data from the Xena Platform, and analyze it using other tools.
Different types of RNA databases are created and choosing a proper one is one of the keys to success. For this reason, the visiting of Database Commons (DC) portal is recommended. This large catalog of biological databases has been updated since 2015 and gives much information about databases and opportunities for scaling their reliability and usefulness, for example the DC portal described 16 active databases connecting with the TCGA project [51]. In the case of testing resistance to therapeutics, databases have been created and are being developed to determine the effectiveness of a given molecule and to identify individual molecular pathways involved in the metabolism or resistance to a given drug. However, there are no databases similar to those discussed above in the context of chemotherapy drugs [52–54]. Moreover, to our knowledge, there are no RNA-seq databases enabling rapid analysis of genetic changes in a cell exposed to ionizing radiation. Such a database would expand knowledge about radiogenomics and accelerate the introduction of biomarkers in radiotherapy-based therapy [55–57].
Finally, we can conclude that despite described imperfections and inconveniences regarding RNA databases, these databases enable crucial preliminary analyses for testing hypotheses, in an application for grants as well as for cancer research.
Conflict of interests
The authors declare that there is no conflict of interest regarding the publication of this paper. The use of the data does not violate the rights of any person or any institution.
Funding
This work was supported by the Greater Poland Cancer Centre — grant no.: 12/2021 (248), 10/11/2021/PGN/WCO/03 and grant no.: 1/2023(263), 13/01/2023/PGN/WCO/001. While writing this manuscript, Klaudia Dudek received a “The best in nature 2.0. The integrated program of the Poznań University of Life Sciences” scholarship from University of Life Sciences in Poznań. Joanna Kozłowska-Masłoń received a PhD program scholarship in the time of writing this manuscript from the Adam Mickiewicz University, Poznan; Kacper Guglas received a scholarship in the time of writing this manuscript from the European Union POWER PhD program and from the Medical University of Warsaw.
Acknowledgments
Many thanks to the Greater Poland Cancer Center for supporting our work.
Authors’ contributions
Authors’ individual contributions: conceptualization: T.K., A.T.; writing — original draft preparation: M.S., J.L., J.O., P.P., M.R., P.P., J.K.M., K.G., K.D., N.G., K.R., A.F., U.K., T.K.; writing — review and editing: T.K., U.K., A.T., K.R.; visualization: T.K., M.S., J.L., P.P.; supervision: T.K., A.T.; funding acquisition: A.T., K.G., K.L. All authors read and approved the final manuscript.
Availability of data and materials
The datasets used during the current study are available online from described websites and from PubMed.
Ethics approval
All data is available online, access is unrestricted. The use of the data does not violate the rights of any person or any institution.
Consent for publication
All authors read and approved the final manuscript.