DEV Community: MennahTullah Mabrouk

Analyzing RNA-seq data with DESeq2 (Part 2)

MennahTullah Mabrouk — Sat, 29 Jul 2023 11:36:09 +0000

As I continue learning, I encountered several issues while using the previous dataset, RNA-seq of human multiple myeloma patients myeloid-derived suppressor cells (M-MDSC). Unfortunately, I was unable to prepare it for DESeq2 analysis due to unexpected errors, which resulted in wasted time and frustration. Despite my strong desire to utilize the dataset, I couldn’t achieve the desired results.
During my search for solutions, I came across an excellent Differential expression with DEseq2 Tutorial, which provided valuable insights into the DESeq2 analysis. However, since I still didn’t know how to properly prepare the (M-MDSC) dataset, I decided to work with the dataset used in the tutorial. This way, I could apply the knowledge I had gained from the tutorial to a dataset with known steps and expectations.
We will cover data preprocessing, differential expression analysis, data visualization with histograms and heatmaps, and the use of Principal Component Analysis (PCA) for sample relationship visualization. For those interested in the entire workflow, I have documented it on my GitHub feel free to explore.

Starting with

Sets up the necessary data and prepares it in the required format to perform differential expression analysis using the DESeq2 package. We transformed the data into a suitable matrix format for further analysis with DESeq2.
“Basically, here we upload our data and prepare rawcounts”

library(DESeq2)
library(ggplot2)
library(pheatmap

# Read in the raw read counts
rawCounts <- read.delim("<http://genomedata.org/gen-viz-workshop/intro_to_deseq2/tutorial/E-GEOD-50760-raw-counts.tsv>")
head(rawCounts)

# Read in the sample mappings
sampleData <- read.delim("<http://genomedata.org/gen-viz-workshop/intro_to_deseq2/tutorial/E-GEOD-50760-experiment-design.tsv>")
head(sampleData)

#For rawCounts
# Convert count data to a matrix of appropriate form that DEseq2 can read
geneID <- rawCounts$Gene.ID
sampleIndex <- grepl("SRR\\\\d+", colnames(rawCounts))
rawCounts <- as.matrix(rawCounts[, sampleIndex])
rownames(rawCounts) <- geneID
head(rawCounts)

Console Snip

> head(rawCounts)
                SRR975551 SRR975552 SRR975553 SRR975554 SRR975555 SRR975556
ENSG00000000003      6617      1352      1492      3390      1464      1251
ENSG00000000005        69         1        20        23        12         4
ENSG00000000419      2798       714       510      1140      1667       322
ENSG00000000457       486       629       398       239       383       290
ENSG00000000460       466       342        73       227       193        35
ENSG00000000938        75        95       158       107       135        75
                SRR975557 SRR975558 SRR975559 SRR975560 SRR975561 SRR975562
ENSG00000000003       207      1333      2126      1799      1362      3435
ENSG00000000005        20         2         3         6        10        15
ENSG00000000419       273       621      1031       677       480      1194
ENSG00000000457       164       452       172       229       264       297
ENSG00000000460        38       184       174        68        46       173
ENSG00000000938       236       254       121       107        94        90

Continue Data Preprocessing and Preparation

Now we select specific columns, renames them, and converts the “individualID” column to a factor, ensuring that the data is in a compatible format for DESeq2’s requirements.
“Basically, here we prepare sampleData”

# Convert sample variable mappings to an appropriate form that DESeq2 can read
head(sampleData)
rownames(sampleData) <- sampleData$Run
keep <- c("Sample.Characteristic.biopsy.site.", "Sample.Characteristic.individual.")
sampleData <- sampleData[, keep]
colnames(sampleData) <- c("tissueType", "individualID")
sampleData$individualID <- factor(sampleData$individualID)
head(sampleData)

Watch Difference in Console Snip

> # Convert sample variable mappings to an appropriate form that DESeq2 can read
> head(sampleData)
        Run Sample.Characteristic.biopsy.site.
1 SRR975551                      primary tumor
2 SRR975552                      primary tumor

  Sample.Characteristic.Ontology.Term.biopsy.site. Sample.Characteristic.disease.
1             <http://www.ebi.ac.uk/efo/EFO_0000616>              colorectal cancer
2             <http://www.ebi.ac.uk/efo/EFO_0000616>              colorectal cancer

  Sample.Characteristic.Ontology.Term.disease.
1         <http://www.ebi.ac.uk/efo/EFO_0005842>
2         <http://www.ebi.ac.uk/efo/EFO_0005842>

  Sample.Characteristic.disease.staging.
1             Stage IV Colorectal Cancer
2             Stage IV Colorectal Cancer

  Sample.Characteristic.Ontology.Term.disease.staging.
1                                                   NA
2                                                   NA

  Sample.Characteristic.individual. Sample.Characteristic.Ontology.Term.individual.
1                             AMC_2                                              NA
2                             AMC_3                                              NA

  Sample.Characteristic.organism. Sample.Characteristic.Ontology.Term.organism.
1                    Homo sapiens <http://purl.obolibrary.org/obo/NCBITaxon_9606>
2                    Homo sapiens <http://purl.obolibrary.org/obo/NCBITaxon_9606>

  Sample.Characteristic.organism.part.
1                                colon
2                                colon

  Sample.Characteristic.Ontology.Term.organism.part. Factor.Value.biopsy.site.
1      <http://purl.obolibrary.org/obo/UBERON_0001155>             primary tumor
2      <http://purl.obolibrary.org/obo/UBERON_0001155>             primary tumor

  Factor.Value.Ontology.Term.biopsy.site. Analysed
1    <http://www.ebi.ac.uk/efo/EFO_0000616>      Yes
2    <http://www.ebi.ac.uk/efo/EFO_0000616>      Yes


> rownames(sampleData) <- sampleData$Run
> keep <- c("Sample.Characteristic.biopsy.site.", "Sample.Characteristic.individual.")
> sampleData <- sampleData[, keep]
> colnames(sampleData) <- c("tissueType", "individualID")
> sampleData$individualID <- factor(sampleData$individualID)


> head(sampleData)
             tissueType individualID
SRR975551 primary tumor        AMC_2
SRR975552 primary tumor        AMC_3
SRR975553 primary tumor        AMC_5

Here is one of the things I had learned

Prepares the data for unsupervised clustering analysis by reordering columns, renaming tissue types, converting variables to factors, and creating a DESeq2DataSet object. Then performing variance stabilizing transformation and calculates a correlation matrix for hierarchical clustering with correlation heatmaps.

# Put the columns of the count data in the same order as row names of the sample mapping, then make sure it worked
rawCounts <- rawCounts[, unique(rownames(sampleData))]
all(colnames(rawCounts) == rownames(sampleData))

# Rename the tissue types
rename_tissues <- function(x) {
  x <- switch(as.character(x), "normal" = "normal-looking surrounding colonic epithelium", "primary tumor" = "primary colorectal cancer", "colorectal cancer metastatic in the liver" = "metastatic colorectal cancer to the liver")
  return(x)
}
sampleData$tissueType <- unlist(lapply(sampleData$tissueType, rename_tissues))

# Order the tissue types so that it is sensible and make sure the control sample is first: normal sample -> primary tumor -> metastatic tumor
sampleData$tissueType <- factor(sampleData$tissueType, levels = c("normal-looking surrounding colonic epithelium", "primary colorectal cancer", "metastatic colorectal cancer to the liver"))

# Modify factor levels to comply with safe naming conventions
levels(sampleData$individualID) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$individualID))
levels(sampleData$tissueType) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$tissueType))

# Create the DESeq2DataSet object
deseq2Data <- DESeqDataSetFromMatrix(countData = rawCounts, colData = sampleData, design = ~ individualID + tissueType)

# Estimate size factors
dds_wt <- estimateSizeFactors(deseq2Data)

# Unsupervised clustering analysis: log transformation using vst
vsd_wt <- vst(dds_wt, blind = TRUE)

# Hierarchical clustering with correlation heatmaps
vsd_mat_wt <- assay(vsd_wt)
vsd_cor_wt <- cor(vsd_mat_wt)

# Save vsd_cor_wt as a TSV file (Optional only to see an overview)
write.table(vsd_cor_wt, file = "vsd_cor_wt.tsv", sep = "\\t", row.names = TRUE, col.names = TRUE)
Let’s Break Our Code a Little Bit:
By reordering the columns of the count data rawCounts to match the row names (gene IDs) of the sample metadata sampleData, it ensures that the samples are in the correct order for downstream analysis. This step is crucial because DESeq2 relies on correctly matched count data and sample metadata to perform valid statistical analysis.

rawCounts <- rawCounts[, unique(rownames(sampleData))]
all(colnames(rawCounts) == rownames(sampleData))
Renaming tissue types to more informative names makes the data easier to interpret and understand.
Providing clearer labels for the different sample groups will particularly be useful for visualizations and result interpretation.
rename_tissues <- function(x) {
  x <- switch(as.character(x), "normal" = "normal-looking surrounding colonic epithelium", "primary tumor" = "primary colorectal cancer", "colorectal cancer metastatic in the liver" = "metastatic colorectal cancer to the liver")
  return(x)
}

sampleData$tissueType <- unlist(lapply(sampleData$tissueType, rename_tissues))

Ordering the tissue types and modifying factor levels ensures that the analysis treats the samples with the intended biological order. This step is crucial for performing meaningful differential expression analysis between specific tissue types
Note: The gsub() function is used to modify the factor levels by replacing any non-alphanumeric characters with underscores. This ensures compliance with safe naming conventions for factor levels and avoids potential issues in subsequent analyses.

sampleData$tissueType <- factor(sampleData$tissueType, levels = c("normal-looking surrounding colonic epithelium", "primary colorectal cancer", "metastatic colorectal cancer to the liver"))

levels(sampleData$individualID) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$individualID))
levels(sampleData$tissueType) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$tissueType))

Creating a DESeq2DataSet object (deseq2Data) is necessary to organize the count data and sample metadata together. DESeq2 requires data to be in this specific object format for differential expression analysis.

deseq2Data <- DESeqDataSetFromMatrix(countData = rawCounts, colData = sampleData, design = ~ individualID + tissueType)

Estimating size factors is a critical step in the normalization process for RNA-seq data. It helps to ensure that the count data is appropriately scaled, making it suitable for meaningful comparisons between samples. Size factors are used to normalize the count data, adjusting for differences in sequencing depth and read coverage, so that genes can be compared more accurately across samples.

dds_wt <- estimateSizeFactors(deseq2Data)

The Variance Stabilizing Transformation (VST) is a powerful statistical method used to stabilize the variance across the range of count values in RNA-seq data. It is based on the negative binomial distribution, which is often used to model count data.

vsd_wt <- vst(dds_wt, blind = TRUE)

The Correlation Matrix for Hierarchical Clustering helps to visualize the relationships and similarities between the samples. Hierarchical clustering with correlation heatmaps allows researchers to identify potential clusters or patterns in the gene expression data, which can be valuable for understanding the underlying biological relationships and differences between the samples.

vsd_mat_wt <- assay(vsd_wt)
vsd_cor_wt <- cor(vsd_mat_wt)

These steps collectively prepare the data for robust differential expression analysis. They ensure that the data is correctly organized, normalized, and transformed to enable reliable detection of differentially expressed genes and meaningful insights into the biological differences between the sample groups under study.

Console Snip

> # Put the columns of the count data in the same order as row names of the sample mapping, then make sure it worked
> rawCounts <- rawCounts[, unique(rownames(sampleData))]
> all(colnames(rawCounts) == rownames(sampleData))
[1] TRUE

Some Visualization

The x-axis tick values will be displayed in a non-scientific notation format, making the plot more readable and user-friendly. The y-axis represents the number of genes falling into each bin of the histogram.

# Add the ggplot code snippet with modified x-axis formatting
ggplot(data.frame(wt_normal1 = rawCounts[, 1])) +
  geom_histogram(aes(x = wt_normal1), stat = "bin", bins = 200) +
  xlab("Raw expression counts") +
  ylab("Number of genes") +
  scale_x_continuous(labels = function(x) format(x, scientific = FALSE))

Plot the heatmap using pheatmap

# Prepare data for pheatmap
data_for_heatmap <- as.matrix(vsd_cor_wt)

# Convert tissueType to a character vector
annotation_row <- as.character(sampleData$tissueType)

# Add spaces between words in the x-axis labels
annotation_row_with_spaces <- paste(" ", annotation_row, " ")

# Plot the heatmap using pheatmap with manual row annotations
pheatmap(data_for_heatmap,
         cluster_rows = TRUE,
         cluster_cols = TRUE,
         color = colorRampPalette(c("blue", "white", "red"))(50),
         show_rownames = FALSE,
         show_colnames = TRUE,
         row_names_side = "left",
         annotation_colors = "black",
         annotation_names_row = FALSE,
         labels_row = annotation_row_with_spaces,
         fontsize_row = 8,     # Adjust the font size of row labels
         fontsize_col = 12,    # Adjust the font size of column labels
         angle_col = 45)       # Set the angle of column labels to 45 degrees

Finally, Perform PCA and provide PCA scores, which can be used for data exploration, visualization, and understanding the underlying structure and relationships within the dataset.

# Calculate PCA scores
sample_scores <- as.data.frame(assay(vsd_wt))
sample_scores$Sample <- rownames(sample_scores)
column_names <- colnames(vsd_wt)

colnames(sample_scores)[2:5] <- c("normal", "fibrosis", "tumor", "metastasis")
sample_scores$PC1 <- sample_scores$normal * -2 + sample_scores$fibrosis * -10 + sample_scores$tumor * 8 + sample_scores$metastasis * 1
sample_scores$PC2 <- sample_scores$normal * 0.5 + sample_scores$fibrosis * 1 + sample_scores$tumor * -5 + sample_scores$metastasis * 6

# Print the PCA scores
print(sample_scores)

Console Snip

> # Print the PCA scores
> print(sample_scores)
                SRR975551    normal  fibrosis     tumor metastasis SRR975556
ENSG00000000003 11.923929  9.968500 10.588833 11.791039  10.058138 10.737563
ENSG00000000005  5.625681  2.799376  4.866074  5.031436   4.065530  3.717037
ENSG00000000419 10.686895  9.059366  9.056915 10.226302  10.243933  8.802204
ENSG00000000457  8.199791  8.879902  8.706264  8.013961   8.158745  8.654508
ENSG00000000460  8.141111  8.024470  6.392902  7.942364   7.214920  5.820591
ENSG00000000938  5.725709  6.301239  7.421013  6.916156   6.736584  6.794418
ENSG00000000971  8.906142  8.952251 10.967832  8.930424   8.158745 10.365427
ENSG00000001036 10.656848 10.452489 10.439085 11.081124  10.571350 10.396267
ENSG00000001084 10.073191 11.118101  9.535102 10.560985   9.788243  9.247774
ENSG00000001167  9.748116 10.326186  9.100712 10.235065   9.469240  9.013992
ENSG00000001460  6.899011  7.258717  8.048669  6.852399   7.791251  7.558024
ENSG00000001461  9.527680  9.811171  9.584137  9.060077  10.353852  9.868912
ENSG00000001497 10.308639  9.804080  9.171732 10.270811   9.505511  9.161493
ENSG00000001561  8.509119  9.179311  9.338776  8.707479   9.155344  8.998760
ENSG00000001617  7.001645  9.451539  8.135472  8.244756   9.540893  8.610090
ENSG00000001626 10.998738 10.733206 11.781040 10.338590  12.181192 12.134195
ENSG00000001629 10.228487 10.303738  9.847778  9.725462  10.016518  9.868912

                     Sample        PC1       PC2
ENSG00000000003 ENSG00000000003 -21.438886 16.966720
ENSG00000000005 ENSG00000000005  -9.942478  5.501761
ENSG00000000419 ENSG00000000419 -16.633528 23.918686
ENSG00000000457 ENSG00000000457 -32.552004 22.028879
ENSG00000000460 ENSG00000000460  -9.224133 13.982838
ENSG00000000938 ENSG00000000938 -24.746778 16.410357
ENSG00000000971 ENSG00000000971 -47.980679 19.744307
ENSG00000001036 ENSG00000001036 -26.075487 23.687810
ENSG00000001084 ENSG00000001084 -23.311101 21.018685
ENSG00000001167 ENSG00000001167 -20.309732 19.903922
ENSG00000001460 ENSG00000001460 -32.393675 24.163534
ENSG00000001461 ENSG00000001461 -32.629246 31.312451
ENSG00000001497 ENSG00000001497 -19.653479 19.752784
ENSG00000001561 ENSG00000001561 -32.931205 25.323103
ENSG00000001617 ENSG00000001617 -24.758855 28.882818
ENSG00000001626 ENSG00000001626 -44.386905 38.541846
ENSG00000001629 ENSG00000001629 -31.265045 26.471447
 [ reached 'max' / getOption("max.print") -- omitted 65200 rows ]

In conclusion, this article demonstrates the use of DESeq2 and R packages like ggplot2 and pheatmap to analyze gene expression data. It covers data preprocessing, unsupervised clustering, heatmap visualization, and principal component analysis (PCA). By providing detailed code snippets and explanations, it enables readers to perform similar analyses and gain valuable insights into biological processes.

Intro: Analyzing RNA-seq data with DESeq2

MennahTullah Mabrouk — Sun, 25 Jun 2023 10:32:03 +0000

“ Other Bioconductor packages with similar aims are edgeR, limma, DSS, EBSeq, and baySeq. “

*DESeq2
*
Helps in identifying differentially expressed genes (DEGs) between experimental conditions. The package utilizes a negative binomial distribution model to account for the inherent variability in RNA-seq count data. DESeq2 uses a specific object class called DESeqDataSet. This class extends another class called RangedSummarizedExperiment, which allows association of the count data with genomic ranges.

_Install
_

install.packages("DESeq2")

Start

library(DESeq2)

DESeq2 expects un-normalized count data as input. The values in the count matrix represent the number of reads or fragments that can be assigned to a specific gene in a particular sample.

RNA-Seq questions

What genes are differentially expressed between sample groups?
Are there any specific genes that are significantly upregulated or downregulated between the sample groups?
Can you identify the top differentially expressed genes based on fold change or statistical significance?
Are there any specific gene expression patterns that emerge over time or across different conditions?
Can you identify clusters or groups of genes that exhibit similar expression patterns over time or across conditions?
Which biological processes or molecular pathways are significantly enriched among the differentially expressed genes?
Can you provide a comprehensive summary or visualization of the gene expression data, highlighting the most relevant genes and pathways related to the condition of interest?

RNA Workflow

Biological Samples: The experiment begins with the collection and preparation of biological samples, such as tissues or cells, from the organism of interest. (Lab)
Library Preparation: RNA molecules are extracted from the samples and converted into cDNA libraries suitable for sequencing.

Sequence Read: The prepared libraries are subjected to high-throughput sequencing, where the DNA fragments are sequenced to generate short reads or longer reads. This step results in a vast amount of raw sequencing data.

Quality Control: The raw sequencing data undergoes quality control to assess the read quality. This step involves checking for sequence errors, adapter contamination, and other artifacts. Low-quality or problematic reads are filtered or trimmed to improve the accuracy of downstream analyses.

Splice-Aware Mapping to the Genome: The high-quality reads are aligned or mapped to a reference genome using splice-aware alignment algorithms.

Counting Reads Associated with the Genome: Once the reads are aligned, the next step is to count the number of reads associated with each genomic feature, such as genes or exons. This step quantifies the expression levels of genes in terms of read counts.

Statistical Analysis for Differential Expression: The read counts are then used for statistical analysis to identify differentially expressed genes between different conditions or sample groups.

To Start

Data preprocessing: Before using DESeq2, it is important to perform some preprocessing steps on the raw RNA-seq data. These steps may include quality control, read alignment, and read counting. It’s worth mentioning that DESeq2 expects raw count data as input, but often, additional preprocessing steps are needed to obtain the count matrix.

Normalization: DESeq2 performs its own normalization internally using the method called “size factors.” This step adjusts for differences in library size between samples. It might be beneficial to provide a brief explanation of the normalization process and its importance for accurate analysis.

Experimental design: DESeq2 requires information about the experimental design, such as the different experimental conditions and the replicates for each condition. This information is used to fit statistical models and identify differentially expressed genes accurately. Including a mention of the importance of experimental design and the need for careful planning would be beneficial.

Statistical analysis: DESeq2 employs statistical models to estimate dispersion and perform hypothesis testing to identify differentially expressed genes. It might be helpful to provide a brief overview of the underlying statistical concepts, such as the negative binomial distribution model and the process of hypothesis testing.

Interpretation of results: After running DESeq2, it is essential to interpret the results correctly. This includes understanding the meaning of different statistical metrics, such as log-fold change and adjusted p-values, and their significance in identifying significant gene expression changes.

To Be Continued

Dataset : RNA-seq of human multiple myeloma patients myeloid-derived suppressor cells (M-MDSC)

# Load the libraries
library(DESeq2)
library(ggplot2)

# Read in the raw read counts
rawCounts <- read.delim("F:\\E-MTAB-9767-raw-counts.tsv")

# Read in the sample mappings
sampleData <- read.delim("F:\\E-MTAB-9767-experiment-design.tsv")

# Also save a copy for later
sampleData_v2 <- sampleData

# Plot the histogram of raw expression counts
ggplot(rawCounts) +
  geom_histogram(aes(x = ERR4843201), stat = "bin", bins = 200) +
  xlab("Raw expression counts") +
  ylab("Number of genes")

Multiple myeloma is a cancer that affects plasma cells, which are a type of white blood cell responsible for producing antibodies. Common symptoms include bone pain, fatigue, recurrent infections, anemia, kidney problems, and weakened bones leading to fractures.

Also to Read

What is Bioconductor in R ?
https://dev.to/mennahtullahmabrouk/what-is-bioconductor-in-r--4501

What is Bioconductor in R ?

MennahTullah Mabrouk — Sun, 18 Jun 2023 14:24:49 +0000

Bioconductor is an open-source and open-development software project. It provides tools, packages, and resources for the analysis and comprehension of genomic data.
Focuses on the statistical analysis and interpretation of high-throughput biological data.
These packages include preprocessing, quality control, normalization, differential expression analysis, pathway analysis, genomic annotation, visualization, and machine learning.

Bioconductor promotes collaboration and community contribution, with researchers and developers actively participating in the development and maintenance of packages.

It emphasizes reproducible research by providing a platform for sharing and distributing analysis workflows, datasets, and methods.
It incorporates important biological metadata and supports scalable software development.
The project facilitates the exploration and interpretation of complex genomic datasets, enabling researchers to extract meaningful insights from their data.

How to Install Bioconductor ?

1) Install R the programming language used for Bioconductor
2) To install core packages, type the following in an R command window

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

3) Check Bioconductor version

version()

4) use BiocManager::install() to install specific packages e.g.

BiocManager::install("limma")
BiocManager::install(c("GenomicFeatures", "AnnotationDbi"))

5) Load Libraries using library()

Library(limma)
Library(GenomicFeatures)
Library(AnnotationDbi)

6) Display Information about the current R session

sessionInfo()

When performing data analysis, it is important to document the versions of the software and packages used to ensure that the analysis can be reproduced in the future. Including sessionInfo() in your code, you can easily retrieve information about the versions of R and packages used at the time of analysis.

7) Check for package updates

valid()

Bioconductor VS Bioperl

Programming Language:

Bioconductor is primarily based on the R programming language. It provides a collection of R packages specifically designed for the analysis and comprehension of genomic data.
Bioperl, on the other hand, is written in Perl, a general-purpose scripting language. It offers a comprehensive set of Perl modules for bioinformatics tasks. (Decreased in Usage)

Scope and Focus:

Bioconductor is focused on the analysis of high-throughput genomic data, such as gene expression, DNA sequencing, and microarray data. It provides a wide range of packages for statistical analysis, visualization, and annotation of genomic data.
Bioperl covers a broader range of bioinformatics tasks, including sequence analysis, molecular biology, and computational biology. It provides modules for parsing, manipulating, and analyzing biological sequence data, as well as tools for database access and integration.

Ease of Use:

Bioconductor is known for its user-friendly interface and extensive documentation, making it accessible to both bioinformaticians and biologists with limited programming experience. The packages are designed to work well together, allowing users to easily combine multiple analyses.
Bioperl is a powerful toolkit but can be more challenging for beginners due to its Perl programming language syntax. It requires a certain level of programming proficiency to effectively utilize and extend the toolkit. (Can be Harder)

Integration with Other Tools:

Bioconductor is tightly integrated with R and leverages its extensive ecosystem of statistical and data manipulation packages. It also integrates well with other bioinformatics tools and resources, such as the NCBI databases and popular genome browsers.
Bioperl provides interfaces to various external tools and databases, allowing users to seamlessly interact with external resources. It has built-in support for common file formats and can be easily integrated into bioinformatics pipelines.

Major Bioconductor Packages

DESeq2: A tool for RNA-Seq data analysis, DESeq2 uses a negative binomial model to account for variability and identifies significant expression changes between conditions.
limma: A package for microarray data analysis, limma employs linear modeling and empirical Bayes methods to detect genes with significant expression differences between groups.
ShortRead: Specifically designed for short-read sequencing data analysis, ShortRead offers functions for quality control, alignment, read counting, variant calling, and other NGS-specific analyses within the Bioconductor framework.
edgeR: Another package for RNA-Seq analysis, edgeR utilizes a negative binomial distribution approach for differential gene expression analysis, including normalization, dispersion estimation, and identification of differentially expressed genes.
GenomicRanges: This package efficiently manipulates, annotates, and analyzes genomic intervals, providing operations like overlap detection, subsetting, merging, and visualization of genomic regions.
GenomicFeatures: Designed for genomic annotation data, GenomicFeatures facilitates the extraction, manipulation, and visualization of genomic features such as genes, transcripts, exons, and promoters.

Example Using Zika Virus Dataset

Zika_Virus_Dataset_from_Datacamp

This is a simple code provided utilizes the Biostrings package, which is part of the Bioconductor project.

# Install and load the Biostrings package

library(Biostrings)

# Provide the path to the file
file_path <- "F:\\zika.txt"

# Read the file
zika_sequence <- readDNAStringSet(file_path)

# Check the length of the sequence
sequence_length <- width(zika_sequence)
cat("Sequence Length:", sequence_length, "\n")

#--- output --- : Sequence Length: 10794

# Retrieve the first 50 characters of the sequence (if available)
if (sequence_length >= 50) {
  first_50_chars <- as.character(zika_sequence)[1:50]
  cat("First 50 Characters:", first_50_chars, "\n")
} else if (sequence_length > 0) {
  first_50_chars <- as.character(zika_sequence)[1:sequence_length]
  cat("First", sequence_length, "Characters:", first_50_chars, "\n")
} else {
  cat("No sequence data available.\n")
}

#--- output --- : First 50 Characters:AGTTGTTGATCTGTGTGAGTCAGACTGCGACA----

# Count the number of occurrences of a specific subsequence
subsequence <- DNAString("AGTT")
subsequence_count <- vcountPattern(subsequence, zika_sequence)
cat("Subsequence Count:", subsequence_count, "\n")

#--- output --- : Subsequence Count: 34

# DNA single string
dna_seq <- DNAString("ATGATCTCGTAA")
print("DNA sequence:")
print(dna_seq)

"""
--- output --- :
DNA sequence:
12-letter DNAString object
seq: ATGATCTCGTAA
"""
# Transcription DNA to RNA string
rna_seq <- RNAString(dna_seq)
print("RNA sequence:")
print(rna_seq)

"""
--- output --- :
RNA sequence:
12-letter RNAString object
seq: AUGAUCUCGUAA
"""

# Translation RNA to amino acids
print("Translation RNA to amino acids:")
aa_seq <- translate(rna_seq)
print(aa_seq)

"""
--- output --- :
Translation RNA to amino acids:
4-letter AAString object
seq: MIS*
"""

# Shortcut translate DNA to amino acids
print("Shortcut translate DNA to amino acids:")
aa_seq_shortcut <- translate(dna_seq)
print(aa_seq_shortcut)

"""
--- output --- : 
Shortcut translate DNA to amino acids:
4-letter AAString object
seq: MIS*
"""

# Read the dataset from the file
dataset <- readLines(file_path)
# Combine the lines into a single string
dataset <- paste(dataset, collapse = "")
# Define the pattern
pattern <- "GGG"

# Calculate the frequency of the pattern within the dataset
pattern_count <- sum(gregexpr(pattern, dataset, fixed = TRUE)[[1]] > 0)
# Print the pattern count
print(pattern_count)

#--- output --- : 171

In conclusion, Bioconductor is a powerful and widely used software project in R that provides a comprehensive collection of packages and resources for analyzing genomic data. It offers a range of tools and algorithms for tasks such as quality control, preprocessing, differential expression analysis, pathway analysis, and visualization. Bioconductor stands out for its extensive package ecosystem, with specialized functionality covering various areas of genomics.