Author: Gabriele Sotto

Supervisors: Stefano Monti & Riccardo Bellazzi

Introduction

In the last years there has been a great increment of genomic data due to the advances in various high-throughput biotechnologies such as RNA gene expression microarrays or NGS technologies. These large genomic data sets are information-rich and contain a lot of information. Such an enormous data volume enables new types of analyses, but also makes it difficult to answer research questions using traditional methods. One of the challenges in this field is the high dimensional nature of biological data. In genomic data analysis, many gene targets are investigated simultaneously, yielding dramatically sparse data points in the corresponding high-dimensional data space. It is well known that mathematical and computational approaches often fail to capture such high dimensional phenomena accurately. We have a “small n and large p” problem. In fact “n”, the number of independent observations and subjects, is much small than the number of candidate prediction parameters and targets, namely “p”. Also we have much noise in this data, because biological information and signals of interest are often observed with many other random or confounding factors. For these reasons it’s important to apply a various of methods for explore the data and perform a features selection for decrese the number of prediction parameters.

In this work we analyse two datasets: a lymphoma dataset from Affymetrix gene expression microarrays, as described in Lenz et al., NEJM 2008; and the TCGA Head and Neck Squamous Carcinoma (HNSC) dataset of RNA-seq expression profiles partially described in Lawrence et al., Nature 2015.

1. Analysis of a Diffuse Large B-Cell Lymphoma (DLBCL) dataset

These data were obtained from 414 patients with newly diagnosed diffuse large-B-cell lymphoma who were treated at 10 institutions in North America and Europe. Although diffuse large-B-cell lymphoma is curable with anthracycline-based chemotherapy regimens such as a combination of cyclophosphamide, doxorubicin, vincristine, and prednisone (CHOP), the addition of rituximab immunotherapy (R-CHOP) has improved overall survival among patients with diffuse large-B-cell lymphoma by 10 to 15%. Diffuse large-B-cell lymphoma is a molecularly heterogeneous disease, and it is unclear whether rituximab preferentially improves the outcome in certain subgroups of patients. Among these patients, a CHOP training group consisted of 181 patients, previously described, who were treated with anthracycline-based combinations, most often CHOP or similar regimens. The other 233 patients constituted an R-CHOP cohort that received similar chemotherapy plus rituximab.

Cell-of-Origin (COO) categorization into Germinal Center B-Cell (GCB) and Activated B-Cell (ABC) are the two classes and we want to build a classifier able to predict the class of a sample starting from the expression matrix.

We start with installing the required packages:

#source("http://bioconductor.org/biocLite.R")
#biocLite(c("GEOquery", "frma", "ArrayExpress", "biomaRt"))
#install.packages(c("pROC","caret"),repos="http://cran.r-project.org")

library(GEOquery)
library(ArrayExpress)
library(frma)
library(biomaRt)
library(reshape2)
require(Biobase)
require(CBMRtools)
require(caret)
require(limma)
library(VennDiagram)
library(rgl)
require(affy)

PATH <- "~/Desktop/Advanced_data_mining/Lezioni_Monti/data/"

After this step we load the dataset and we RMA-process the data. It’s important in this early step to annotate the probesets (the rows of the dataset) with gene symbols. We used biomaRt, and then merge multiple probesets with the function “collapseByMedian”.

DATA<-readRDS(paste(PATH,"lymphoma_lenz.frma.ensg.RDS",sep=""))
DATA
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 19081 features, 414 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM274895 GSM274896 ... GSM275308 (414 total)
##   varLabels: Sample Other ... PR_vs_Cure (15 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: AFFX-BioB-3 AFFX-BioB-5 ... ENSG00000259758 (19081
##     total)
##   fvarLabels: ensembl_gene_id entrezgene ... description (9 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: hgu133plus2hsensg
#collapse eset by specified rowid (column name in fData)
#if duplicate rows appear, use the median among rows for each column

collapseByMedian<-function(eset, rowid){
    #library(reshape2)

    #remove unmapped probe sets
    genes<-fData(eset)[, rowid]
    rows.mapped<-!is.na(genes) & genes != ""
    eset<-eset[rows.mapped,]
    genes<-fData(eset)[, rowid]

    #collapse by median value among duplicate probes
    df<-data.frame(exprs(eset), genes = genes)
    df.melt<-melt(df, id.vars = "genes")
    df.median.collapsed<-dcast(df.melt, genes ~ variable, median, fill=NaN)

    #reassemble collapsed eset
    fdat<-fData(eset)
    fdat.collapsed<-fdat[!duplicated(fdat[, rowid]), ] #keep first rows of duplicated fData entries
    row.order<-match(fdat.collapsed[, rowid], df.median.collapsed$genes)
    df.median.collapsed <-df.median.collapsed[row.order,]
    exprs(eset)<-as.matrix(df.median.collapsed[, colnames(eset)])
    fData(eset)<-fdat.collapsed
    return(eset)
}

setwd("~/Desktop/Advanced_data_mining/Lezioni_Monti/data/GSE10846_RAW")
data <- justRMA() 

mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

mapping<-getBM(attributes=c("affy_hg_u133_plus_2","hgnc_symbol","description","ensembl_gene_id"),
               filters="affy_hg_u133_plus_2", 
               values=featureNames(data), 
               mart=mart)
head(mapping)
##   affy_hg_u133_plus_2 hgnc_symbol
## 1        1553551_s_at      MT-ND1
## 2        1553551_s_at       MT-TI
## 3        1553551_s_at       MT-TM
## 4        1553551_s_at      MT-ND2
## 5          1553569_at      MT-CO1
## 6        1553538_s_at      MT-CO1
##                                                                                                description
## 1 mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1 [Source:HGNC Symbol;Acc:HGNC:7455]
## 2                               mitochondrially encoded tRNA isoleucine [Source:HGNC Symbol;Acc:HGNC:7488]
## 3                               mitochondrially encoded tRNA methionine [Source:HGNC Symbol;Acc:HGNC:7492]
## 4 mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 2 [Source:HGNC Symbol;Acc:HGNC:7456]
## 5                        mitochondrially encoded cytochrome c oxidase I [Source:HGNC Symbol;Acc:HGNC:7419]
## 6                        mitochondrially encoded cytochrome c oxidase I [Source:HGNC Symbol;Acc:HGNC:7419]
##   ensembl_gene_id
## 1 ENSG00000198888
## 2 ENSG00000210100
## 3 ENSG00000210112
## 4 ENSG00000198763
## 5 ENSG00000198804
## 6 ENSG00000198804
rows.match<-match(featureNames(data), mapping$affy_hg_u133_plus_2)
fdat.new<-cbind(fData(data), mapping[rows.match,])
fData(data)<-fdat.new

#collapsing duplicate gene symbols by symbol
data.collapsed<-collapseByMedian(data, "hgnc_symbol")
fData(DATA)<-fData(data.collapsed)
exprs(DATA)<-exprs(data.collapsed)

After this step, before starting with the analysis, we have modified the dataset by deleting the samples that are not classified:

ind <- which(DATA$COO != "Unclassified")

# exclude "unclassified" samples
DATA <- DATA[,ind]
DATA <- DATA[,DATA$COO %in% c("ABC","GCB")]
DATA$COO <- factor(DATA$COO)     #we removed the unclass

saveRDS(DATA, paste(PATH, "/", "GSE10846_lymphoma_lenz.RDS", sep = ""))

#DATA<-readRDS(paste(PATH,"GSE10846_lymphoma_lenz.RDS",sep=""))

The structure of the data frame can be summarized in the following picture:

Now we can start with the first analysis. The RCHOP cohort will be used as the discovery set, and the CHOP cohort will be used as the validation cohort.

1.1 Principal Component Analysis (PCA)

We start by perform a Principal Component Analysis (PCA) and plot data onto the first two PCs, and the 1st and 3rd PCs, by coloring the data points by treatment, by gender and by COO.

require(rgl)
require(ggplot2)

# run PCA
data.pca <- prcomp(t(exprs(DATA)))  
data.pca1.2 <- data.pca$x[,1:2] # take the first 2 components
data.pca1.3 <- data.pca$x[,1:3] # take the first component and the third

(summary(data.pca)$importance)[,1:50]
##                             PC1      PC2      PC3      PC4      PC5
## Standard deviation     37.10523 22.21172 17.70494 15.44411 14.36297
## Proportion of Variance  0.22137  0.07933  0.05040  0.03835  0.03317
## Cumulative Proportion   0.22137  0.30070  0.35110  0.38945  0.42262
##                             PC6      PC7      PC8      PC9     PC10
## Standard deviation     11.99174 11.04657 9.933379 9.282565 8.749251
## Proportion of Variance  0.02312  0.01962 0.015870 0.013850 0.012310
## Cumulative Proportion   0.44574  0.46536 0.481230 0.495080 0.507390
##                            PC11    PC12    PC13    PC14     PC15     PC16
## Standard deviation     8.404919 8.04952 7.91658 7.62578 7.076607 6.965413
## Proportion of Variance 0.011360 0.01042 0.01008 0.00935 0.008050 0.007800
## Cumulative Proportion  0.518750 0.52917 0.53925 0.54860 0.556650 0.564450
##                            PC17     PC18     PC19     PC20     PC21
## Standard deviation     6.841613 6.332118 6.252187 6.159741 5.992491
## Proportion of Variance 0.007530 0.006450 0.006290 0.006100 0.005770
## Cumulative Proportion  0.571970 0.578420 0.584710 0.590810 0.596580
##                            PC22     PC23     PC24     PC25     PC26
## Standard deviation     5.854245 5.760582 5.702694 5.587967 5.533095
## Proportion of Variance 0.005510 0.005340 0.005230 0.005020 0.004920
## Cumulative Proportion  0.602090 0.607430 0.612660 0.617680 0.622600
##                           PC27     PC28     PC29     PC30     PC31
## Standard deviation     5.50653 5.287855 5.200637 5.095048 5.087345
## Proportion of Variance 0.00488 0.004500 0.004350 0.004170 0.004160
## Cumulative Proportion  0.62748 0.631970 0.636320 0.640490 0.644660
##                            PC32     PC33     PC34     PC35     PC36
## Standard deviation     5.061748 4.916047 4.823742 4.721467 4.706865
## Proportion of Variance 0.004120 0.003890 0.003740 0.003580 0.003560
## Cumulative Proportion  0.648770 0.652660 0.656400 0.659990 0.663550
##                            PC37     PC38     PC39     PC40     PC41
## Standard deviation     4.657087 4.617239 4.532037 4.490648 4.431887
## Proportion of Variance 0.003490 0.003430 0.003300 0.003240 0.003160
## Cumulative Proportion  0.667040 0.670460 0.673770 0.677010 0.680170
##                            PC42     PC43     PC44     PC45     PC46
## Standard deviation     4.398638 4.349249 4.336676 4.307485 4.251713
## Proportion of Variance 0.003110 0.003040 0.003020 0.002980 0.002910
## Cumulative Proportion  0.683280 0.686320 0.689340 0.692330 0.695230
##                            PC47     PC48     PC49     PC50
## Standard deviation     4.219854 4.194948 4.135302 4.096967
## Proportion of Variance 0.002860 0.002830 0.002750 0.002700
## Cumulative Proportion  0.698100 0.700930 0.703680 0.706370
## extract label
col.palette <- c("green","red","gray")

# GENDER
colors <- col.palette[as.numeric(as.factor(DATA$Gender))]
## basic 2D plot
plot(x=data.pca1.2[,1],y=data.pca1.2[,2],pch=20,xlab="1st PCA",ylab="2nd PCA",col=colors) # plot the first 2 components
legend("topleft",pch=20,legend=levels(as.factor(DATA$Gender)), col=col.palette)

plot(x=data.pca1.3[,1],y=data.pca1.3[,3],pch=20,xlab="1st PCA",ylab="3rd PCA",col=colors) # plot the first component and the third
legend("topleft",pch=20,legend=levels(as.factor(DATA$Gender)), col=col.palette)

#TREATMENT
colors <- col.palette[as.numeric(as.factor(DATA$Treatment_Regimen))]
## basic 2D plot
plot(x=data.pca1.2[,1],y=data.pca1.2[,2],pch=20,xlab="1st PCA",ylab="2nd PCA",col=colors)
legend("topleft",pch=20,legend=levels(as.factor(DATA$Treatment_Regimen)), col=col.palette)

plot(x=data.pca1.3[,1],y=data.pca1.3[,3],pch=20,xlab="1st PCA",ylab="3rd PCA",col=colors)
legend("topleft",pch=20,legend=levels(as.factor(DATA$Treatment_Regimen)), col=col.palette)

#COO
colors <- col.palette[as.numeric(as.factor(DATA$COO))]
## basic 2D plot
plot(x=data.pca1.2[,1],y=data.pca1.2[,2],pch=20,xlab="1st PCA",ylab="2nd PCA",col=colors)
legend("topleft",pch=20,legend=levels(as.factor(DATA$COO)), col=col.palette)

plot(x=data.pca1.3[,1],y=data.pca1.3[,3],pch=20,xlab="1st PCA",ylab="3rd PCA",col=colors)
legend("topleft",pch=20,legend=levels(as.factor(DATA$COO)), col=col.palette)

As we can see, plotting the first three components of PCA, we note that we have a natural separation of the data if we consider the treatment. However we note that the space distribution of these two populations are very similiar and is possible that there is only a scale factor between these cluster. In future we consider the two populations subjected to different treatments as two different datasets, one for training and one for the validation step. For the other features we note that there are no natural clusters that stand out at first sight except for the COO class: in fact we can note that, if we consider the first and the third PCs, we can easily identify two clusters in which we find examples labeled. On the basis of these results we can think that it might be possible to build a classifier based on the genes’expression in the different samples.

1.2 Hierarchical clustering

In this section we report the results obtained by performing hierarchical clustering of both genes and samples. In particular we have generated a corresponding heatmap of all the data with sample color-coding indicating COO (GCB, ABC), gender and the treatment (CHOP vs. RCHOP). For do this, we have used a drastic variation filter to reduce the number of genes: this clustering is mostly for exploratory purposes and to visually inspect the data. The heatmaps are built using two different packages.

require(heatmap.plus)

## Variation filtering: reducing the expression dataset to the 3000 genes with the highest MAD

DATAF <- variationFilter(DATA,ngenes=3000,do.plot=TRUE)
## Variation filtering based on mad .. done.
## Selecting top 3000 by mad .. done, 3000 genes selected.
## Creating scatter plot ..

## done.
## first define a simple function to create a color gradient

colGradient <- function( cols, length, cmax=255 )
{
  ramp <- colorRamp(cols)
  rgb( ramp(seq(0,1,length=length)), max=cmax )
}

## color coding of the samples

CSC <- cbind(COO=c("green","pink")[match(DATAF$COO,c("GCB","ABC"))],
             GEND=c("red","blue","black")[match(DATAF$Gender,c("female","male","N/A"))],
             TREAT=c("magenta","lightblue")[match(DATAF$Treatment_Regimen,c("CHOP-Like","R-CHOP-Like"))])


## color gradient for the expression levels (blue=down-regulated; white=neutral; red=up-regulated)
bwrPalette <- colGradient(c("blue","white","red"),length=13)

## cluster rows (genes) and columns (samples)
hc.row <- hclust(as.dist(1-cor(t(exprs(DATAF)))),method="ward.D2") # genes by correlation
hc.col <- hclust(dist(t(exprs(DATAF))),method="ward.D2")           # samples by euclidean distance (default)

## draw the heatmap
heatmap.plus(exprs(DATAF),Rowv=as.dendrogram(hc.row),Colv=as.dendrogram(hc.col),col=bwrPalette,ColSideColors=CSC,labCol=NA,labRow=NA)

## since this is a step always worth performing, let's define a simple function that implements the necessary steps

hcopt <- function(d, HC=NULL, method = "ward.D", members = NULL)
{
  if ( is.null(HC) ) { 
    HC <- hclust(d,method=method,members=members)
  }
  ORD <- order.optimal(d,merge=HC$merge)
  HC$merge <- ORD$merge
  HC$order <- ORD$order
  HC
}

hc.row <- hcopt(as.dist(1-cor(t(exprs(DATAF)))),method="ward.D2") # genes by correlation
hc.col <- hcopt(dist(t(exprs(DATAF))),method="ward.D2")            # samples by euclidean distance (default)
heatmap.plus(exprs(DATAF),Rowv=as.dendrogram(hc.row),Colv=as.dendrogram(hc.col),col=bwrPalette,ColSideColors=CSC,labCol=NA,labRow=NA)

##### Using the package gplots

library(ConsensusClusterPlus)
library(gplots)

distmatrix.col<-dist(t(exprs(DATAF)))
## using ward clustering 
hc01.col <- hclust(distmatrix.col,method="ward.D") 
hc01.col 
## 
## Call:
## hclust(d = distmatrix.col, method = "ward.D")
## 
## Cluster method   : ward.D 
## Distance         : Euclidean 
## Number of objects: 350
## calculate distance matrix for rows: use correlation for rows/genes
distmatrix.row<-as.dist(1-cor(t(exprs(DATAF))))
## using ward clustering
hc01.row <- hclust(distmatrix.row,method="ward.D") 

## making heatmap
levels(as.factor(DATAF$COO))
## [1] "ABC" "GCB"
col.palette <- c("green", "purple")
colors <- col.palette[as.numeric(as.factor(DATAF$COO))]
heatmaptitle <- paste(" ", "top MAD-filtered 3k genes", sep = "")
heatmap.2(exprs(DATAF),
          trace='none',
          main=heatmaptitle,
          col=bluered(64),
          dendrogram='both',
          scale='row',
          ColSideColors=colors,
          Colv=as.dendrogram(hc01.col),
          Rowv=as.dendrogram(hc01.row),
          margins=c(6,6))
legend("topright", c('ABC','GCB'),
       pch=15,pt.cex=2,col=col.palette,ncol=1,title.adj=0,cex=1,bty='n')

## making heatmap for gender 

levels(as.factor(DATAF$Gender))
## [1] "female" "male"   "N/A"
col.palette <- c("green", "purple","red")
colors <- col.palette[as.numeric(as.factor(DATAF$Gender))]
heatmaptitle <- paste(" ", "top MAD-filtered 3k genes", sep = "")
heatmap.2(exprs(DATAF),
          trace='none',
          main=heatmaptitle,
          col=bluered(64),
          dendrogram='both',
          scale='row',
          ColSideColors=colors,
          Colv=as.dendrogram(hc01.col),
          Rowv=as.dendrogram(hc01.row),
          margins=c(6,6))
legend("topright", c('female','male','N/A'),
       pch=15,pt.cex=2,col=col.palette,ncol=1,title.adj=0,cex=1,bty='n')

## making heatmap for treatment
levels(as.factor(DATAF$Treatment_Regimen))
## [1] "CHOP-Like"   "R-CHOP-Like"
col.palette <- c("red", "purple")
colors <- col.palette[as.numeric(as.factor(DATAF$Treatment_Regimen))]
heatmaptitle <- paste(" ", "top MAD-filtered 3k genes", sep = "")
heatmap.2(exprs(DATAF),
          trace='none',
          main=heatmaptitle,
          col=bluered(64),
          dendrogram='both',
          scale='row',
          ColSideColors=colors,
          Colv=as.dendrogram(hc01.col),
          Rowv=as.dendrogram(hc01.row),
          margins=c(6,6))
legend("topright", c('CHOP','R-CHOP'),
       pch=15,pt.cex=2,col=col.palette,ncol=1,title.adj=0,cex=1,bty='n')

As we can see, we are able to identify some clusters in these data. In particular we note that we are able to divide the data exacly in the two cohort-RCHOP and CHOP (this is the strongest phenotype).

1.3 EM-clustering

Now we use an even more drastic variation filter (1000 genes), and perform EM-clustering with the mclust package.Then we plot the data onto the first two PCs and color the data points by cluster. With the mclust package we can choose the best model on the basis of the measure of BIC (bayesian information criterio). The best model is the model that maximize the value of BIC.

library(mclust)

## Variation filtering: reducing the expression dataset to the 1000 genes with the highest MAD

DATAF <- variationFilter(DATA,ngenes=1000,do.plot=TRUE)
## Variation filtering based on mad .. done.
## Selecting top 1000 by mad .. done, 1000 genes selected.
## Creating scatter plot ..

## done.
# selection of the best model
BIC <- mclustBIC(t(exprs(DATAF)), G=1:15)
summary(BIC)
## Best BIC values:
##              VEI,9       VEI,11       VEI,12
## BIC      -928761.9 -929122.9818 -929271.2943
## BIC diff       0.0    -361.1222    -509.4347
BIC
## Bayesian Information Criterion (BIC):
##         EII      VII        EEI        VEI        EVI        VVI
## 1  -1157354 -1157354 -1125226.6 -1125226.6 -1125226.6 -1125226.6
## 2  -1067180 -1066601  -974112.4  -972823.3  -970935.7  -969217.2
## 3  -1054308 -1053385  -960083.0  -959043.0  -957982.1  -957423.2
## 4  -1044427 -1042028  -948269.9  -944892.1  -950481.0  -948869.7
## 5  -1035630 -1034407  -940002.8  -938265.1  -946299.1  -945955.0
## 6  -1033095 -1032229  -938440.9  -936637.1  -947233.0  -947267.6
## 7  -1029110 -1028440  -935681.7  -933326.7  -949498.2  -946979.1
## 8  -1026367 -1024822  -933061.8  -930006.2  -951269.1  -948037.1
## 9  -1027703 -1025966  -930568.3  -928761.9  -956109.7  -953051.0
## 10 -1026664 -1024960  -931833.6  -929752.5  -959556.1  -957118.2
## 11 -1026791 -1024461  -931785.1  -929123.0  -965045.1  -960906.2
## 12 -1027896 -1025881  -931918.6  -929271.3  -969679.4  -966286.6
## 13 -1029070 -1026760  -932785.8  -929928.8  -975934.6  -971877.4
## 14 -1030610 -1028387  -932784.9  -931452.4  -980342.5  -977135.1
## 15 -1033953 -1031537  -934447.2  -932032.2  -986767.7  -982368.8
## 
## Top 3 models based on the BIC criterion:
##     VEI,9    VEI,11    VEI,12 
## -928761.9 -929123.0 -929271.3
plot(BIC) 

# now we can extract the cluster
MC = Mclust(t(exprs(DATAF)), x = BIC, G=1:15)
summary(MC)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VEI (diagonal, equal shape) model with 9 components:
## 
##  log.likelihood   n    df       BIC       ICL
##       -435044.4 350 10016 -928761.9 -928761.9
## 
## Clustering table:
##  1  2  3  4  5  6  7  8  9 
## 31 38 42 39 48 36 39 32 45
# perform a PCA
data.pca <- prcomp(t(exprs(DATAF)))
data.pca1.2 <- data.pca$x[,1:2] # take the first 2 components
data.pca1.3 <- data.pca$x[,1:3] # take the first component and the third

## extract label
col.palette <- c("green","red","gray","blue","black","yellow","orange","purple","magenta")

# MC-CLUSTER
colors <- col.palette[as.numeric(as.factor(MC$classification))]
# plot onto the first two PCs
plot(x=data.pca1.2[,1],y=data.pca1.2[,2],pch=20,xlab="1st PCA",ylab="2nd PCA",col=colors)
legend("topleft",pch=20,legend=levels(as.factor(MC$classification)), col=col.palette)

With this method we can see that we have 9 clusters. It’s a very high number and we note on the graphic that the clusters that we have extracted are not clear. The possible reason is that the method overfitting the data or the probabilistic model (the gaussian mixture) that we used it is not good for this data. However these conclusions are not strong because we consider only two components and for doing a more accurate analysis we have to consider more components and observe the clusters in a multi dimensional space.

1.4 Differential analysis

Now we perform differential analysis with respect to the “ABC vs. GCB” distinction, using limma and t-test. For doing this step we start to perform variation filtering and then we selected the discovery cohort of 223 R-CHOP-treated samples, in order to reduce the number of hypothesis to test for differential analysis. In particular we choose the standar deviation as variation measure to use and a number of genes equal to 6000. With these analysis we are interested to investigate if we have a significant different gene expression in the two classes.

#variation filter
DATAF <- variationFilter(DATA,score="sd",ngenes=6000,do.plot=TRUE)
## Variation filtering based on sd .. done.
## Selecting top 6000 by sd .. done, 6000 genes selected.
## Creating scatter plot ..

## done.
# start by select the R-CHOP cohort 
DATA2F <- DATAF[,DATAF$Treatment_Regimen %in% c("R-CHOP-Like")]

samplesSet <- DATA2F[,order(DATA2F$COO)]
pheno <- samplesSet$COO

## split the data into the two groups
group1 <- exprs(samplesSet)[, pheno =="ABC"]
group2 <- exprs(samplesSet)[, pheno =="GCB"]
dim(group1)  # show the size of group1
## [1] 6000   93
dim(group2)  # show the size of group2
## [1] 6000  107
## use gene symbols to index the rows (rather than the entrez IDs)
rownames(group1) <- fData(samplesSet)$hgnc_symbol
rownames(group2) <- fData(samplesSet)$hgnc_symbol

1.4.1 Differential expression analysis using the limma package

Starting from Limma. Limma is a library for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of differential expression. Empirical Bayesian methods are used to provide stable results even when the number of arrays is small. The model that we used in this section include only the class, COO.

# using limma
design <- model.matrix(~0 + factor(pheno))
colnames(design) <- c("ABC", "GCB")

contrast.matrix <- makeContrasts(ABC-GCB, levels = design)
fit <- lmFit(samplesSet, design)
fit <- contrasts.fit(fit,contrast.matrix)
fit <- eBayes(fit)
head(fit$coefficients)
##        Contrasts
##          ABC - GCB
##   7210   0.7074712
##   5774  -0.3331985
##   14462  0.1100977
##   10416  0.2197934
##   4902   0.2968539
##   3819   0.4303390
#get full differential expression output table, sorted by p-value
limmaRes <- topTable(fit, adjust.method = "BH", n = Inf, sort.by = "P")

Now we have to select the most significant genes. For doing this we have to choose a p-value cutoff. In particular we apply a cutoff equal to 0.001.

#subset to genes with adjusted p-value cutoff
adjpcutoff <- 0.001

limmaRes.sig <- subset(limmaRes, adj.P.Val < adjpcutoff)
nrow(limmaRes.sig)
## [1] 1434
head(limmaRes.sig)
##       affy_hg_u133_plus_2 hgnc_symbol
## 10808           213906_at       MYBL1
## 1110            218862_at       ASB13
## 1437            205965_at        BATF
## 17400           207641_at   TNFRSF13B
## 3959          220230_s_at      CYB5R2
## 13635           218699_at       RAB29
##                                                                                       description
## 10808 v-myb avian myeloblastosis viral oncogene homolog-like 1 [Source:HGNC Symbol;Acc:HGNC:7547]
## 1110                ankyrin repeat and SOCS box containing 13 [Source:HGNC Symbol;Acc:HGNC:19765]
## 1437        basic leucine zipper transcription factor, ATF-like [Source:HGNC Symbol;Acc:HGNC:958]
## 17400   tumor necrosis factor receptor superfamily member 13B [Source:HGNC Symbol;Acc:HGNC:18153]
## 3959                                cytochrome b5 reductase 2 [Source:HGNC Symbol;Acc:HGNC:24376]
## 13635                        RAB29, member RAS oncogene family [Source:HGNC Symbol;Acc:HGNC:9789]
##       ensembl_gene_id     logFC  AveExpr         t      P.Value
## 10808 ENSG00000185697 -1.664738 6.228461 -15.41713 4.790664e-36
## 1110  ENSG00000196372 -1.453119 8.415138 -15.02899 7.683418e-35
## 1437  ENSG00000156127  1.582747 9.654307  14.65825 1.092633e-33
## 17400 ENSG00000240505  1.178199 7.205777  14.11348 5.423305e-32
## 3959  ENSG00000166394  2.620342 6.793673  13.87734 2.947465e-31
## 13635 ENSG00000117280  1.563642 7.822934  13.78843 5.574521e-31
##          adj.P.Val        B
## 10808 2.874398e-32 71.16989
## 1110  2.305025e-31 68.44045
## 1437  2.185266e-30 65.82872
## 17400 8.134958e-29 61.98629
## 3959  3.536958e-28 60.32018
## 13635 5.574521e-28 59.69293
topGenes <- limmaRes.sig$hgnc_symbol
topGenes.up <- subset(limmaRes.sig, t<0)$hgnc_symbol
topGenes.down <- subset(limmaRes.sig, t>=0)$hgnc_symbol

In the following picture we report the volcano plot of the q-values (y-axis) against the corresponding differential scores.

require(calibrate)

# Volcano plot for limma
subset_limma <- limmaRes[, c("hgnc_symbol","logFC","adj.P.Val")]
colnames(subset_limma) <- c("gene","log2FoldChange","q_value")
with(subset_limma, plot(log2FoldChange, -log10(q_value), pch=20, main="Limma Volcano plot"))

# Add colored points: red if adj.P.Val<0.0001, orange if log2FC>1.5, green if both)
with(subset(subset_limma, q_value<.0001 ), points(log2FoldChange, -log10(q_value), pch=20, col="red"))
with(subset(subset_limma, abs(log2FoldChange)>1.5), points(log2FoldChange, -log10(q_value), pch=20,col="orange"))
with(subset(subset_limma, q_value<.0001 & abs(log2FoldChange)>1.5), points(log2FoldChange, -log10(q_value), pch=20,col="green"))
with(subset(subset_limma, q_value<.0001 & abs(log2FoldChange)>1.5), textxy(log2FoldChange, -log10(q_value), labs=gene, cex=.8))

1.4.2 Differential expression analysis using t-test and lm

Now we repeat the analysis with other two methods for the differential analysis, a t-test and lm that is used to fit linear models.

## t-test
## apply the t.test to each gene and save the output in a data.frame

ttestRes <- data.frame(t(sapply(1:nrow(group1), 
     function(i){
          res <- t.test(x = group1[i, ], y = group2[i,], alternative ="two.sided")
          res.list <-  c(t.score=res$statistic, 
                         t.pvalue = res$p.value)
          return(res.list)
      })))

## application to all genes 
ttestRes1 <- as.data.frame(t(apply(exprs(samplesSet),1,
                                   function(y) {
                                       out <- t.test(y~pheno,var.equal=TRUE)
                                       c(t.score=out$statistic,t.pvalue=out$p.value)
                                   })))

## use the gene names to index the rows (for interpretability)
rownames(ttestRes1) <- rownames(group1)

## let us add to the output data.frame an extra column reporting the FDR
## .. (i.e., the MHT-corrected p-value)
ttestRes1$t.fdr <- p.adjust(ttestRes1$t.pvalue, method = "BH")

#subset to genes with adjusted p-value cutoff
adjpcutoff <- 0.001
ttestRes1.sig <- subset(ttestRes1, t.fdr < adjpcutoff)

## let us sort the output by t-score
ttestOrd <- order(ttestRes1[,'t.score.t'],decreasing=TRUE)
head(ttestRes1[ttestOrd,])
##           t.score.t     t.pvalue        t.fdr
## BATF       14.58713 3.280848e-33 6.561695e-30
## TNFRSF13B  14.13247 8.156046e-32 1.223407e-28
## CYB5R2     13.72558 1.448477e-30 1.619497e-27
## RAB29      13.70979 1.619497e-30 1.619497e-27
## PIM2       12.80745 9.471725e-28 5.683035e-25
## BCL2L10    12.59437 4.246406e-27 2.123203e-24

For values of p-value lower we refuse the null hypothesis and we accept the hypothesis for which the expression levels of these genes are different for the classes “ABC” and “GCB”. We now show how to visualize the top markers for each class by means of the heatmap.plus function. We are using our own my.heatmap function, which is a simple wrapper adding few extra features (including the change of the default color palette).

source(paste(PATH,"../code/heatmap.R",sep=""))

## let us visualize the top 50 and bottom 50 genes
hiIdx <- ttestOrd[1:50]
loIdx <- ttestOrd[nrow(ttestRes1):(nrow(ttestRes1)-49)]
datOut <- exprs(samplesSet)[c(hiIdx,loIdx),]

## create a color bar to show sample labels (green=ABC, orange=GCB)
CSC <- rep(c("green","orange"),times=c(ncol(group1),ncol(group2)))
CSC <- cbind(CSC,CSC) # ColSideColors needs to be a matrix 
my.heatmap(datOut,Colv=NA,Rowv=NA,ColSideColors=CSC)

From the heatmap we can see that there are genes expressed in a significantly different manner between the two classes. Now we try to use lm. It is used to fit linear models and it can be used to carry out regression, single stratum analysis of variance and analysis of covariance. We can regress the expression of a gene on the phenotype variable. We apply it to all the genes and we show that the test result is the same as for the t.test with equal variance.

## application to all genes
lmRes <- as.data.frame(t(apply(exprs(samplesSet),1,
                                   function(y) {
                                       out <- summary(lm(y~pheno))$coefficients
                                       c(t.score=out[2,"t value"],t.pvalue=out[2,"Pr(>|t|)"])
                                   })))

lmRes$t.fdr <- p.adjust(lmRes$t.pvalue, method = "BH")

# the score are the same
all.equal(-ttestRes1$t.score.t,lmRes$t.score) 
## [1] TRUE

1.4.3 Comparing t-test and limma results

Now we compare scores and MHT-corrected q-values of the two approaches, limma and t-test, and quantify the overlap of significant genes for a given significance threshold. Then we visualize the overlap as a Venn diagram.

#comparing t-test results to limma results
combinedRes <- cbind(limmaRes, ttestRes1[match(limmaRes$hgnc_symbol, rownames(ttestRes1)),])

plot(combinedRes$t, combinedRes$t.score.t, xlab = "limma t-statistic", ylab = "t-test t-statistic", pch = 20, cex = 0.5)

plot(combinedRes$P.Value, combinedRes$t.pvalue, xlab = "limma p-value", ylab = "t-test pvalue", pch = 20, cex = 0.5)

plot(combinedRes$adj.P.Val, combinedRes$t.fdr, xlab = "limma fdr", ylab = "t-test fdr", pch = 20, cex =0.5)

combinedRes <- cbind(limmaRes.sig, ttestRes1.sig[match(limmaRes.sig$hgnc_symbol, rownames(ttestRes1)),])

#what is the overlap between t-ttest and limma derived significant genes? 
top.ttest <- rownames(combinedRes)[order(combinedRes$t.fdr, decreasing = FALSE)[1:nrow(ttestRes1.sig)]]
top.limma <- rownames(combinedRes)[order(combinedRes$adj.P.Val, decreasing = FALSE)[1:nrow(limmaRes.sig)]]
top <- list(top.ttest = top.ttest, top.limma = top.limma)


fill <- c("light blue","light green")
p <- venn.diagram(x = top, filename = NULL, fill=fill)
grid.newpage()
grid.draw(p)

We note that there are not significant differences beetween the two methods.

1.4.4 Differential expression analysis using limma with the control for confounders

Now we repeat the differential analysis, using only limma, but this time controlling for gender and age effect. Also we compare these results to those obtained without controlling for confounders. As we can see we have some significant differences between the previous test results and these. We decided to keep the results of this more complex model for future analysis.

phenoGender <- samplesSet$Gender
phenoAge <- samplesSet$Age

# using limma

design <- model.matrix(~0 + factor(pheno) + factor(phenoGender)+phenoAge)
colnames(design) <- c("ABC", "GCB", "Gender","Age")

contrast.matrix <- makeContrasts(ABC-GCB,Gender,Age, levels = design)
fit <- lmFit(samplesSet, design)
fit <- contrasts.fit(fit,contrast.matrix)
fit <- eBayes(fit)
head(fit$coefficients)
##        Contrasts
##           ABC - GCB      Gender           Age
##   7210   0.69296345  0.30136134  0.0016996311
##   5774  -0.33010896  0.09107204 -0.0008435825
##   14462  0.01864653  4.57684223  0.0024082914
##   10416  0.34787667 -0.16815752 -0.0227379131
##   4902   0.17173729  4.57393975  0.0085308589
##   3819   0.39638521  0.36989280  0.0050183912
#get full differential expression output table
#if we have more than 2 coeff. we use topTableF, using the F-statistic
limmaResConfounders <- topTableF(fit, adjust.method = "BH", n = Inf, sort.by = "F") 

#subset to genes with adjusted p-value cutoff
adjpcutoff <- 0.001  

limmaResConfounders.sig <- subset(limmaResConfounders, adj.P.Val < adjpcutoff)
nrow(limmaResConfounders.sig)
## [1] 1262
head(limmaResConfounders.sig)
##       affy_hg_u133_plus_2 hgnc_symbol
## 14462           201909_at      RPS4Y1
## 18699           224588_at        XIST
## 4902          204409_s_at      EIF1AY
## 4208           1570359_at       DDX3Y
## 8266          206700_s_at       KDM5D
## 18333           206624_at       USP9Y
##                                                                                    description
## 14462                     ribosomal protein S4, Y-linked 1 [Source:HGNC Symbol;Acc:HGNC:10425]
## 18699  X inactive specific transcript (non-protein coding) [Source:HGNC Symbol;Acc:HGNC:12810]
## 4902  eukaryotic translation initiation factor 1A, Y-linked [Source:HGNC Symbol;Acc:HGNC:3252]
## 4208        DEAD (Asp-Glu-Ala-Asp) box helicase 3, Y-linked [Source:HGNC Symbol;Acc:HGNC:2699]
## 8266                    lysine (K)-specific demethylase 5D [Source:HGNC Symbol;Acc:HGNC:11115]
## 18333             ubiquitin specific peptidase 9, Y-linked [Source:HGNC Symbol;Acc:HGNC:12633]
##       ensembl_gene_id   ABC...GCB    Gender          Age  AveExpr        F
## 14462 ENSG00000129824  0.01864653  4.576842 0.0024082914 9.042906 416.4330
## 18699 ENSG00000229807  0.04114763 -3.981781 0.0001703606 5.829036 329.2754
## 4902  ENSG00000198692  0.17173729  4.573940 0.0085308589 5.642781 273.9151
## 4208  ENSG00000067048  0.03034468  2.101216 0.0017984663 4.683126 263.5944
## 8266  ENSG00000012817 -0.05504334  2.901498 0.0015608391 5.836277 194.4596
## 18333 ENSG00000114374  0.01484562  2.111785 0.0039586826 4.228574 153.9509
##            P.Value    adj.P.Val
## 14462 4.057170e-86 2.434302e-82
## 18699 2.003383e-77 6.010150e-74
## 4902  7.585485e-71 1.517097e-67
## 4208  1.673501e-69 2.510252e-66
## 8266  2.983563e-59 3.580275e-56
## 18333 6.692629e-52 6.692629e-49
topGenes2 <- limmaResConfounders.sig$hgnc_symbol

out.dir <- paste(PATH,"results",sep="")
saveRDS(topGenes2, paste(out.dir, "/", "topGenesProject2.RDS", sep = ""))


all.equal(limmaRes$adj.P.Val,limmaResConfounders$adj.P.Val)
## [1] "Mean relative difference: 0.07803283"
## the scores are not the same 
plot(limmaRes$adj.P.Val,limmaResConfounders$adj.P.Val,pch=20)

top.limmaRes <- rownames(limmaRes.sig)[order(limmaRes$adj.P.Val, decreasing = FALSE)[1:nrow(limmaRes.sig)]]
top.limmaResConfounders <- rownames(limmaResConfounders.sig)[order(limmaResConfounders.sig$adj.P.Val, decreasing = FALSE)[1:nrow(limmaResConfounders.sig)]]

top <- list(top.limmaRes = top.limmaRes, top.limmaResConfounders = top.limmaResConfounders)

fill <- c("light blue","light green")
p <- venn.diagram(x = top, filename = NULL, fill = fill)
grid.newpage()
grid.draw(p)

1.5 Gene Set Enrichment

Gene set enrichment is a method to identify classes of genes that are over-represented in a large set of genes, and they may have an association with disease phenotypes. In this section we want to attach a biological meaning to the gene signature. For doing this we can start from the identified signatures in the previous analysis and use a method based on the Hyper-Geometric distribution or consider all genes and use a method based on the Kolmogorov-Smirnov test.

## For the hyperenrichment test we need to download some gene sets from MSigDB, available here:
## http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/4.0/c2.cp.v4.0.symbols.gmt

gsets <- readLines(paste(PATH,'/c2.cp.v5.0.symbols.gmt',sep=''))
gsets <- strsplit(gsets,'\t')
names(gsets) <- sapply(gsets,function(x)x[1])
gsets <- sapply(gsets,function(x)x[-(1:2)])

1.5.1 Hyperenrichment analysis

Now we perform pathway enrichment analysis of the identified signatures. In particular we used hyperenrichment-type analysis, using a method call calcHyper and geneset/pathway compendium (MSigDB).

allGenes <- nrow(DATA)

## define a function to do a Fisher's exact test

calcHyper <- function(geneSet,sig,allGenes){
   a <- length(intersect(sig,geneSet))
   b <- length(geneSet)-a
   c <- length(sig)-a
   d <- allGenes-a-b-c
   cont <- matrix(c(a,b,c,d),ncol=2)
   pVal <- fisher.test(cont)
   return(c(overlap=a,geneset=a+b,sig=c+a,background=allGenes,p.value=pVal$p.value))
}
enrichment <- t(sapply(gsets,calcHyper,topGenes2,allGenes))
enrichment <- cbind(enrichment,adjusted.p=p.adjust(enrichment[,'p.value'],method='BH'))
enrichment <- enrichment[order(enrichment[,'adjusted.p'], decreasing = FALSE),]
enrichment[1:10,]
##                                            overlap geneset  sig
## NABA_CORE_MATRISOME                             60     275 1262
## KEGG_OLFACTORY_TRANSDUCTION                      0     389 1262
## NABA_MATRISOME                                 120    1028 1262
## REACTOME_OLFACTORY_SIGNALING_PATHWAY             0     328 1262
## PID_AVB3_INTEGRIN_PATHWAY                       21      75 1262
## NABA_COLLAGENS                                  16      44 1262
## PID_SYNDECAN_1_PATHWAY                          16      46 1262
## REACTOME_COLLAGEN_FORMATION                     18      58 1262
## REACTOME_EXTRACELLULAR_MATRIX_ORGANIZATION      22      87 1262
## NABA_ECM_GLYCOPROTEINS                          35     196 1262
##                                            background.Features
## NABA_CORE_MATRISOME                                      19565
## KEGG_OLFACTORY_TRANSDUCTION                              19565
## NABA_MATRISOME                                           19565
## REACTOME_OLFACTORY_SIGNALING_PATHWAY                     19565
## PID_AVB3_INTEGRIN_PATHWAY                                19565
## NABA_COLLAGENS                                           19565
## PID_SYNDECAN_1_PATHWAY                                   19565
## REACTOME_COLLAGEN_FORMATION                              19565
## REACTOME_EXTRACELLULAR_MATRIX_ORGANIZATION               19565
## NABA_ECM_GLYCOPROTEINS                                   19565
##                                                 p.value   adjusted.p
## NABA_CORE_MATRISOME                        3.589375e-17 4.773869e-14
## KEGG_OLFACTORY_TRANSDUCTION                7.858870e-12 5.226148e-09
## NABA_MATRISOME                             1.560619e-10 6.918745e-08
## REACTOME_OLFACTORY_SIGNALING_PATHWAY       4.399347e-10 1.462783e-07
## PID_AVB3_INTEGRIN_PATHWAY                  6.218853e-09 1.378512e-06
## NABA_COLLAGENS                             6.081350e-09 1.378512e-06
## PID_SYNDECAN_1_PATHWAY                     1.279457e-08 2.127098e-06
## REACTOME_COLLAGEN_FORMATION                1.255953e-08 2.127098e-06
## REACTOME_EXTRACELLULAR_MATRIX_ORGANIZATION 2.139715e-08 3.162024e-06
## NABA_ECM_GLYCOPROTEINS                     3.554249e-08 4.727152e-06

From this analysis we note some interesting things. In fact the ensemble of genes encoding the “matrisome” (the ensemble of extracellular matrix and ECM-associated proteins ) are the ensemble of genes with the major number of overlaps. The extracellular matrix (ECM) is a complex meshwork of cross-linked proteins providing both biophysical and biochemical cues that are important regulators of cell proliferation, survival, differentiation, and migration. Both tumor cells and stromal cells contribute to the production of the tumor matrix and that tumors of differing metastatic potential differ in both the tumor- and the stroma-derived ECM components.

1.5.2 Gene Set Enrichment Analysis (GSEA)

Another way to look for up or down regulated gene sets/pathways is GSEA, implemented in GSEAlm. Gene set enrichment analysis (GSEA) is an approach that correlates a large database of co-regulated gene sets with respect to a microarray or RNA-seq data set. GSEA is a hybrid approach: it is competitive in that different sets are pitted against one another, but significance is evaluated by permutation of sample labels.

library(GSEAlm)

#make incident matrix of genes
makeSigMatrix <- function(sigs,samplesSet, rowid = "hgnc_symbol"){
   all.genes <- fData(samplesSet)[, rowid]
   sigMat <- t(sapply(1:length(sigs),function(i){
        y <- rep(0,length(all.genes))
        names(y) <- all.genes
        x <- sigs[[i]]
        x.id <- match(x, names(y))
        x.id <- x.id[!is.na(x.id)]
        y[x.id] <- 1
        return(y)}))   
   return(sigMat) 
}

#pathway membership matrix for each gene
sigMat <- makeSigMatrix(gsets, samplesSet)

#perform gsealm
res <- gsealmPerm(samplesSet, ~COO,sigMat,nperm=1000)
colnames(res) <- levels(pData(samplesSet)$COO)
rownames(res) <- names(gsets)
res <- data.frame(res)

#histogram of p-values for pathway enrichment
hist(res[,2],10,main="GSEAlm p-values",xlab="p-values for GCB minus ABC",xlim=c(0,1))

hist(res[,1],10,main="GSEAlm p-values",xlab="p-values for ABC minus GCB",xlim=c(0,1))

res.sig.up <- res[order(res[,2], decreasing = FALSE, na.last = TRUE),]
res.sig.down <- res[order(res[,1], decreasing = FALSE, na.last = TRUE),]

#top enriched pathways in GCB (vs. ABC)
head(rownames(res.sig.up))
## [1] "KEGG_PRIMARY_BILE_ACID_BIOSYNTHESIS"                
## [2] "KEGG_HISTIDINE_METABOLISM"                          
## [3] "KEGG_TAURINE_AND_HYPOTAURINE_METABOLISM"            
## [4] "KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_HEPARAN_SULFATE"
## [5] "KEGG_GLYCEROLIPID_METABOLISM"                       
## [6] "KEGG_PANTOTHENATE_AND_COA_BIOSYNTHESIS"
#top enriched pathways in ABC (vs. GCB)
head(rownames(res.sig.down))
## [1] "KEGG_PURINE_METABOLISM"                             
## [2] "KEGG_PYRIMIDINE_METABOLISM"                         
## [3] "KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM"    
## [4] "KEGG_BETA_ALANINE_METABOLISM"                       
## [5] "KEGG_N_GLYCAN_BIOSYNTHESIS"                         
## [6] "KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_KERATAN_SULFATE"

Now we want to compare the two methods used for enrichment analysis, in particular we are interested to investigate if the list of pathways found by this second method similar to the one found by the previous one.

hyper.pvalue <- as.numeric(enrichment[, "adjusted.p"]) #adjusted pvalues
gsealm.pvalue <- res[rownames(enrichment),1] #permutation based pvalues

plot(hyper.pvalue, gsealm.pvalue)

sig.hyper <- hyper.pvalue < 0.05
sig.gsealm <- gsealm.pvalue < 0.05

table(sig.hyper, sig.gsealm)
##          sig.gsealm
## sig.hyper FALSE TRUE
##     FALSE   892  395
##     TRUE     25    4

We note that we have a lot of differences beetween this two methods. The list of pathways found by this method is not similar to the one found by the previous method, however we note that some pathways are the same.

1.6 Classification

In this last section we used different classifiers to try to predict the class, starting from different sets of significant genes. In particular for training the classifiers we used two different features sets: in one case we consider all genes in the expression set after the application of the variation filter based on the standard deviation (6000 genes) and the other feature set consist in only the top genes that we have identified with the differential analysis using limma with the control for the counfouders. The classifiers were trained using the cohort of R-CHOP as training set. We perform also a cross validation to compare the performance of various classifiers and find which of these may be more suited to a classification problem of this type.

## source some custom scripts (all included in the sub-directory code/)
source(paste(PATH,"../code/xvalSelect.R",sep=""))
source(paste(PATH,"../code/knn.R",sep=""))


# training set with the best genes from the diffanal
trainingSet1 <- subset(samplesSet,fData(samplesSet)$hgnc_symbol %in% limmaResConfounders.sig$hgnc_symbol)
# training set with all genes after filtering (6000)
trainingSet2 <- samplesSet

## generate a vector of fold assignments (in this case, 3 folds) for the first features set
set.seed(987) # for reproducible results
split <- xvalSelect(sample.size=ncol(trainingSet1),nfolds=3,cls=pData(trainingSet1)[,"COO"])

trnSet1_1 <- trainingSet1[,split!=1]
trnSet1_2 <- trainingSet1[,split!=2]
trnSet1_3 <- trainingSet1[,split!=3]
tstSet1_1 <- trainingSet1[,split==1]
tstSet1_2 <- trainingSet1[,split==2]
tstSet1_3 <- trainingSet1[,split==3]

table(pData(trnSet1_1)[,"COO"])
## 
## ABC GCB 
##  62  71
table(pData(tstSet1_1)[,"COO"])
## 
## ABC GCB 
##  31  36
rownames(exprs(trnSet1_1))<-fData(trnSet1_1)$hgnc_symbol
rownames(exprs(trnSet1_2))<-fData(trnSet1_2)$hgnc_symbol
rownames(exprs(trnSet1_3))<-fData(trnSet1_3)$hgnc_symbol
rownames(exprs(tstSet1_1))<-fData(tstSet1_1)$hgnc_symbol
rownames(exprs(tstSet1_2))<-fData(tstSet1_2)$hgnc_symbol
rownames(exprs(tstSet1_3))<-fData(tstSet1_3)$hgnc_symbol

#now the same with the second training set
set.seed(987) # for reproducible results
split <- xvalSelect(sample.size=ncol(trainingSet2),nfolds=3,cls=pData(trainingSet2)[,"COO"])

trnSet2_1 <- trainingSet2[,split!=1]
trnSet2_2 <- trainingSet2[,split!=2]
trnSet2_3 <- trainingSet2[,split!=3]
tstSet2_1 <- trainingSet2[,split==1]
tstSet2_2 <- trainingSet2[,split==2]
tstSet2_3 <- trainingSet2[,split==3]

table(pData(trnSet2_1)[,"COO"])
## 
## ABC GCB 
##  62  71
table(pData(tstSet2_1)[,"COO"])
## 
## ABC GCB 
##  31  36
rownames(exprs(trnSet2_1))<-fData(trnSet2_1)$hgnc_symbol
rownames(exprs(trnSet2_2))<-fData(trnSet2_2)$hgnc_symbol
rownames(exprs(trnSet2_3))<-fData(trnSet2_3)$hgnc_symbol
rownames(exprs(tstSet2_1))<-fData(tstSet2_1)$hgnc_symbol
rownames(exprs(tstSet2_2))<-fData(tstSet2_2)$hgnc_symbol
rownames(exprs(tstSet2_3))<-fData(tstSet2_3)$hgnc_symbol

1.6.1 Random forest

It’s a method that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.

require(randomForest)

# we start with the first features set

RF1_1 <- randomForest(x=t(exprs(trnSet1_1)),y=factor(trnSet1_1$COO),ntree=1000,importance=TRUE)
print(RF1_1)
## 
## Call:
##  randomForest(x = t(exprs(trnSet1_1)), y = factor(trnSet1_1$COO),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 35
## 
##         OOB estimate of  error rate: 1.5%
## Confusion matrix:
##     ABC GCB class.error
## ABC  60   2  0.03225806
## GCB   0  71  0.00000000
#prediction on the first Test-fold
RF1_1.pred <- predict(RF1_1, t(exprs(tstSet1_1)))
table(observed = tstSet1_1$COO, predicted = RF1_1.pred)
##         predicted
## observed ABC GCB
##      ABC  31   0
##      GCB   0  36
RF1_2 <- randomForest(x=t(exprs(trnSet1_2)),y=factor(trnSet1_2$COO),ntree=1000,importance=TRUE)
print(RF1_2)
## 
## Call:
##  randomForest(x = t(exprs(trnSet1_2)), y = factor(trnSet1_2$COO),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 35
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##     ABC GCB class.error
## ABC  62   0           0
## GCB   0  71           0
#prediction on the second Test-fold
RF1_2.pred <- predict(RF1_2, t(exprs(tstSet1_2)))
table(observed = tstSet1_2$COO, predicted = RF1_2.pred)
##         predicted
## observed ABC GCB
##      ABC  28   3
##      GCB   0  36
RF1_3 <- randomForest(x=t(exprs(trnSet1_3)),y=factor(trnSet1_3$COO),ntree=1000,importance=TRUE)
print(RF1_3)
## 
## Call:
##  randomForest(x = t(exprs(trnSet1_3)), y = factor(trnSet1_3$COO),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 35
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##     ABC GCB class.error
## ABC  62   0           0
## GCB   0  72           0
#prediction on the third Test-fold
RF1_3.pred <- predict(RF1_3, t(exprs(tstSet1_3)))
table(observed = tstSet1_3$COO, predicted = RF1_3.pred)
##         predicted
## observed ABC GCB
##      ABC  31   0
##      GCB   0  35
# second features set with all genes

RF2_1 <- randomForest(x=t(exprs(trnSet2_1)),y=factor(trnSet2_1$COO),ntree=1000,importance=TRUE)
print(RF2_1)
## 
## Call:
##  randomForest(x = t(exprs(trnSet2_1)), y = factor(trnSet2_1$COO),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 77
## 
##         OOB estimate of  error rate: 1.5%
## Confusion matrix:
##     ABC GCB class.error
## ABC  60   2  0.03225806
## GCB   0  71  0.00000000
#prediction on the first Test-fold
RF2_1.pred <- predict(RF2_1, t(exprs(tstSet2_1)))
table(observed = tstSet2_1$COO, predicted = RF2_1.pred)
##         predicted
## observed ABC GCB
##      ABC  31   0
##      GCB   0  36
RF2_2 <- randomForest(x=t(exprs(trnSet2_2)),y=factor(trnSet2_2$COO),ntree=1000,importance=TRUE)
print(RF2_2)
## 
## Call:
##  randomForest(x = t(exprs(trnSet2_2)), y = factor(trnSet2_2$COO),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 77
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##     ABC GCB class.error
## ABC  62   0           0
## GCB   0  71           0
#prediction on the second Test-fold
RF2_2.pred <- predict(RF2_2, t(exprs(tstSet2_2)))
table(observed = tstSet2_2$COO, predicted = RF2_2.pred)
##         predicted
## observed ABC GCB
##      ABC  28   3
##      GCB   0  36
RF2_3 <- randomForest(x=t(exprs(trnSet2_3)),y=factor(trnSet2_3$COO),ntree=1000,importance=TRUE)
print(RF2_3)
## 
## Call:
##  randomForest(x = t(exprs(trnSet2_3)), y = factor(trnSet2_3$COO),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 77
## 
##         OOB estimate of  error rate: 0.75%
## Confusion matrix:
##     ABC GCB class.error
## ABC  61   1  0.01612903
## GCB   0  72  0.00000000
#prediction on the third Test-fold
RF2_3.pred <- predict(RF2_3, t(exprs(tstSet2_3)))
table(observed = tstSet2_3$COO, predicted = RF2_3.pred)
##         predicted
## observed ABC GCB
##      ABC  31   0
##      GCB   0  35

1.6.2 Adaboost

This is a machine learning meta-algorithm that can be used in conjunction with many other learning algorithms to improve their performance. It is a method that builds a “strong” binary classifier as a linear combination of “weak” classifiers. The final classification is based on a weighted voting of the different classifiers. Adaboost is adaptive in the sense that subsequent weak learners are tweaked in favour of those instances misclassified by previous classifiers and it is sensitive to noisy data. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (for example their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to a strong learner. The function R that we used implement an algorithm that using classification trees as single classifiers.

library("adabag")

# we start with the first features set

#fold 1
COO<-trnSet1_1$COO
adaboost1_1<-boosting(COO~.,data = cbind(as.data.frame(t(exprs(trnSet1_1))), COO))
COO<-tstSet1_1$COO
adaboost1_1.pred<-predict.boosting(adaboost1_1, newdata = cbind(as.data.frame(t(exprs(tstSet1_1))), COO))
adaboost1_1.pred$confusion
##                Observed Class
## Predicted Class ABC GCB
##             ABC  31   1
##             GCB   0  35
#fold 2
COO<-trnSet1_2$COO
adaboost1_2<-boosting(COO~.,data = cbind(as.data.frame(t(exprs(trnSet1_2))), COO))
COO<-tstSet1_2$COO
adaboost1_2.pred<-predict.boosting(adaboost1_2, newdata = cbind(as.data.frame(t(exprs(tstSet1_2))), COO))
adaboost1_2.pred$confusion
##                Observed Class
## Predicted Class ABC GCB
##             ABC  31   0
##             GCB   0  36
#fold 3
COO<-trnSet1_3$COO
adaboost1_3<-boosting(COO~.,data = cbind(as.data.frame(t(exprs(trnSet1_3))), COO))
COO<-tstSet1_3$COO
adaboost1_3.pred<-predict.boosting(adaboost1_3, newdata = cbind(as.data.frame(t(exprs(tstSet1_3))), COO))
adaboost1_3.pred$confusion
##                Observed Class
## Predicted Class ABC GCB
##             ABC  29   0
##             GCB   2  35
# second features set (6k genes)

#fold 1
COO<-trnSet2_1$COO
adaboost2_1<-boosting(COO~.,data = cbind(as.data.frame(t(exprs(trnSet2_1))), COO))
COO<-tstSet2_1$COO
adaboost2_1.pred<-predict.boosting(adaboost2_1, newdata = cbind(as.data.frame(t(exprs(tstSet2_1))), COO))
adaboost2_1.pred$confusion
##                Observed Class
## Predicted Class ABC GCB
##             ABC  30   1
##             GCB   1  35
#fold 2
COO<-trnSet2_2$COO
adaboost2_2<-boosting(COO~.,data = cbind(as.data.frame(t(exprs(trnSet2_2))), COO))
COO<-tstSet2_2$COO
adaboost2_2.pred<-predict.boosting(adaboost2_2, newdata = cbind(as.data.frame(t(exprs(tstSet2_2))), COO))
adaboost2_2.pred$confusion
##                Observed Class
## Predicted Class ABC GCB
##             ABC  31   1
##             GCB   0  35
#fold 3
COO<-trnSet2_3$COO
adaboost2_3<-boosting(COO~.,data = cbind(as.data.frame(t(exprs(trnSet2_3))), COO))
COO<-tstSet2_3$COO
adaboost2_3.pred<-predict.boosting(adaboost2_3, newdata = cbind(as.data.frame(t(exprs(tstSet2_3))), COO))
adaboost2_3.pred$confusion
##                Observed Class
## Predicted Class ABC GCB
##             ABC  30   0
##             GCB   1  35

1.6.3 Logistic regression with LASSO

It is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. In particular we use a function that fits a traditional logistic regression model with lasso penalty.

require(glmnet)

#alpha=1 lasso, =0.5 elastic net

## first training set
#fold 1
regression1_1 <- glmnet(x=t(exprs(trnSet1_1)),y=factor(trnSet1_1$COO), family = "binomial", alpha=1)
cv.regression1_1 <- cv.glmnet(x=t(exprs(trnSet1_1)),y=factor(trnSet1_1$COO),alpha=1, family = "binomial" )

best_lambda <- cv.regression1_1$lambda.min
#prediction on Test Data
regression1_1.pred <- predict(regression1_1, newx = t(exprs(tstSet1_1)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet1_1$COO, predicted = regression1_1.pred)
##         predicted
## observed ABC GCB
##      ABC  30   1
##      GCB   0  36
# fold 2
regression1_2 <- glmnet(x=t(exprs(trnSet1_2)),y=factor(trnSet1_2$COO), family = "binomial", alpha=1)
cv.regression1_2 <- cv.glmnet(x=t(exprs(trnSet1_2)),y=factor(trnSet1_2$COO),alpha=1, family = "binomial" )

best_lambda <- cv.regression1_2$lambda.min
#prediction on Test Data
regression1_2.pred <- predict(regression1_2, newx = t(exprs(tstSet1_2)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet1_2$COO, predicted = regression1_2.pred)
##         predicted
## observed ABC GCB
##      ABC  30   1
##      GCB   1  35
# fold 3
regression1_3 <- glmnet(x=t(exprs(trnSet1_3)),y=factor(trnSet1_3$COO), family = "binomial", alpha=1)
cv.regression1_3 <- cv.glmnet(x=t(exprs(trnSet1_3)),y=factor(trnSet1_3$COO),alpha=1, family = "binomial" )

best_lambda <- cv.regression1_3$lambda.min
#prediction on Test Data
regression1_3.pred <- predict(regression1_3, newx = t(exprs(tstSet1_3)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet1_3$COO, predicted = regression1_3.pred)
##         predicted
## observed ABC GCB
##      ABC  30   1
##      GCB   1  34
## second training set 

#fold 1
regression2_1 <- glmnet(x=t(exprs(trnSet2_1)),y=factor(trnSet2_1$COO), family = "binomial", alpha=1)
cv.regression2_1 <- cv.glmnet(x=t(exprs(trnSet2_1)),y=factor(trnSet2_1$COO),alpha=1, family = "binomial" )

best_lambda <- cv.regression2_1$lambda.min
#prediction on Test Data
regression2_1.pred <- predict(regression2_1, newx = t(exprs(tstSet2_1)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet2_1$COO, predicted = regression2_1.pred)
##         predicted
## observed ABC GCB
##      ABC  30   1
##      GCB   0  36
# fold 2
regression2_2 <- glmnet(x=t(exprs(trnSet2_2)),y=factor(trnSet2_2$COO), family = "binomial", alpha=1)
cv.regression2_2 <- cv.glmnet(x=t(exprs(trnSet2_2)),y=factor(trnSet2_2$COO),alpha=1, family = "binomial" )

best_lambda <- cv.regression2_2$lambda.min
#prediction on Test Data
regression2_2.pred <- predict(regression2_2, newx = t(exprs(tstSet2_2)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet2_2$COO, predicted = regression2_2.pred)
##         predicted
## observed ABC GCB
##      ABC  30   1
##      GCB   1  35
# fold 3
regression2_3 <- glmnet(x=t(exprs(trnSet2_3)),y=factor(trnSet2_3$COO), family = "binomial", alpha=1)
cv.regression2_3 <- cv.glmnet(x=t(exprs(trnSet2_3)),y=factor(trnSet2_3$COO),alpha=1, family = "binomial" )

best_lambda <- cv.regression2_3$lambda.min
#prediction on Test Data
regression2_3.pred <- predict(regression2_3, newx = t(exprs(tstSet2_3)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet2_3$COO, predicted = regression2_3.pred)
##         predicted
## observed ABC GCB
##      ABC  30   1
##      GCB   1  34

1.6.4 KNN

It is an easy deployment model that doesn’t require any particular assumption about the distribution of data and also, it can work with any type of surface of separation. The model calculates the similarity between the classifing example with the examples of the training set and retrieves the k closest examples. On the basis of this neighborhood it assign the class with a grade which can be normal or weighted based on the distance. This classifier requires a metric to calculate the distance and the value of k which is the number of examples to be included in the neighborhood. But this is a very “lazy classifier” as they do not learn anything from the data.

#best genes

knn1_1 <- knn.estimate(dat=t(exprs(trnSet1_1)),cls=pData(trnSet1_1)[,"COO"],weight="none",k=3)
knn1_1.pred <- knn.predict(dat=t(exprs(tstSet1_1)),model=knn1_1)
knn.summary(knn1_1.pred,cls=pData(tstSet1_1)[,"COO"])
##                  P(C=ABC) P(C=GCB) Pred
## GSM275088.cel.gz        1        0  ABC
## GSM275090.cel.gz        1        0  ABC
## GSM275093.cel.gz     0.33     0.67  GCB
## GSM275097.cel.gz     0.33     0.67  GCB
## GSM275112.cel.gz        1        0  ABC
## GSM275119.cel.gz        1        0  ABC
## GSM275123.cel.gz        1        0  ABC
## GSM275136.cel.gz        1        0  ABC
## GSM275139.cel.gz        1        0  ABC
## GSM275140.cel.gz        1        0  ABC
## GSM275141.cel.gz        1        0  ABC
## GSM275143.cel.gz        1        0  ABC
## GSM275160.cel.gz        1        0  ABC
## GSM275161.cel.gz     0.67     0.33  ABC
## GSM275162.cel.gz     0.67     0.33  ABC
## GSM275179.cel.gz        1        0  ABC
## GSM275181.cel.gz        1        0  ABC
## GSM275189.cel.gz     0.67     0.33  ABC
## GSM275195.cel.gz     0.67     0.33  ABC
## GSM275199.cel.gz        1        0  ABC
## GSM275216.cel.gz        1        0  ABC
## GSM275229.cel.gz        1        0  ABC
## GSM275247.cel.gz        1        0  ABC
## GSM275261.cel.gz        1        0  ABC
## GSM275278.cel.gz        1        0  ABC
## GSM275279.cel.gz     0.67     0.33  ABC
## GSM275290.cel.gz        1        0  ABC
## GSM275291.cel.gz        1        0  ABC
## GSM275299.cel.gz        1        0  ABC
## GSM275305.cel.gz        1        0  ABC
## GSM275307.cel.gz        1        0  ABC
## GSM275077.cel.gz        0        1  GCB
## GSM275083.cel.gz        0        1  GCB
## GSM275085.cel.gz        0        1  GCB
## GSM275096.cel.gz        0        1  GCB
## GSM275104.cel.gz        0        1  GCB
## GSM275106.cel.gz        0        1  GCB
## GSM275113.cel.gz        0        1  GCB
## GSM275117.cel.gz        0        1  GCB
## GSM275118.cel.gz        0        1  GCB
## GSM275126.cel.gz        0        1  GCB
## GSM275138.cel.gz        0        1  GCB
## GSM275144.cel.gz        0        1  GCB
## GSM275148.cel.gz        0        1  GCB
## GSM275156.cel.gz        0        1  GCB
## GSM275164.cel.gz        0        1  GCB
## GSM275166.cel.gz        0        1  GCB
## GSM275167.cel.gz     0.33     0.67  GCB
## GSM275170.cel.gz     0.67     0.33  ABC
## GSM275180.cel.gz        0        1  GCB
## GSM275186.cel.gz        0        1  GCB
## GSM275201.cel.gz        0        1  GCB
## GSM275212.cel.gz        0        1  GCB
## GSM275213.cel.gz        0        1  GCB
## GSM275218.cel.gz        0        1  GCB
## GSM275232.cel.gz        0        1  GCB
## GSM275255.cel.gz        0        1  GCB
## GSM275258.cel.gz        0        1  GCB
## GSM275259.cel.gz        0        1  GCB
## GSM275263.cel.gz        0        1  GCB
## GSM275264.cel.gz        0        1  GCB
## GSM275266.cel.gz        0        1  GCB
## GSM275280.cel.gz        0        1  GCB
## GSM275285.cel.gz        0        1  GCB
## GSM275294.cel.gz        0        1  GCB
## GSM275295.cel.gz        0        1  GCB
## GSM275306.cel.gz        0        1  GCB
ftable(levels(pData(tstSet1_1)[,"COO"])[apply(knn1_1.pred,1,which.max)], # predictions
       pData(tstSet1_1)[,"COO"])                                  # actual
##      ABC GCB
##             
## ABC   29   1
## GCB    2  35
knn1_2 <- knn.estimate(dat=t(exprs(trnSet1_2)),cls=pData(trnSet1_2)[,"COO"],weight="none",k=3)
knn1_2.pred <- knn.predict(dat=t(exprs(tstSet1_2)),model=knn1_2)

ftable(levels(pData(tstSet1_2)[,"COO"])[apply(knn1_2.pred,1,which.max)], # predictions
       pData(tstSet1_2)[,"COO"])                                  # actual
##      ABC GCB
##             
## ABC   28   1
## GCB    3  35
knn1_3 <- knn.estimate(dat=t(exprs(trnSet1_3)),cls=pData(trnSet1_3)[,"COO"],weight="none",k=3)
knn1_3.pred <- knn.predict(dat=t(exprs(tstSet1_3)),model=knn1_3)

ftable(levels(pData(tstSet1_3)[,"COO"])[apply(knn1_3.pred,1,which.max)], # predictions
       pData(tstSet1_3)[,"COO"])                                  # actual
##      ABC GCB
##             
## ABC   31   3
## GCB    0  32
#features set with 6k genes

knn2_1 <- knn.estimate(dat=t(exprs(trnSet2_1)),cls=pData(trnSet2_1)[,"COO"],weight="none",k=3)
knn2_1.pred <- knn.predict(dat=t(exprs(tstSet2_1)),model=knn2_1)

ftable(levels(pData(tstSet2_1)[,"COO"])[apply(knn2_1.pred,1,which.max)], # predictions
       pData(tstSet2_1)[,"COO"])                                  # actual
##      ABC GCB
##             
## ABC   27   4
## GCB    4  32
knn2_2 <- knn.estimate(dat=t(exprs(trnSet2_2)),cls=pData(trnSet2_2)[,"COO"],weight="none",k=3)
knn2_2.pred <- knn.predict(dat=t(exprs(tstSet2_2)),model=knn2_2)

ftable(levels(pData(tstSet2_2)[,"COO"])[apply(knn2_2.pred,1,which.max)], # predictions
       pData(tstSet2_2)[,"COO"])                                  # actual
##      ABC GCB
##             
## ABC   28   4
## GCB    3  32
knn2_3 <- knn.estimate(dat=t(exprs(trnSet2_3)),cls=pData(trnSet2_3)[,"COO"],weight="none",k=3)
knn2_3.pred <- knn.predict(dat=t(exprs(tstSet2_3)),model=knn2_3)

ftable(levels(pData(tstSet2_3)[,"COO"])[apply(knn2_3.pred,1,which.max)], # predictions
       pData(tstSet2_3)[,"COO"])                                  # actual
##      ABC GCB
##             
## ABC   31   8
## GCB    0  27

1.6.5 Evaluation of classifiers

Now we have the performace for all folds and we want to compare them. We have calculeted the mean of sensitivity, specificity, AUC and balance accuracy for each classifier that we have implemented. Of course we reported the measure of these indexes for the two features sets we used.

require(pROC)

# at first we consider the performance on the entire features set

#RANDOM FOREST
#calculate sensitivity
RF_sens_mean2<- mean( c(sensitivity(factor(RF2_1.pred), factor(tstSet2_1$COO)),
sensitivity(factor(RF2_2.pred), factor(tstSet2_2$COO)),
sensitivity(factor(RF2_3.pred), factor(tstSet2_3$COO))))

#calculate specificity
RF_spec_mean2<- mean( c(specificity(factor(RF2_1.pred), factor(tstSet2_1$COO)),
specificity(factor(RF2_2.pred), factor(tstSet2_2$COO)),
specificity(factor(RF2_3.pred), factor(tstSet2_3$COO))))

#calculate balanced accuracy
RF_bal_acc2 <- (RF_sens_mean2+RF_spec_mean2)/2

#KNN
#calculate sensitivity
knn_sens_mean2<- mean( c(sensitivity(factor(apply(knn2_1.pred,1,which.max)),factor(as.numeric(tstSet2_1$COO))),sensitivity(factor(apply(knn2_2.pred,1,which.max)), factor(as.numeric(tstSet2_2$COO))),
sensitivity(factor(apply(knn2_3.pred,1,which.max)), factor(as.numeric(tstSet2_3$COO)))))

#calculate specificity
knn_spec_mean2<- mean( c(specificity(factor(apply(knn2_1.pred,1,which.max)),factor(as.numeric(tstSet2_1$COO))),specificity(factor(apply(knn2_2.pred,1,which.max)), factor(as.numeric(tstSet2_2$COO))),
specificity(factor(apply(knn2_3.pred,1,which.max)), factor(as.numeric(tstSet2_3$COO)))))

#calculate balanced accuracy
knn_bal_acc2 <- (knn_sens_mean2+knn_spec_mean2)/2

#ADABOOST
#calculate sensitivity
adaboost_sens_mean2<- mean( c(sensitivity(factor(adaboost2_1.pred$class), factor(tstSet2_1$COO)),
sensitivity(factor(adaboost2_2.pred$class), factor(tstSet2_2$COO)),
sensitivity(factor(adaboost2_3.pred$class), factor(tstSet2_3$COO))))

#calculate specificity
adaboost_spec_mean2<- mean( c(specificity(factor(adaboost2_1.pred$class), factor(tstSet2_1$COO)),
specificity(factor(adaboost2_2.pred$class), factor(tstSet2_2$COO)),
specificity(factor(adaboost2_3.pred$class), factor(tstSet2_3$COO))))

#calculate balanced accuracy
adaboost_bal_acc2 <- (adaboost_sens_mean2+adaboost_spec_mean2)/2

#LASSO REGRESSION
#calculate sensitivity
lasso_sens_mean2<- mean( c(sensitivity(factor(regression2_1.pred), factor(tstSet2_1$COO)),
sensitivity(factor(regression2_2.pred), factor(tstSet2_2$COO)),
sensitivity(factor(regression2_3.pred), factor(tstSet2_3$COO))))

#calculate specificity
lasso_spec_mean2<- mean( c(specificity(factor(regression2_1.pred), factor(tstSet2_1$COO)),
specificity(factor(regression2_2.pred), factor(tstSet2_2$COO)),
specificity(factor(regression2_3.pred), factor(tstSet2_3$COO))))

#calculate balanced accuracy
lasso_bal_acc2 <- (lasso_sens_mean2+lasso_spec_mean2)/2

#final results

result2 <- cbind(c(RF_sens_mean2, knn_sens_mean2,adaboost_sens_mean2,lasso_sens_mean2),c(RF_spec_mean2,knn_spec_mean2,adaboost_spec_mean2,lasso_spec_mean2),c(RF_bal_acc2,knn_bal_acc2,adaboost_bal_acc2,lasso_bal_acc2))
rownames(result2)<-c("RF","KNN","AdABoost", "LASSO")
colnames(result2)<-c("sensitivity","specificity","Balanced accuracy")
result2
##          sensitivity specificity Balanced accuracy
## RF         0.9677419   1.0000000         0.9838710
## KNN        0.9247312   0.8497354         0.8872333
## AdABoost   0.9784946   0.9814815         0.9799881
## LASSO      0.9677419   0.9812169         0.9744794
# now we consider the features set with only the top genes

#RANDOM FOREST
#calculate sensitivity
RF_sens_mean1<- mean( c(sensitivity(factor(RF1_1.pred), factor(tstSet1_1$COO)),
sensitivity(factor(RF1_2.pred), factor(tstSet1_2$COO)),
sensitivity(factor(RF1_3.pred), factor(tstSet1_3$COO))))

#calculate specificity
RF_spec_mean1<- mean( c(specificity(factor(RF1_1.pred), factor(tstSet1_1$COO)),
specificity(factor(RF1_2.pred), factor(tstSet1_2$COO)),
specificity(factor(RF1_3.pred), factor(tstSet1_3$COO))))

#calculate balanced accuracy
RF_bal_acc1 <- (RF_sens_mean1+RF_spec_mean1)/2


#KNN
#calculate sensitivity
knn_sens_mean1<- mean( c(sensitivity(factor(apply(knn1_1.pred,1,which.max)),factor(as.numeric(tstSet1_1$COO))),sensitivity(factor(apply(knn1_2.pred,1,which.max)), factor(as.numeric(tstSet1_2$COO))),
sensitivity(factor(apply(knn1_3.pred,1,which.max)), factor(as.numeric(tstSet1_3$COO)))))

#calculate specificity
knn_spec_mean1<- mean( c(specificity(factor(apply(knn1_1.pred,1,which.max)),factor(as.numeric(tstSet1_1$COO))),specificity(factor(apply(knn1_2.pred,1,which.max)), factor(as.numeric(tstSet1_2$COO))),
specificity(factor(apply(knn1_3.pred,1,which.max)), factor(as.numeric(tstSet1_3$COO)))))

#calculate balanced accuracy
knn_bal_acc1 <- (knn_sens_mean1+knn_spec_mean1)/2


#ADABOOST
#calculate sensitivity
adaboost_sens_mean1<- mean( c(sensitivity(factor(adaboost1_1.pred$class), factor(tstSet1_1$COO)),
sensitivity(factor(adaboost1_2.pred$class), factor(tstSet1_2$COO)),
sensitivity(factor(adaboost1_3.pred$class), factor(tstSet1_3$COO))))

#calculate specificity
adaboost_spec_mean1<- mean( c(specificity(factor(adaboost1_1.pred$class), factor(tstSet1_1$COO)),
specificity(factor(adaboost1_2.pred$class), factor(tstSet1_2$COO)),
specificity(factor(adaboost1_3.pred$class), factor(tstSet1_3$COO))))

#calculate balanced accuracy
adaboost_bal_acc1 <- (adaboost_sens_mean1+adaboost_spec_mean1)/2

#LASSO REGRESSION
#calculate sensitivity
lasso_sens_mean1<- mean( c(sensitivity(factor(regression1_1.pred), factor(tstSet1_1$COO)),
sensitivity(factor(regression1_2.pred), factor(tstSet1_2$COO)),
sensitivity(factor(regression1_3.pred), factor(tstSet1_3$COO))))

#calculate specificity
lasso_spec_mean1<- mean( c(specificity(factor(regression1_1.pred), factor(tstSet1_1$COO)),
specificity(factor(regression1_2.pred), factor(tstSet1_2$COO)),
specificity(factor(regression1_3.pred), factor(tstSet1_3$COO))))

#calculate balanced accuracy
lasso_bal_acc1 <- (lasso_sens_mean1+lasso_spec_mean1)/2

#final results

result1 <- cbind(c(RF_sens_mean1, knn_sens_mean1,adaboost_sens_mean1,lasso_sens_mean1),c(RF_spec_mean1,knn_spec_mean1,adaboost_spec_mean1,lasso_spec_mean1),c(RF_bal_acc1,knn_bal_acc1,adaboost_bal_acc1,lasso_bal_acc1))
rownames(result1)<-c("RF","KNN", "AdABoost","LASSO")
colnames(result1)<-c("sensitivity","specificity","Balanced accuracy")
result1
##          sensitivity specificity Balanced accuracy
## RF         0.9677419   1.0000000         0.9838710
## KNN        0.9462366   0.9529101         0.9495733
## AdABoost   0.9784946   0.9907407         0.9846177
## LASSO      0.9677419   0.9812169         0.9744794

As we can see we are able to achieve excellent performance for each classifier that we have implemented. In particular we note that the performance of Random Forest, Adaboost and LASSO Regression are high and remain almost the same if we use the full genes set or only the most significant signatures. This is due to the fact that these are methods that have by nature of an algorithm that perform a features selection and for this reason they work well in both situations. We can’t see the same fact for the KNN which is completely different and in fact it significantly increases the performance when we give it a smaller features set.

The performances are very similar and very high and also the variation between the different folds are very small. So we considered that there is not sense to apply statistical tests to assess whether there are significant differences between the classifiers (as you can see there are not) or compute confidence intervals for the performace indices.

1.6.6 Final Results

In the end we can try to apply one of the classifiers that we used on the cohort of the 181 patients treated with CHOP. We have choosen to used the Random Forest model with a reduce features set. We learned the model on the entire training set e then we predict the samples classes of the test set. We report in this section the ROC, accuracy, sensitivity, specificity, and confusion matrix.

RF <- randomForest(x=t(exprs(trainingSet1)),y=factor(trainingSet1$COO),ntree=1000,importance=TRUE)
print(RF)
## 
## Call:
##  randomForest(x = t(exprs(trainingSet1)), y = factor(trainingSet1$COO),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 35
## 
##         OOB estimate of  error rate: 0.5%
## Confusion matrix:
##     ABC GCB class.error
## ABC  92   1  0.01075269
## GCB   0 107  0.00000000
# start by select the CHOP cohort 
testSet <- DATAF[,DATAF$Treatment_Regimen %in% c("CHOP-Like")]

## let's order samples by the phenotype
testSet <- testSet[,order(testSet$COO)]
RF.pred <- predict(RF, t(exprs(testSet)))
table(observed = testSet$COO, predicted = RF.pred)
##         predicted
## observed ABC GCB
##      ABC  74   0
##      GCB   5  71
#calculate sensitivity
sens <- sensitivity(factor(RF.pred), factor(testSet$COO))
sens
## [1] 1
#calculate specificity
spec <- specificity(factor(RF.pred), factor(testSet$COO))
spec
## [1] 0.9342105
RF.pred2 <- predict(RF, t(exprs(testSet)), "prob")
roc(testSet$COO, RF.pred2[,1], plot = TRUE)

## 
## Call:
## roc.default(response = testSet$COO, predictor = RF.pred2[, 1],     plot = TRUE)
## 
## Data: RF.pred2[, 1] in 74 controls (testSet$COO ABC) > 76 cases (testSet$COO GCB).
## Area under the curve: 0.9993
bal_acc <- (sens+spec)/2
bal_acc
## [1] 0.9671053

As we can see with this approach we are able to build a predictor with an high accuracy starting from the gene expression set obtained from microarray data.

2. Analysis of an RNA-seq HSNC Dataset

For many applications, we are interested in measuring the absolute or relative expression of each mRNA in the cell. So in this second problem we analyse another type of data: a RNA-seq count data. In recent years, a novel methodology for RNA sequencing, called RNA-seq, is replacing microarrays for the study of gene expression.In this method millions of short sequences, called reads, are sequenced from random positions of the input RNAs. These reads can then be computationally mapped on a reference genome to reveal a transcriptional map, where the number of reads aligned on each gene, called counts, gives a measure of its level of expression. In this new section we perform analysis of the TCGA head and neck squamous carcinoma (HNSC) dataset. The dataset can be downloaded as raw counts. We start to remove all mitochondrial genes, as well as non-coding RNAs, and Ensemble IDs without an associated gene symbol. This will considerably reduce the size (number of rows) of the data. Also, we want to ‘merge’ multiple Ensemble IDs associated to the same gene symbol.

library(DESeq2)
library(edgeR)
library(BiocParallel)

HNSC_rawCounts<-readRDS(paste(PATH,"HNSC_htseq_raw_counts.anonymized.RDS",sep=""))
HNSC_rawCounts
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 56885 features, 504 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: sample001 sample002 ... sample504 (504 total)
##   varLabels: patient.age_at_initial_pathologic_diagnosis
##     patient.anatomic_neoplasm_subdivision ...
##     patient.year_of_tobacco_smoking_onset (25 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSG00000210049 ENSG00000211459 ...
##     ENSG00000207477 (56885 total)
##   fvarLabels: ensembl_gene_id entrezgene ... description (9 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: RNASeq raw counts
#remove the MT genes

HNSC_rawCounts<-subset(HNSC_rawCounts,fData(HNSC_rawCounts)$chromosome_name != "MT")

#remove the genes with hgnc_symbol = ""

HNSC_rawCounts<-subset(HNSC_rawCounts,fData(HNSC_rawCounts)$hgnc_symbol != "")
HNSC_rawCounts.collapsed<-collapseByMedian(HNSC_rawCounts, "hgnc_symbol")

#this is the list of non coding
hnsc <- subset(HNSC_rawCounts.collapsed, regexpr("non-protein",strsplit(fData(HNSC_rawCounts.collapsed)$description, " "))!=-1) 

HNSC <- subset(HNSC_rawCounts.collapsed,!(fData(HNSC_rawCounts.collapsed)$hgnc_symbol %in% fData(hnsc)$hgnc_symbol))

exprs(HNSC)<-round(exprs(HNSC))

saveRDS(HNSC, paste(PATH, "/", "HNSC.RDS", sep = ""))

Now we need to filter out genes with mostly zero counts. In fact we have a large number of rows contain all zero so we want to eliminate them. We also grouped the classes (the grade g1,g2,g3 and g4) in order to get a problem with binary class: the first class is “g1” (group that include g1 and g2) and the second is “g4” (group that include g3 and g4). Even the stage was grouped in two groups “stage i” (group that include stage i, ii and iii) and “stage iv”. After this step is also useful to apply a log-transform: since count values for a gene can be zero in some conditions (and non-zero in others), some advocate the use of pseudocounts. These data are useful for clustering, PCA and classification. Contrariwise in differential analysis the used methods require count data, so we decided to create two different data frame.

#HNSC<-readRDS(paste(PATH,"HNSC.RDS",sep=""))

# now we remove the genes with a large number of 0 in count
HNSCF <- HNSC[rowSums(exprs(HNSC))>100,]

# we replace g2 with g1 and g3 with g4. We obtain 2 class g1 (g1-2) and g4 (g3-4)
ind2 <- which(pData(HNSCF)$patient.neoplasm_histologic_grade == "g2")
pData(HNSCF)$patient.neoplasm_histologic_grade<-replace(pData(HNSCF)$patient.neoplasm_histologic_grade,ind2,"g1")

ind3 <- which(pData(HNSCF)$patient.neoplasm_histologic_grade == "g3")
pData(HNSCF)$patient.neoplasm_histologic_grade<-replace(pData(HNSCF)$patient.neoplasm_histologic_grade,ind3,"g4")

HNSCF <- HNSCF[,pData(HNSCF)$patient.neoplasm_histologic_grade %in% c("g1","g4")]
pData(HNSCF)$patient.neoplasm_histologic_grade <- factor(pData(HNSCF)$patient.neoplasm_histologic_grade)

#we replace stage ii and iii with stage i

ind4 <- which(pData(HNSCF)$patient.stage_event.clinical_stage == "stage ii")
pData(HNSCF)$patient.stage_event.clinical_stage<-replace(pData(HNSCF)$patient.stage_event.clinical_stage,ind4,"stage i")

ind5 <- which(pData(HNSCF)$patient.stage_event.clinical_stage == "stage iii")
pData(HNSCF)$patient.stage_event.clinical_stage<-replace(pData(HNSCF)$patient.stage_event.clinical_stage,ind5,"stage i")

HNSCF <- HNSCF[,pData(HNSCF)$patient.stage_event.clinical_stage %in% c("stage i","stage iv")]
pData(HNSCF)$patient.stage_event.clinical_stage <- factor(pData(HNSCF)$patient.stage_event.clinical_stage)

HNSCF_log <-HNSCF

#exprs(HNSCF_log) <- log2(exprs(HNSCF)+1) 

exprs(HNSCF_log) <- cpm(exprs(HNSCF_log),log = TRUE)

2.1 Principal Component Analysis (PCA)

As in the previus problem, we start by perform a Principal Component Analysis and plot data onto the first two PCs, and the 1st and 3rd PCs, by coloring the data points by grade and stage.

require(rgl)
require(ggplot2)

# run PCA
data.pca <- prcomp(t(exprs(HNSCF_log)))  
data.pca1.2 <- data.pca$x[,1:2] # take the first 2 components
data.pca1.3 <- data.pca$x[,1:3] # take the first and the third components

(summary(data.pca)$importance)[,1:50]
##                             PC1      PC2      PC3      PC4      PC5
## Standard deviation     87.33942 62.06484 57.43471 49.32885 42.84617
## Proportion of Variance  0.11914  0.06016  0.05152  0.03801  0.02867
## Cumulative Proportion   0.11914  0.17931  0.23083  0.26884  0.29751
##                             PC6      PC7      PC8      PC9     PC10
## Standard deviation     38.84199 36.90942 32.68699 29.10243 27.91820
## Proportion of Variance  0.02356  0.02128  0.01669  0.01323  0.01217
## Cumulative Proportion   0.32107  0.34235  0.35904  0.37227  0.38444
##                            PC11     PC12     PC13     PC14     PC15
## Standard deviation     27.20058 26.26271 24.68169 23.40034 22.42204
## Proportion of Variance  0.01156  0.01077  0.00951  0.00855  0.00785
## Cumulative Proportion   0.39600  0.40677  0.41628  0.42484  0.43269
##                            PC16     PC17     PC18     PC19     PC20
## Standard deviation     21.04075 20.64683 20.07221 19.58994 18.55403
## Proportion of Variance  0.00691  0.00666  0.00629  0.00599  0.00538
## Cumulative Proportion   0.43960  0.44626  0.45255  0.45855  0.46392
##                            PC21     PC22     PC23     PC24     PC25
## Standard deviation     17.98850 17.73216 17.46801 17.17894 16.79517
## Proportion of Variance  0.00505  0.00491  0.00477  0.00461  0.00441
## Cumulative Proportion   0.46898  0.47389  0.47866  0.48327  0.48767
##                            PC26     PC27     PC28     PC29     PC30
## Standard deviation     16.54065 16.06808 15.89667 15.65266 15.41173
## Proportion of Variance  0.00427  0.00403  0.00395  0.00383  0.00371
## Cumulative Proportion   0.49194  0.49598  0.49992  0.50375  0.50746
##                            PC31     PC32     PC33     PC34     PC35
## Standard deviation     15.04511 14.94810 14.82930 14.70877 14.35990
## Proportion of Variance  0.00354  0.00349  0.00343  0.00338  0.00322
## Cumulative Proportion   0.51100  0.51449  0.51792  0.52130  0.52452
##                            PC36     PC37     PC38     PC39     PC40
## Standard deviation     14.21256 14.10586 14.05190 13.86897 13.62545
## Proportion of Variance  0.00315  0.00311  0.00308  0.00300  0.00290
## Cumulative Proportion   0.52767  0.53078  0.53387  0.53687  0.53977
##                            PC41     PC42     PC43     PC44     PC45
## Standard deviation     13.58755 13.50413 13.40065 13.30690 13.14465
## Proportion of Variance  0.00288  0.00285  0.00280  0.00277  0.00270
## Cumulative Proportion   0.54265  0.54550  0.54831  0.55107  0.55377
##                            PC46     PC47     PC48     PC49     PC50
## Standard deviation     12.97888 12.95321 12.89655 12.80880 12.71780
## Proportion of Variance  0.00263  0.00262  0.00260  0.00256  0.00253
## Cumulative Proportion   0.55640  0.55902  0.56162  0.56418  0.56671
## extract label
col.palette <- c("green","red","blue","black","magenta")

#HISTOLOGIC GRADE
colors <- col.palette[as.numeric(as.factor(HNSCF_log$patient.neoplasm_histologic_grade))]
## basic 2D plot
plot(x=data.pca1.2[,1],y=data.pca1.2[,2],pch=20,xlab="1st PCA",ylab="2nd PCA",col=colors)
legend("topleft",pch=20,legend=levels(factor(HNSCF_log$patient.neoplasm_histologic_grade)), col=col.palette)

plot(x=data.pca1.3[,1],y=data.pca1.3[,3],pch=20,xlab="1st PCA",ylab="3rd PCA",col=colors)
legend("topleft",pch=20,legend=levels(factor(HNSCF_log$patient.neoplasm_histologic_grade)), col=col.palette)

#STAGE EVENT
colors <- col.palette[as.numeric(as.factor(HNSCF_log$patient.stage_event.clinical_stage))]
## basic 2D plot
plot(x=data.pca1.2[,1],y=data.pca1.2[,2],pch=20,xlab="1st PCA",ylab="2nd PCA",col=colors)
legend("topleft",pch=20,legend=levels(as.factor(HNSCF_log$patient.stage_event.clinical_stage)), col=col.palette)

plot(x=data.pca1.3[,1],y=data.pca1.3[,2],pch=20,xlab="1st PCA",ylab="3rd PCA",col=colors)
legend("topleft",pch=20,legend=levels(as.factor(HNSCF_log$patient.stage_event.clinical_stage)), col=col.palette)

In this case we note that we are not able to identify some natural clusters in the data and we have a lot of noise. However we consider only two components and for doing a more accurate analysis we have to consider more components because the variance that these components describe are very low. So we can try, in the next section, with hierarchical clustering to find out some clusters in the data.

2.2 Hierarchical clustering

HNSCF2_log <- variationFilter(HNSCF_log,ngenes=3000, score="sd",do.plot=TRUE)
## Variation filtering based on sd .. done.
## Selecting top 3000 by sd .. done, 3000 genes selected.
## Creating scatter plot ..

## done.
distmatrix.col<-dist(t(exprs(HNSCF2_log)))
## using ward clustering 
hc01.col <- hclust(distmatrix.col,method="ward.D") 
hc01.col 
## 
## Call:
## hclust(d = distmatrix.col, method = "ward.D")
## 
## Cluster method   : ward.D 
## Distance         : Euclidean 
## Number of objects: 464
## calculate distance matrix for rows: use correlation for rows/genes
distmatrix.row<-as.dist(1-cor(t(exprs(HNSCF2_log))))
## using ward clustering
hc01.row <- hclust(distmatrix.row,method="ward.D") 

## making heatmap for grade
levels(as.factor(HNSCF2_log$patient.neoplasm_histologic_grade))
## [1] "g1" "g4"
col.palette <- c("green", "purple","red")
colors <- col.palette[as.numeric(as.factor(HNSCF2_log$patient.neoplasm_histologic_grade))]
heatmaptitle <- paste(" ", "top sd-filtered 3k genes", sep = "")
heatmap.2(exprs(HNSCF_log),
          trace='none',
          main=heatmaptitle,
          col=bluered(64),
          dendrogram='both',
          scale='row',
          ColSideColors=colors,
          Colv=as.dendrogram(hc01.col),
          Rowv=as.dendrogram(hc01.row),
          margins=c(6,6))
legend("topright", c('g1','g2'),
       pch=15,pt.cex=2,col=col.palette,ncol=1,title.adj=0,cex=1,bty='n')

## making heatmap for stage
levels(as.factor(HNSCF2_log$patient.stage_event.clinical_stage))
## [1] "stage i"  "stage iv"
col.palette <- c("green", "purple","red","lightblue", "pink")
colors <- col.palette[as.numeric(as.factor(HNSCF2_log$patient.stage_event.clinical_stage))]
heatmaptitle <- paste(" ", "top sd-filtered 3k genes", sep = "")
heatmap.2(exprs(HNSCF_log),
          trace='none',
          main=heatmaptitle,
          col=bluered(64),
          dendrogram='both',
          scale='row',
          ColSideColors=colors,
          Colv=as.dendrogram(hc01.col),
          Rowv=as.dendrogram(hc01.row),
          margins=c(6,6))
legend("topright", levels(as.factor(HNSCF_log$patient.stage_event.clinical_stage)),
       pch=15,pt.cex=2,col=col.palette,ncol=1,title.adj=0,cex=1,bty='n')

As well as we note in the previous analysis with PCA, using the hierarchical clustering we note that we are not able to identify some natural clusters in these data and we have a lot of noise. So this case is different from the first and probably building a classifier with an high accuracy will be very difficult.

2.3 RNA-seq differential expression analysis with DEseq2 and edgeR

Just as we did in the previous problem with microarray data, now we want to perform a differential analysis. A gene is declared differentially expressed if an observed difference or change in read counts between two experimental conditions is statistically significant (if the difference is greater than what would be expected just due to random variation). In particular we want to perform differential analysis with respect to grade (g1 vs. g4) and stage (stage1-3 vs. stage4) using edgeR and Deseq2 and compare results. Also we want to use the differential analysis like a method for features selection. So before starting, we split the dataset into a 60/40 stratified pair of datasets. We have used the 60% portion as the discovery set and the 40% portion as the validation set. In this case we perform a features selection of type “supervised” so, to ensure that the results of our classifiers are not polarized, we perform the differential analysis only on the discovery set, like in the previous problem.

Let’s start by splitting the dataset and writing wrapper functions for each tool:

set.seed(3456) # for reproducible results
trainIndex <- createDataPartition(HNSCF2_log$patient.neoplasm_histologic_grade, p = 0.6, list = FALSE, times = 1)

trainingSet <- HNSCF2_log[,trainIndex]
testSet <- HNSCF2_log[,-trainIndex]

HNSCF.allSamples <- HNSCF
HNSCF<- HNSCF[,trainIndex]

# we consider only the 3k best genes, but not log-transf
HNSCF <- subset(HNSCF,fData(HNSCF)$hgnc_symbol %in% fData(HNSCF2_log)$hgnc_symbol)

## wrapper for running DESeq2
##
run_deseq <- function(eset, class_id, group1, group2){
 
  group1_inds <- which(pData(eset)[, class_id] %in% group1)
  group2_inds <- which(pData(eset)[, class_id] %in% group2)
  eset.group1 <- eset[, group1_inds]
  eset.group2 <- eset[, group2_inds]
  eset.compare <- eset[, c(group1_inds, group2_inds)]

  #make deseq2 compliant dataset
  colData <- data.frame(condition=as.character(pData(eset.compare)[, class_id]))
  dds <- DESeqDataSetFromMatrix(exprs(eset.compare), colData, formula( ~ condition))

  #set reference to control, otherwise default is alphabetical order
  dds$condition <- factor(dds$condition, levels=c(group1,group2))

  #run deseq2
  #3 steps:
  #1.) estimate size factors
  #2.) estimate dispersion
  #3.) negative binomial GLM fitting and wald test
  dds_res <- DESeq(dds)
  res <- results(dds_res)
  return(res)
}
## wrapper for running edgeR
##
run_edgeR <- function(eset, class_id, group1, group2){
  

  group1_inds <- which(pData(eset)[, class_id] %in% group1)
  group2_inds <- which(pData(eset)[, class_id] %in% group2)

  eset.group1 <- eset[, group1_inds]
  eset.group2 <- eset[, group2_inds]
  eset.compare <- eset[, c(group1_inds, group2_inds)]
  condition <- as.character(pData(eset.compare)[, class_id])

  y <- DGEList(counts=exprs(eset.compare), group = condition)
  y <- calcNormFactors(y)
  y <- estimateGLMCommonDisp(y)
  y <- estimateGLMTrendedDisp(y)
  y <- estimateGLMTagwiseDisp(y)
  et <- exactTest(y)
  res <- topTags(et, n = nrow(eset.compare),  sort.by = "none")
  return(res)
}

## optional: reattach empirical measurements / gene annotation to DE results
##
summarize_results <- function(res, eset, class_id, group1, group2){
  group1_inds <- which(pData(eset)[, class_id] %in% group1)
  group2_inds <- which(pData(eset)[, class_id] %in% group2)
  eset.group1 <- eset[, group1_inds]
  eset.group2 <- eset[, group2_inds]

  eset.ordered <- match(rownames(res), rownames(eset))
  res <- cbind(res, fData(eset)[eset.ordered,])
  res$rowmeans.group1 <- rowMeans(exprs(eset.group1)[eset.ordered,])
  res$rowmeans.group1 <- rowMeans(exprs(eset.group2)[eset.ordered,])

  res$log2fc <- log2(res$rowmeans.group2/res$rowmeans.group1)
  index <- (res$rowmeans.group2 <= res$rowmeans.group1)
  res$log2fc[index] <- -abs(res$log2fc[index])
  return(res)
}
Now the first package that we want to use is DEseq2. As input, the DESeq2 package expects count data in the form of a matrix of integer values. This method is based on a model using negative binomial generalized linear models; the estimates of dispersion and logarithmic fold changes incorporate data-driven prior distributions. The function preforms a default analysis through the steps:
#grade
res_deseq2 <- run_deseq(eset = HNSCF, class_id = "patient.neoplasm_histologic_grade",
group1 = "g1", group2 = "g4")

#stage
res_deseq2_stage <- run_deseq(eset = HNSCF, class_id = "patient.stage_event.clinical_stage",
group1 = "stage i", group2 = "stage iv")

The second method is edgeR. The package implements statistical methods based on generalized linear models. The glm functions can test for differential expression using either likelihood ratio tests or quasi-likelihood F-tests. A particular feature of edgeR functionality are empirical Bayes methods that permit the estimation of gene-specific biological variation, even for experiments with minimal levels of biological replication.

## grade
res_edgeR <- run_edgeR(eset = HNSCF, class_id = "patient.neoplasm_histologic_grade",
 group1 = "g1", group2 = "g4")

#stage
res_edgeR_stage <- run_edgeR(eset = HNSCF, class_id = "patient.stage_event.clinical_stage",
 group1 = "stage i", group2 = "stage iv")

2.3.1 Comparing results for grade

Now we want to compare the results for the grade of DEseq2 and edgeR and visualize the overlap as a Venn diagram.

## cumulative number of sig genes below mean expression
res_summary <- data.frame(deseq2_padj=res_deseq2$padj, 
  edgeR_padj = res_edgeR$table$FDR,
  deseq2_logfc=res_deseq2$log2FoldChange, 
  edgeR_logfc = res_edgeR$table$logFC,
  mean_exprs = rowMeans(log10(exprs(HNSCF)+1)))

## for plotting purposes
exprs_breaks <- seq(0, log10(max(as.numeric(exprs(HNSCF)+1))), 0.01)

## cumulative significant genes vs. log10 mean expression
csg_DESeq2 <- sapply(exprs_breaks, 
  function(x) nrow(subset(res_summary, mean_exprs < x & deseq2_padj < 0.1)))

csg_edgeR <- sapply(exprs_breaks, 
  function(x) nrow(subset(res_summary, mean_exprs < x & edgeR_padj < 0.1)))

plot(exprs_breaks, csg_DESeq2, 
  type = "l",
  col = "red",
  xaxt="n",
  xlab = "log10 expression", 
  ylab = "cumulative significant genes (fdr 0.05)",
  main = "Cumulative number of significant genes by log10 count")
lines(exprs_breaks, csg_edgeR, type = "l", col = "blue")

labels <- sapply(1:5,function(i) as.expression(bquote(10^ .(i))))
axis(1,at=1:5,labels=labels)
legend("right", c("DEseq2", "edgeR"), 
  col=c("red", "blue"), lwd=10)

## normalization to proportion of significant genes instead of absolute number
##
plot(exprs_breaks, csg_DESeq2/max(csg_DESeq2), 
  type = "l",
  col = "red",
  xaxt="n",
  xlab = "log10 expression", 
  ylab = "cumulative proportion of significant genes (fdr 0.05)",
  main = "Cumulative proportion of significant genes by log10 count")
lines(exprs_breaks, csg_edgeR/max(csg_edgeR), type = "l", col = "blue")
labels <- sapply(1:5,function(i) as.expression(bquote(10^ .(i))))
axis(1,at=1:5,labels=labels)
legend("right", c("DEseq2", "edgeR"), 
  col=c("red", "blue"), lwd=10)

## histogram of proportion of sig genes by mean expression
##
hist(subset(res_summary, deseq2_padj <0.1)[, "mean_exprs"], freq=FALSE,
  col=rgb(1,0,0,1/4), xlab = "log10 expression", 
  ylab = "number of sig. genes (fdr 0.05)",
  main = "Proportion of number of significant genes by log10 count", xaxt = "n")
hist(subset(res_summary, edgeR_padj <0.1)[, "mean_exprs"],freq=FALSE,
  col=rgb(0,0,1,1/4), add = T)
labels <- sapply(1:5,function(i)
            as.expression(bquote(10^ .(i)))
          )
axis(1,at=1:5,labels=labels)
box()
legend("topright", c("DEseq2", "edgeR"), 
  col=c(rgb(1,0,0,1/4), rgb(0,0,1,1/4), rgb(1,1,0,1/4)), lwd=10)

#expected results: DESeq claims edgeR is anti-conservative for lowly expressed genes, and 
#more conservative for strongly expressed genes (weakly expressed genes overrepresented)

#Comparing MA-plots: Expression level vs. log Fold Change

plot(res_summary$mean_exprs, res_summary$deseq2_logfc, 
  pch = ".",
  col = factor(res_summary$deseq2_padj<0.1),
  xaxt="n",
  xlab = "expression", 
  ylab = "log fold change",
  main = "MA-plot Deseq2", ylim = c(-2, 2))
abline(0,0, col = "red")
labels <- sapply(1:5,function(i) as.expression(bquote(10^ .(i))))
axis(1,at=1:5,labels=labels)

plot(res_summary$mean_exprs, res_summary$edgeR_logfc, 
  pch = ".",
  col = factor(res_summary$deseq2_padj<0.1),
  xaxt="n",
  xlab = "expression", 
  ylab = "log fold change",
  main = "MA-plot edgeR", ylim = c(-2, 2))
abline(0,0, col = "red")
labels <- sapply(1:5,function(i) as.expression(bquote(10^ .(i))))
axis(1,at=1:5,labels=labels)

#Venn Diagrams of Gene Signatures

diff_list <- list(DESeq2 = rownames(subset(res_summary, deseq2_padj < 0.1)),
  edgeR = rownames(subset(res_summary, edgeR_padj < 0.1)))

fill <- c("light blue", "pink")
size  <- rep(0.5,2)
venn <- venn.diagram(x = diff_list, 
        filename = NULL,
      height = 2000,
      width = 2000, fill = fill,
      cat.default.pos = "text", 
      cat.cex = size,
      main = "Overlap of Signficant genes (FDR 0.1)")
grid.newpage()
grid.draw(venn)

In these graphics we note that the are some differences beetween the two methods. In particular we note, just as we expected, that edgeR is much more restrictive and select fewer genes as significant.

2.3.1 Comparing results for stage

Now we compare the results for the stage.

## cumulative number of sig genes below mean expression
res_summary_stage <- data.frame(deseq2_padj_stage=res_deseq2_stage$padj, 
  edgeR_padj_stage = res_edgeR_stage$table$FDR,
  deseq2_logfc_stage=res_deseq2_stage$log2FoldChange, 
  edgeR_logfc_stage = res_edgeR_stage$table$logFC,
  mean_exprs = rowMeans(log10(exprs(HNSCF)+1)))

## for plotting purposes
exprs_breaks <- seq(0, log10(max(as.numeric(exprs(HNSCF)+1))), 0.01)

## cumulative significant genes vs. log10 mean expression
csg_DESeq2_stage <- sapply(exprs_breaks, 
  function(x) nrow(subset(res_summary_stage, mean_exprs < x & deseq2_padj_stage < 0.1)))

csg_edgeR_stage <- sapply(exprs_breaks, 
  function(x) nrow(subset(res_summary_stage, mean_exprs < x & edgeR_padj_stage < 0.1)))

plot(exprs_breaks, csg_DESeq2_stage, 
  type = "l",
  col = "red",
  xaxt="n",
  xlab = "log10 expression", 
  ylab = "cumulative significant genes (fdr 0.05)",
  main = "Cumulative number of significant genes by log10 count")
lines(exprs_breaks, csg_edgeR_stage, type = "l", col = "blue")

labels <- sapply(1:5,function(i) as.expression(bquote(10^ .(i))))
axis(1,at=1:5,labels=labels)
legend("right", c("DEseq2", "edgeR"), 
  col=c("red", "blue"), lwd=10)

## normalization to proportion of significant genes instead of absolute number
##
plot(exprs_breaks, csg_DESeq2_stage/max(csg_DESeq2_stage), 
  type = "l",
  col = "red",
  xaxt="n",
  xlab = "log10 expression", 
  ylab = "cumulative proportion of significant genes (fdr 0.05)",
  main = "Cumulative proportion of significant genes by log10 count")
lines(exprs_breaks, csg_edgeR_stage/max(csg_edgeR_stage), type = "l", col = "blue")
labels <- sapply(1:5,function(i) as.expression(bquote(10^ .(i))))
axis(1,at=1:5,labels=labels)
legend("right", c("DEseq2", "edgeR"), 
  col=c("red", "blue"), lwd=10)

## histogram of proportion of sig genes by mean expression
##
hist(subset(res_summary_stage, deseq2_padj_stage <0.1)[, "mean_exprs"], freq=FALSE,
  col=rgb(1,0,0,1/4), xlab = "log10 expression", 
  ylab = "number of sig. genes (fdr 0.05)",
  main = "Proportion of number of significant genes by log10 count", xaxt = "n")
hist(subset(res_summary_stage, edgeR_padj_stage <0.1)[, "mean_exprs"],freq=FALSE,
  col=rgb(0,0,1,1/4), add = T)
labels <- sapply(1:5,function(i)
            as.expression(bquote(10^ .(i)))
          )
axis(1,at=1:5,labels=labels)
box()
legend("topright", c("DEseq2", "edgeR"), 
  col=c(rgb(1,0,0,1/4), rgb(0,0,1,1/4), rgb(1,1,0,1/4)), lwd=10)

#Comparing MA-plots: Expression level vs. log Fold Change

plot(res_summary_stage$mean_exprs, res_summary_stage$deseq2_logfc_stage, 
  pch = ".",
  col = factor(res_summary_stage$deseq2_padj_stage<0.1),
  xaxt="n",
  xlab = "expression", 
  ylab = "log fold change",
  main = "MA-plot Deseq2", ylim = c(-2, 2))
abline(0,0, col = "red")
labels <- sapply(1:5,function(i) as.expression(bquote(10^ .(i))))
axis(1,at=1:5,labels=labels)

plot(res_summary_stage$mean_exprs, res_summary_stage$edgeR_logfc_stage, 
  pch = ".",
  col = factor(res_summary_stage$deseq2_padj_stage<0.1),
  xaxt="n",
  xlab = "expression", 
  ylab = "log fold change",
  main = "MA-plot edgeR", ylim = c(-2, 2))
abline(0,0, col = "red")
labels <- sapply(1:5,function(i) as.expression(bquote(10^ .(i))))
axis(1,at=1:5,labels=labels)

#Venn Diagrams of Gene Signatures

diff_list_stage <- list(DESeq2_stage = rownames(subset(res_summary_stage, deseq2_padj_stage < 0.1)),
  edgeR_stage = rownames(subset(res_summary_stage, edgeR_padj_stage < 0.1)))

fill <- c("light blue", "pink")
size  <- rep(0.5,2)
venn <- venn.diagram(x = diff_list_stage, 
        filename = NULL,
      height = 2000,
      width = 2000, fill = fill,
      cat.default.pos = "text", 
      cat.cex = size,
      main = "Overlap of Signficant genes (FDR 0.1)")
grid.newpage()
grid.draw(venn)

Even in this case we have some difference beetween the two methods. But in this case we see that the number of significant genes for the two methods is similar.

2.4 Differential analysis using JAGS

Another way to perform a differential analysis is to use JAGS based on a GLM model. Just another Gibbs sampler (JAGS) is a program for simulation from Bayesian hierarchical models using Markov chain Monte Carlo (MCMC). In particular we used a GLM model based on the Poisson distribution and on the Negative Binomial distribution.

The count is a discrete measure therefore the statistics descriptions of the data usually used are non applicable to RNA-Seq data and new models are therefore needed. The two of the greatest interest are the Poisson model and the Negative Binomial model. The first is a model that describes the distribution of the data through a single parameter, which represents both the average and the variance. It is a simple model, which requires the estimation of one parameter. But it is precisely this simplicity to represent its main limitation: usually the RNA-Seq data has a dispersion that is not represented by the variance-mean relationship envisaged by the Poisson model. If the variance is greater than the average, the data are said overdisperded and, for this reason, the use of Poisson distribution is inappropriate. The Binomial Negative model also extends the description in dispersed data, by the estimation of a second parameter which represents precisely the dispersion coefficient.

These methods have a very high computational cost so we apply a drastic variation filter for reduce the dataset to 100 genes. For each model we report 95% and 99% Credible Intervals (CIs) of the fold change for the 100 genes, and we identify those whose CIs deviate from 1. For doing this analysis we used the entire data set.

HNSCF3_log <- variationFilter(HNSCF_log,ngenes=100, score = "sd", do.plot=TRUE)
## Variation filtering based on sd .. done.
## Selecting top 100 by sd .. done, 100 genes selected.
## Creating scatter plot ..

## done.
HNSCF2 <- subset(HNSCF.allSamples,fData(HNSCF.allSamples)$hgnc_symbol %in% fData(HNSCF3_log)$hgnc_symbol)

require(Biobase)
require(rjags)

dataList <- list(x = as.numeric(pData(HNSCF2)$patient.neoplasm_histologic_grade),y = exprs(HNSCF2)[1,], nSample = ncol(HNSCF2))

2.4.1 JAGS with Poisson model

jagFile <- paste(PATH,"../jags/models/GLMpois2.jags",sep="")
#see content of jags file
writeLines(readLines(jagFile))
## model {
## 
##  for (i in 1:nSample){
##  
##      y[i] ~ dpois(mu[i])
##      log(mu[i])<- alpha + beta*x[i]
##      
##  }
##  
##  alpha ~ dnorm(0, 0.0001)
##  beta ~ dnorm(0, 0.0001)
##  
##  fold.change <- exp(beta)
## 
## }
model <- jags.model(jagFile, 
                    data = dataList,
                    n.chains = 3, 
                    n.adapt = 1000,
                    quiet = TRUE)
update(model,n.iter=1000)

## run JAGS
vars <- c("fold.change")
samples <- coda.samples(model,
                        variable.names = vars, 
                        n.iter = 1000, 
                        thin = 1)
plot(samples)

gelman.diag(samples)
## Potential scale reduction factors:
## 
##             Point est. Upper C.I.
## fold.change          1       1.01
summary(samples)
## 
## Iterations = 2001:3000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 1000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##      1.2521681      0.0121730      0.0002222      0.0010080 
## 
## 2. Quantiles for each variable:
## 
##  2.5%   25%   50%   75% 97.5% 
## 1.228 1.244 1.253 1.261 1.275
jagsDiffanal <- function(datI,jagFile,n.adapt=1000,n.update=1000,n.iter=1000,quiet=TRUE)
{
  progress.bar <- if (quiet) "none" else "text"
  if (!quiet) cat("loading model\n")
  model <- jags.model(jagFile, 
                      data = c( list(gene.data=datI), dataList ),
                      n.chains = 1, 
                      n.adapt = n.adapt,
                      quiet = quiet)
  ## burnin
  if (!quiet) cat("burnin\n")
  update(model,n.iter=n.update,progress.bar=progress.bar)

  ## variables to monitor
  vars <- c("fold.change")
    
  if ( !quiet ) cat( "sampling ..")
  samples <- coda.samples(model,
                          variable.names = vars, 
                          n.iter = n.iter, 
                          thin = 1,
                          quiet=quiet,
                          progress.bar=progress.bar)
  tmp <- summary(samples, quantiles = c(0.005,0.995,0.025,0.975))
  c(tmp$statistics,tmp$quantiles)
}
OUT_pois <- t(apply(exprs(HNSCF2), 1, jagsDiffanal, jagFile=jagFile, quiet=TRUE))

## check which genes have a 95% CI of the fold-change not intersecting with 1
sig <- ((OUT_pois[,"2.5%"] < 1) & (OUT_pois[,"97.5%"] < 1)) | ((OUT_pois[,"2.5%"] > 1) & (OUT_pois[,"97.5%"] > 1))

## augment the data output with that information
OUT1 <- cbind(OUT_pois,significant=sig)    

## check which genes have a 99% CI of the fold-change not intersecting with 1
sig2 <- ((OUT_pois[,"0.5%"] < 1) & (OUT_pois[,"99.5%"] < 1)) | ((OUT_pois[,"0.5%"] > 1) & (OUT_pois[,"99.5%"] > 1))

OUT2 <- cbind(OUT_pois,significant=sig2)  

2.4.2 Differential analysis using JAGS with negative binomial model

jagFile <- paste(PATH,"../jags/models/GLMbineg2.jags",sep="")
#see content of jags file
writeLines(readLines(jagFile))
## model
## {
##  
##      for (i in 1:nSample){
##      
##      y[i] ~ dnegbin(p[i],r)
##      
##      p[i]<- r/(mu[i]+r)
##      log(mu[i])<- alpha + beta*x[i]
##  }
##  
##  r ~ dcat(pi [])
##  for (i in 1:1000){
##      pi[i]<- 1/1000
##  }
##  
##  alpha ~ dnorm(0, 0.0001)
##  beta ~ dnorm(0, 0.0001)
##  
##  fold.change <- exp(beta)
## 
## }
model <- jags.model(jagFile, 
                    data = dataList ,
                    n.chains = 3, 
                    n.adapt = 1000,
                    quiet = FALSE)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 464
##    Unobserved stochastic nodes: 3
##    Total graph size: 2949
## 
## Initializing model
#burnin
update(model,n.iter=1000)

## run JAGS
vars <- c("fold.change")
samples <- coda.samples(model,
                        variable.names = vars, 
                        n.iter = 1000, 
                        thin = 1)
plot(samples)

gelman.diag(samples)
## Potential scale reduction factors:
## 
##             Point est. Upper C.I.
## fold.change       1.05       1.16
summary(samples)
## 
## Iterations = 2001:3000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 1000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##       1.273774       0.139414       0.002545       0.013229 
## 
## 2. Quantiles for each variable:
## 
##  2.5%   25%   50%   75% 97.5% 
## 1.025 1.182 1.260 1.359 1.587
jagsDiffanal <- function(datI,jagFile,n.adapt=1000,n.update=1000,n.iter=1000,quiet=TRUE)
{
  progress.bar <- if (quiet) "none" else "text"
  if (!quiet) cat("loading model\n")
  model <- jags.model(jagFile, 
                      data = c( list(gene.data=datI), dataList ),
                      n.chains = 1, 
                      n.adapt = n.adapt,
                      quiet = quiet)
  ## burnin
  if (!quiet) cat("burnin\n")
  update(model,n.iter=n.update,progress.bar=progress.bar)

  ## variables to monitor
  vars <- c("fold.change")
    
  if ( !quiet ) cat( "sampling ..")
  samples <- coda.samples(model,
                          variable.names = vars, 
                          n.iter = n.iter, 
                          thin = 1,
                          quiet=quiet,
                          progress.bar=progress.bar)
  tmp <- summary(samples, quantiles = c(0.005,0.995,0.025,0.975))
  c(tmp$statistics,tmp$quantiles)
}
OUT <- t(apply(exprs(HNSCF2), 1, jagsDiffanal, jagFile=jagFile, quiet=TRUE))

## check which genes have a 95% CI of the fold-change not intersecting with 1
sig_bineg <- ((OUT[,"2.5%"] < 1) & (OUT[,"97.5%"] < 1)) | ((OUT[,"2.5%"] > 1) & (OUT[,"97.5%"] > 1))

## augment the data output with that information
OUT1_bineg <- cbind(OUT,significant=sig_bineg)    

## check which genes have a 99% CI of the fold-change not intersecting with 1
sig2_bineg <- ((OUT[,"0.5%"] < 1) & (OUT[,"99.5%"] < 1)) | ((OUT[,"0.5%"] > 1) & (OUT[,"99.5%"] > 1))

OUT2_bineg <- cbind(OUT,significant=sig2_bineg)  

2.4.3 Comparing results

Now we want to compare the results and we visualize the overlap as a Venn diagram.

#Venn diagram for genes have a 95% CI of the fold-change not intersecting with 1
top.poisRes <- rownames(subset(OUT1, OUT1[,"significant"]==1))
top.binegRes <- rownames(subset(OUT1_bineg, OUT1_bineg[,"significant"]==1))

top <- list(top.poisRes = top.poisRes, top.binegRes = top.binegRes)

fill <- c("light blue","light green")
p <- venn.diagram(x = top, filename = NULL, fill = fill, main = "Overlap for 95% CI")
grid.newpage()
grid.draw(p)

# Venn diagram for genes have a 99% CI of the fold-change not intersecting with 1
top.poisRes <- rownames(subset(OUT2, OUT2[,"significant"]==1))
top.binegRes <- rownames(subset(OUT2_bineg, OUT2_bineg[,"significant"]==1))

top <- list(top.poisRes = top.poisRes, top.binegRes = top.binegRes)

fill <- c("light blue","light green")
p <- venn.diagram(x = top, filename = NULL, fill = fill, main = "Overlap for 99% CI")
grid.newpage()
grid.draw(p)

In these diagram we note that there are some differences between the two models, in particular we note that the number of significant genes for the negative binomial model is less than the number of Poisson model. The problem is that the number of genes are too low. For obtaining more significant results we have to work with more genes, but the computational cost is too hight (for doing this work with 500 genes I use a MacBook Pro (Retina, 15-inch, Late 2013) with 2 GHz Intel Core i7 and 8 GB of RAM and it takes about 15 hours to obtain the results).

2.5 Classification

Let us apply the same methodology used in the previous exercise to build classifiers of grade (g1 vs. g4). Even in this case we will use two features sets, one with 3000 genes and one with only the top genes identified from differantial analysis with edgeR.

source(paste(PATH,"../code/xvalSelect.R",sep=""))
source(paste(PATH,"../code/knn.R",sep=""))

# training set with the best genes from the diffanal
trainingSet1 <- subset(trainingSet,rownames(exprs(trainingSet)) %in% colnames(t(exprs(trainingSet))[,diff_list$edgeR]))
# training set with all genes after filtering (3000)
trainingSet2 <- trainingSet

## generate a vector of fold assignments (in this case, 3 folds) for the first features set
set.seed(987) # for reproducible results
split <- xvalSelect(sample.size=ncol(trainingSet1),nfolds=3,cls=pData(trainingSet1)[,"patient.neoplasm_histologic_grade"])

trnSet1_1 <- trainingSet1[,split!=1]
trnSet1_2 <- trainingSet1[,split!=2]
trnSet1_3 <- trainingSet1[,split!=3]
tstSet1_1 <- trainingSet1[,split==1]
tstSet1_2 <- trainingSet1[,split==2]
tstSet1_3 <- trainingSet1[,split==3]

table(pData(trnSet1_1)[,"patient.neoplasm_histologic_grade"])
## 
##  g1  g4 
## 140  46
table(pData(tstSet1_1)[,"patient.neoplasm_histologic_grade"])
## 
## g1 g4 
## 69 24
rownames(exprs(trnSet1_1))<-fData(trnSet1_1)$hgnc_symbol
rownames(exprs(trnSet1_2))<-fData(trnSet1_2)$hgnc_symbol
rownames(exprs(trnSet1_3))<-fData(trnSet1_3)$hgnc_symbol
rownames(exprs(tstSet1_1))<-fData(tstSet1_1)$hgnc_symbol
rownames(exprs(tstSet1_2))<-fData(tstSet1_2)$hgnc_symbol
rownames(exprs(tstSet1_3))<-fData(tstSet1_3)$hgnc_symbol

#now the same with the second training set
set.seed(987) # for reproducible results
split <- xvalSelect(sample.size=ncol(trainingSet2),nfolds=3,cls=pData(trainingSet2)[,"patient.neoplasm_histologic_grade"])

trnSet2_1 <- trainingSet2[,split!=1]
trnSet2_2 <- trainingSet2[,split!=2]
trnSet2_3 <- trainingSet2[,split!=3]
tstSet2_1 <- trainingSet2[,split==1]
tstSet2_2 <- trainingSet2[,split==2]
tstSet2_3 <- trainingSet2[,split==3]

table(pData(trnSet2_1)[,"patient.neoplasm_histologic_grade"])
## 
##  g1  g4 
## 140  46
table(pData(tstSet2_1)[,"patient.neoplasm_histologic_grade"])
## 
## g1 g4 
## 69 24
rownames(exprs(trnSet2_1))<-fData(trnSet2_1)$hgnc_symbol
rownames(exprs(trnSet2_2))<-fData(trnSet2_2)$hgnc_symbol
rownames(exprs(trnSet2_3))<-fData(trnSet2_3)$hgnc_symbol
rownames(exprs(tstSet2_1))<-fData(tstSet2_1)$hgnc_symbol
rownames(exprs(tstSet2_2))<-fData(tstSet2_2)$hgnc_symbol
rownames(exprs(tstSet2_3))<-fData(tstSet2_3)$hgnc_symbol

2.5.1 Random forest

require(randomForest)

# we start with the first features set

RF1_1 <- randomForest(x=t(exprs(trnSet1_1)),y=factor(trnSet1_1$patient.neoplasm_histologic_grade),ntree=1000,importance=TRUE)
print(RF1_1)
## 
## Call:
##  randomForest(x = t(exprs(trnSet1_1)), y = factor(trnSet1_1$patient.neoplasm_histologic_grade),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 40
## 
##         OOB estimate of  error rate: 19.35%
## Confusion matrix:
##     g1 g4 class.error
## g1 134  6  0.04285714
## g4  30 16  0.65217391
#prediction on the first Test-fold
RF1_1.pred <- predict(RF1_1, t(exprs(tstSet1_1)))
table(observed = tstSet1_1$patient.neoplasm_histologic_grade, predicted = RF1_1.pred)
##         predicted
## observed g1 g4
##       g1 65  4
##       g4 14 10
RF1_2 <- randomForest(x=t(exprs(trnSet1_2)),y=factor(trnSet1_2$patient.neoplasm_histologic_grade),ntree=1000,importance=TRUE)
print(RF1_2)
## 
## Call:
##  randomForest(x = t(exprs(trnSet1_2)), y = factor(trnSet1_2$patient.neoplasm_histologic_grade),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 40
## 
##         OOB estimate of  error rate: 20.43%
## Confusion matrix:
##     g1 g4 class.error
## g1 130  9   0.0647482
## g4  29 18   0.6170213
#prediction on the second Test-fold
RF1_2.pred <- predict(RF1_2, t(exprs(tstSet1_2)))
table(observed = tstSet1_2$patient.neoplasm_histologic_grade, predicted = RF1_2.pred)
##         predicted
## observed g1 g4
##       g1 69  1
##       g4 16  7
RF1_3 <- randomForest(x=t(exprs(trnSet1_3)),y=factor(trnSet1_3$patient.neoplasm_histologic_grade),ntree=1000,importance=TRUE)
print(RF1_3)
## 
## Call:
##  randomForest(x = t(exprs(trnSet1_3)), y = factor(trnSet1_3$patient.neoplasm_histologic_grade),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 40
## 
##         OOB estimate of  error rate: 21.51%
## Confusion matrix:
##     g1 g4 class.error
## g1 133  6  0.04316547
## g4  34 13  0.72340426
#prediction on the third Test-fold
RF1_3.pred <- predict(RF1_3, t(exprs(tstSet1_3)))
table(observed = tstSet1_3$patient.neoplasm_histologic_grade, predicted = RF1_3.pred)
##         predicted
## observed g1 g4
##       g1 68  2
##       g4 15  8
# second features set with all genes

RF2_1 <- randomForest(x=t(exprs(trnSet2_1)),y=factor(trnSet2_1$patient.neoplasm_histologic_grade),ntree=1000,importance=TRUE)
print(RF2_1)
## 
## Call:
##  randomForest(x = t(exprs(trnSet2_1)), y = factor(trnSet2_1$patient.neoplasm_histologic_grade),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 54
## 
##         OOB estimate of  error rate: 19.89%
## Confusion matrix:
##     g1 g4 class.error
## g1 134  6  0.04285714
## g4  31 15  0.67391304
#prediction on the first Test-fold
RF2_1.pred <- predict(RF2_1, t(exprs(tstSet2_1)))
table(observed = tstSet2_1$patient.neoplasm_histologic_grade, predicted = RF2_1.pred)
##         predicted
## observed g1 g4
##       g1 64  5
##       g4 17  7
RF2_2 <- randomForest(x=t(exprs(trnSet2_2)),y=factor(trnSet2_2$patient.neoplasm_histologic_grade),ntree=1000,importance=TRUE)
print(RF2_2)
## 
## Call:
##  randomForest(x = t(exprs(trnSet2_2)), y = factor(trnSet2_2$patient.neoplasm_histologic_grade),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 54
## 
##         OOB estimate of  error rate: 21.51%
## Confusion matrix:
##     g1 g4 class.error
## g1 132  7  0.05035971
## g4  33 14  0.70212766
#prediction on the second Test-fold
RF2_2.pred <- predict(RF2_2, t(exprs(tstSet2_2)))
table(observed = tstSet2_2$patient.neoplasm_histologic_grade, predicted = RF2_2.pred)
##         predicted
## observed g1 g4
##       g1 69  1
##       g4 17  6
RF2_3 <- randomForest(x=t(exprs(trnSet2_3)),y=factor(trnSet2_3$patient.neoplasm_histologic_grade),ntree=1000,importance=TRUE)
print(RF2_3)
## 
## Call:
##  randomForest(x = t(exprs(trnSet2_3)), y = factor(trnSet2_3$patient.neoplasm_histologic_grade),      ntree = 1000, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 54
## 
##         OOB estimate of  error rate: 23.66%
## Confusion matrix:
##     g1 g4 class.error
## g1 134  5  0.03597122
## g4  39  8  0.82978723
#prediction on the third Test-fold
RF2_3.pred <- predict(RF2_3, t(exprs(tstSet2_3)))
table(observed = tstSet2_3$patient.neoplasm_histologic_grade, predicted = RF2_3.pred)
##         predicted
## observed g1 g4
##       g1 68  2
##       g4 15  8

2.5.2 Adaboost

library("adabag")

# we start with the first features set

#fold 1
patient.neoplasm_histologic_grade<-trnSet1_1$patient.neoplasm_histologic_grade

adaboost1_1<-boosting(patient.neoplasm_histologic_grade~.,data = cbind(as.data.frame(t(exprs(trnSet1_1))), patient.neoplasm_histologic_grade))

patient.neoplasm_histologic_grade<-tstSet1_1$patient.neoplasm_histologic_grade

adaboost1_1.pred<-predict.boosting(adaboost1_1, newdata = cbind(as.data.frame(t(exprs(tstSet1_1))), patient.neoplasm_histologic_grade))

adaboost1_1.pred$confusion
##                Observed Class
## Predicted Class g1 g4
##              g1 59 12
##              g4 10 12
#fold 2
patient.neoplasm_histologic_grade<-trnSet1_2$patient.neoplasm_histologic_grade

adaboost1_2<-boosting(patient.neoplasm_histologic_grade~.,data = cbind(as.data.frame(t(exprs(trnSet1_2))), patient.neoplasm_histologic_grade))

patient.neoplasm_histologic_grade<-tstSet1_2$patient.neoplasm_histologic_grade

adaboost1_2.pred<-predict.boosting(adaboost1_2, newdata = cbind(as.data.frame(t(exprs(tstSet1_2))), patient.neoplasm_histologic_grade))

adaboost1_2.pred$confusion
##                Observed Class
## Predicted Class g1 g4
##              g1 65 14
##              g4  5  9
#fold 3
patient.neoplasm_histologic_grade<-trnSet1_3$patient.neoplasm_histologic_grade

adaboost1_3<-boosting(patient.neoplasm_histologic_grade~.,data = cbind(as.data.frame(t(exprs(trnSet1_3))), patient.neoplasm_histologic_grade))

patient.neoplasm_histologic_grade<-tstSet1_3$patient.neoplasm_histologic_grade

adaboost1_3.pred<-predict.boosting(adaboost1_3, newdata = cbind(as.data.frame(t(exprs(tstSet1_3))), patient.neoplasm_histologic_grade))

adaboost1_3.pred$confusion
##                Observed Class
## Predicted Class g1 g4
##              g1 67 13
##              g4  3 10
# second features set (6k genes)

#fold 1
patient.neoplasm_histologic_grade<-trnSet2_1$patient.neoplasm_histologic_grade

adaboost2_1<-boosting(patient.neoplasm_histologic_grade~.,data = cbind(as.data.frame(t(exprs(trnSet2_1))), patient.neoplasm_histologic_grade))

patient.neoplasm_histologic_grade<-tstSet2_1$patient.neoplasm_histologic_grade
adaboost2_1.pred<-predict.boosting(adaboost2_1, newdata = cbind(as.data.frame(t(exprs(tstSet2_1))), patient.neoplasm_histologic_grade))

adaboost2_1.pred$confusion
##                Observed Class
## Predicted Class g1 g4
##              g1 65 15
##              g4  4  9
#fold 2
patient.neoplasm_histologic_grade<-trnSet2_2$patient.neoplasm_histologic_grade

adaboost2_2<-boosting(patient.neoplasm_histologic_grade~.,data = cbind(as.data.frame(t(exprs(trnSet2_2))), patient.neoplasm_histologic_grade))

patient.neoplasm_histologic_grade<-tstSet2_2$patient.neoplasm_histologic_grade

adaboost2_2.pred<-predict.boosting(adaboost2_2, newdata = cbind(as.data.frame(t(exprs(tstSet2_2))), patient.neoplasm_histologic_grade))

adaboost2_2.pred$confusion
##                Observed Class
## Predicted Class g1 g4
##              g1 68 18
##              g4  2  5
#fold 3
patient.neoplasm_histologic_grade<-trnSet2_3$patient.neoplasm_histologic_grade

adaboost2_3<-boosting(patient.neoplasm_histologic_grade~.,data = cbind(as.data.frame(t(exprs(trnSet2_3))), patient.neoplasm_histologic_grade))

patient.neoplasm_histologic_grade<-tstSet2_3$patient.neoplasm_histologic_grade

adaboost2_3.pred<-predict.boosting(adaboost2_3, newdata = cbind(as.data.frame(t(exprs(tstSet2_3))), patient.neoplasm_histologic_grade))

adaboost2_3.pred$confusion
##                Observed Class
## Predicted Class g1 g4
##              g1 67 16
##              g4  3  7

2.5.3 Logistic regression with LASSO

require(glmnet)

#alpha=1 lasso, =0.5 elastic net

## first training set
#fold 1
regression1_1 <- glmnet(x=t(exprs(trnSet1_1)),y=factor(trnSet1_1$patient.neoplasm_histologic_grade), family = "binomial", alpha=1)

cv.regression1_1 <- cv.glmnet(x=t(exprs(trnSet1_1)),y=factor(trnSet1_1$patient.neoplasm_histologic_grade),alpha=1, family = "binomial" )

best_lambda <- cv.regression1_1$lambda.min
#prediction on Test Data
regression1_1.pred <- predict(regression1_1, newx = t(exprs(tstSet1_1)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet1_1$patient.neoplasm_histologic_grade, predicted = regression1_1.pred)
##         predicted
## observed g1 g4
##       g1 62  7
##       g4 17  7
# fold 2
regression1_2 <- glmnet(x=t(exprs(trnSet1_2)),y=factor(trnSet1_2$patient.neoplasm_histologic_grade), family = "binomial", alpha=1)

cv.regression1_2 <- cv.glmnet(x=t(exprs(trnSet1_2)),y=factor(trnSet1_2$patient.neoplasm_histologic_grade),alpha=1, family = "binomial" )

best_lambda <- cv.regression1_2$lambda.min
#prediction on Test Data
regression1_2.pred <- predict(regression1_2, newx = t(exprs(tstSet1_2)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet1_2$patient.neoplasm_histologic_grade, predicted = regression1_2.pred)
##         predicted
## observed g1 g4
##       g1 67  3
##       g4 19  4
# fold 3
regression1_3 <- glmnet(x=t(exprs(trnSet1_3)),y=factor(trnSet1_3$patient.neoplasm_histologic_grade), family = "binomial", alpha=1)

cv.regression1_3 <- cv.glmnet(x=t(exprs(trnSet1_3)),y=factor(trnSet1_3$patient.neoplasm_histologic_grade),alpha=1, family = "binomial" )

best_lambda <- cv.regression1_3$lambda.min
#prediction on Test Data
regression1_3.pred <- predict(regression1_3, newx = t(exprs(tstSet1_3)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet1_3$patient.neoplasm_histologic_grade, predicted = regression1_3.pred)
##         predicted
## observed g1 g4
##       g1 67  3
##       g4 16  7
## second training set 
#fold 1
regression2_1 <- glmnet(x=t(exprs(trnSet2_1)),y=factor(trnSet2_1$patient.neoplasm_histologic_grade), family = "binomial", alpha=1)

cv.regression2_1 <- cv.glmnet(x=t(exprs(trnSet2_1)),y=factor(trnSet2_1$patient.neoplasm_histologic_grade),alpha=1, family = "binomial" )

best_lambda <- cv.regression2_1$lambda.min
#prediction on Test Data
regression2_1.pred <- predict(regression2_1, newx = t(exprs(tstSet2_1)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet2_1$patient.neoplasm_histologic_grade, predicted = regression2_1.pred)
##         predicted
## observed g1 g4
##       g1 63  6
##       g4 17  7
# fold 2
regression2_2 <- glmnet(x=t(exprs(trnSet2_2)),y=factor(trnSet2_2$patient.neoplasm_histologic_grade), family = "binomial", alpha=1)

cv.regression2_2 <- cv.glmnet(x=t(exprs(trnSet2_2)),y=factor(trnSet2_2$patient.neoplasm_histologic_grade),alpha=1, family = "binomial" )

best_lambda <- cv.regression2_2$lambda.min
#prediction on Test Data
regression2_2.pred <- predict(regression2_2, newx = t(exprs(tstSet2_2)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet2_2$patient.neoplasm_histologic_grade, predicted = regression2_2.pred)
##         predicted
## observed g1 g4
##       g1 67  3
##       g4 20  3
# fold 3
regression2_3 <- glmnet(x=t(exprs(trnSet2_3)),y=factor(trnSet2_3$patient.neoplasm_histologic_grade), family = "binomial", alpha=1)

cv.regression2_3 <- cv.glmnet(x=t(exprs(trnSet2_3)),y=factor(trnSet2_3$patient.neoplasm_histologic_grade),alpha=1, family = "binomial" )

best_lambda <- cv.regression2_3$lambda.min
#prediction on Test Data
regression2_3.pred <- predict(regression2_3, newx = t(exprs(tstSet2_3)),type = "class", s=best_lambda)
#confusion matrix
table(observed = tstSet2_3$patient.neoplasm_histologic_grade, predicted = regression2_3.pred)
##         predicted
## observed g1 g4
##       g1 67  3
##       g4 16  7

2.5.4 KNN

#best genes

knn1_1 <- knn.estimate(dat=t(exprs(trnSet1_1)),cls=pData(trnSet1_1)[,"patient.neoplasm_histologic_grade"],weight="none",k=3)
knn1_1.pred <- knn.predict(dat=t(exprs(tstSet1_1)),model=knn1_1)

ftable(levels(pData(tstSet1_1)[,"patient.neoplasm_histologic_grade"])[apply(knn1_1.pred,1,which.max)], # predictions
       pData(tstSet1_1)[,"patient.neoplasm_histologic_grade"])                                  # actual
##     g1 g4
##          
## g1  65 13
## g4   4 11
knn1_2 <- knn.estimate(dat=t(exprs(trnSet1_2)),cls=pData(trnSet1_2)[,"patient.neoplasm_histologic_grade"],weight="none",k=3)
knn1_2.pred <- knn.predict(dat=t(exprs(tstSet1_2)),model=knn1_2)

ftable(levels(pData(tstSet1_2)[,"patient.neoplasm_histologic_grade"])[apply(knn1_2.pred,1,which.max)], # predictions
       pData(tstSet1_2)[,"patient.neoplasm_histologic_grade"])                                  # actual
##     g1 g4
##          
## g1  70 18
## g4   0  5
knn1_3 <- knn.estimate(dat=t(exprs(trnSet1_3)),cls=pData(trnSet1_3)[,"patient.neoplasm_histologic_grade"],weight="none",k=3)
knn1_3.pred <- knn.predict(dat=t(exprs(tstSet1_3)),model=knn1_3)

ftable(levels(pData(tstSet1_3)[,"patient.neoplasm_histologic_grade"])[apply(knn1_3.pred,1,which.max)], # predictions
       pData(tstSet1_3)[,"patient.neoplasm_histologic_grade"])                                  # actual
##     g1 g4
##          
## g1  67 17
## g4   3  6
#features set with 6k genes

knn2_1 <- knn.estimate(dat=t(exprs(trnSet2_1)),cls=pData(trnSet2_1)[,"patient.neoplasm_histologic_grade"],weight="none",k=3)
knn2_1.pred <- knn.predict(dat=t(exprs(tstSet2_1)),model=knn2_1)

ftable(levels(pData(tstSet2_1)[,"patient.neoplasm_histologic_grade"])[apply(knn2_1.pred,1,which.max)], # predictions
       pData(tstSet2_1)[,"patient.neoplasm_histologic_grade"])                                  # actual
##     g1 g4
##          
## g1  66 15
## g4   3  9
knn2_2 <- knn.estimate(dat=t(exprs(trnSet2_2)),cls=pData(trnSet2_2)[,"patient.neoplasm_histologic_grade"],weight="none",k=3)
knn2_2.pred <- knn.predict(dat=t(exprs(tstSet2_2)),model=knn2_2)

ftable(levels(pData(tstSet2_2)[,"patient.neoplasm_histologic_grade"])[apply(knn2_2.pred,1,which.max)], # predictions
       pData(tstSet2_2)[,"patient.neoplasm_histologic_grade"])                                  # actual
##     g1 g4
##          
## g1  69 19
## g4   1  4
knn2_3 <- knn.estimate(dat=t(exprs(trnSet2_3)),cls=pData(trnSet2_3)[,"patient.neoplasm_histologic_grade"],weight="none",k=3)
knn2_3.pred <- knn.predict(dat=t(exprs(tstSet2_3)),model=knn2_3)

ftable(levels(pData(tstSet2_3)[,"patient.neoplasm_histologic_grade"])[apply(knn2_3.pred,1,which.max)], # predictions
       pData(tstSet2_3)[,"patient.neoplasm_histologic_grade"])                                  # actual
##     g1 g4
##          
## g1  67 18
## g4   3  5

2.5.5 Evaluation of classifiers

require(pROC)

# at first we consider the performance on the entire features set

#RANDOM FOREST
#calculate sensitivity
RF_sens_mean2<- mean( c(sensitivity(factor(RF2_1.pred), factor(tstSet2_1$patient.neoplasm_histologic_grade)),
sensitivity(factor(RF2_2.pred), factor(tstSet2_2$patient.neoplasm_histologic_grade)),
sensitivity(factor(RF2_3.pred), factor(tstSet2_3$patient.neoplasm_histologic_grade))))

#calculate specificity
RF_spec_mean2<- mean( c(specificity(factor(RF2_1.pred), factor(tstSet2_1$patient.neoplasm_histologic_grade)),
specificity(factor(RF2_2.pred), factor(tstSet2_2$patient.neoplasm_histologic_grade)),
specificity(factor(RF2_3.pred), factor(tstSet2_3$patient.neoplasm_histologic_grade))))

#calculate balanced accuracy
RF_bal_acc2 <- (RF_sens_mean2+RF_spec_mean2)/2

#KNN
#calculate sensitivity
knn_sens_mean2<- mean( c(sensitivity(factor(apply(knn2_1.pred,1,which.max)),factor(as.numeric(tstSet2_1$patient.neoplasm_histologic_grade))),sensitivity(factor(apply(knn2_2.pred,1,which.max)), factor(as.numeric(tstSet2_2$patient.neoplasm_histologic_grade))),
sensitivity(factor(apply(knn2_3.pred,1,which.max)), factor(as.numeric(tstSet2_3$patient.neoplasm_histologic_grade)))))

#calculate specificity
knn_spec_mean2<- mean( c(specificity(factor(apply(knn2_1.pred,1,which.max)),factor(as.numeric(tstSet2_1$patient.neoplasm_histologic_grade))),specificity(factor(apply(knn2_2.pred,1,which.max)), factor(as.numeric(tstSet2_2$patient.neoplasm_histologic_grade))),
specificity(factor(apply(knn2_3.pred,1,which.max)), factor(as.numeric(tstSet2_3$patient.neoplasm_histologic_grade)))))

#calculate balanced accuracy
knn_bal_acc2 <- (knn_sens_mean2+knn_spec_mean2)/2

#ADABOOST
#calculate sensitivity
adaboost_sens_mean2<- mean( c(sensitivity(factor(adaboost2_1.pred$class), factor(tstSet2_1$patient.neoplasm_histologic_grade)),
sensitivity(factor(adaboost2_2.pred$class), factor(tstSet2_2$patient.neoplasm_histologic_grade)),
sensitivity(factor(adaboost2_3.pred$class), factor(tstSet2_3$patient.neoplasm_histologic_grade))))

#calculate specificity
adaboost_spec_mean2<- mean( c(specificity(factor(adaboost2_1.pred$class), factor(tstSet2_1$patient.neoplasm_histologic_grade)),
specificity(factor(adaboost2_2.pred$class), factor(tstSet2_2$patient.neoplasm_histologic_grade)),
specificity(factor(adaboost2_3.pred$class), factor(tstSet2_3$patient.neoplasm_histologic_grade))))

#calculate balanced accuracy
adaboost_bal_acc2 <- (adaboost_sens_mean2+adaboost_spec_mean2)/2

#LASSO REGRESSION
#calculate sensitivity
lasso_sens_mean2<- mean( c(sensitivity(factor(regression2_1.pred), factor(tstSet2_1$patient.neoplasm_histologic_grade)),
sensitivity(factor(regression2_2.pred), factor(tstSet2_2$patient.neoplasm_histologic_grade)),
sensitivity(factor(regression2_3.pred), factor(tstSet2_3$patient.neoplasm_histologic_grade))))

#calculate specificity
lasso_spec_mean2<- mean( c(specificity(factor(regression2_1.pred), factor(tstSet2_1$patient.neoplasm_histologic_grade)),
specificity(factor(regression2_2.pred), factor(tstSet2_2$patient.neoplasm_histologic_grade)),
specificity(factor(regression2_3.pred), factor(tstSet2_3$patient.neoplasm_histologic_grade))))

#calculate balanced accuracy
lasso_bal_acc2 <- (lasso_sens_mean2+lasso_spec_mean2)/2


result2 <- cbind(c(RF_sens_mean2, knn_sens_mean2,adaboost_sens_mean2,lasso_sens_mean2),c(RF_spec_mean2,knn_spec_mean2,adaboost_spec_mean2,lasso_spec_mean2),c(RF_bal_acc2,knn_bal_acc2,adaboost_bal_acc2,lasso_bal_acc2))
rownames(result2)<-c("RF","KNN","AdABoost", "LASSO")
colnames(result2)<-c("sensitivity","specificity","Balanced accuracy")
result2
##          sensitivity specificity Balanced accuracy
## RF         0.9615597   0.3001208         0.6308402
## KNN        0.9664596   0.2554348         0.6109472
## AdABoost   0.9568668   0.2989130         0.6278899
## LASSO      0.9424431   0.2421498         0.5922964
# now we consider the features set with only the top genes

#RANDOM FOREST
#calculate sensitivity
RF_sens_mean1<- mean( c(sensitivity(factor(RF1_1.pred), factor(tstSet1_1$patient.neoplasm_histologic_grade)),
sensitivity(factor(RF1_2.pred), factor(tstSet1_2$patient.neoplasm_histologic_grade)),
sensitivity(factor(RF1_3.pred), factor(tstSet1_3$patient.neoplasm_histologic_grade))))

#calculate specificity
RF_spec_mean1<- mean( c(specificity(factor(RF1_1.pred), factor(tstSet1_1$patient.neoplasm_histologic_grade)),
specificity(factor(RF1_2.pred), factor(tstSet1_2$patient.neoplasm_histologic_grade)),
specificity(factor(RF1_3.pred), factor(tstSet1_3$patient.neoplasm_histologic_grade))))

#calculate balanced accuracy
RF_bal_acc1 <- (RF_sens_mean1+RF_spec_mean1)/2


#KNN
#calculate sensitivity
knn_sens_mean1<- mean( c(sensitivity(factor(apply(knn1_1.pred,1,which.max)),factor(as.numeric(tstSet1_1$patient.neoplasm_histologic_grade))),sensitivity(factor(apply(knn1_2.pred,1,which.max)), factor(as.numeric(tstSet1_2$patient.neoplasm_histologic_grade))),
sensitivity(factor(apply(knn1_3.pred,1,which.max)), factor(as.numeric(tstSet1_3$patient.neoplasm_histologic_grade)))))

#calculate specificity
knn_spec_mean1<- mean( c(specificity(factor(apply(knn1_1.pred,1,which.max)),factor(as.numeric(tstSet1_1$patient.neoplasm_histologic_grade))),specificity(factor(apply(knn1_2.pred,1,which.max)), factor(as.numeric(tstSet1_2$patient.neoplasm_histologic_grade))),
specificity(factor(apply(knn1_3.pred,1,which.max)), factor(as.numeric(tstSet1_3$patient.neoplasm_histologic_grade)))))

#calculate balanced accuracy
knn_bal_acc1 <- (knn_sens_mean1+knn_spec_mean1)/2


#ADABOOST
#calculate sensitivity
adaboost_sens_mean1<- mean( c(sensitivity(factor(adaboost1_1.pred$class), factor(tstSet1_1$patient.neoplasm_histologic_grade)),
sensitivity(factor(adaboost1_2.pred$class), factor(tstSet1_2$patient.neoplasm_histologic_grade)),
sensitivity(factor(adaboost1_3.pred$class), factor(tstSet1_3$patient.neoplasm_histologic_grade))))

#calculate specificity
adaboost_spec_mean1<- mean( c(specificity(factor(adaboost1_1.pred$class), factor(tstSet1_1$patient.neoplasm_histologic_grade)),
specificity(factor(adaboost1_2.pred$class), factor(tstSet1_2$patient.neoplasm_histologic_grade)),
specificity(factor(adaboost1_3.pred$class), factor(tstSet1_3$patient.neoplasm_histologic_grade))))

#calculate balanced accuracy
adaboost_bal_acc1 <- (adaboost_sens_mean1+adaboost_spec_mean1)/2

#LASSO REGRESSION
#calculate sensitivity
lasso_sens_mean1<- mean( c(sensitivity(factor(regression1_1.pred), factor(tstSet1_1$patient.neoplasm_histologic_grade)),
sensitivity(factor(regression1_2.pred), factor(tstSet1_2$patient.neoplasm_histologic_grade)),
sensitivity(factor(regression1_3.pred), factor(tstSet1_3$patient.neoplasm_histologic_grade))))

#calculate specificity
lasso_spec_mean1<- mean( c(specificity(factor(regression1_1.pred), factor(tstSet1_1$patient.neoplasm_histologic_grade)),
specificity(factor(regression1_2.pred), factor(tstSet1_2$patient.neoplasm_histologic_grade)),
specificity(factor(regression1_3.pred), factor(tstSet1_3$patient.neoplasm_histologic_grade))))

#calculate balanced accuracy
lasso_bal_acc1 <- (lasso_sens_mean1+lasso_spec_mean1)/2


result1 <- cbind(c(RF_sens_mean1, knn_sens_mean1,adaboost_sens_mean1,lasso_sens_mean1),c(RF_spec_mean1,knn_spec_mean1,adaboost_spec_mean1,lasso_spec_mean1),c(RF_bal_acc1,knn_bal_acc1,adaboost_bal_acc1,lasso_bal_acc1))
rownames(result1)<-c("RF","KNN", "AdABoost","LASSO")
colnames(result1)<-c("sensitivity","specificity","Balanced accuracy")
result1
##          sensitivity specificity Balanced accuracy
## RF         0.9663906   0.3562802         0.6613354
## KNN        0.9663906   0.3121981         0.6392943
## AdABoost   0.9135956   0.4420290         0.6778123
## LASSO      0.9376121   0.2566425         0.5971273

We note that in this case the performances of our classifiers are very low in any case. Probably the data are too noisy to build a classifier with high performance.

2.5.6 Final Results

Now we try to apply the Adaboost with a reduce set of genes on the test set.

patient.neoplasm_histologic_grade<-trainingSet1$patient.neoplasm_histologic_grade

adaboost<-boosting(patient.neoplasm_histologic_grade~.,data = cbind(as.data.frame(t(exprs(trainingSet1))), patient.neoplasm_histologic_grade))

patient.neoplasm_histologic_grade<-testSet$patient.neoplasm_histologic_grade

adaboost.pred<-predict.boosting(adaboost, newdata = cbind(as.data.frame(t(exprs(testSet))), patient.neoplasm_histologic_grade))

adaboost.pred$confusion
##                Observed Class
## Predicted Class  g1  g4
##              g1 121  31
##              g4  18  15
#calculate sensitivity
sens <- sensitivity(factor(adaboost.pred$class), factor(testSet$patient.neoplasm_histologic_grade))
sens
## [1] 0.8705036
#calculate specificity
spec <- specificity(factor(adaboost.pred$class), factor(testSet$patient.neoplasm_histologic_grade))
spec
## [1] 0.326087
roc(testSet$patient.neoplasm_histologic_grade, adaboost.pred$votes[,1], plot = TRUE)

## 
## Call:
## roc.default(response = testSet$patient.neoplasm_histologic_grade,     predictor = adaboost.pred$votes[, 1], plot = TRUE)
## 
## Data: adaboost.pred$votes[, 1] in 139 controls (testSet$patient.neoplasm_histologic_grade g1) > 46 cases (testSet$patient.neoplasm_histologic_grade g4).
## Area under the curve: 0.6905
bal_acc <- (sens+spec)/2
bal_acc
## [1] 0.5982953

We note that the performances are not good so we are not able to predict the class with an high accuracy. Moreover the dataset is unbalanced in fact if we observe the value of specificity is very low. So we can try with something different.

2.5.7 Support Vector Machines with MLSeq

MLSeq package provides several algorithms, including support vector machines (SVM), to classify sequencing data. MLSeq package expects a count table that contains the number of reads mapped to each transcript for each sample and class label information of samples in an S4 class DESeqDataSet format.

As we saw previously, RNA-Seq data is overdispersed. Results of some studies revealed that overdispersion has a significant effect on classification accuracies and should be taken into account before model building. When data is overdispersed, increasing the number of samples does not change the performance of classifiers. However, increasing number of features significantly increases overall model accuracies in most scenarios.

We can transform our training and test data to DESeqDataSet instance, which is the main data structure in the MLSeq package. For this purpose, we use the DESeqDataSetFromMatrix function of DESeq2 package. For increase the performance of most classifiers MLSeq package use two normalization methods. In particular we used the ”deseq normalization”, which estimates the size factors by dividing each sample by the geometric means of the transcript counts. After the normalization process, it is useful to transform the data for classification analysis. There are two transformation methods available in MLSeq package. First one is the ”voom transformation” which applies a logarithmic transformation to normalized count data and computes gene weights using the mean-dispersion relationship. Second transformation method is the ”vst transformation” that uses an error modeling and the concept of variance stabilizing transformations to estimate the mean-dispersion relationship of data. We used a training set with 3000 genes for train the SVM and we choose ”deseq” as normalization method, ”vst” as transformation method. We also define the number of cross validation fold as ”cv=3” and number of repeats as ”rpt=3” for model validation. At the end we apply the classifier on the test set. SVM is capable of nonlinear classification and deal with high-dimensional data. Thus, it has been applied in many fields such as computational biology.

require("MLSeq")

trainingSet3 <- HNSCF
testSet3 <- HNSCF[,-trainIndex]


colData <- data.frame(condition=as.character(pData(trainingSet3)[, "patient.neoplasm_histologic_grade"]))

data.trainS4 = DESeqDataSetFromMatrix(countData = exprs(trainingSet3), colData = colData,
    formula(~condition))
data.trainS4 = DESeq(data.trainS4, fitType = "local")

svm <- classify(data = data.trainS4, method = "svm",  ref = "g1" , deseqTransform = "vst", cv = 3, rpt = 3)
svm
## 
##   An object of class  MLSeq 
## 
##             Method  :  svm 
## 
##        Accuracy(%)  :  89.61 
##     Sensitivity(%)  :  100 
##     Specificity(%)  :  58.57 
## 
##   Reference Class   :  g1
colData <- data.frame(condition=as.character(pData(testSet3)[, "patient.neoplasm_histologic_grade"]))
data.testS4 = DESeqDataSetFromMatrix(countData = exprs(testSet3), colData = colData,
    formula(~condition))
data.testS4 = DESeq(data.testS4, fitType = "local")

pred.svm = predictClassify(svm, data.testS4)
table(observed = testSet3$patient.neoplasm_histologic_grade, predicted = pred.svm)
##         predicted
## observed g1 g4
##       g1 84  0
##       g4  8 18
#calculate sensitivity
sens_svm <- sensitivity(factor(pred.svm), factor(testSet3$patient.neoplasm_histologic_grade))
sens_svm
## [1] 1
#calculate specificity
spec_svm <- specificity(factor(pred.svm), factor(testSet3$patient.neoplasm_histologic_grade))
spec_svm
## [1] 0.6923077
bal_acc_svm <- (sens_svm+spec_svm)/2
bal_acc_svm
## [1] 0.8461538

As we can see, this method work very good and we have an high improvement of performance. Using this package we can try also with bagSVM (SVM with bagging) for increase the performances but this method have an high computional cost.

References

  • Lenz G, Wright G, Dave SS, Xiao W, Powell J, Zhao H, et al. Stromal gene signatures in large-B-cell lymphomas. N Engl J Med. 2008;359(22):2313–23.

  • Cancer Genome Atlas Network. (2015). Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature, 517(7536), 576-582.

  • Lee, J. K., Williams, P. D., & Cheon, S. (2008). Data Mining in Genomics. Clinics in Laboratory Medicine, 28(1), 145–viii.

  • Naba, A., Clauser, K. R., Hoersch, S., Liu, H., Carr, S. A., & Hynes, R. O. (2012). The Matrisome: In Silico Definition and In Vivo Characterization by Proteomics of Normal and Tumor Extracellular Matrices. Molecular & Cellular Proteomics: MCP, 11(4), M111.014647.

  • M. I. Love, W. Huber, S. Anders. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550.

  • Robinson, MD, and Smyth, GK (2008). Small sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321–332.

  • Robinson, MD, and Smyth, GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881–2887.

  • Robinson, MD, McCarthy, DJ, Smyth, GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140.

  • David Lunn, Christopher Jackson, Nicky Best, Andrew Thomas and David Spiegelhalter. (2012). The BUGS Book - A Practical Introduction to Bayesian Analysis. CRC Press / Chapman and Hall.

  • Yanming Di, Daniel W Schafer, Jason S Cumbie, and Jeff H Chang. (2011). The NBP negative binomial model for assessing differential gene expression from RNA-seq. Statistical Applications in Genetics and Molecular Biology, 10(1):24.

  • Ann L Oberg, Brian M Bot, Diane E Grill, Gregory A Poland, and Terry M Therneau. (2012). Technical and biological variance structure in mRNA-Seq data: life in the real world. BMC genomics, 13(1):304.

  • Gokmen Zararsiz, Dincer Goksuluk, Selcuk Korkmaz, Vahap Eldem, Izzet Parug Duru, Turgay Unver and Ahmet Ozturk (2016). MLSeq: Machine learning interface for RNA-Seq data. R package version 1.8.1.

  • Anders S, Huber W (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10):R106.

  • Charity WL, Chen Y, Shi W, et al. (2014) Voom: precision weights unlock linear model analysis tools for RNA-Seq read counts, Genome Biology, 15:R29, doi:10.1186/gb–2014–15–2–r29.