PluriPred

PluriPred : A webserver for predicting pluripotent proteins

HOME

BROWSE

ABOUT

HELP

TEAM

Pluripred is a web server for predicting whether a protein has an important role in pluripotency or not from amino acid sequence of the proteins.

Datasets

Positive dataset : PluriNetwork is a manually curated protein-protein interaction pluripotent network containing 274 mouse genes/proteins, which has direct evidence in pluripotency. Out of those genes 270 genes' ids were matched and used as positive training dataset to train the SVM model as well as making database for BLAST search.

Negative dataset : Around 2785 genes were randomly selected from Uniprot for mouse genome those were not annotated in gene ontology with the terms, such as growth (GO:0040007, level 1), developmental process (GO: 0032502, level 1), cell proliferation (GO:0008283, level 1), cell differentiation (GO:0030154, level 4). CD-Hit was used for removing redundant genes with similar type of amino acid sequences. We selected sequence identity cut-off as 0.3 which represent the threshold of similarity between FASTA sequence clusters of proteins/genes and took the representative of the clusters. After that we manually deleted a few genes which were closely related. Finally, 932 genes were found which were used as negative dataset.

Swiss-Prot genes(Mouse and human) : Amino acid sequences in FASTA format of 16232 mouse genes/proteins and 20193 human genes/proteins were collected from UniProt which had UniprotKB/Swiss-Prot entry and perform optimal model on those genes for predicting novel pluripotent genes/proteins.

Downloads :Positive training dataset Negative training dataset Blind set from ESCAPE database Blind set from PluriNet

Methods

Since Support Vector Machine(SVM) model gives higher sensitivity, but low positive predictive value and BLAST search gives lower sensitivity but high positive predictive value, we proposed a hybrid model of these two, taking advantage of both models. The flow chart of our proposed hybrid model is given in Flow chart 1.

Flow chart 1 : Flow chart of the proposed hybrid model.

Feature vectors

For training of our models we used different combinations of features which include

6 genomic features (GC%, transcription count, maximum CDS length, maximum 5’UTR length, maximum 3’UTR length and dN/dS)
Amino acid percentage
Dipeptide percentage
Triad percentage
Topological(Degree of a protein : numbers of different proteins those interact with the particular protein)

Confidence measure for the prediction

Confidence measurement of prediction based on SVM score

If we plot a histogram of a population with a bandwidth W, F_i is the frequency of within a particular bandwidth, and total size of the population is N then the density D_i is F_i/(W*N) in this bandwidth. If we multiply density with the bandwidth then we will get the relative frequency of the population for the bandwidth. The sum of all relative frequency is 1. So if we plot relative frequency density curve, then the area under the curve will be 1, and it represents the probability density function of the population.

                               ∑ Di*W = ∑ (F_i/(W*N))*N
                                           = ∑ F_i/N
                                           = 1            As ∑ Fi = N

In our model if we plot relative frequency plot of the positive training dataset as well as negative training dataset, it will give the probability density functions of the positive training dataset and negative training dataset respectively. From there we can calculate the chance of protein whether the protein is important or not for pluripotentcy from their SVM score that we denote in the terms of positiveness and negativeness.

Positiveness : Positiveness of a protein means that the chance of the protein is important in pluripotency. We calculate positiveness of a protein by the probability P(X<S) where S is the SVM score of the protein and X is a random variable in the probability density function of the positive training dataset.This P(X<S) is the percentage of proteins having lower SVM score in the positive training dataset than the SVM score of the protein. Figure 1 is an example of the probability density function of the positive training dataset and red shaded area is the percentage of the proteins having the lower SVM score than the corresponding protein's SVM score, which is the positiveness of the protein.

Negativeness : Negativeness of a protein means that the chance of the protein is not important in pluripotency. We calculate negativeness of a protein by taking the percentage of proteins having higher SVM score than the SVM score of the protein in the negative training dataset, and that is represent by P(X>S), where S is the SVM score of the protein and X is a random variable in probability density function of the negative training dataset. Figure 2 is an example of the probability density function of the negative training dataset and the blue shaded area is the percentage of the proteins having higher SVM score than the corresponding protein's SVM score, which is the negativeness of the protein.


Figure 1 : Probability density function of the positive training data set.	Figure 2 : Probability density function of the negative training data set.

Confidence measurement of prediction based on E-value of BLAST search

The confidence of a protein to be pluripotent is also calculated in term of p-value from BLAST search. If minimum E-value of a protein is E with respect to the database of positive training dataset, then p-value is calculate by the equation is P = 1 – e^E.

Results

5 fold cross validation results : The results of different models are given in the Table 1.

Table 1 : Performance compression of different models.

Models	No of features	Sensitivity	Specificity	Accuracy	PPV	NPV	MCC	Area under curve in ROC curve
SVM(Amino acid and dipeptide)	420	71.85	70.38	70.71	41.41	89.62	0.36	0.78
SVM(Amino acid and triad)	363	72.96	71.67	71.96	43.09	90.15	0.39	0.77
BLAST	NA	60.74	91.85	84.86	68.33	90.87	0.55	0.76
SVM(Amino acid and dipeptide)+BLAST	NA	77.41	79.72	79.20	52.51	92.41	0.51	0.82
SVM(Amino acid and triad)+BLAST	NA	78.52	77.89	78.04	50.72	92.60	0.49	0.82

Performance in unkown set from all Swiss-Prot proteins(Mouse and Human) : We evaluated all Swiss-Prot proteins of mouse and human by our proposed prediction model. For mouse we got 233 novel core pluripotent proteins and 323 novel extended pluripotent proteins. For human model, we got 167 novel core pluripotent and 385 extended pluripotent proteins.

Figure 3 : Protein-protein interactions of pluripotent proteins that were predicted from all Swiss-Prot mouse proteins and are associated with core pluripotent transcription factors Sox2,Pou5f1 and Nanog. Blue nodes are the newly predicted pluripotent proteins that are not in the positive training dataset and other nodes are predicted pluripotent proteins and also in the positive training dataset.

Figure 4 : Protein-protein interactions of pluripotent proteins that were predicted from all Swiss-Prot human proteins and are associated with core pluripotent transcription factors SOX2,POU5F1 and NANOG. Blue nodes are the newly predicted pluripotent proteins that are not in the positive trainig dataset and other nodes are predicted pluripotent proteins and are also in the positive training dataset.

Conclusion :

This is an organism independent prediction server where it predicts pluripotent proteins/genes with their confidence level to be pluripotent protein/gene. So it can help the biological community by helping not only pluripotent stem cell research, but also in developmental biology, stem cell research and cancer research.