LMDIPred

Linear Motif Domain Interaction Prediction, abbreviated as "LMDIPred", is a web server that detects the occurrence of peptides conforming to linear motifs mediating Protein-Protein Interactions (PPIs) with SH3, WW and PDZ domains, in user-provided amino-acid sequence(s). ( Sarkar et al. PLoS One. 2018. doi: 10.1371/journal.pone.0200430.)

A comparison of the total number of SwissProt proteins that are known to contain these three domains from all organisms and only humans (as on July 10, 2017) is shown in the following figure:

Datasets:

Positive dataset. A non-redundant dataset consisting of 115 SH3-domain binding 6-mer peptides, 140 WW-domain binding 6-mer peptides and 165 PDZ-domain binding 4-mer peptides, were created from the LMPID database, to be used as positive training examples for the respective class of peptides.

Download LMDIPred Positive datasets:

Negative dataset.A set of 3960 fasta-formatted protein sequences [3192 from Oryza sativa subsp. japonica (short-grained Asian rice), 400 from Solanum tuberosum (potato), and 368 from Triticum aestivum (common wheat)] were downloaded from UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase. Perl scripts were used to extract 6-residue (for SH3 & WW) and 4-residue (for PDZ) long peptides from random positions within these sequences, and a set of 120 such random peptides were used as negative training examples for each class of peptide ligands.

Download Negative dataset: Random peptide Instances

Independent dataset.The indepedent dataset was composed of 62 experimentally validated PDZ-binding 10-mer mouse peptides from Stiffler et al [PubMed ID: 17641200], and 25 experimentally validated SH3-binding yeast peptides of variable length from Tonikian et al [PubMed ID: 19841731].

Download Independent datasets:

Table 1: Overview of the datasets for each class of ligand motifs :

Domain	Positive Training Data	Negative Training Data	Approx Ratio (Positive:Negative)
SH3	115	425	~1:4
WW	140	400	~1:3
PDZ	165	375	~1:2

Table 2: Performance analysis of SVM models for each class of ligand motifs using different input features :

[ The area under the Receiver Operating Characteristic (ROC) curve, or "AUC" ("Area Under Curve"), is an estimate of the accuracy of the prediction method, and can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example. AUC of 100%-90% denotes excellent prediction, and the accuracy decreases with the AUC values such that AUC <= 50% denotes incorrect or random prediction.]

Input Features

AUC (%) values for different domains

	SH3	WW	PDZ
Amino Acid Composition (AAC)	88.05	93.54	92.31
Dipeptide Composition (DPC)	86.79	96.33	93.65
Tripeptide Composition (TPC)	94.72	96.11	92.44
AAC + DPC	94.63	97.77	93.98
AAC + TPC	95.56	97.86	97.69
DPC + TPC	95.34	97.58	94.89
AAC + DPC + TPC	97.45	98.35	90.49

Table 3A: Performance of different prediction methods in 5-fold cross-validation for the SH3 domain binding peptides :

Method	Threshold	Sensitivity	Specificity	Accuracy	MCC
SVM Prediction	-0.25	0.94	0.95	0.95	0.85
PSSM Scanning	1.00	0.70	0.93	0.88	0.62
Motif Instance Matching	NA	0.17	1.00	0.83	0.36
Regular Expression Scanning	NA	0.81	0.91	0.89	0.67

Table 3B: Performance of different prediction methods in 5-fold cross-validation for the WW domain binding peptides :

Method	Threshold	Sensitivity	Specificity	Accuracy	MCC
SVM Prediction	-0.05	0.96	0.96	0.96	0.90
PSSM Scanning	0.50	0.88	0.84	0.85	0.66
Motif Instance Matching	NA	0.13	1.00	0.78	0.29
Regular Expression Scanning	NA	0.89	0.99	0.97	0.91

Table 3C: Performance of different prediction methods in 5-fold cross-validation for the PDZ domain binding peptides :

Method	Threshold	Sensitivity	Specificity	Accuracy	MCC
SVM Prediction	-0.10	0.92	0.93	0.92	0.83
PSSM Scanning	0.60	0.69	0.95	0.86	0.68
Motif Instance Matching	NA	0.30	1.00	0.78	0.47
Regular Expression Scanning	NA	0.77	0.87	0.83	0.63

Table 3: ROC plots of different prediction methods for different datasets : [Green (SVM Prediction), Blue (PSSM Scanning), Black (Motif Instance Matching) and Red (Regular Expression Scanning). ROC plots for MIM and RES methods appear as smooth flat lines when compared to the plots for SVM and PSSM, because SVM and PSSM outputs comprise of continuous scores, while the MIM and RES produce discrete outcomes, one or zero (either “match” or “mismatch”)]

(A) SH3 domain		(B) WW domain		(C) PDZ domain

For queries and suggestions please contact Dr. Sudipto Saha (ssaha4@jcbose.ac.in, ssaha4@gmail.com)