ProtDDG-Bench: Benchmarking resource for predictors of protein stability change

ProtDDG-Bench

Benchmarking resource for predictors of protein stability change

All the datasets for testing the performance of the predictors of DDG upon mutation are available at the protddg-bench repository. The protddg-bench repository includes the following datasets:

VB1432: 1432 variants from 79 protein structures corresponding to 65 clusters. 9 mutations have double experimental data. 1 mutation is not mapping to the strcuture. The 1LRP structure was replace with 1LMB. For structure 1WQ5 was considered the chain B. Data from PMID:29597263.
S2648: 2648 variants from 132 protein strcutures corresponding to 113 clusters. Experimental DDGs of the same variants are avereged. Data from PMID:21569468.
Ssym: 634 variants from 357 structures corresponding to 13 clusters. Dataset composed by 342 mutations and their reverse. From the original publication, the data of few mutants have been corrected. Original data from PMID:29718106.
Broom: Dataset composed by 605 mutations from 58 structures corresponding to 50 clusters. This dataset contains 53 mutations from non-native proteins and 59 mutations referring to fragment of the protein. Data from PMID:28710274.
Myoglobin: 134 variants from myoglobin from structure 1BZ6. Data from PMID:26054434.
P53: 42 variants from P53 structure 2OCJ. Data from PMID:24281696.
KORPM: 2,371 mutations from 129 protein families with sequence identity <25% and testing set of 461 variants. Data from PMID:36629451.
PTMUL: 914 multiple site variants from 91 protein structures and 77 clusters. Data from PMID:31266447.

Datasets files

The directory S2648 and VB1432 contains 10 files for 10-folds cross-validation tests. Furthermore, the cross-validation subset of S2648 and VB1432 are consistent. This means that the following predictions can be performed:

	Training: not SET_i vb1432-10fold-split-j.tsv -> Test: SET_i s2648-10fold-split-j.tsv
        Training: not SET_i s2648-10fold-split-j.tsv  -> Test: SET_i vb1432-10fold-split-j.tsv

The directory BROOM contains a 5-fold split of the BROOM dataset. Given the numeber of mutations mutations form the same cluster the set has been diveded in 5 subsets. The test on this dataset can be performed as follow:

        Training: not SET_i train-vb1432-test-broom.tsv -> Test: SET_i broom-5fold.tsv
        Training: not SET_i train-s2648-test-broom.tsv  -> Test: SET_i broom-5fold.tsv

The directory SSYM contains a 5-fold split of the Ssym dataset. Given the large number of mutations form the same cluster the set has been diveded in 5 subsets. The test on this dataset can be performed as follow:

         Training: not SET_i train-vb1432-test-ssym.tsv -> Test: SET_i ssym-5fold.tsv
         Training: not SET_i train-s2648-test-ssym.tsv  -> Test: SET_i ssym-5fold.tsv

The directory MYOGLOBIN test contains the testing dataset myoglogin.tsv with the best subsets of VB1432 and S2648 to be used as possible training. The following prediction can be performed:

        Training: train-vb1432-test-myoglobin.tsv (1399) -> Test: myoglobin.tsv
        Training: train-s2648-test-myoglobin.tsv  (2607) -> Test: myoglobin.tsv

The directory P53 test contains the testing dataset p53.tsv with the best subsets of VB1432 and S2648 to be used as possible training. The following prediction can be performed:

        Training: train-vb1432-test-p53.tsv       (1427) -> Test: p53.tsv
        Training: train-s2648-test-p53.tsv        (2643) -> Test: p53.tsv

The directory KORPM contains 10 files for 10-folds cross-validation tests. Furtermore it contains 2 training and 2 testing files. The testing files are Ssym and S461. The tests on this dataset can be performed as follow:

         Training: not SET_i korpm-10fold-split-j.tsv -> Test: SET_i korpm-10fold-split-j.tsv
         Training: not Ssym  train-korpm-nossym.tsv  (1,807) -> Test: ssym-korpm.tsv
         Training: not S461  train-korpm-nos461.tsv  (2,224) -> Test: s461-korpm.tsv

The directory PTMUL contains files for testing predictions on multiple site mutations starting from a training on a set single point mutations. The directory also includes a 5-fold split of the PTMUL dataset. Given the number of mutations mutations form the same cluster the set has been diveded in 5 subsets. The test on this dataset can be performed as follow:

         Training: not SET_i train-vb1432-test-ptmul.tsv -> Test: SET_i ptmul-5fold.tsv
         Training: not SET_i train-s2648-test-ptmul.tsv  -> Test: SET_i ptmul-5fold.tsv

Clustering

The file cluster-545-pdbchains.txt contains 132 clusters of 545 PDB chains. The clustering is obtained using blastclust with the options -S 25 -L 0.5 -b F.
The file cluster-129-korpm-pdbchain.txt contains 129 groups of proteins from the korpm dataset. The clustering is obtained using MMseq with 25% sequence identity cutoff.

Testing

To test your method you need to:

1. replace the file predict-ddg-value.py with your own script that runs taking in input only the testing and training files and returning in standard output the experimental and the predicted DDGs respectively The program runs as follows:

        ./scripts/predict-ddg-value.py test_file.txt train_file.txt

2. Generate an inputfile containing a two columns representing the PDB chain identifier and the mutation followed by all the inputfeatures. The full list of mutations are reported in the file data/unique-mutations-input.txt and example of input file with two input features is ifeatures-KYTJ820101-BASU050101.txt. Finally run scripts/test.py input_feature_file.txt to score the performace of your method. For example runs:

	./scripts/test.py data/ifeatures-KYTJ820101-BASU050101.txt