Benchmark Data Sets for Graph Kernels

This page contains collected benchmark data sets for the evaluation of graph kernels. The data sets were collected by Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann with partial support of the German Science Foundation (DFG) within the Collaborative Research Center SFB 876Providing Information by Resource-Constrained Data Analysis”, project A6Resource-efficient Graph Mining”.

  • 11.05.2017: Added twelve new data sets from [24].
  • 17.06.2016: Added Synthie data set from [21].
  • 10.05.2016: Added eight new data sets from [16].
  • 19.04.2016: Added FRANKENSTEIN data set from [15].
  • 13.04.2016: Added SYNTHETICnew data set from [3,10].
  • 08.04.2016: Added six new data sets from [14].
NameSourceStatisticsLabels/AttributesDownload (ZIP)
Num. of GraphsNum. of ClassesAvg. Number of NodesAvg. Number of EdgesNode LabelsEdge LabelsNode Attr. (Dim.)Edge Attr. (Dim.)
AIDS[16,17] 2000 215.6916.20+++ (4)AIDS
BZR[7] 405 235.7538.36++ (3)BZR
BZR_MD[7,23] 306 221.30225.06+++ (1)BZR_MD
COIL-DEL[16,18] 3900 100 21.54 54.24 ++ (2)COIL-DEL
COIL-RAG[16,18] 3900 100 3.01 3.02 + (64)+ (1)COIL-RAG
COLLAB[14] 5000 374.49 2457.78COLLAB
COX2[7] 467 241.22 43.45++ (3)COX2
COX2_MD[7,23] 303 226.28335.12+++ (1)COX2_MD
DHFR[7] 467 242.4344.54++ (3)DHFR
DHFR_MD[7,23] 393 223.87 283.01+++ (1)DHFR_MD
ER_MD[7,23] 446 2 21.33 234.85 +++ (1)ER_MD
DD[6,22] 1178 2284.32 715.66+DD
ENZYMES[4,5] 600 632.63 62.14++ (18)ENZYMES
Fingerprint[16,19] 2800 45.42 4.42+ (2)+ (2)Fingerprint
FIRSTMM_DB[11,12,13] 41 111377.27 3074.10++ (1) + (2)FIRSTMM_DB
FRANKENSTEIN[15] 4337 2 16.90 17.88 + (780) FRANKENSTEIN
IMDB-BINARY[14] 1000 2 19.77 96.53 IMDB-BINARY
IMDB-MULTI[14] 1500 3 13.00 65.94 IMDB-MULTI
Letter-high[16] 2250 15 4.67 4.50 + (2)Letter-high
Letter-low[16] 2250 15 4.68 3.13 + (2)Letter-low
Letter-med[16] 2250 15 4.67 4.50 + (2)Letter-med
Mutagenicity[16,20] 4337 2 30.32 30.77 ++Mutagenicity
MSRC_9[13] 221 840.58 97.94 +MSCR_9
MSRC_21[13] 563 2077.52198.32+MSRC_21
MSRC_21C[13] 209 2040.28 96.60+MSRC_21C
MUTAG[1,23] 188 217.9319.79++MUTAG
NCI1[8,9,22] 4110 229.8732.30+NCI1
NCI109[8,9,22] 4127 229.68 32.13 +NCI109
PTC_FM[2,23] 349 214.1114.48++PTC_FM
PTC_FR[2,23] 351 214.56 15.00++PTC_FR
PTC_MM[2,23] 336 213.97 14.32++PTC_MM
PTC_MR[2,23] 344 214.29 14.69++PTC_MR
PROTEINS[4,6] 1113 239.0672.82++ (1)PROTEINS
PROTEINS_full[4,6] 1113 239.0672.82++ (29)PROTEINS_full
REDDIT-BINARY[14] 2000 2 429.63 497.75 REDDIT-BINARY
REDDIT-MULTI-5k[14] 4999 5 508.52 594.87 REDDIT-MULTI-5k
REDDIT-MULTI-12k[14] 11929 11 391.41 456.89 REDDIT-MULTI-12k
SYNTHETIC[3] 300 2100.00 196.00+ (1)SYNTHETIC
SYNTHETICnew[3,10] 300 2100.00 196.25+ (1)SYNTHETICnew
Synthie[21] 400 495.00 172.93+ (15)Synthie
Tox21_AHR[24]81692 18.0918.50++Tox21_AHR
Tox21_AR[24]93622 18.3918.84++Tox21_AR
Tox21_AR-LBD[24]85992 17.7718.16++Tox21_AR-LBD
Tox21_ARE[24]71672 16.2816.52++Tox21_ARE
Tox21_aromatase[24]72262 17.5017.79++Tox21_aromatase
Tox21_ATAD5[24]90912 17.8918.30++Tox21_ATAD5
Tox21_ER[24]76972 17.5817.94++Tox21_ER
Tox21_ER_LBD[24]87532 18.0618.47++Tox21_ER_LBD
Tox21_HSE[24]81502 16.7217.04++Tox21_HSE
Tox21_MMP[24]73202 17.4917.83++Tox21_MMP
Tox21_p53[24]86342 17.7918.19++Tox21_p53
Tox21_PPAR-gamma[24]81842 17.2317.55++Tox21_PPAR-gamma
All Data Sets DS_all

File Format

The data sets have the following format (replace DS by the name of the data set):

Let

  • n = total number of nodes
  • m = total number of edges
  • N = number of graphs
  1. DS_A.txt (m lines): sparse (block diagonal) adjacency matrix for all graphs, each line corresponds to (row, col) resp. (node_id, node_id). All graphs are undirected. Hence, DS_A.txt contains two entries for each edge.
  2. DS_graph_indicator.txt (n lines): column vector of graph identifiers for all nodes of all graphs, the value in the i-th line is the graph_id of the node with node_id i
  3. DS_graph_labels.txt (N lines): class labels for all graphs in the data set, the value in the i-th line is the class label of the graph with graph_id i
  4. DS_node_labels.txt (n lines): column vector of node labels, the value in the i-th line corresponds to the node with node_id i

There are optional files if the respective information is available:

  • DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt): labels for the edges in DS_A_sparse.txt
  • DS_edge_attributes.txt (m lines; same size as DS_A.txt): attributes for the edges in DS_A.txt
  • DS_node_attributes.txt (n lines): matrix of node attributes, the comma seperated values in the i-th line is the attribute vector of the node with node_id i
  • DS_graph_attributes.txt (N lines): regression values for all graphs in the data set, the value in the i-th line is the attribute of the graph with graph_id i

Citing this Website

We encourage you to refer to our website at http://graphkernels.cs.tu-dortmund.de if you have used the data sets for your publication. Please use the following BibTeX citation:

@misc{KKMMN2016,
  title  = {Benchmark Data Sets for Graph Kernels},
  author = {Kristian Kersting and Nils M. Kriege and Christopher Morris and Petra Mutzel and Marion Neumann},
  year   = {2016},
  url    = {http://graphkernels.cs.tu-dortmund.de}
}

If your bibliography style does not support the url field, you may use this alternative:

@misc{KKMMN2016,
  title  = {Benchmark Data Sets for Graph Kernels},
  author = {Kristian Kersting and Nils M. Kriege and Christopher Morris and Petra Mutzel and Marion Neumann},
  year   = {2016},
  note   = {\url{http://graphkernels.cs.tu-dortmund.de}}
}

Bibliography

[1] Debnath, A.K., Lopez de Compadre, R.L., Debnath, G., Shusterman, A.J., and Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J. Med. Chem. 34(2):786-797 (1991).

[2] Helma, C., King, R. D., Kramer, S., and Srinivasan, A. The Predictive Toxicology Challenge 2000–2001. Bioinformatics, 2001, 17, 107-108. URL: www.predictive-toxicology.org/ptc

[3] Feragen, A., Kasenburg, N., Petersen, J., de Bruijne, M., Borgwardt, K.M.: Scalable kernels for graphs with continuous attributes. In: C.J.C. Burges, L. Bottou, Z. Ghahramani, K.Q. Weinberger (eds.) NIPS, pp. 216-224 (2013).

[4] K. M. Borgwardt, C. S. Ong, S. Schoenauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56, Jun 2005.

[5] I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg. Brenda, the enzyme database: updates and major new developments. Nucleic Acids Research, 32D:431–433, 2004.

[6] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol., 330(4):771–783, Jul 2003.

[7] Sutherland, J. J.; O'Brien, L. A. & Weaver, D. F. Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships. J. Chem. Inf. Comput. Sci., 2003, 43, 1906-1915.

[8] N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. In Proc. of ICDM, pages 678–689, Hong Kong, 2006.

[9] http://pubchem.ncbi.nlm.nih.gov

[10] http://image.diku.dk/aasa/papers/graphkernels_nips_erratum.pdf

[11] M. Neumann, P. Moreno, L. Antanas, R. Garnett, K. Kersting. Graph Kernels for Object Category Prediction in Task-Dependent Robot Grasping. Eleventh Workshop on Mining and Learning with Graphs (MLG-13), Chicago, Illinois, USA, 2013.

[12] http://www.first-mm.eu/data.html

[13] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting. Propagation kernels: efficient graph kernels from propagated information.Machine Learning, 102(2):209–245, 2016

[14] Pinar Yanardag and S.V.N. Vishwanathan. 2015. Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 1365-1374.

[15] Francesco Orsini, Paolo Frasconi, and Luc De Raedt. 2015 Graph invariant kernels. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI'15), Qiang Yang and Michael Wooldridge (Eds.). AAAI Press 3756-3762.

[16] Riesen, K. and Bunke, H.: IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning. In: da Vitora Lobo, N. et al. (Eds.), SSPR&SPR 2008, LNCS, vol. 5342, pp. 287-297, 2008.

[17] AIDS Antiviral Screen Data (2004)

[18] S. A. Nene, S. K. Nayar and H. Murase. Columbia Object Image Library (COIL-100), Technical Report, Department of Computer Science, Columbia University CUCS-006-96, Feb. 1996.

[19] NIST Special Database 4

[20] Jeroen Kazius, Ross McGuire and, and Roberta Bursi. Derivation and Validation of Toxicophores for Mutagenicity Prediction, Journal of Medicinal Chemistry 2005 48 (1), 312-320

[21] Christopher Morris, Nils M. Kriege, Kristian Kersting, Petra Mutzel. Faster Kernels for Graphs with Continuous Attributes via Hashing, IEEE International Conference on Data Mining (ICDM) 2016

[22] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. 2011. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 12 (November 2011), 2539-2561.

[23] Nils Kriege, Petra Mutzel. 2012. Subgraph Matching Kernels for Attributed Graphs. International Conference on Machine Learning 2012.

[24] Tox21 Data Challenge 2014

Contact

If you have any questions regarding the data sets or are interested in adding your graph data, please write an email to christopher.morristu-dortmund.de.

 
Last modified: 2017-05-11 17:52 by Christopher Morris
DokuWikiRSS-Feed