This is an old revision of the document!


Benchmark Data Sets for Graph Kernels

This page contains collected benchmark data sets for the evaluation of graph kernels. The data sets were collected by Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann with partial support of the German Science Foundation (DFG) within the Collaborative Research Center SFB 876Providing Information by Resource-Constrained Data Analysis”, project A6Resource-efficient Graph Mining”.

  • 02.03.2020: Added three new data sets from [29].
  • 14.01.2020: Added twenty-four new data sets from [24].
  • 28.08.2019: Added twenty-two new data sets from [28].
  • 09.07.2019: Added two new data sets from [27].
  • 23.10.2018: Added five new data sets from [26].
  • 13.02.2018: Added Cuneiform data set from [25].
  • 11.05.2017: Added twelve new data sets from [24].
  • 17.06.2016: Added Synthie data set from [21].
  • 10.05.2016: Added eight new data sets from [16].
  • 19.04.2016: Added FRANKENSTEIN data set from [15].
  • 13.04.2016: Added SYNTHETICnew data set from [3,10].
  • 08.04.2016: Added six new data sets from [14].
NameSourceStatisticsLabels/AttributesDownload (ZIP)
Num. of GraphsNum. of ClassesAvg. Number of NodesAvg. Number of EdgesNode LabelsEdge LabelsNode Attr. (Dim.)Edge Attr. (Dim.)
AIDS[16,17] 2000 215.6916.20+++ (4)AIDS
alchemy_dev[29] 99776 R (12)9.7110.02++alchemy_dev
alchemy_test[29] 15760 11.2511.76++alchemy_test
alchemy_valid[29] 3951 R (12)11.2511.77++alchemy_valid
BZR[7] 405 235.7538.36++ (3)BZR
BZR_MD[7,23] 306 221.30225.06+++ (1)BZR_MD
COIL-DEL[16,18] 3900 100 21.54 54.24 ++ (2)COIL-DEL
COIL-RAG[16,18] 3900 100 3.01 3.02 + (64)+ (1)COIL-RAG
COLLAB[14] 5000 374.49 2457.78COLLAB
COLORS-3[27]105001161.3191.03+ (4)COLORS-3
COX2[7] 467 241.22 43.45++ (3)COX2
COX2_MD[7,23] 303 226.28335.12+++ (1)COX2_MD
Cuneiform[25] 267 3021.2744.80+++ (3)+ (2)Cuneiform
DBLP_v1[26]194562 10.4819.65++DBLP_v1
DHFR[7] 467 242.4344.54++ (3)DHFR
DHFR_MD[7,23] 393 223.87 283.01+++ (1)DHFR_MD
ER_MD[7,23] 446 2 21.33 234.85 +++ (1)ER_MD
DD[6,22] 1178 2284.32 715.66+DD
ENZYMES[4,5] 600 632.63 62.14++ (18)ENZYMES
Fingerprint[16,19] 2800 45.42 4.42+ (2)+ (2)Fingerprint
FIRSTMM_DB[11,12,13] 41 111377.27 3074.10++ (1) + (2)FIRSTMM_DB
FRANKENSTEIN[15] 4337 2 16.90 17.88 + (780) FRANKENSTEIN
IMDB-BINARY[14] 1000 2 19.77 96.53 IMDB-BINARY
IMDB-MULTI[14] 1500 3 13.00 65.94 IMDB-MULTI
KKI[26]832 26.9648.42+KKI
Letter-high[16] 2250 15 4.67 4.50 + (2)Letter-high
Letter-low[16] 2250 15 4.68 3.13 + (2)Letter-low
Letter-med[16] 2250 15 4.67 4.50 + (2)Letter-med
MCF-7[28] 27770 226.39 28.52 ++MCF-7
MCF-7H[28] 27770 247.30 49.43 ++MCF-7H
MOLT-4[28] 39765 226.09 28.13 ++MOLT-4
MOLT-4H[28] 39765 246.70 48.73 ++MOLT-4H
Mutagenicity[16,20] 4337 2 30.32 30.77 ++Mutagenicity
MSRC_9[13] 221 840.58 97.94 +MSCR_9
MSRC_21[13] 563 2077.52198.32+MSRC_21
MSRC_21C[13] 209 2040.28 96.60+MSRC_21C
MUTAG[1,23] 188 217.9319.79++MUTAG
NCI1[8,9,22] 4110 229.8732.30+NCI1
NCI109[8,9,22] 4127 229.68 32.13 +NCI109
NCI-H23[28] 40353 226.07 28.10 ++NCI-H23
NCI-H23H[28] 40353 246.67 48.69 ++NCI-H23H
OHSU[26]792 82.01199.66+OHSU
OVCAR-8[28] 40516 226.07 28.10 ++OVCAR-8
OVCAR-8H[28] 40516 246.67 48.70 ++OVCAR-8H
P388[28] 41472 222.11 23.55 ++P388
P388H[28] 41472 240.44 41.88 ++P388H
PC-3[28] 27509 226.35 28.49 ++PC-3
PC-3H[28] 27509 247.19 49.32 ++PC-3H
Peking_1[26]852 39.3177.35+Peking_1
PTC_FM[2,23] 349 214.1114.48++PTC_FM
PTC_FR[2,23] 351 214.56 15.00++PTC_FR
PTC_MM[2,23] 336 213.97 14.32++PTC_MM
PTC_MR[2,23] 344 214.29 14.69++PTC_MR
PROTEINS[4,6] 1113 239.0672.82++ (1)PROTEINS
PROTEINS_full[4,6] 1113 239.0672.82++ (29)PROTEINS_full
REDDIT-BINARY[14] 2000 2 429.63 497.75 REDDIT-BINARY
REDDIT-MULTI-5K[14] 4999 5 508.52 594.87 REDDIT-MULTI-5K
REDDIT-MULTI-12K[14] 11929 11 391.41 456.89 REDDIT-MULTI-12K
SF-295[28] 40271 226.06 28.08 ++SF-295
SF-295H[28] 40271 246.65 48.68 ++SF-295H
SN12C[28] 40004 226.08 28.11 ++SN12C
SN12CH[28] 40004 246.69 48.71 ++SN12CH
SW-620[28] 40532 226.05 28.08 ++SW-620
SW-620H[28] 40532 246.62 48.65 ++SW-620H
SYNTHETIC[3] 300 2100.00 196.00+ (1)SYNTHETIC
SYNTHETICnew[3,10] 300 2100.00 196.25+ (1)SYNTHETICnew
Synthie[21] 400 495.00 172.93+ (15)Synthie
Tox21_AhR_training[24]81692 18.0918.50++Tox21_AhR_training
Tox21_AhR_testing[24]2722 22.1323.05++Tox21_AhR_testing
Tox21_AhR_evaluation[24]6072 17.6418.06++Tox21_AhR_evaluation
Tox21_AR_training[24]93622 18.3918.84++Tox21_AR_training
Tox21_AR_testing[24]2922 22.3523.32++Tox21_AR_testing
Tox21_AR_evaluation[24]5852 17.9918.45++Tox21_AR_evaluation
Tox21_AR-LBD_training[24]85992 17.7718.16++Tox21_AR-LBD_training
Tox21_AR-LBD_testing[24]2532 21.8522.73++Tox21_AR-LBD_testing
Tox21_AR-LBD_evaluation[24]5802 17.0917.42++Tox21_AR-LBD_evaluation
Tox21_ARE_training[24]71672 16.2816.52++Tox21_ARE_training
Tox21_ARE_testing[24]2342 21.9922.91++Tox21_ARE_testing
Tox21_ARE_evaluation[24]5522 17.0117.33++Tox21_ARE_evaluation
Tox21_aromatase_training[24]72262 17.5017.79++Tox21_aromatase_training
Tox21_aromatase_testing[24]2142 21.6522.36++Tox21_aromatase_testing
Tox21_aromatase_evaluation[24]5282 16.7416.99++Tox21_aromatase_evaluation
Tox21_ATAD5_training[24]90912 17.8918.30++Tox21_ATAD5_training
Tox21_ATAD5_testing[24]2722 21.9922.89++Tox21_ATAD5_testing
Tox21_ATAD5_evaluation[24]6192 17.6818.11++Tox21_ATAD5_evaluation
Tox21_ER_training[24]76972 17.5817.94++Tox21_ER_training
Tox21_ER_testing[24]2652 22.1623.13++Tox21_ER_testing
Tox21_ER_evaluation[24]5152 17.6618.10++Tox21_ER_evaluation
Tox21_ER-LBD_training[24]87532 18.0618.47++Tox21_ER-LBD_training
Tox21_ER-LBD_testing[24]2872 22.2823.23++Tox21_ER-LBD_testing
Tox21_ER-LBD_evaluation[24]5992 17.7518.17++Tox21_ER-LBD_evaluation
Tox21_HSE_training[24]81502 16.7217.04++Tox21_HSE_training
Tox21_HSE_testing[24]2672 22.0723.00++Tox21_HSE_testing
Tox21_HSE_evaluation[24]6072 17.6118.01++Tox21_HSE_evaluation
Tox21_MMP_training[24]73202 17.4917.83++Tox21_MMP_training
Tox21_MMP_testing[24]2382 21.6822.55++Tox21_MMP_testing
Tox21_MMP_evaluation[24]5412 16.6716.88++Tox21_MMP_evaluation
Tox21_p53_training[24]86342 17.7918.19++Tox21_p53_training
Tox21_p53_testing[24]2692 22.1423.04++Tox21_p53_testing
Tox21_p53_evaluation[24]6132 17.3417.72++Tox21_p53_evaluation
Tox21_PPAR-gamma_training[24]81842 17.2317.55++Tox21_PPAR-gamma_training
Tox21_PPAR-gamma_testing[24]2672 22.0422.93++Tox21_PPAR-gamma_testing
Tox21_PPAR-gamma_evaluation[24]6022 17.3817.77++Tox21_PPAR-gamma_evaluation
TRIANGLES[27]450001020.8532.74TRIANGLES
TWITTER-Real-Graph-Partial[26]1440332 4.034.98++ (1)TWITTER-Real-Graph-Partial
UACC257[28] 39988 226.09 28.12 ++UACC257
UACC257H[28] 39988 246.68 48.71 ++UACC257H
Yeast[28] 79601 221.54 22.84 ++Yeast
YeastH[28] 79601 239.44 40.74 ++YeastH
All Data Sets DS_all

R(N) are regression datasets with N tasks per graph.

File Format

The data sets have the following format (replace DS by the name of the data set):

Let

  • n = total number of nodes
  • m = total number of edges
  • N = number of graphs
  1. DS_A.txt (m lines): sparse (block diagonal) adjacency matrix for all graphs, each line corresponds to (row, col) resp. (node_id, node_id). All graphs are undirected. Hence, DS_A.txt contains two entries for each edge.
  2. DS_graph_indicator.txt (n lines): column vector of graph identifiers for all nodes of all graphs, the value in the i-th line is the graph_id of the node with node_id i
  3. DS_graph_labels.txt (N lines): class labels for all graphs in the data set, the value in the i-th line is the class label of the graph with graph_id i
  4. DS_node_labels.txt (n lines): column vector of node labels, the value in the i-th line corresponds to the node with node_id i

There are optional files if the respective information is available:

  • DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt): labels for the edges in DS_A_sparse.txt
  • DS_edge_attributes.txt (m lines; same size as DS_A.txt): attributes for the edges in DS_A.txt
  • DS_node_attributes.txt (n lines): matrix of node attributes, the comma seperated values in the i-th line is the attribute vector of the node with node_id i
  • DS_graph_attributes.txt (N lines): regression values for all graphs in the data set, the value in the i-th line is the attribute of the graph with graph_id i

Deep Learning Libraries

The datasets can also be accessed using PyTorch Geometric and the Deep Graph Library.

Citing this Website

We encourage you to refer to our website at http://graphkernels.cs.tu-dortmund.de if you have used the data sets for your publication. Please use the following BibTeX citation:

@misc{KKMMN2016,
  title  = {Benchmark Data Sets for Graph Kernels},
  author = {Kristian Kersting and Nils M. Kriege and Christopher Morris and Petra Mutzel and Marion Neumann},
  year   = {2016},
  url    = {http://graphkernels.cs.tu-dortmund.de}
}

If your bibliography style does not support the url field, you may use this alternative:

@misc{KKMMN2016,
  title  = {Benchmark Data Sets for Graph Kernels},
  author = {Kristian Kersting and Nils M. Kriege and Christopher Morris and Petra Mutzel and Marion Neumann},
  year   = {2016},
  note   = {\url{http://graphkernels.cs.tu-dortmund.de}}
}

Bibliography

[1] Debnath, A.K., Lopez de Compadre, R.L., Debnath, G., Shusterman, A.J., and Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J. Med. Chem. 34(2):786-797 (1991).

[2] Helma, C., King, R. D., Kramer, S., and Srinivasan, A. The Predictive Toxicology Challenge 2000–2001. Bioinformatics, 2001, 17, 107-108. URL: www.predictive-toxicology.org/ptc

[3] Feragen, A., Kasenburg, N., Petersen, J., de Bruijne, M., Borgwardt, K.M.: Scalable kernels for graphs with continuous attributes. In: C.J.C. Burges, L. Bottou, Z. Ghahramani, K.Q. Weinberger (eds.) NIPS, pp. 216-224 (2013).

[4] K. M. Borgwardt, C. S. Ong, S. Schoenauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56, Jun 2005.

[5] I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg. Brenda, the enzyme database: updates and major new developments. Nucleic Acids Research, 32D:431–433, 2004.

[6] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol., 330(4):771–783, Jul 2003.

[7] Sutherland, J. J.; O'Brien, L. A. & Weaver, D. F. Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships. J. Chem. Inf. Comput. Sci., 2003, 43, 1906-1915.

[8] N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. In Proc. of ICDM, pages 678–689, Hong Kong, 2006.

[9] http://pubchem.ncbi.nlm.nih.gov

[10] http://image.diku.dk/aasa/papers/graphkernels_nips_erratum.pdf

[11] M. Neumann, P. Moreno, L. Antanas, R. Garnett, K. Kersting. Graph Kernels for Object Category Prediction in Task-Dependent Robot Grasping. Eleventh Workshop on Mining and Learning with Graphs (MLG-13), Chicago, Illinois, USA, 2013.

[12] http://www.first-mm.eu/data.html

[13] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting. Propagation kernels: efficient graph kernels from propagated information.Machine Learning, 102(2):209–245, 2016

[14] Pinar Yanardag and S.V.N. Vishwanathan. 2015. Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 1365-1374.

[15] Francesco Orsini, Paolo Frasconi, and Luc De Raedt. 2015 Graph invariant kernels. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI'15), Qiang Yang and Michael Wooldridge (Eds.). AAAI Press 3756-3762.

[16] Riesen, K. and Bunke, H.: IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning. In: da Vitora Lobo, N. et al. (Eds.), SSPR&SPR 2008, LNCS, vol. 5342, pp. 287-297, 2008.

[17] AIDS Antiviral Screen Data (2004)

[18] S. A. Nene, S. K. Nayar and H. Murase. Columbia Object Image Library (COIL-100), Technical Report, Department of Computer Science, Columbia University CUCS-006-96, Feb. 1996.

[19] NIST Special Database 4

[20] Jeroen Kazius, Ross McGuire and, and Roberta Bursi. Derivation and Validation of Toxicophores for Mutagenicity Prediction, Journal of Medicinal Chemistry 2005 48 (1), 312-320

[21] Christopher Morris, Nils M. Kriege, Kristian Kersting, Petra Mutzel. Faster Kernels for Graphs with Continuous Attributes via Hashing, IEEE International Conference on Data Mining (ICDM) 2016

[22] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. 2011. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 12 (November 2011), 2539-2561.

[23] Nils Kriege, Petra Mutzel. 2012. Subgraph Matching Kernels for Attributed Graphs. International Conference on Machine Learning 2012.

[24] Tox21 Data Challenge 2014

[25] Nils M. Kriege, Matthias Fey, Denis Fisseler, Petra Mutzel, Frank Weichert. Recognizing Cuneiform Signs Using Graph Based Methods. International Workshop on Cost-Sensitive Learning (COST), SIAM International Conference on Data Mining (SDM) 2018, 31-44, arXiv:1802.05908.

[26] A Repository of Benchmark Graph Datasets for Graph Classification

[27] Boris Knyazev, Graham W. Taylor, Mohamed R. Amer. Understanding Attention and Generalization in Graph Neural Networks

[28] Chemical DataSets

[29] Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models

Contact

If you have any questions regarding the data sets or are interested in adding your graph data, please write an email to christopher.morristu-dortmund.de.

 
Last modified: 2020-03-03 16:53 (external edit)
DokuWikiRSS-Feed