### Table of Contents

## Benchmark Data Sets for Graph Kernels

This page contains collected benchmark data sets for the evaluation of graph kernels. The data sets were collected by Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann with partial support of the
German Science Foundation (DFG) within the Collaborative Research Center
SFB 876 “*Providing Information by Resource-Constrained Data Analysis*”, project A6 “*Resource-efficient Graph Mining*”.

**09.07.2019:**Added two new data set from [27].**23.10.2018:**Added five new data set from [26].**13.02.2018:**Added Cuneiform data set from [25].**11.05.2017:**Added twelve new data sets from [24].**17.06.2016:**Added Synthie data set from [21].**10.05.2016:**Added eight new data sets from [16].**19.04.2016:**Added FRANKENSTEIN data set from [15].**13.04.2016:**Added SYNTHETICnew data set from [3,10].**08.04.2016:**Added six new data sets from [14].

Name | Source | Statistics | Labels/Attributes | Download (ZIP) |
||||||
---|---|---|---|---|---|---|---|---|---|---|

Num. of Graphs | Num. of Classes | Avg. Number of Nodes | Avg. Number of Edges | Node Labels | Edge Labels | Node Attr. (Dim.) | Edge Attr. (Dim.) | |||

AIDS | [16,17] | 2000 | 2 | 15.69 | 16.20 | + | + | + (4) | – | AIDS |

BZR | [7] | 405 | 2 | 35.75 | 38.36 | + | – | + (3) | – | BZR |

BZR_MD | [7,23] | 306 | 2 | 21.30 | 225.06 | + | + | – | + (1) | BZR_MD |

COIL-DEL | [16,18] | 3900 | 100 | 21.54 | 54.24 | – | + | + (2) | – | COIL-DEL |

COIL-RAG | [16,18] | 3900 | 100 | 3.01 | 3.02 | – | – | + (64) | + (1) | COIL-RAG |

COLLAB | [14] | 5000 | 3 | 74.49 | 2457.78 | – | – | – | – | COLLAB |

COLORS-3 | [27] | 10500 | 11 | 61.31 | 91.03 | – | – | + (4) | – | TRIANGLES |

COX2 | [7] | 467 | 2 | 41.22 | 43.45 | + | – | + (3) | – | COX2 |

COX2_MD | [7,23] | 303 | 2 | 26.28 | 335.12 | + | + | – | + (1) | COX2_MD |

Cuneiform | [25] | 267 | 30 | 21.27 | 44.80 | + | + | + (3) | + (2) | Cuneiform |

DBLP_v1 | [26] | 19456 | 2 | 10.48 | 19.65 | + | + | – | – | DBLP_v1 |

DHFR | [7] | 467 | 2 | 42.43 | 44.54 | + | – | + (3) | – | DHFR |

DHFR_MD | [7,23] | 393 | 2 | 23.87 | 283.01 | + | + | – | + (1) | DHFR_MD |

ER_MD | [7,23] | 446 | 2 | 21.33 | 234.85 | + | + | – | + (1) | ER_MD |

DD | [6,22] | 1178 | 2 | 284.32 | 715.66 | + | – | – | – | DD |

ENZYMES | [4,5] | 600 | 6 | 32.63 | 62.14 | + | – | + (18) | – | ENZYMES |

Fingerprint | [16,19] | 2800 | 4 | 5.42 | 4.42 | – | – | + (2) | + (2) | Fingerprint |

FIRSTMM_DB | [11,12,13] | 41 | 11 | 1377.27 | 3074.10 | + | – | + (1) | + (2) | FIRSTMM_DB |

FRANKENSTEIN | [15] | 4337 | 2 | 16.90 | 17.88 | – | – | + (780) | – | FRANKENSTEIN |

IMDB-BINARY | [14] | 1000 | 2 | 19.77 | 96.53 | – | – | – | – | IMDB-BINARY |

IMDB-MULTI | [14] | 1500 | 3 | 13.00 | 65.94 | – | – | – | – | IMDB-MULTI |

KKI | [26] | 83 | 2 | 26.96 | 48.42 | + | – | – | – | KKI |

Letter-high | [16] | 2250 | 15 | 4.67 | 4.50 | – | – | + (2) | – | Letter-high |

Letter-low | [16] | 2250 | 15 | 4.68 | 3.13 | – | – | + (2) | – | Letter-low |

Letter-med | [16] | 2250 | 15 | 4.67 | 4.50 | – | – | + (2) | – | Letter-med |

Mutagenicity | [16,20] | 4337 | 2 | 30.32 | 30.77 | + | + | – | – | Mutagenicity |

MSRC_9 | [13] | 221 | 8 | 40.58 | 97.94 | + | – | – | – | MSCR_9 |

MSRC_21 | [13] | 563 | 20 | 77.52 | 198.32 | + | – | – | – | MSRC_21 |

MSRC_21C | [13] | 209 | 20 | 40.28 | 96.60 | + | – | – | – | MSRC_21C |

MUTAG | [1,23] | 188 | 2 | 17.93 | 19.79 | + | + | – | – | MUTAG |

NCI1 | [8,9,22] | 4110 | 2 | 29.87 | 32.30 | + | – | – | – | NCI1 |

NCI109 | [8,9,22] | 4127 | 2 | 29.68 | 32.13 | + | – | – | – | NCI109 |

OHSU | [26] | 79 | 2 | 82.01 | 199.66 | + | – | – | – | OHSU |

Peking_1 | [26] | 85 | 2 | 39.31 | 77.35 | + | – | – | – | Peking_1 |

PTC_FM | [2,23] | 349 | 2 | 14.11 | 14.48 | + | + | – | – | PTC_FM |

PTC_FR | [2,23] | 351 | 2 | 14.56 | 15.00 | + | + | – | – | PTC_FR |

PTC_MM | [2,23] | 336 | 2 | 13.97 | 14.32 | + | + | – | – | PTC_MM |

PTC_MR | [2,23] | 344 | 2 | 14.29 | 14.69 | + | + | – | – | PTC_MR |

PROTEINS | [4,6] | 1113 | 2 | 39.06 | 72.82 | + | – | + (1) | – | PROTEINS |

PROTEINS_full | [4,6] | 1113 | 2 | 39.06 | 72.82 | + | – | + (29) | – | PROTEINS_full |

REDDIT-BINARY | [14] | 2000 | 2 | 429.63 | 497.75 | – | – | – | – | REDDIT-BINARY |

REDDIT-MULTI-5K | [14] | 4999 | 5 | 508.52 | 594.87 | – | – | – | – | REDDIT-MULTI-5K |

REDDIT-MULTI-12K | [14] | 11929 | 11 | 391.41 | 456.89 | – | – | – | – | REDDIT-MULTI-12K |

SYNTHETIC | [3] | 300 | 2 | 100.00 | 196.00 | – | – | + (1) | – | SYNTHETIC |

SYNTHETICnew | [3,10] | 300 | 2 | 100.00 | 196.25 | – | – | + (1) | – | SYNTHETICnew |

Synthie | [21] | 400 | 4 | 95.00 | 172.93 | – | – | + (15) | – | Synthie |

Tox21_AHR | [24] | 8169 | 2 | 18.09 | 18.50 | + | + | – | – | Tox21_AHR |

Tox21_AR | [24] | 9362 | 2 | 18.39 | 18.84 | + | + | – | – | Tox21_AR |

Tox21_AR-LBD | [24] | 8599 | 2 | 17.77 | 18.16 | + | + | – | – | Tox21_AR-LBD |

Tox21_ARE | [24] | 7167 | 2 | 16.28 | 16.52 | + | + | – | – | Tox21_ARE |

Tox21_aromatase | [24] | 7226 | 2 | 17.50 | 17.79 | + | + | – | – | Tox21_aromatase |

Tox21_ATAD5 | [24] | 9091 | 2 | 17.89 | 18.30 | + | + | – | – | Tox21_ATAD5 |

Tox21_ER | [24] | 7697 | 2 | 17.58 | 17.94 | + | + | – | – | Tox21_ER |

Tox21_ER_LBD | [24] | 8753 | 2 | 18.06 | 18.47 | + | + | – | – | Tox21_ER_LBD |

Tox21_HSE | [24] | 8150 | 2 | 16.72 | 17.04 | + | + | – | – | Tox21_HSE |

Tox21_MMP | [24] | 7320 | 2 | 17.49 | 17.83 | + | + | – | – | Tox21_MMP |

Tox21_p53 | [24] | 8634 | 2 | 17.79 | 18.19 | + | + | – | – | Tox21_p53 |

Tox21_PPAR-gamma | [24] | 8184 | 2 | 17.23 | 17.55 | + | + | – | – | Tox21_PPAR-gamma |

TRIANGLES | [27] | 45000 | 10 | 20.85 | 32.74 | – | – | – | – | COLORS-3 |

TWITTER-Real-Graph-Partial | [26] | 144033 | 2 | 4.03 | 4.98 | + | – | – | + (1) | TWITTER-Real-Graph-Partial |

All Data Sets | DS_all |

### File Format

The data sets have the following *format* (replace **DS** by the name of the data set):

Let

- n = total number of nodes
- m = total number of edges
- N = number of graphs

**DS_A.txt (m lines):**sparse (block diagonal) adjacency matrix for all graphs, each line corresponds to (row, col) resp. (node_id, node_id).*All graphs are undirected. Hence, DS_A.txt contains two entries for each edge.***DS_graph_indicator.txt (n lines):**column vector of graph identifiers for all nodes of all graphs, the value in the i-th line is the graph_id of the node with node_id i**DS_graph_labels.txt (N lines):**class labels for all graphs in the data set, the value in the i-th line is the class label of the graph with graph_id i**DS_node_labels.txt (n lines):**column vector of node labels, the value in the i-th line corresponds to the node with node_id i

There are *optional files* if the respective information is available:

**DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt):**labels for the edges in DS_A_sparse.txt**DS_edge_attributes.txt (m lines; same size as DS_A.txt):**attributes for the edges in DS_A.txt**DS_node_attributes.txt (n lines):**matrix of node attributes, the comma seperated values in the i-th line is the attribute vector of the node with node_id i**DS_graph_attributes.txt (N lines):**regression values for all graphs in the data set, the value in the i-th line is the attribute of the graph with graph_id i

### Citing this Website

We encourage you to refer to our website at http://graphkernels.cs.tu-dortmund.de if you have used the data sets for your publication. Please use the following BibTeX citation:

@misc{KKMMN2016, title = {Benchmark Data Sets for Graph Kernels}, author = {Kristian Kersting and Nils M. Kriege and Christopher Morris and Petra Mutzel and Marion Neumann}, year = {2016}, url = {http://graphkernels.cs.tu-dortmund.de} }

If your bibliography style does not support the url field, you may use this alternative:

@misc{KKMMN2016, title = {Benchmark Data Sets for Graph Kernels}, author = {Kristian Kersting and Nils M. Kriege and Christopher Morris and Petra Mutzel and Marion Neumann}, year = {2016}, note = {\url{http://graphkernels.cs.tu-dortmund.de}} }

### Bibliography

[1] Debnath, A.K., Lopez de Compadre, R.L., Debnath, G., Shusterman, A.J., and Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J. Med. Chem. 34(2):786-797 (1991).

[2] Helma, C., King, R. D., Kramer, S., and Srinivasan, A. The Predictive Toxicology Challenge 2000–2001. Bioinformatics, 2001, 17, 107-108. URL: www.predictive-toxicology.org/ptc

[3] Feragen, A., Kasenburg, N., Petersen, J., de Bruijne, M., Borgwardt, K.M.: Scalable kernels for graphs with continuous attributes. In: C.J.C. Burges, L. Bottou, Z. Ghahramani, K.Q. Weinberger (eds.) NIPS, pp. 216-224 (2013).

[4] K. M. Borgwardt, C. S. Ong, S. Schoenauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56, Jun 2005.

[5] I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg. Brenda, the enzyme database: updates and major new developments. Nucleic Acids Research, 32D:431–433, 2004.

[6] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol., 330(4):771–783, Jul 2003.

[7] Sutherland, J. J.; O'Brien, L. A. & Weaver, D. F. Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships. J. Chem. Inf. Comput. Sci., 2003, 43, 1906-1915.

[8] N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. In Proc. of ICDM, pages 678–689, Hong Kong, 2006.

[9] http://pubchem.ncbi.nlm.nih.gov

[10] http://image.diku.dk/aasa/papers/graphkernels_nips_erratum.pdf

[11] M. Neumann, P. Moreno, L. Antanas, R. Garnett, K. Kersting. Graph Kernels for Object Category Prediction in Task-Dependent Robot Grasping. Eleventh Workshop on Mining and Learning with Graphs (MLG-13), Chicago, Illinois, USA, 2013.

[12] http://www.first-mm.eu/data.html

[13] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting. Propagation kernels: efficient graph kernels from propagated information.Machine Learning, 102(2):209–245, 2016

[14] Pinar Yanardag and S.V.N. Vishwanathan. 2015. Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 1365-1374.

[15] Francesco Orsini, Paolo Frasconi, and Luc De Raedt. 2015 Graph invariant kernels. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI'15), Qiang Yang and Michael Wooldridge (Eds.). AAAI Press 3756-3762.

[16] Riesen, K. and Bunke, H.: IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning. In: da Vitora Lobo, N. et al. (Eds.), SSPR&SPR 2008, LNCS, vol. 5342, pp. 287-297, 2008.

[17] AIDS Antiviral Screen Data (2004)

[18] S. A. Nene, S. K. Nayar and H. Murase. Columbia Object Image Library (COIL-100), Technical Report, Department of Computer Science, Columbia University CUCS-006-96, Feb. 1996.

[20] Jeroen Kazius, Ross McGuire and, and Roberta Bursi. Derivation and Validation of Toxicophores for Mutagenicity Prediction, Journal of Medicinal Chemistry 2005 48 (1), 312-320

[21] Christopher Morris, Nils M. Kriege, Kristian Kersting, Petra Mutzel. Faster Kernels for Graphs with Continuous Attributes via Hashing, IEEE International Conference on Data Mining (ICDM) 2016

[22] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. 2011. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 12 (November 2011), 2539-2561.

[23] Nils Kriege, Petra Mutzel. 2012. Subgraph Matching Kernels for Attributed Graphs. International Conference on Machine Learning 2012.

[24] Tox21 Data Challenge 2014

[25] Nils M. Kriege, Matthias Fey, Denis Fisseler, Petra Mutzel, Frank Weichert. Recognizing Cuneiform Signs Using Graph Based Methods. International Workshop on Cost-Sensitive Learning (COST), SIAM International Conference on Data Mining (SDM) 2018, 31-44, `arXiv:1802.05908`

.

[26] A Repository of Benchmark Graph Datasets for Graph Classification

[27] Boris Knyazev, Graham W. Taylor, Mohamed R. Amer. Understanding Attention and Generalization in Graph Neural Networks