Chair 11

Benchmark Data Sets for Graph Kernels

This page contains collected benchmark data sets for the evaluation of graph kernels. The data sets were collected by Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann with partial support of the German Science Foundation (DFG) within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Data Analysis”, project A6 “Resource-efficient Graph Mining”.

02.03.2020: Added three new data sets from [29].
14.01.2020: Added twenty-four new data sets from [24].
28.08.2019: Added twenty-two new data sets from [28].
09.07.2019: Added two new data sets from [27].
23.10.2018: Added five new data sets from [26].
13.02.2018: Added Cuneiform data set from [25].
11.05.2017: Added twelve new data sets from [24].
17.06.2016: Added Synthie data set from [21].
10.05.2016: Added eight new data sets from [16].
19.04.2016: Added FRANKENSTEIN data set from [15].
13.04.2016: Added SYNTHETICnew data set from [3,10].
08.04.2016: Added six new data sets from [14].

Name	Source	Statistics				Labels/Attributes				Download (ZIP)
		Num. of Graphs	Num. of Classes	Avg. Number of Nodes	Avg. Number of Edges	Node Labels	Edge Labels	Node Attr. (Dim.)	Edge Attr. (Dim.)
AIDS	[16,17]	2000	2	15.69	16.20	+	+	+ (4)	–	AIDS
alchemy_dev	[29]	99776	R (12)	9.71	10.02	+	+	–	–	alchemy_dev
alchemy_test	[29]	15760	–	11.25	11.76	+	+	–	–	alchemy_test
alchemy_valid	[29]	3951	R (12)	11.25	11.77	+	+	–	–	alchemy_valid
BZR	[7]	405	2	35.75	38.36	+	–	+ (3)	–	BZR
BZR_MD	[7,23]	306	2	21.30	225.06	+	+	–	+ (1)	BZR_MD
COIL-DEL	[16,18]	3900	100	21.54	54.24	–	+	+ (2)	–	COIL-DEL
COIL-RAG	[16,18]	3900	100	3.01	3.02	–	–	+ (64)	+ (1)	COIL-RAG
COLLAB	[14]	5000	3	74.49	2457.78	–	–	–	–	COLLAB
COLORS-3	[27]	10500	11	61.31	91.03	–	–	+ (4)	–	COLORS-3
COX2	[7]	467	2	41.22	43.45	+	–	+ (3)	–	COX2
COX2_MD	[7,23]	303	2	26.28	335.12	+	+	–	+ (1)	COX2_MD
Cuneiform	[25]	267	30	21.27	44.80	+	+	+ (3)	+ (2)	Cuneiform
DBLP_v1	[26]	19456	2	10.48	19.65	+	+	–	–	DBLP_v1
DHFR	[7]	467	2	42.43	44.54	+	–	+ (3)	–	DHFR
DHFR_MD	[7,23]	393	2	23.87	283.01	+	+	–	+ (1)	DHFR_MD
ER_MD	[7,23]	446	2	21.33	234.85	+	+	–	+ (1)	ER_MD
DD	[6,22]	1178	2	284.32	715.66	+	–	–	–	DD
ENZYMES	[4,5]	600	6	32.63	62.14	+	–	+ (18)	–	ENZYMES
Fingerprint	[16,19]	2800	4	5.42	4.42	–	–	+ (2)	+ (2)	Fingerprint
FIRSTMM_DB	[11,12,13]	41	11	1377.27	3074.10	+	–	+ (1)	+ (2)	FIRSTMM_DB
FRANKENSTEIN	[15]	4337	2	16.90	17.88	–	–	+ (780)	–	FRANKENSTEIN
IMDB-BINARY	[14]	1000	2	19.77	96.53	–	–	–	–	IMDB-BINARY
IMDB-MULTI	[14]	1500	3	13.00	65.94	–	–	–	–	IMDB-MULTI
KKI	[26]	83	2	26.96	48.42	+	–	–	–	KKI
Letter-high	[16]	2250	15	4.67	4.50	–	–	+ (2)	–	Letter-high
Letter-low	[16]	2250	15	4.68	3.13	–	–	+ (2)	–	Letter-low
Letter-med	[16]	2250	15	4.67	4.50	–	–	+ (2)	–	Letter-med
MCF-7	[28]	27770	2	26.39	28.52	+	+	–	–	MCF-7
MCF-7H	[28]	27770	2	47.30	49.43	+	+	–	–	MCF-7H
MOLT-4	[28]	39765	2	26.09	28.13	+	+	–	–	MOLT-4
MOLT-4H	[28]	39765	2	46.70	48.73	+	+	–	–	MOLT-4H
Mutagenicity	[16,20]	4337	2	30.32	30.77	+	+	–	–	Mutagenicity
MSRC_9	[13]	221	8	40.58	97.94	+	–	–	–	MSCR_9
MSRC_21	[13]	563	20	77.52	198.32	+	–	–	–	MSRC_21
MSRC_21C	[13]	209	20	40.28	96.60	+	–	–	–	MSRC_21C
MUTAG	[1,23]	188	2	17.93	19.79	+	+	–	–	MUTAG
NCI1	[8,9,22]	4110	2	29.87	32.30	+	–	–	–	NCI1
NCI109	[8,9,22]	4127	2	29.68	32.13	+	–	–	–	NCI109
NCI-H23	[28]	40353	2	26.07	28.10	+	+	–	–	NCI-H23
NCI-H23H	[28]	40353	2	46.67	48.69	+	+	–	–	NCI-H23H
OHSU	[26]	79	2	82.01	199.66	+	–	–	–	OHSU
OVCAR-8	[28]	40516	2	26.07	28.10	+	+	–	–	OVCAR-8
OVCAR-8H	[28]	40516	2	46.67	48.70	+	+	–	–	OVCAR-8H
P388	[28]	41472	2	22.11	23.55	+	+	–	–	P388
P388H	[28]	41472	2	40.44	41.88	+	+	–	–	P388H
PC-3	[28]	27509	2	26.35	28.49	+	+	–	–	PC-3
PC-3H	[28]	27509	2	47.19	49.32	+	+	–	–	PC-3H
Peking_1	[26]	85	2	39.31	77.35	+	–	–	–	Peking_1
PTC_FM	[2,23]	349	2	14.11	14.48	+	+	–	–	PTC_FM
PTC_FR	[2,23]	351	2	14.56	15.00	+	+	–	–	PTC_FR
PTC_MM	[2,23]	336	2	13.97	14.32	+	+	–	–	PTC_MM
PTC_MR	[2,23]	344	2	14.29	14.69	+	+	–	–	PTC_MR
PROTEINS	[4,6]	1113	2	39.06	72.82	+	–	+ (1)	–	PROTEINS
PROTEINS_full	[4,6]	1113	2	39.06	72.82	+	–	+ (29)	–	PROTEINS_full
REDDIT-BINARY	[14]	2000	2	429.63	497.75	–	–	–	–	REDDIT-BINARY
REDDIT-MULTI-5K	[14]	4999	5	508.52	594.87	–	–	–	–	REDDIT-MULTI-5K
REDDIT-MULTI-12K	[14]	11929	11	391.41	456.89	–	–	–	–	REDDIT-MULTI-12K
SF-295	[28]	40271	2	26.06	28.08	+	+	–	–	SF-295
SF-295H	[28]	40271	2	46.65	48.68	+	+	–	–	SF-295H
SN12C	[28]	40004	2	26.08	28.11	+	+	–	–	SN12C
SN12CH	[28]	40004	2	46.69	48.71	+	+	–	–	SN12CH
SW-620	[28]	40532	2	26.05	28.08	+	+	–	–	SW-620
SW-620H	[28]	40532	2	46.62	48.65	+	+	–	–	SW-620H
SYNTHETIC	[3]	300	2	100.00	196.00	–	–	+ (1)	–	SYNTHETIC
SYNTHETICnew	[3,10]	300	2	100.00	196.25	–	–	+ (1)	–	SYNTHETICnew
Synthie	[21]	400	4	95.00	172.93	–	–	+ (15)	–	Synthie
Tox21_AhR_training	[24]	8169	2	18.09	18.50	+	+	–	–	Tox21_AhR_training
Tox21_AhR_testing	[24]	272	2	22.13	23.05	+	+	–	–	Tox21_AhR_testing
Tox21_AhR_evaluation	[24]	607	2	17.64	18.06	+	+	–	–	Tox21_AhR_evaluation
Tox21_AR_training	[24]	9362	2	18.39	18.84	+	+	–	–	Tox21_AR_training
Tox21_AR_testing	[24]	292	2	22.35	23.32	+	+	–	–	Tox21_AR_testing
Tox21_AR_evaluation	[24]	585	2	17.99	18.45	+	+	–	–	Tox21_AR_evaluation
Tox21_AR-LBD_training	[24]	8599	2	17.77	18.16	+	+	–	–	Tox21_AR-LBD_training
Tox21_AR-LBD_testing	[24]	253	2	21.85	22.73	+	+	–	–	Tox21_AR-LBD_testing
Tox21_AR-LBD_evaluation	[24]	580	2	17.09	17.42	+	+	–	–	Tox21_AR-LBD_evaluation
Tox21_ARE_training	[24]	7167	2	16.28	16.52	+	+	–	–	Tox21_ARE_training
Tox21_ARE_testing	[24]	234	2	21.99	22.91	+	+	–	–	Tox21_ARE_testing
Tox21_ARE_evaluation	[24]	552	2	17.01	17.33	+	+	–	–	Tox21_ARE_evaluation
Tox21_aromatase_training	[24]	7226	2	17.50	17.79	+	+	–	–	Tox21_aromatase_training
Tox21_aromatase_testing	[24]	214	2	21.65	22.36	+	+	–	–	Tox21_aromatase_testing
Tox21_aromatase_evaluation	[24]	528	2	16.74	16.99	+	+	–	–	Tox21_aromatase_evaluation
Tox21_ATAD5_training	[24]	9091	2	17.89	18.30	+	+	–	–	Tox21_ATAD5_training
Tox21_ATAD5_testing	[24]	272	2	21.99	22.89	+	+	–	–	Tox21_ATAD5_testing
Tox21_ATAD5_evaluation	[24]	619	2	17.68	18.11	+	+	–	–	Tox21_ATAD5_evaluation
Tox21_ER_training	[24]	7697	2	17.58	17.94	+	+	–	–	Tox21_ER_training
Tox21_ER_testing	[24]	265	2	22.16	23.13	+	+	–	–	Tox21_ER_testing
Tox21_ER_evaluation	[24]	515	2	17.66	18.10	+	+	–	–	Tox21_ER_evaluation
Tox21_ER-LBD_training	[24]	8753	2	18.06	18.47	+	+	–	–	Tox21_ER-LBD_training
Tox21_ER-LBD_testing	[24]	287	2	22.28	23.23	+	+	–	–	Tox21_ER-LBD_testing
Tox21_ER-LBD_evaluation	[24]	599	2	17.75	18.17	+	+	–	–	Tox21_ER-LBD_evaluation
Tox21_HSE_training	[24]	8150	2	16.72	17.04	+	+	–	–	Tox21_HSE_training
Tox21_HSE_testing	[24]	267	2	22.07	23.00	+	+	–	–	Tox21_HSE_testing
Tox21_HSE_evaluation	[24]	607	2	17.61	18.01	+	+	–	–	Tox21_HSE_evaluation
Tox21_MMP_training	[24]	7320	2	17.49	17.83	+	+	–	–	Tox21_MMP_training
Tox21_MMP_testing	[24]	238	2	21.68	22.55	+	+	–	–	Tox21_MMP_testing
Tox21_MMP_evaluation	[24]	541	2	16.67	16.88	+	+	–	–	Tox21_MMP_evaluation
Tox21_p53_training	[24]	8634	2	17.79	18.19	+	+	–	–	Tox21_p53_training
Tox21_p53_testing	[24]	269	2	22.14	23.04	+	+	–	–	Tox21_p53_testing
Tox21_p53_evaluation	[24]	613	2	17.34	17.72	+	+	–	–	Tox21_p53_evaluation
Tox21_PPAR-gamma_training	[24]	8184	2	17.23	17.55	+	+	–	–	Tox21_PPAR-gamma_training
Tox21_PPAR-gamma_testing	[24]	267	2	22.04	22.93	+	+	–	–	Tox21_PPAR-gamma_testing
Tox21_PPAR-gamma_evaluation	[24]	602	2	17.38	17.77	+	+	–	–	Tox21_PPAR-gamma_evaluation
TRIANGLES	[27]	45000	10	20.85	32.74	–	–	–	–	TRIANGLES
TWITTER-Real-Graph-Partial	[26]	144033	2	4.03	4.98	+	–	–	+ (1)	TWITTER-Real-Graph-Partial
UACC257	[28]	39988	2	26.09	28.12	+	+	–	–	UACC257
UACC257H	[28]	39988	2	46.68	48.71	+	+	–	–	UACC257H
Yeast	[28]	79601	2	21.54	22.84	+	+	–	–	Yeast
YeastH	[28]	79601	2	39.44	40.74	+	+	–	–	YeastH

All Data Sets										DS_all

R(N) are regression datasets with N tasks per graph.

File Format

The data sets have the following format (replace DS by the name of the data set):

Let

n = total number of nodes
m = total number of edges
N = number of graphs

DS_A.txt (m lines): sparse (block diagonal) adjacency matrix for all graphs, each line corresponds to (row, col) resp. (node_id, node_id). All graphs are undirected. Hence, DS_A.txt contains two entries for each edge.
DS_graph_indicator.txt (n lines): column vector of graph identifiers for all nodes of all graphs, the value in the i-th line is the graph_id of the node with node_id i
DS_graph_labels.txt (N lines): class labels for all graphs in the data set, the value in the i-th line is the class label of the graph with graph_id i
DS_node_labels.txt (n lines): column vector of node labels, the value in the i-th line corresponds to the node with node_id i

There are optional files if the respective information is available:

DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt): labels for the edges in DS_A_sparse.txt
DS_edge_attributes.txt (m lines; same size as DS_A.txt): attributes for the edges in DS_A.txt
DS_node_attributes.txt (n lines): matrix of node attributes, the comma seperated values in the i-th line is the attribute vector of the node with node_id i
DS_graph_attributes.txt (N lines): regression values for all graphs in the data set, the value in the i-th line is the attribute of the graph with graph_id i

Deep Learning Libraries

The datasets can also be accessed using PyTorch Geometric and the Deep Graph Library.

Citing this Website

We encourage you to refer to our website at http://graphkernels.cs.tu-dortmund.de if you have used the data sets for your publication. Please use the following BibTeX citation:

@misc{KKMMN2016,
  title  = {Benchmark Data Sets for Graph Kernels},
  author = {Kristian Kersting and Nils M. Kriege and Christopher Morris and Petra Mutzel and Marion Neumann},
  year   = {2016},
  url    = {http://graphkernels.cs.tu-dortmund.de}
}

If your bibliography style does not support the url field, you may use this alternative:

@misc{KKMMN2016,
  title  = {Benchmark Data Sets for Graph Kernels},
  author = {Kristian Kersting and Nils M. Kriege and Christopher Morris and Petra Mutzel and Marion Neumann},
  year   = {2016},
  note   = {\url{http://graphkernels.cs.tu-dortmund.de}}
}

Bibliography

[1] Debnath, A.K., Lopez de Compadre, R.L., Debnath, G., Shusterman, A.J., and Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J. Med. Chem. 34(2):786-797 (1991).

[2] Helma, C., King, R. D., Kramer, S., and Srinivasan, A. The Predictive Toxicology Challenge 2000–2001. Bioinformatics, 2001, 17, 107-108. URL: www.predictive-toxicology.org/ptc

[3] Feragen, A., Kasenburg, N., Petersen, J., de Bruijne, M., Borgwardt, K.M.: Scalable kernels for graphs with continuous attributes. In: C.J.C. Burges, L. Bottou, Z. Ghahramani, K.Q. Weinberger (eds.) NIPS, pp. 216-224 (2013).

[4] K. M. Borgwardt, C. S. Ong, S. Schoenauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56, Jun 2005.

[5] I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg. Brenda, the enzyme database: updates and major new developments. Nucleic Acids Research, 32D:431–433, 2004.

[6] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol., 330(4):771–783, Jul 2003.

[7] Sutherland, J. J.; O'Brien, L. A. & Weaver, D. F. Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships. J. Chem. Inf. Comput. Sci., 2003, 43, 1906-1915.

[8] N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. In Proc. of ICDM, pages 678–689, Hong Kong, 2006.

[9] http://pubchem.ncbi.nlm.nih.gov

[10] http://image.diku.dk/aasa/papers/graphkernels_nips_erratum.pdf

[11] M. Neumann, P. Moreno, L. Antanas, R. Garnett, K. Kersting. Graph Kernels for Object Category Prediction in Task-Dependent Robot Grasping. Eleventh Workshop on Mining and Learning with Graphs (MLG-13), Chicago, Illinois, USA, 2013.

[12] http://www.first-mm.eu/data.html

[13] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting. Propagation kernels: efficient graph kernels from propagated information.Machine Learning, 102(2):209–245, 2016

[14] Pinar Yanardag and S.V.N. Vishwanathan. 2015. Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 1365-1374.

[15] Francesco Orsini, Paolo Frasconi, and Luc De Raedt. 2015 Graph invariant kernels. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI'15), Qiang Yang and Michael Wooldridge (Eds.). AAAI Press 3756-3762.

[16] Riesen, K. and Bunke, H.: IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning. In: da Vitora Lobo, N. et al. (Eds.), SSPR&SPR 2008, LNCS, vol. 5342, pp. 287-297, 2008.

[17] AIDS Antiviral Screen Data (2004)

[18] S. A. Nene, S. K. Nayar and H. Murase. Columbia Object Image Library (COIL-100), Technical Report, Department of Computer Science, Columbia University CUCS-006-96, Feb. 1996.

[19] NIST Special Database 4

[20] Jeroen Kazius, Ross McGuire and, and Roberta Bursi. Derivation and Validation of Toxicophores for Mutagenicity Prediction, Journal of Medicinal Chemistry 2005 48 (1), 312-320

[21] Christopher Morris, Nils M. Kriege, Kristian Kersting, Petra Mutzel. Faster Kernels for Graphs with Continuous Attributes via Hashing, IEEE International Conference on Data Mining (ICDM) 2016

[22] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. 2011. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 12 (November 2011), 2539-2561.

[23] Nils Kriege, Petra Mutzel. 2012. Subgraph Matching Kernels for Attributed Graphs. International Conference on Machine Learning 2012.

[24] Tox21 Data Challenge 2014

[25] Nils M. Kriege, Matthias Fey, Denis Fisseler, Petra Mutzel, Frank Weichert. Recognizing Cuneiform Signs Using Graph Based Methods. International Workshop on Cost-Sensitive Learning (COST), SIAM International Conference on Data Mining (SDM) 2018, 31-44, arXiv:1802.05908.

[26] A Repository of Benchmark Graph Datasets for Graph Classification

[27] Boris Knyazev, Graham W. Taylor, Mohamed R. Amer. Understanding Attention and Generalization in Graph Neural Networks

[28] Chemical DataSets

[29] Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models

Contact

If you have any questions regarding the data sets or are interested in adding your graph data, please write an email to christopher.morristu-dortmund.de.