Table of Contents
Using Scaffold Hunter for the Visual Analysis of Chemical Datasets
Nils Kriege and Till Schäfer, Hands-on Workshops, OpenTox 2013
Introduction
Scaffold Hunter is a tool for the visual analysis and interactive exploration of chemical space. The software supports the integration of chemical datasets from various sources and provides several complementary modules for analysis and visualization.
In this tutorial we will:
- Import data into the internal database.
- Add properties from other sources to the existing dataset.
- Enrich the stored data by calculating molecular descriptors and a scaffold-based classification scheme.
- Use the different modules of Scaffold Hunter to analyze the data in different views, e.g., by means of the scaffold tree or hierarchical clustering algorithms. This includes the mapping of chemical properties to visual attributes of the individual views.
- Demonstrate the global linkage, selection and annotation concept across views.
- Use filtering and selection to refine the dataset and demonstrate the build-in subset management.
Scaffold Hunter Homepage: http://scaffoldhunter.sourceforge.net
Requirements
Software
- JRE: A Java VM supporting Java SE 6 code is required to run the program. Please refer to http://www.java.com for installation instructions concerning Java.
- Database: Scaffold Hunter comes with HSQLDB and MySQL support. HSQLDB does not need any additional software to be installed, but is only sufficient to store small personal datasets. For productive use a MySQL server is recommended. Please refer to http://www.mysql.com for installation instructions concerning MySQL.
Hardware
- CPU: A CPU with at least 2 GHz is recommended to run the program.
- RAM: At least 1 GB must be available to the Java Virtual Machine running the program. 4 GB are recommended for full functionality.
- Hard Disk: 40 MB of free space are sufficient to store the program and configuration files. Additional some space for the database is needed if it is installed on the client computer. The amount depends on how much data you wish to store and which database system you use (see below).
Tutorial
Getting Started
Download a Molecular Dataset
Download the sample dataset and extract it to some location on your hard drive. We will need this dataset later during the import process.
Installation & Program Start
Scaffold Hunter is written completely in Java, therefore no installation is required to run the application. Simply download the latest version of Scaffold Hunter and extract it somewhere on your computer (e.g., in case you have admin rights, C:\Program Files\ScaffoldHunter on Windows or /opt/scaffoldhunter on Unix/Linux/Mac, or alternatively in any local directory). To launch Scaffold Hunter, simply run the supplied start-script (run.cmd on Windows or run.sh on Unix/Linux/Mac). Now you should see the Start Dialog:
Setting Up a Database Back-end
Before Scaffold Hunter is usable, a database back-end needs to be configured. The database is used to store the imported molecules, scaffolds, attributes, presets, and any configuration you do inside the program. It is not a connection to any public available database, where you might want to import data from. Scaffold Hunter comes with HSQLDB and MySQL support. HSQLDB does not need any additional software to be installed and the database can be simply saved somewhere in your file system. Please Note, that HSQLDB is only sufficient to store small personal datasets. For productive use and large datasets a MySQL server is recommended. Also the collaborative features of Scaffold Hunter are only available if you are sharing a common database (which is only possible using MySQL).
You can select already configured database connections using the drop-down box in the Start Dialog. To add a new database to this list, click Databases. This opens the Databases Dialog:
On the left-hand side of the window, the existing database connections are shown. With the plus and minus buttons you can create a new connection or remove an existing one. On the right-hand side the settings for the selected database are shown. Select a name for your database connection and select the database type. In this tutorial we will only explain the use of HSQLDB. If you want to use the MySQL database please refer to section 4.1 Setting up a Database Connection in the manual. After selecting the HSQLDB database type the location for your database in file system needs to be configured. Choose any suitable location (e.g., /home/user/scaffoldhunter/dbname or C:\Users\username\scaffoldhunter\dbname) and click OK.
Creating a User Profile
A user profile is used to save your personal settings and sessions (we will come to that later) in the database. As mentioned before multiple users can connect to the same database and use collaborative features. You can also use different PCs and connect to the same database. That means you do not need to think about database synchronization between your different clients.
To create a new user, click Create User in the Start Dialog. Thereby the User Creation Dialog is opened.
Select the previously created database connection, choose a username and password and click OK.
Log In
To log in simply choose the previously created database connection in the Start Dialog, type in your username and password and click Log In. You should now see the session dialog:
Manage Datasets
As you will see the session dialog should be empty now. Before we can actually use Scaffold Hunter for analyzing a dataset, we have to import it into the internal database. To open the Dataset Management Dialog click Manage Datasets.
The Dataset Management Dialog gives you the possibility to import, modify or delete the molecular data stored in Scaffold Hunter’s internal database. In Scaffold Hunter you can manage your projects by the use of datasets. Each dataset consists of a set of molecules and a collection of properties for these molecules. For each dataset you can generate an arbitrary number of scaffold trees using different generation rules. You will later use the scaffold tree of your choice, to navigate through the molecules in the dataset.
Create a New Dataset
Select the Input Sources
Click New Dataset to open the Create Import Job Dialog.
Using different import plugins, Scaffold Hunter can import data from different sources. Currently there are plugins to import data from CSV files, SDF files and SQL databases bundled with the program. You can also write your own plugins to support arbitrary sources (see section 5.2.1.2: Import Plugins in the manual for details).
Please enter a title and description of the dataset at the top of the dialog now. Select the SDF import plugin and provide the path to the downloaded and extracted SDF file from the beginning of this tutorial. After this you can give the import job a name (do not mix this up with the dataset title) and click Create New Import Job. You will see your import jobs on the left side below the import plugins.
It is possible to import and merge multiple sources by creating more than one import job. These jobs will run sequentially and the data imported by those jobs will be merged according to rules which can be selected during the import process. Scaffold Hunter considers two structures equal if they have the same canonical SMILES string. If a structure is present in more than one source the property entries for the structure will be merged where applicable.
For simplicity we will only import a single data source. Please click Start Import.
Property Mapping
The Property Mapping Dialog will appear after you have started the import.
For every import Job there are the same GUI elements. Above each import job you have a line with the name.
In the Structure name property drop down box you select one of the properties for the name field of the structures. The name field will be used at several places in the program to present a name together with a structure, however it is not used internally so there are no special requirements for this property.
In the Name merge strategy and Molecular structure merge strategy drop down box you can specify the merge strategy if a structure is contained multiple times in the same input source or already imported by a previous input source. You can choose do not overwrite, overwrite and concatenate.
Molecular structures often come together with arbitrary properties (e.g., IC50 values, molecular mass, and other). Theses properties are displayed as a table with their source names in the first column. To import these values to Scaffold Hunter, you need to map the properties of the source file to properties of the internal database. This is especially useful if you import data from different sources and the same property is named different in these sources. In this case you can simply select a common internal property and the values of the two sources are merged. There are two ways to specify mappings. If you want to import all properties and create an internal property for each, you can simply click Map all Unmapped Properties and Scaffold Hunter will do all the work for you. If you have a more complicated setup, than you can map them individually by clicking in the Mapped to column and select an existing property or create a new one. A property has a name, a description and a type (numerical or string). It is important to set the type correctly, as some operations are different for different types. For example, you can filter a dataset by a numerical property that is larger than some specified value only if the property is considered numerical. Scaffold Hunter tries to auto-detect if a property is numerical or not. Check if the type is set correctly if you use the automated mapping!
You can also set a transform function for numerical properties and a merge strategy for each property. If you are interested in this features refer to section 5.2.1. Importing Data in the manual.
For the later workflow in this tutorial, please map all properties using the button at the top right corner. After you are done click OK.
Import
The import will now start and display some information about the import process. If the import process is finished, click Close.
Create a Scaffold Tree
The scaffold tree is a hierarchical classification scheme for molecules. It must be calculated in advance and at least one tree for each dataset must exist to use Scaffold Hunter.
To calculate a scaffold tree select the imported dataset in the Dataset Management Dialog and click New Tree.
The New Tree Dialog allows you to select a title and a description of the tree. Furthermore, it is possible to edit the rules for scaffold pruning. Refer to section 5.2.3.1. Using Custom Rules in the manual if you like to do so. In this tutorial we will just use the default rules and click Generate Tree. After the calculation has finished click Close.
Calculate Additional Properties
Since we have imported a dataset and calculated a scaffold tree we are able to start the visual analysis now. However, if we want to use cluster analysis, fingerprints will help us to cluster on the structure of the molecules. Therefore, we need to create this fingerprints as they were not included in the source files. In general it is possible to calculate arbitrary properties with calculation plugins and the use of this properties is not limited to the clustering. For further details and information on how to write your own plugins refer to section 5.2.2. Calculation of Properties of the manual.
Select the dataset in the Dataset Management Dialog and click Calculate Properties.
Please select the DaylightBitFingerprinter plugin on the left side. You will now see some options on the generation of the fingerprint. After the configuration is finished (we do not need to change anything in this tutorial) click Create new Calc Job. Repeat this steps for the EStateBitFingerprinter and the EStateNumericalFingerprinter. You will now have three calculation jobs listed on the left. Click Start Calc, wait until the calculation is finished and click Close.
Visual Analysis with Scaffold Hunter
Close the Dataset Management Dialog.
Sessions
After the dataset import, tree generation and property calculation has been performed, we are now ready to create a session for visual analysis of the dataset. Think of a session as a representation of the complete state of the workspace. It represents an optionally filtered dataset and a specific tree plus the visual settings such as the opened views and their configuration on the dataset.
To create a new session, choose a title, a dataset and a scaffold tree. Than click Create. The Filter Dialog pops up.
The Filter Dialog allows you to filter the dataset before actually displaying it. This is helpful in cases of very huge datasets where Scaffold Hunter is unable to display the complete dataset. Please note, that further filtering is possible later, during visual inspection.
The filters are stored as presets in Scaffold Hunter. This is useful as one may want to apply the same filter settings to different datasets or subsets of that dataset. To create such a filter preset click the new button at the left bottom corner of the Filter Dialog. You can now choose a title for the filter preset and select properties plus filter criteria for them. During configuration of a filter the remaining number of molecules is displayed at the bottom of the Filter Dialog. If you have configured the filter, click Save to save the filter and than click OK to apply it. You do not need to filter the sample dataset for this tutorial.
After filtering you will see the new Session in the sessions list.
The Main Window
Select the new session and click Open. The Main Window will open.
The figure shows parts dedicated to different functionalities highlighted in different colors. The region in blue contains controls for the global subset and selection management, the red pane shows the currently open views organized in tabs and the green area contains view-specific controls and widgets arranged in a tool- and sidebar, respectively. Both are dynamically adjusted whenever a different tab is activated. Note that there are four tabs already open displaying the same data in different views. Feel free to switch between the tabs, the individual views will be explained in the following.
Views
Table View
The table view lists the imported molecules in well-known spreadsheet form. You can adjust the height of columns using the buttons in the toolbar. Furthermore, you can reorder and resize columns or make them “sticky” to keep them fixed at the left-hand side when scrolling. Just click the title of a column to sort the data according to a specific property. You can select molecules by clicking at a row. Note that the selected molecules are shown at the right bottom corner and you can browse the selection by clicking the arrows below the picture.
Scaffold Tree View
Switch to the scaffold tree view. You may zoom and pan by using the mouse wheel and dragging the mouse while pressing the left mouse button, respectively. When you zoom in the depiction of scaffolds is adjusted and shows more details. Note that initially only the first two levels of the tree are displayed. You may expand individual branches by clicking the +-symbol, using the context menu invoked by a right click on a scaffold or extend all nodes using the button at the toolbar. To locate the previously selected molecules you may click the magnifying lens button below the selected molecule at the bottom right corner. The camera will focus the scaffold the molecule belongs to. Note that this feature is supported by all views to easily locate selected molecules.
There are several ways to map molecular properties to visual attributes of the scaffold tree view: Using the left side bar you can sort the scaffolds according to properties. Click the third button from the right at the toolbar to open the property mapping dialog. For each visual attribute you can select a property and how it is mapped to different colors.
Dendrogram View
The dendrogram view allows SAHN clustering of molecules and supports various distance measures and linkage schemes. To cluster the sample dataset based on the precomputed Daylight fingerprints, select the Distance Tanimoto Bit. On the right-hand side all available bit fingeprints will be displayed. Select DaylightBitFingerprint and chose the desired linkage methods, e.g., Ward linkage.
After clicking the Start Clustering button at the left side bar the cluster analysis is performed and the result is displayed as a dendrogram. The view is zoomable and molecules will be depicted when zooming in. The red cluster selection bar can be dragged using the left mouse button to interactively cut the dendrogram at a specific height to define a clustering with variable granularity. It is possible to add or remove clusters and arbitrary subtrees of the dendrogram to or from the selection by clicking the associated horizontal line in the dendrogram. Furthermore it is possible to display an embedded table view which highlights the cluster members by color. You can activate the table by clicking the third button from the left at the toolbar.
Linkage & Coordination of Views
Scaffold Hunter uses a global selection mechanism over all views. Whenever a molecule is selected, it is highlighted in all views simultaneously. The selection browser provides quick access to selected molecules and allows to locate them in the active view. The current selection can be stored as a subset by clicking the Make Subset button at the right side bar. An entry will be added to the subset tree. New subsets can also be created by right clicking on a subset and selecting Filter… from the context menu. The same filter dialog as explained earlier allows to refine the subset. Each subset can be shown in the different views and for all active views the underlying dataset can be changed preserving the current settings and property mappings.
The sample dataset contains inhibitors and activators. To create two subsets for these classes you may use the table view: Sort the table by the property Activity Direction, clear the current selection by using the button Clear at the bottom right corner and select the range of molecule with Activity Direction “increasing” and click “Make Subset” to obtain a subset for the activators. You can create another subset for the inhibitors using the same technique. Alternative you can select the activator subset, right click on the root subset and choose Make Difference from the context menu to obtain a new subsets containing all molecules of the root subset that are not contained in the activators subset.
Switch to the scaffold tree view. Currently the view shows the root subset as indicated by the bold entry in the subset tree. Right click on one of two subsets and select Replace Selection. The molecules of the subsets are highlighted in the current view. Double click on one of the subsets to change the dataset the view is based on. Note that the active property mapping is preserved as shown in the figure:
Changing subsets is possible for all views and you can also open new views showing individual subsets by right clicking at a subset and selecting Show in New View. You can also split the tab pane horizontally or vertically to inspect two views simultaneously by selecting Windows → Split Horizontally or Split Vertically, respectively.
Conclusion
Thank you for taking the time to do the Scaffold Hunter tutorial! You have now learned how to start the program, import data and calculate additional properties. The individual views and basic concepts of Scaffold Hunter have been demonstrated. Feel free to further explore the dataset on your own. We hope you had fun using Scaffold Hunter!