Molecular Descriptors For Cheminformatics Pdf To Excel

The familiar Chemical Spreadsheet paradigm is an extremely useful way of presenting structural information together with calculated or measured structural properties. Indeed, most software which handles or stores chemical data will make available a tabular view implementing at least some of the more common spreadsheet functionality such as sorting by columns. Many excellent chemical spreadsheet tools are commercially available and there are also notable freeware/open source examples [1]. Most such software is self-contained which, of course, gives the developers maximum freedom of implementation. This approach has certain potential disadvantages however, particularly considered in the context of a corporate environment:

Molecular Descriptors For Cheminformatics Pdf To Excel Online
Molecular Descriptors For Chemoinformatics Pdf To Excel

An interested user needs to buy/download and install the software. This of course is trivial in the case of a 'home' or independent user but may pose almost insurmountable challenges in a 'locked-down' corporate environment
The user must get to grips with an entirely new piece of software overcoming a potentially steep learning curve
It is extremely difficult to provide spreadsheet features (powerful calculated columns, visualisation, macro language, etc) which begin to rival those of the industry standard, Microsoft Excel - a program already very familiar to target users.

‘plug and play’ cheminformatics toolkit With over 1000 API functions, Accord SDK lets. 100 molecular descriptors & properties can be calculated from 2D structure, giving you the power. Excel - adding all the functionality available in Accord into MS. Such as contained in molecular descriptors. A Java framework for cheminformatics. As a result, it is possible to read in a variety of molecular formats, calculate molecular descriptors and evaluate ﬁngerprints. In addition, we describe. 4 Chemical Informatics Functionality in R Function Description.

The last point suggests a different approach in which the chemistry engine is build on top of Excel. This tactic appears extremely attractive partly because the potential developer can concentrate on implementing chemical functionality but also because of the ubiquity and power of Excel. Two well-known realisations of this approach are Isis for Excel [2] and Accord for Excel [3].

Solutions of this type are typically implemented as Excel AddIns, using Visual Basic for Applications (VBA) to interface with chemistry engines. Structures are usually stored on the spreadsheets as some kind of object (including structure-layout or image data) which may be interpreted by the chemistry engine for visualisation and calculation purposes. To ensure that structure objects display and sort properly, it is usually necessary to intercept several of Excel's fundamental calls (such as the main calculation routine). This necessity, together with the size of the stored objects, can lead to rapid degradation of performance for spreadsheets containing large numbers of structures.

Bearing the foregoing in mind, LICSS was designed to appeal particularly to corporate users of Excel for Windows. Because of one of the authors' experience of corporate locked-down environments and because LICSS was to be a 'hobby' project, initially with just one spare-time developer, some rather specific design criteria were developed:

LICSS should require no installation beyond file copying. Users should be able to share spreadsheets with fully automatic installation (if necessary)
LICSS would implement chemistry functionality by interfacing with the excellent CDK Java library [4, 5] (and the corresponding rendering package, JChemPaint [6])
Structures should be stored purely as Smiles strings in cells; structure rendering would be on-the-fly
LICSS spreadsheets would not intercept Excel's calculation calls
An Excel add-in would not be used (they normally need user installation and can require admin privileges). Any necessary VBA would exist on each chemically-enabled spreadsheet.

User Implementation and Features

From a user's point of view, LICSS is implemented as a single Excel for Windows workbook with just one routine which allows chemical enabling of any suitable spreadsheets (containing Smiles strings) and associated charts (Figure 1). Once enabled, the spreadsheets are entirely standalone, requiring no add-ins or any customisation of Excel [7]. If shared with other users, or moved to a workstation without LICSS installation, the enabled sheets install LICSS seamlessly (if available in some shared area) or, if necessary, prompt the user to allow automatic file install from the LICSS project site on Google projects [8].

LICSS-enabled sheets use JChemPaint to render Smiles strings in a pop-up window (Figure 2). This is activated by clicking directly on the Smiles string, choosing a shortcut key to show the first structure on a row, or by mouse hover over scatter chart data points. If desired, users can also choose to display structures for all visible cells (Figure 3). The routine which achieves this calculates only which cells are currently visible to the user and renders the structures for them on-the-fly. This method ensures that even very large sheets (> 100,000 compounds) may be visualised without running out of memory.

Clicking on the 'LICSS Programs' worksheet tab gives access to a single menu making all other LICSS functionality available (Figure 4).

Routines are currently available for substructure and similarity searching, fingerprint generation (for faster substructure searching), R Group table generation, Jarvis-Patrick clustering, Sammon map coordinate generation (see Figure 5 for a scatter plot created from LICSS-generated Sammon map coordinates), diverse compound picking, molecular descriptor calculation and conversion of IUPAC names to Smiles (using the OPSIN Java library [9]). New Excel formulas are also available - for calculating molecular descriptors, molecular weight or molecular formula and for determining whether one Smiles string is a substructure of, or is similar to, another Smiles string (within a defined threshold). Table 1 gives some indicative data for the performance a user can expect from LICSS functionality.

Table 1

Timings for common cheminformatics tasks using LICSS.

Dataset	Operation	Timing (m:s)	Hits
[1]	SSS (Sub Structure Search) with n1cnccc1 (Smarts matching)	0:13	76
[1]	SSS with pyrimidine (sketcher)	0:05	76
[1]	SSS with n1cnccc1 (Smarts matching/fingerprint pre-search)	0:04	76
[1]	SSS with pyrimidine (sketcher/fingerprint pre-search)	0:03	76
[1]	Fingerprint generation	0:13
[1]	RGroupTable generation with Pyrimidine as core (sketcher)	0:06 (batch) 0:05 (formula)
[1]	Jarvis Patrick clustering (generating 737 clusters)	0:19
[1]	Sammon Map coordinate calculation	0:28
[1]	Descriptor calculation (XLogP)	0:08
[2]	SSS with Cc1cncnc1 (Smarts matching)	5:32	349
[2]	SSS with 5-MePyrimidine (sketcher)	1:55	486 (includes cc1cncnc1 as well as Cc1cncnc1)
[2]	SSS with Cc1cncnc1 (Smarts matching/fingerprint pre-search)	0:21	349
[2]	SSS with 5-MePyrimidine (sketcher/fingerprint pre-search)	0:13	486
[2]	Fingerprint generation	5:01
[2]	RGroupTable generation (on Pyrimidine subset with Pyrimidine as core; sketcher)	0:28
[2]	Descriptor calculation (XLogP)	4:37 (batch) 4:22 (formula)

Times refer to a 2.13 GHz laptop with 4 GB of memory running Vista/Microsoft Excel 2007.

Two datasets were used: [1]: a set containing ~1.6 k pesticidal compounds, [2]: a set containing ~27 k anti-malarial compounds.

Technical Implementation

The main enabling program is contained in an Excel for Windows workbook (Excel 97-2003 format), EnableChemicalSpreadsheetV2.1.xls. It is written in VBA using the VBA Extensibility library which allows the program to copy code to and create code in the workbook being enabled. Most code is simply copied from EnableChemicalspreadsheetV2.1.xls but some event handling routines are created specifically for the workbook being enabled; this makes possible features such as structure pop-up upon mouse hover over chart data points for example.

The CDK and OPSIN Java libraries are accessed in one of two ways. For batch processes (such as Substructure and Similarity searching) the relevant compounds are first written to file in Smiles (SMI) file format (after an in-sheet fingerprint search if necessary). Then an executable JAR file, CDKSSWin.jar is synchronously executed. This contains a number of routines corresponding to each of the available LICSS programs and taking appropriate input/output file and other control parameters. Each of these routines creates an output file and terminates, whereupon the calling VBA processes the output file appropriately. The synchronous Jar file execution is done without a command line window through Javaw.exe and CDKSSWin.jar starts by creating a pop-up Swing progress window. In this way, the routines appear to run as part of Excel.

CDK classes are widely used within CDKSSWin.jar to provide cheminformatics methods (fingerprint generation, substructure searching etc). Where available, existing open source code was adapted to use the CDK minimising the need to rewrite algorithms (eg for Jarvis Patrick clustering and Sammon projection; see acknowledgments). Algorithms for R-Group table generation, similarity searching and diverse compound picking were written in-house.

Calls to JChemPaint, to display structure editing or structure display windows, are handled quite differently. Originally (version 1 x), the JChemPaint applet was used inside a WebBrowser control within VBA. However, this approach was not suitable for the rapid display of several structures (eg for displaying all worksheet structures). From version 2.0 onwards, a JVM is run within the Excel process space so calls to Java can be made directly, without per action initialisation or context switching overheads. Calls to Java of this type are made possible by creating C++ proxies for each Java method (contained within a single CDecl dll file, CDKInterfaceDll.dll) using JNI via the open-source Jace project technology [10]. The C++ proxy functions may then be declared and called directly from VBA.

In practice, after one-off Java initialisation, this approach enables extremely rapid access to Java routines directly from VBA in Excel. Thus, for example, a user can render a screen's worth of structures from Smiles in < 1 second. The same method has been used for all the new Excel formulas - for example, on a 2.13 MHz laptop with 4 GB of memory running Vista, a formula entry such as: ' = GetCDKDescriptor(C2,'XLogP',1)' will calculate the XLogP descriptor for > 100 compounds per second when copied down for a column of Smiles strings (see also Table 1).

Molecular descriptors play a fundamental role in chemistry, pharmaceutical sciences, environmental protection policy, and health researches, as well as in quality control, being the way molecules, thought of as real bodies, are transformed into numbers, allowing some mathematical treatment of the chemical information contained in the molecule. This was defined by Todeschini and Consonni as:

'The molecular descriptor is the final result of a logic and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment.'^[1]

By this definition, the molecular descriptors are divided into two main categories: experimental measurements, such as log P, molar refractivity, dipole moment, polarizability, and, in general, additive physico-chemical properties, and theoretical molecular descriptors, which are derived from a symbolic representation of the molecule and can be further classified according to the different types of molecular representation.

The main classes of theoretical molecular descriptors are: 1) 0D-descriptors (i.e. constitutional descriptors, count descriptors), 2) 1D-descriptors (i.e. list of structural fragments, fingerprints),3) 2D-descriptors (i.e. graph invariants),4) 3D-descriptors (such as, for example, 3D-MoRSE descriptors, WHIM descriptors, GETAWAY descriptors, quantum-chemical descriptors, size, steric, surface and volume descriptors),5) 4D-descriptors (such as those derived from GRID or CoMFA methods, Volsurf).

Invariance properties of molecular descriptors[edit]

The invariance properties of molecular descriptors can be defined as the ability of the algorithm for their calculation to give a descriptor value that is independent of the particular characteristics of the molecular representation, such as atom numbering or labeling, spatial reference frame, molecular conformations, etc. Invariance to molecular numbering or labeling is assumed as a minimal basic requirement for any descriptor.

Two other important invariance properties, translational invariance and rotational invariance, are the invariance of a descriptor value to any translation or rotation of the molecules in the chosen reference frame. These last invariance properties are required for the 3D-descriptors.

Degeneracy of molecular descriptors[edit]

This property refers to the ability of a descriptor to avoid equal values for different molecules. In this sense, descriptors can show no degeneracy at all, low, intermediate, or high degeneracy. For example, the number of molecule atoms and the molecular weights are high degeneracy descriptors, while, usually, 3D-descriptors show low or no degeneracy at all.

Basic requirements for optimal descriptors[edit]

Should have structural interpretation
Should have good correlation with at least one property
Should preferably discriminate among isomers
Should be possible to apply to local structure
Should possible to generalize to 'higher' descriptors
Should be simple
Should not be based on experimental properties
Should not be trivially related to other descriptors
Should be possible to construct efficiently
Should use familiar structural concepts
Should change gradually with gradual change in structures
Should have the correct size dependence, if related to the molecule size

References[edit]