A Systematic Property Based Approach for Molecular Synthesis Using Higher
Order Molecular Groups and Molecular Descriptors
by
Nishanth Gopalakrishnan Chemmangattuvalappil
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
December 13, 2010
Key Words: Molecular Design, Topological Indices, Molecular Signatures
Copyright 2010 by Nishanth Gopalakrishnan Chemmangattuvalappil
Approved by
Mario R. Eden, Associate Professor of Chemical Engineering, Auburn University, Chair
Christopher B. Roberts, Professor of Chemical Engineering, Auburn University
Ram B. Gupta, Professor of Chemical Engineering, Auburn University
Jin Wang, Assistant Professor of Chemical Engineering, Auburn University
Mahmoud M. El-Halwagi, Professor of Chemical Engineering, Texas A&M University
ii
Abstract
In this work, algorithms have been developed for the design of molecules
corresponding to the optimum performance of a process. The concept of property
clustering has been extended into molecular design based on second and third order
group contribution methods. An algebraic approach has been developed utilizing higher
order molecular groups built from first order groups. The significant aspect of the
aforementioned method is that both the application range and reliability of the molecular
property clustering technique are considerably increased by incorporating second and
third order estimation. A methodology has been developed for incorporating the property
contribution predicted using combined group contribution and connectivity indices into
the design framework in case the property contributions of any of the molecular groups of
interest are not available in literature. For the design of simple mono-functional
molecules, a modified visual approach has been used whereas for the design of more
complicated structures and/or for treating more than three properties at a time, an
algebraic method has been developed.
Until now, most reverse property prediction algorithms are based on group
contribution methods. However, a variety of properties can be predicted using
Quantitative Structure Activity/Property Relationships (QSAR/QSPR) models. QSAR
models make use of topological indices to predict physical properties and biological
activities. In this dissertation, a new algorithm has been developed to include topological
iii
index based property models into the reverse problem formulation framework. This
algorithm makes use of the concept of molecular signature descriptors to incorporate a
variety of different topological indices on a common platform. A large number of
environmental, safety and health related constraints can be now investigated as a part of
the integrated process and molecular design. An algorithm for the enumeration of the
molecular structures has been developed with very low degeneracy. In the last part, a
general framework has been proposed to simultaneously integrate process and product
design problems with flowsheet design. This methodology will identify the best candidate
molecules that provide the optimum process performance with minimum energy
utilization. The dissertation concludes with a list of potential areas where more study can
be conducted based on the developed algorithms.
iv
Acknowledgments
I would like to express my profound gratitude to Dr. Mario R. Eden for his
constant support, encouragement and assistance. Special recognition is given for his
guidance and direction. He has always been a great source of information and
inspiration. I would like to thank my research committee members, Dr. Mahmoud El-
Halwagi at Texas A&M University, Dr. Ram Gupta, Dr. Christopher Roberts and Dr. Jin
Wang for their valuable comments and suggestions. My sincere thanks to my
collaborators Charles Solvason, Dr. Fadwa Eljack and Susilpa Bommareddy for their
brilliant ideas and feedback. Thanks are also due to my friends and co-workers, Dr.
Norman Sammons, Dr. Jeffrey Seay, Wei Yuan, Subin Hada and Gregory Vaughan at
Auburn University. Special gratitude and appreciation is given to my parents,
Gopalakrishnan C.C. and Vijayakumari T.V., my wife Suchithra, and my sister Dhanya
for all their support and encouragement. Finally, I would like to thank the faculty and
staff in the Department of Chemical Engineering at Auburn University for making my
graduate research experience at Auburn a memorable and rewarding one.
v
Table of Contents
Abstract............................................................................................................................... ii
Acknowledgments.............................................................................................................. iv
List of Figures.................................................................................................................... ix
List of Tables ..................................................................................................................... xi
Nomenclature................................................................................................................... xiv
1. Introduction......................................................................................................................1
2. Theoretical Background...................................................................................................9
2.1. Chemical Product Design .......................................................................................9
2.2. Mathematical Formulation of Chemical Product Design .. ..................................14
2.3. Types of Properties and Estimation Techniques...................................................17
2.4. Mixture Design .....................................................................................................19
2.4.1. Design of Experiments.................................................................................19
2.4.2. Mixture Design of Experiments...................................................................21
2.5. Group Contribution Methods................................................................................23
2.5.1. Initial Efforts................................................................................................23
2.5.2. Group Contribution Models with Higher Levels.........................................25
2.6. Topological Indices and Property Prediction........................................................29
2.6.1. Connectivity Indices ....................................................................................32
vi
2.6.2. Edge Adjacency Index .................................................................................37
2.6.3. Shape Indices ...............................................................................................39
2.6.4. Wiener Indices .............................................................................................40
2.6.5. The Hoyosa Topolological Index.................................................................40
2.7. Connectivity Indices and GC+ Method .................................................................41
2.8. Molecular Signature Descriptors ..........................................................................43
2.8.1. Current Status In Inverse Design .................................................................43
2.8.2. Development of Molecular Signature..........................................................49
2.8.3. Application of Signature Descriptors in Property Prediction ......................55
2.9. Flowsheet Property Model....................................................................................56
2.10. Summary.............................................................................................................59
3. Basics of Computer Aided Molecular Design ...............................................................62
3.1. Computer Aided Molecular Design Framework...................................................62
3.2. Computer Aided Molecular Design Techniques...................................................66
3.3. Property Models....................................................................................................67
3.4. Reverse Problem Formulation ..............................................................................69
3.5. Reverse Problem Formulation Methodology........................................................70
3.6. Summary...............................................................................................................71
4. Integrated Process and Molecular Design......................................................................73
4.1. Property Clustering Techniques............................................................................73
4.1.1. Property Operator and Cluster Formulation ................................................73
4.1.2. Conservation Rules ......................................................................................76
4.1.3. Visualization Techniques.............................................................................78
vii
4.1.4. Identification of Feasibility Region .............................................................80
4.2. Molecular Property Operators and Clusters..........................................................82
4.3. Visual Solution of Molecular Design Problem.....................................................85
4.4. Limitations of the Visual Approach in Molecular Design....................................90
4.5. An Algebraic Approach for Molecular Synthesis with Higher Order Groups .....92
4.5.1. General Problem Statement .........................................................................95
4.5.2. Algebraic Approach for Solving the Molecular Design Problem................95
4.5.3. Algebraic Molecular Design Algorithm ....................................................102
4.6. Introduction of GC+ Models into the Cluster Space ..........................................104
4.6.1. GC+ Algorithm for Visual Solution of a Molecular Design problem........106
4.6.2. GC+ Algorithm for Algebraic Solution of a Molecular Design Problem ..109
4.7. Molecular Signatures in Reverse Problem Formulations ...................................110
4.7.1. First Order Connectivity Index ..................................................................111
4.7.2. Kier-Hall Shape Index of Order 1..............................................................112
4.7.3. Reverse Problem Formulation using Molecular Signatures ......................114
4.7.4. Signature Based Algorithm for Molecular Design ....................................115
4.7.5. Expression of Group Contribution Models with Signatures......................126
4.7.6. Property Models with Different Signature Heights ...................................129
4.7.7. Enumeration of Molecular Structures from Signatures .............................131
4.7.8. Stepwise Procedure for Solving a Molecular Design Problem..................136
4.8. General Framework for Integrated Flowsheet and Molecular Design................137
4.9. Summary.............................................................................................................139
5. Case Studies.................................................................................................................141
viii
5.1. Design of Blanket Wash Solvent ........................................................................141
5.2. Metal Degreasing Solvent Design ......................................................................151
5.2.1. Visual Solution...........................................................................................154
5.2.2. Algebraic Solution .....................................................................................158
5.3. Design of Alkyl Substituent for the Fungicide DD ............................................162
5.3.1. Problem Statement.....................................................................................162
5.3.2. Solution of Design Problem with Two Types of Property Models ...........163
5.3.3. Solution of Design Problem with Different Topological Indices ..............173
5.4. Acid Gas Removal ..............................................................................................175
5.4.1. Problem Statement.....................................................................................175
5.4.2. Process Design...........................................................................................177
5.4.3. Molecular Design.......................................................................................178
5.4.4. Proof of Concept for Integrated Flowsheet and Molecular Design ...........187
6. Conclusions and Future Work .....................................................................................191
6.1. Major Achievements...........................................................................................191
6.2. Future Work........................................................................................................195
6.2.1. Statistical Tools for Product Design ..........................................................195
6.2.2. Integrated Process and Product design ......................................................196
6.2.3. Inclusion of More Sophisticated Descriptors.............................................198
6.2.4. Exploration of Biochemical Reaction Pathways........................................199
6.2.5. Design of Reactions ...................................................................................201
References........................................................................................................................203
Appendix: Group Contribution and Connectivity Indices Data ......................................218
ix
List of Figures
Figure 2.1 General Chemical Product Design ...................................................................12
Figure 2.2 Product Design Steps .......................................................................................13
Figure 2.3 Property Estimation Models.............................................................................19
Figure 2.4 Response Surface of the Second Order Model.................................................21
Figure 2.5 Mixture Design Plots........................................................................................22
Figure 2.6 Multilevel Approach for Property Estimation using GC Method ....................26
Figure 2.7 Example of Hydrogen Suppressed Graphs.......................................................31
Figure 2.8 Molecular Skeleton of 3,3 Dimethyl Pentane...................................................34
Figure 2.9 Atomic Signatures up to Height 3 ...................................................................52
Figure 2.10 Molecular Signature Tree...............................................................................54
Figure 3.1 Iterative Molecular Design...............................................................................62
Figure 3.2 Multilevel Approach for Product Design .........................................................65
Figure 3.3 Simultaneous Consideration of Process and Product Design...........................70
Figure 3.4 Reverse Problem Formulation..........................................................................71
Figure 4.1 Visualization of Intra-stream Conservation of Clusters ...................................77
Figure 4.2 Visualization of Inter-stream Conservation of Clusters ...................................77
Figure 4.3 Mixing of Streams ............................................................................................79
Figure 4.4 Feasibility Region on a Ternary Diagram ........................................................82
Figure 4.5 Mixing of Molecular Groups............................................................................88
x
Figure 4.6 Simultaneous Process and Product Design Framework ...................................90
Figure 4.7 Second Order Group Formation .......................................................................94
Figure 4.8 Mixing of CI Group with GC Group..............................................................108
Figure 4.9 Coloring of Atomic Signature ........................................................................120
Figure 4.10 Illustration of Connectivity Principles..........................................................123
Figure 4.11 Integrated Process-Product-Flowsheet Design Framework .........................138
Figure 4.12 Flowchart of the Integrated Process-Product-Flowsheet Design ................139
Figure 5.1 Cluster Diagram for Degreaser Design ..........................................................157
Figure 5.2 Visual Solution of Molecular Design Problem ..............................................158
Figure 5.3 Acid Gas Removal Flowsheet ........................................................................176
Figure 5.4 Best Five Solutions to Acid Gas Removal Problem.......................................186
xi
List of Tables
Table 2.1 Types of Products ..............................................................................................10
Table 2.2 Group Contribution Models ..............................................................................27
Table 2.3 Adjustable Parameters in Group Contribution Models .....................................28
Table 2.4 Values of KC-X Parameters ................................................................................39
Table 4.1 Visual Molecular Design Algorithm .................................................................85
Table 4.2 Algebraic Approach Algorithm ......................................................................103
Table 4.3 GC+ Model Algorithm ....................................................................................109
Table 4.4 CI Calculation using Signatures ......................................................................111
Table 4.5 Signature Equivalent of Topological Indices ..................................................113
Table 4.6 Consistency of Signatures ..............................................................................124
Table 5.1 Property Constraints for Blanket Wash Solvent .............................................143
Table 5.2 Property Operators and Reference Values ......................................................143
Table 5.3 Adjustable Parameters ....................................................................................143
Table 5.4 Normalized Molecular Property Operator Values ..........................................144
Table 5.5 Property Data of Selected Molecular Fragments.............................................144
Table 5.6 Second Order Groups and Their Contributions ..............................................147
Table 5.7 Valid Formulations and Their Properties ........................................................147
Table 5.8 Groups for Ring Compounds ..........................................................................148
Table 5.9 Second Order Groups and Their Contributions for Cyclic Structures.............149
xii
Table 5.10 Valid Cyclic Compounds and Their Properties ............................................150
Table 5.11 Property Constraints for the Degreaser Problem ..........................................152
Table 5.12 Property Operators and Reference Values for Degreaser Design..................154
Table 5.13 Normalized Molecular Property Operator Values ........................................154
Table 5.14 Atom and Bond Indices ...............................................................................155
Table 5.15 First Order Connectivity Indices ...................................................................156
Table 5.16 CI Property Contributions of CI Groups .......................................................156
Table 5.17 Possible Higher Order Groups and Their Property Contributions ................159
Table 5.18 Final Solution Set for Degreaser Design .......................................................160
Table 5.19 Upper and Lower Bounds for Fungicide Properties .....................................163
Table 5.20 Signatures of Height Two for Alkanes ..........................................................166
Table 5.21 Solution in Terms of Signatures ....................................................................171
Table 5.22 Possible Alkyl Substituents ...........................................................................172
Table 5.23 New Solution for Alkyl Substituent ..............................................................175
Table 5.24 Property and Flowrate Data for Acid Gas Removal Problem ......................177
Table 5.25 Property Targets for Molecular design .........................................................178
Table 5.26 Property Operators and Targets ....................................................................181
Table 5.27 New Property Targets for Acid Gas Absorbent Design ...............................187
Table 5.28 Driving Force Data .......................................................................................189
Table 5.29 Energy Index Values .....................................................................................189
Table A.1 First Order Group Contribution Property Data...............................................219
Table A.2 Second Order Group Contribution Property Data ..........................................223
Table A.3 Third Order Group Contribution Property Data .............................................226
xiii
Table A.4 First Order GC Data for Acentric Factors and Liquid Molar Volume ...........228
Table A.5 Second Order GC Data for Acentric Factors and Liquid Molar Volume .......230
Table A.6 Regressed Parameters for Different Atom Types in the CI Method...............232
xiv
Nomenclature
AUP Augmented property index
Cj Property cluster for property j
D Number of degrees
deg Degree
FBN Free bond number
FBNg Free bond number associated with group g
GCM Group contribution method
G(x) Molecular subgraph of atom x
h Height of signature
Hfus Heat of fusion
hfus0 Adjustment parameter used in the estimation of heat of fusion
hfus1 Contribution of first order group for the estimation of heat of fusion
hfus2 Contribution of second order group for the estimation of heat of fusion
?Hv Standard heat of vaporization at 298 K
hv0 Adjustment parameter used in the estimation of heat of vaporization
hv1 First order group contribution for estimation of heat of vaporization
hv2 Second order group contribution for estimation of heat of vaporization
xv
M Number of edges
Ng Total number of first order molecular groups
Ns Total number of second order molecular groups
Nr Number of rings in molecular structure
nf Number of functional groups in the main chain
ng Molecular group
ngr Groups forming the ring
no Number of carbon atoms in the main chain
Pjg Contribution to property j from group g
R Number of rings
Tb Boiling point
tbo Adjustment parameter used in the estimation of boiling point
tb1 Contribution of first order group for estimation of boiling point
tb2 Contribution of second order group for estimation of boiling point
TI Topological index
V Vertex
Tm Melting point
tmo Adjustment parameter used in the estimation of melting point
tm1 Contribution of first order group for estimation of melting point
tm2 Contribution of second order group for estimation of melting point
xs Fractional contribution
Greek symbols
?j (Pj) Molecular property operator of the jth property
xvi
)( jirefj P? Reference operator for jth property
?j Normalized property operator for property j
? Number of occurrences of one first order group in a second order group
? Molecular signature
? Number of each signature
? Property function
? Shape index
? Connectivity index
1
1. Introduction
The selection of products/product mixtures that give the optimum performance of
a process is a critical issue for a design engineer. The process performance is usually
understood in terms of physical properties and on many occasions, the physical properties
of the product rather than their chemical structure determine the suitability of a specific
product as the input to the process. For example, in the design of a blanket wash solvent,
the primary focus of designer are the solubility parameter, flammability, vapor pressure
etc. of the solvent. Molecular design algorithms generally require target properties to
design the molecules. At the same time, to identify the target properties that give the
optimum process performance, the process parameters are to be considered as well.
Therefore, for obtaining the optimal solution for this type of problem, it is necessary to
have a methodology to represent the product performance in terms of measurable
physical properties and identify the molecule/mixture that gives the property targets
corresponding to the optimum process performance.
In spite of the relationship between the process design and product design
problems, they have been traditionally considered as two separate problems because the
product design part is generally considered as outside the scope of chemical engineering
design. Therefore, the product identified by chemists (without considering the process
design aspects) may not provide the best solution corresponding to the optimum process
performance and this makes the process of identifying the suitable molecule/mixture an
iterative process. Therefore, on most occasions, the chemists try to design the product
2
based more on expert knowledge, trial and error, heuristics and experimentation based on
intuition rather than any specific scientific reasoning. The attempts to develop a
procedure to pass the information between the process and product designer brought
about the development of reverse problem formulations (Eden et al., 2004).
The conventional way to treat a combined process-product design problem was
through mathematical modeling. This approach considers the molecular design as an
optimization problem. The objective in such an optimization problem is to minimize the
error between the sought values and the values attained by the current design. The
advantage with this approach is in most of the cases, it is possible to represent the
problem in terms of known mathematical expressions. The system of equations formed
consists of balance equations, constraint expressions and constitutive equations (Russel et
al., 2000). The non-linear nature of the constitutive equations, which is often the case
with most structure-property relationships, makes it difficult to obtain convergence
during the computational stages. In such cases, by following decomposition techniques,
the values of a subset of intensive variables are determined that match the required
property targets. This can be considered as the reverse of a property estimation problem.
The target properties are estimated by solving a reverse simulation problem, where for
the given values of design and input variables, the desired range of property values can be
obtained (Gani & Pistikopoulos, 2002).
The reverse problem formulation decouples the complicated property models
from the system of equations and the conventional forward problem can be divided into
two reverse problems. The first reverse problem solves the balance and constraint
equations in terms of properties to provide the design targets. The second reverse
3
problem then solves the constitutive equations to identify the operating
conditions/products to match the property targets set from the first reverse problem.
Decoupling the constitutive expressions from the system of equations makes the system
linear and achieving convergence is easier. In this way, the reverse problem formulation
(RPF) lowers the complexity of the problem without compromising the accuracy (Eden et
al., 2004).
The RPF provides a property-based platform to link process and product design
problems since the process performance can be represented in terms of the properties, and
the properties form the input to an molecular design problem. However, such an
algorithm can be followed only if there is a way to track the properties. The concept of
property clustering provides the necessary tools to track properties. Property clustering
techniques have recently been extended into molecular design to develop molecular
clusters that allow the molecular groups to be combined to match a set of target
properties (Eljack et al., 2007). The property models available in group contribution
methods formed the link connecting the molecular structure and properties in the above-
mentioned algorithm.
The purpose of this work is to expand the range of applications of the integrated
process and product synthesis approach. The product design part of this approach, even
though it conceptually bridged the gap between process and product design is limited in
terms of its practical applications in its present form. Three major limitations of the
present method are being addressed in this dissertation. The first one is the limitations of
the property models that can be applied in that algorithm. The algorithm by Eljack et al.
(2007) is based on first order group contribution methods. This method provides a good
4
starting tool in several product design problems. Nevertheless, its applicability is limited
to the design of simple molecular structures because the first order group contribution
method has limited accuracy especially when dealing with polyfunctional molecules and
cyclic molecules. In addition, first order groups cannot capture proximity effects or
differentiate between isomers. A variety of newer group contribution methods tried to
address this issue by considering the effects of the combinations of certain molecular
groups in the chemical structure (Constantinou & Gani, 1994; Marrero & Gani, 2001;
Conte et al., 2008). These combinations of molecular groups are known as higher order
groups. In order to increase the range and applicability of the design, higher levels of
group contribution methods also have to be represented in the cluster domain and
considered in generating molecular structures. However, unlike first order groups, the
second and third order groups are not linearly represented in a molecule. In addition,
since the higher order molecular groups are based on the effects due to the combination
of the constituent molecular building blocks, their effects cannot be predicted before
knowing the complete molecular structure. Therefore, a new algorithm must be
developed for their systematic inclusion into the cluster domain.
In the visual approach developed by Eljack et al. (2007), the number of property
targets has to be three for the simultaneous consideration of process and product
requirements. In their approach, different molecular groups are mixed/combined
according to a set of rules developed to obtain different candidate solutions. However, the
number of properties of interest may be more than three on many occasions. Similarly,
the generation of a complete potential candidate set is very significant for product design
problems because, as mentioned in section three (figure 3.2), there are further stages in a
5
product design algorithm after the elimination based on the group contribution based
techniques. Therefore, it is very important to make sure that none of the potential
candidates are overlooked in the initial stages of product design because the possibility
that a molecule not considered could be the optimum candidate based on parameters
other than group contribution method based tools cannot be ignored. Therefore, there
should be a generalized procedure to generate all potential solutions of a problem to
match the given set of property targets.
The motivation to produce more sustainable and environmental friendly
chemicals to meet the consumer needs has increased considerably over the last decade
(Kokossis & Yang, 2009). Therefore, it is important to have a systematic methodology to
design chemicals that possess both the consumer specified attributes and environmentally
acceptable characteristics. Most biological and environmental properties are structure
dependent and group contribution techniques are not available or reliable for the
determination of these properties. However, a lot of work has already been done to
categorize atoms or molecules systematically based on their structure and to relate these
assignments to their biological activities and properties. These relationships are termed as
Quantitative Structure-Activity Relationships (QSAR) and Quantitative Structure-
Property Relationships (QSPR) (Kier & Hall, 1986). QSAR and QSPR can provide viable
tools for the determination of many properties from molecular structure information.
However, most QSAR and QSPR techniques are very property specific and thousands of
molecular structural descriptors are now available in literature corresponding to different
properties (Randic & Basak, 2001). In spite of this, very few attempts have been made to
make use of the available QSAR/QSPR relationships to solve inverse design problems.
6
This is because, compared to the group contribution models, the QSAR/QSPR
relationships have very complicated formulations due to the highly non-linear nature of
most of the topological indices used in developing such relationships. In addition, unlike
molecular groups, there is no one to one mapping possible from the solution of an
optimization problem to the final molecular structures (Visco et al., 2002; Faulon et al.,
2003a). The recently introduced concept of molecular signature description (Visco et al.,
2002; Faulon et al., 2003b) is a potential tool for the reverse problem formulation
techniques explained in chapter 3 of this dissertation. The motivation behind developing
algorithms to introduce signature description into the reverse problem formulation
techniques is that, it is an already proven fact that many molecular descriptors can be
written in terms of their signatures and many such relationships are linear in nature.
Therefore, one single algorithm will have the potential to handle different molecular
descriptors (Visco et al., 2002; Faulon et al., 2003b). In addition, the enumeration of
molecular structures from the solution of an inverse problem is challenging. Even though,
a few stochastic techniques are available to solve this problem (Venkatasubramanian et
al., 1994; Sheridan & Kearsley, 1995; Venkatasubramanian et al., 1995), there have been
very few attempts to solve this problem using a deterministic approach. Therefore, an
algorithm that can solve this type of problems using a deterministic approach would
widen the applicability of reverse problem formulations to a different domain of
problems.
The dissertation has been distributed in six chapters. Chapter 2 covers most of the
background information including details of the nature of process design problems, some
current product design techniques, the basics of group contribution methods, topological
7
indices and their applications in QSAR/QSPR expressions and a brief description of
molecular signature descriptors and their application to property prediction. The chapter 3
covers the current state of the art in the field of computer aided molecular design, the role
of property models, the concept of reverse problem formulations and the integrated
process and product design framework. Chapter 4 starts with the basics of property
clustering and first order molecular property clusters. Then, it covers the systematic
development of second and third order molecular property operators. The systematic
procedure for the development of an algebraic algorithm for the application of these
operators in the solution of integrated process and product design is also presented. The
next section describes the procedure followed for the introduction of connectivity index
based models into the clustering framework and the steps followed in the development of
an algorithm for their application in solving integrated process and product design. The
developed visual and algebraic approaches are described. In the next section, the
application of molecular signature descriptors in solving molecular design problems is
described. Finally, the general framework to integrate flowsheet design techniques to
process and molecular design problems is presented. Chapter 5 provides four application
examples for the algorithms developed in chapter 4. The first example is a blanket wash
solvent design problem to illustrate the development of higher order molecular groups
and the algorithm for their introduction into an algebraic solution framework. The second
example illustrates the application of the algorithm developed for the introduction of
combined connectivity index/group contribution models into the reverse problem
algorithm. The third example is a simple molecular design problem that shows the
applications of molecular signature descriptors. The fourth example is an integrated
8
flowsheet and molecular design problem. The last chapter covers the major achievements
and conclusions from this project and highlights some of the future works that can be
done based on the techniques developed in this dissertation. Most of the group
contribution data and connectivity index data are provided in appendix.
9
2. Theoretical Background
2.1. Chemical Product Design
Chemical product design is an emerging branch of chemical engineering. In the
past, the development of new chemical products has always been left to chemists and the
chemical engineering community generally focused on the process design aspects and
ignored all product related issues other than purity. Due to this, the product design has
always been considered separately from the process design with no feedback between
each other. This approach often leads to the generation of sub-optimal solutions. In
addition, most chemical products currently in use have been developed after scientific
experimentation based on knowledge of existing products that has been largely based on
heuristics and expert knowledge. This approach often limits the scope of the output
solutions because of the inability to produce non-intuitive solutions. The innumerable
options available for such a methodology ultimately make this technique researcher
specific and many times based on intuition. Therefore, there is a need for a
comprehensive and systematic methodology for solving product design problems.
Cussler has classified all the chemical products into three broad classes (Cussler
et al., 2010): commodities, molecular products and performance chemicals. Commodities
are bulk chemicals and the focus of chemical engineers while producing these chemicals
are traditionally on designing the processes to produce them economically. The second
class of chemicals are molecules with specific applications like pharmaceutical products,
and the key to market them depends on the speed of the discovery and the ability to
10
introduce them into the market immediately after the discovery. In the third class of
products, the value will be added depends on the specific microstructure. The key to the
marketability will be the function and the benefits that they provide like the flavor a
chemical provides to the ice cream or the shine a certain polishing material can provide to
the shoe. A summary of the types and focus involved in the different classes of products
are given in table 2.1 (Cussler et al., 2010):
Table 2.1: Types of Products
Hill (2009) identified the need for a new mindset along with new chemical
engineering approaches for the solution of product design problems and termed the
emergence of this field as a new paradigm. This is because, the chemical engineering
community generally ignored product related issues with the exception of purity. The
process related issues were the only areas of interest for chemical engineers. The mindset
behind solving the different classes of process design problems follows the same
approach. Here, the focus of design will be on obtaining the process with minimum cost.
The design of a new product however, requires a different understanding of the profit,
which may not be readily converted into a set of mathematical expressions. In order to
successfully solve a product design problem, the designer has to identify both the process
requirements and product specifications and to systematically generate a finite number of
Commodities Molecules Performance
Key Cost Speed Function
Basis Unit operations Chemistry Microstructure
Risk Feedstock Discovery Science
11
potential candidate solutions to satisfy the problem requirements. The experimentation
can now be limited to the systematically obtained candidates because it is not possible to
conduct experiments with all available options.
A chemical product design problem can be stated like this: Identify a chemical
product (molecule or mixture) that satisfies a set of desired needs. So, a product design
problem can be considered as an inverse property prediction problem where the attributes
are represented in terms of physical properties (Gani & O'Connell, 2001).
The purpose of product engineering is not to substitute the traditional
experimental techniques and/or heuristics followed to design a molecule. Instead, the
product design engineers aim to systematically eliminate the numerous unsuitable options
and reduce the search space to a finite set of options. Therefore, the chemical product
design can be considered as a phase in the overall product development operation that
should precede a well defined experimental program, design and analysis (Hill, 2009).
Cussler and Moggridge (2001) have identified the principal steps involved in a
product design process. Specific to each problem, solution strategies are to be developed
in each step:
1. Identify customer needs
2. Generate ideas to meet the needs
3. Select among ideas
4. Manufacture product
12
This framework is a simplified yet generalized representation of what product
design is all about. Each step must be defined more elaborately for different specific
classes of problems. However, the framework applies to all kinds of problems.
The first step is traditionally considered as a topic handled by marketing experts
and chemical engineer?s tasks usually start from step 2. However, a recent work has
incorporated the first step as a part of an optimization framework as shown in figure 2.1
(Smith & Ierapetritou, 2009). The motivation behind forming such a customer integrated
approach for product design is the realization that, the driving force for a product-
centered industry is the consumer needs (Stephanopoulos, 2003). Smith and Ierapetritou
(2009) developed a product design methodology for which the inputs are the consumer
inputs and economic criteria. The design problem has been formulated as a biobjective
optimization problem that ensures the consumer influence in design trade-off
considerations.
Figure 2.1: General Chemical Product Design (Smith & Ierapetritou, 2009)
In addition, chemical engineers are in a better position to refine the understanding
of customer needs because of their ability to analyze what is physically possible (Hill,
13
2009). Similarly, once the product to be manufactured is finalized, the actual production
of it is more specific to the compound and needs to be treated accordingly. The heart of
the product design problem is steps 2 and 3 and so, the focus of the work in this area has
to be to generate and select among ideas.
Step two has been traditionally performed by chemists according to the
specifications provided by process engineers and are typically based on heuristics and
expert knowledge. The compounds they identify in this step are analyzed by process
design engineers to decide the suitability of the supplied options in step three. If none of
the options are practical, their conclusions will be send back to the chemists for more
options. Therefore, this is an iterative process as described in figure 2.2.
Figure 2.2: Product Design Steps (Gani, 2004a)
14
The computational complexity involved in identifying a suitable molecule with
desirable properties can never be underestimated. Without a systematic approach to track
the specific solutions, the number of potential solutions will be prohibitively high even
with very restrictive criteria. For instance, if the search space is limited to all alkane
molecules up to C22, there will be 38 million different isomers (Davidson, 2002). As the
number of atoms under consideration increases, the number of candidates increases and
results in combinatorial explosion. For example, even if the search restricted to acyclic
molecules with three double bonds, the number of structural isomers for C4H10N2O3S2
will be 16 million (Contreras et al., 1994)
The traditional approach above points out that, since the process and product
design steps are done separately, the product identified may not be corresponding to the
optimum process performance since the product design has been done without the
knowledge of property targets corresponding to the optimum process performance. A
recently developed integrated process and product design approach ( Eljack et al., 2007b;
Eljack & Eden, 2008) provides an alternate way to solve this problem. This approach and
the various tools used will be discussed later in this dissertation.
2.2. Mathematical Formulation of Chemical Product Design
If the focus of the product is on the macroscopic properties, the product design
can be considered as a combination of molecular design and mixture design (Achenie et
al., 2003). All different types of product design problems can be represented using the
following set of generalized mathematical expressions (Gani, 2004a).
15
{ }
(2.7)
(2.6) ),(
(2.5) )(
(2.4) 0),(
(2.3) 0)(
(2.2) 0)(
(2.1) )(max
33
222
111
3
2
1
uCxByl
uyxgl
uxgl
yxh
xh
xh
xfyCF TObj
?+?
??
??
=
=
=
+=
In the above expressions, x is the vector of continuous variables like fraction in a mixture,
flowrates etc., y is the vector represents the presence or absence of a group, compound,
operation, etc., h1 (x) is a set of equality constraints corresponding to process design
specifications, h2(x) is a set of equality constraints corresponding to process model
equations, h3(x, y) is a set of equality constraints related to molecular structure
generation, mixing rules for properties, etc, g1 (x) is a set of inequality constraints related
to process design specifications, g2 (x, y) is a set of inequality expressions corresponding
to specific problems related to the product design and f(x) is the vector of objective
functions. For process design problems, f(x) will typically be a non-linear function and
for an integrated process and product design problem, f(x) typically represent more than
one non-linear expression. B and C represent the matrix containing fixed data in
constraint defined by eq. (2.7).
16
Depending on the specific nature of the problem, some or all of the
equations/constraints listed above may be used. A few different types of product design
problems are as follows (Gani, 2004a):
1) Satisfy only the constraint in eq. (2.6): A product design problem based on a
database search. Here, the objective would be to identify from the dataset, the
compounds matching the property constraints. Here, the molecular structure
generation or the application of property models is not necessary.
2) Satisfy constraints in eqs. (2.4) and (2.6): Here, the molecular structures are
generated on the basis of property model in eq. (2.4) subject to the constraints in
eq. (2.6).
3) Satisfy the objective function and constraints in eqs. (2.4) and (2.6): In this
problem, the optimum molecular structure has been identified according to the
objective function given in eq. (2.1) subjected to the product design constraints in
eq. (2.6). The property model given in eq.(2.4) is used to generate the molecular
structure. However, there is no guarantee that the solution obtained is an optimal
solution because of the non-linear equations that constitute the property models in
eq. (2.4).
4) Satisfy all the constraints: This type of problems identifies the set of products
corresponding to the process requirements. Therefore, this is a simultaneous
process-product design problem. This will generate feasible, but not necessarily
optimal solutions because of the non-linear nature of the property models in both
process and product design specifications.
17
5) Solve all the equations: This is an integrated process-product design problem. The
non-linear nature of the objective function and the process model equations will
make this a complex mixed integer non-linear programming (MINLP) problem.
In all these kinds of approaches, the properties either need to be provided or
predicted through property models (Achenie et al., 2003). Therefore, except for the first
type of product design problems, the application range of any developed product design
methodology is limited to the accuracy of the involved property models. Therefore, the
success of product design methodologies will depend on the included property constraints
and the process/product model.
2.3. Types of Properties and Estimation Techniques
Gani and Constantinou (1996) proposed a three tier classification for different
properties as primary, secondary and functional.
Primary properties Properties that can be estimated from the molecular structure
variables. Examples include normal boiling point, normal melting point, heat of
vaporization at 298 K etc.
Secondary properties Pure component properties which are dependent on other
properties. Examples are solubility parameter, density at a given temperature, etc.
Functional properties Pure component properties which are dependent on temperature
and/or pressure. Examples are density, vapor pressure, enthalpy, etc
Apart from these general classifications, there are a number of high level
performance characteristics, which are difficult to estimate. These kinds of properties
18
involve the taste of food products, aroma of fragrances, various mechanical properties,
etc. Since many of these properties are dynamic and the design objectives can only be
specified in terms of a time-evolution profile, the modeling process will be very
challenging. Different types of hybrid approaches can be applied for solving such
problems such as combining molecular modeling with kinetic phenomena to obtain
prediction accuracy (Ghosh et al., 2000).
Depending on the types of target properties and expected range of accuracy,
different types of property models can be used for product design. Gani and Constantinou
(1996) have proposed the classification of property models shown in figure 2.3.
However, most of the types of models shown in figure 2.3 are suitable only for forward
problems because of the computational complexities involved in the quantum mechanics
calculations. The semi-empirical and empirical models also posses many computational
difficulties in a traditional mathematical programming based approach. However,
different tools have been developed recently to incorporate many of such methods into
inverse design formulations (Venkatasubramanian et al., 1995; Camarda & Maranas,
1999; Sahinidis et al., 2003; Papadopoulos & Linke, 2006; Eljack & Eden, 2008;
Solvason et al., 2009).
19
Classification of Estimation Methods
Reference Approximate
Mechanical
models
Semi-empirical
models
Empirical
models
Quantum mechanics
Molecular Mechanics
Molecular Simulation
Corresponding States Theory
Topology/Geometry
Group/Atom/Bond additivity
Chemometrics
Pattern matching
Factor analysis
QSAR/QSPR
Figure 2.3: Property Estimation Models (Achenie et al., 2003)
2.4. Mixture Design
2.4.1. Design of Experiments
Design of experiments (DOE) is a statistical method to plan and execute
experiments in a systematic manner so that maximum information can be gained from the
experiments. DOE is a potential tool in the field of product design because, as mentioned
previously, the product design in actual practice depends heavily on results obtained
through experiments. To develop a model, the first step is to identify the factors that
affect the variable of interest. In the next step, a model is postulated to represent the
effect of the factors on the response of interest of the variable. The objective is to
optimize the response. In the next step, the experimental points are placed to which the
model can be fitted. In the final step, the model adequacy is tested. There may be many
iterations until the experimenter decides the accuracy is good enough (Cornell, 2002).
20
The accuracy of this method depends on the adequacy of the model equation and the
location of the design points. The polynomial model is the most commonly selected
model to represent a response surface since it can be expanded through a Taylor series to
improve accuracy (Cornell, 2002). Generally, first or second degree models will be
adequate to represent the surface (Montgomery, 2005). Figure 2.4 shows one example of
a response surface plot for a second degree model where the response y corresponding to
the factors x1 and x2 are plotted. The fitted first order and second order equations have the
following general form:
??? ++= ?
=
u
i
iio xy
1
(2.8)
???? +++= ???
? ?=
u
ji
u
ij
jiij
u
i
iio xxxy
1
(2.9)
Here, y is the response, xi and xj are the factors affecting the response. The ?
values are the regression coefficients and ? is the error observed in the response. Note
that, in the second order equation, there is a term corresponding to the interaction effects
of the factors involved.
21
x2
-4
-2
0
2
4
x1
-4. 2 -2. 1 0. 0 2. 1 4. 2
Figure 2.4: Response Surface of the Second Order Model
2.4.2 Mixture Design of Experiments
Mixture design of experiments (MDOE) is an extension of DOE in which the
factors are the chemical constituents. Therefore, the constituent fractions will sum to one
and every constituent fraction must have a value between zero and one. Figure 2.5 shows
one example of a mixture design plot where the density of a mixture made from the
components x1, x2 and x3 is plotted. However, this relationship violates the condition that
the factors are independent and random and thereby imposes a colinearity effect.
Therefore, even though the model can still be used, it will affect the interpretation of the
regression coefficients. However, because of this condition, it is possible to represent the
mixture data on a simplex. Scheffe developed the first simplex-lattice designs, which
many researchers consider to be the foundation of mixture design (Cornell, 2002).
22
According to Scheffe models, the response surface can be represented in terms of only
the pure component and interaction terms. However, due to the colinearity, the regressors
will not provide the true interpretation of the pure component or interaction effects.
Figure 2.5: Mixture Design Plots
In order to address the limitations of Scheffe models, the Cox model (Cox, 1971)
was introduced. Even though the Cox model removes the colinearities introduced by the
relation between the mixture constituents, it leaves the secondary colinearities introduced
by the constraints in the constituent ranges (Solvason et al., 2008). One solution to this
issue may be the use of decomposition techniques like Principal Component Regression
(PCR) and Partial Least Squares (PLS) (Kettanah-Wold, 1992). Applying the property
clustering technique is another way of treating this issue (Solvason et al., 2008).
X1 X2
X3
23
2.5. Group Contribution Methods
2.5.1. Initial Efforts
Computer Aided Molecular Design (CAMD) techniques have become a
significant part of process and product design because of their ability to predict and
design molecules with a given set of properties. In all CAMD algorithms, it is necessary
to have a systematic method to evaluate whether the designed structures satisfy the
property constraints set by the process from well-defined molecular building blocks.
Therefore, almost all CAMD techniques use group contribution methods (GCM) to verify
whether the generated molecules exhibit the specified set of desirable properties (Harper
et al., 1999). In additive group contribution methods ( Benson, 1968; Ambrose, 1978,
1980; Joback & Reid, 1987; Horvath, 1992), a molecule is considered as a collection of
various simple groups. The property function of a molecule has been estimated as a
summation of the property contributions of all the molecular groups present in the
molecular structure.
i
i
iCNXf ?=)( (2.10)
Here, f(X) is a function of the actual property X, Ci is the contribution of the
molecular group i that occurs Ni times
These contributions are estimated through regression of large amounts of
experimental data. Group contribution methods are indispensable tools for property
prediction of molecules from their structures especially when the experimental values for
the properties are not available. They are simple and yet provide reasonably accurate
24
results for many properties. These methods can provide quick estimates of properties
without much computational complexity and errors (Constantinou et al., 1993). In the
case of simple compounds, GCM can provide accurate trends due to the addition of new
functional groups to the existing structure. During molecular synthesis, this will help to
generate molecular structures that meet a specific property in a systematic way from
basic molecular groups (Joback & Stephanopoulos, 1989).
However, as the complexity of the molecule increases, the accuracy of first order
GCM becomes less reliable. They generally cannot capture proximity effects or
differentiate between isomers (Kehiaian, 1983; Wu & Sandler, 1989, 1991). The
simplified representation of molecular structure in any of the above-mentioned methods
ignores many of the concepts in organic chemistry and quantum mechanics like
resonance, conjugation and various interactions among groups (Mavrovouniotis, 1990).
So, several attempts have been made to make the GCM more general and reliable
(Fedors, 1982; Reid et al., 1987; Constantinou et al., 1993).
The ABC method introduced by Constantinou et al. (1993) is of particular
importance. The ABC method is based on the contributions of atoms and bonds in the
properties of different conjugate forms of a molecular structure. Here, the property of a
molecule has been estimated as the linear combination of contributions from all the
conjugate forms of the molecule. However, the generation and enumeration of the
conjugate forms is computationally challenging. Nevertheless, this method provided the
basis for future methods, which did not require such computational effort (Constantinou
& Gani, 1994).
25
2.5.2. Group Contribution Models with Higher Levels
A new GC approach was put forward by Constantinou and Gani (1994) in which
the property estimation is done in two stages. In this approach, two types of molecular
building blocks are defined. The basic level is known as first order groups and the next
higher level is called second order groups. The second order groups have first order
groups as their building blocks. They essentially represent different types of interactions
among the first order groups and the effects of certain molecular group combinations to
the property of the final molecule. The second order groups can provide a better
description of compounds with many functional groups and differentiate among isomers.
However, even the second order groups may not be able to provide a good representation
of poly-ring compounds and open-chain polyfunctional compounds with more than four
carbon atoms in the main chain (Marrero & Gani, 2001). Therefore, a further level of
molecular groups have been identified and their property contributions have been
regressed (Marrero & Gani, 2001) for use in group contribution methods. The formation
of third order groups is analogous to the second order groups, but the focus is on a
different class of molecular groups. The third order groups focus on multi-ring
compounds, fused ring compounds and compounds with many functional groups in the
structure. Similar to the second order groups, third order groups also have first order
groups as their building blocks. The property estimation model developed in this
approach has the following form:
k
k
kj
j
ji
i
i EOzDMwCNXf ??? ++=)( (2.11)
26
Here, f(X) is a function of the actual property X, Ci is the contribution of first
order group i that occurs Ni times, Dj the contribution of second order group j that occurs
Mj times and Ek the contribution of third order group k that occurs Ok times in the
molecule. The constants w and z can have values zero or unity depending on how many
levels of estimation are of interest.
The pictorial representation of the property estimation technique using higher
order group contribution techniques has been shown in figure 2.6 (Conte et al., 2008).
The properties estimated using this technique and the corresponding property functions
are listed in table 2.2. The universal constants for each property function are given in
table 2.3. Four properties, for which property functions and group contributions are
estimated in two different articles (Constantinou et al., 1995; Conte et al., 2008) up to
second order level, are also included in the table 2.2.
Figure 2.6: Multilevel Approach for Property Estimation using GC Method
First level
Groups representing
simple and monofunctional
compounds
Second level
Groups representing conjugates, employed
to distinguish between isomers
Third level
Groups representing large, complex
and multifunctional compounds
27
Table 2.2: Group Contribution Models
Property Property Function Group Contribution Terms
Normal melting point, Tm ( )0exp mm TT km
k
kjm
j
jim
i
i TOTMTN 321 ??? ++
Normal boiling point, Tb ( )0exp bb TT kb
k
kjb
j
jib
i
i TOTMTN 321 ??? ++
Critical temperature, Tc ( )0exp cc TT kc
k
kjc
j
jic
i
i TOTMTN 321 ??? ++
Viscosity, ? ln(?) 21 j
j
ji
i
i MN ?? ?? +
Critical volume, Vc 0cc VV ? kc
k
kjc
j
jic
i
i VOVMVN 321 ??? ++
Standard Gibbs energy, Gf 0ff GG ? kf
k
kjf
j
jif
i
i GOGMGN 321 ??? ++
Critical pressure, Pc ( ) 25.01 ccc PPP ?? ? kc
k
kjc
j
jic
i
i POPMPN 321 ??? ++
Standard enthalpy of
formation, Hf
0ff HH ? kf
k
kjf
j
jif
i
i HOHMHN 321 ??? ++
Standard enthalpy of
vaporization, Hv
0vv HH ? jv
j
jiv
i
i HMHN 21 ?? +
Standard enthalpy of
fusion, Hfus
0fusfus HH ? kfus
k
kjfus
j
jifus
i
i HOHMHN 321 ??? ++
Acentric factor, w ( ) Caw b ?/exp 21 j
j
ji
i
i wMwN ?? +
Liquid molar volume, Vm dVm ? 21 mj
j
jmi
i
i VMVN ?? +
Surface Tension, ? ? 21 j
j
ji
i
i MN ?? ?? +
28
Table 2.3: Adjustable Parameters in Group Contribution Models
Adjustable Parameter Value
Tm0 147.45 K
Tb0 222.543 K
Tc 231.239 K
Pc1 5.9827 bar
Pc2 0.108998 bar-0.5
Vc0 7.95 cm3/mol
Gf0 -34.967 kJ/mol
Hf0 5.549 kJ/mol
Hv0 11.733 kJ/mol
Hfus0 -2.806 kJ/mol
a 0.4085
b 0.505
c 1.1507
d 0.01211
The application of group contribution based CAMD techniques rely on the
availability of molecular groups and the estimated property contributions corresponding
to each group. The properties that can be predicted based on the molecular structure alone
are called primary properties and the properties that can be estimated as a function of
primary properties and molecular structural data are called secondary properties
(Constantinou & Gani, 1994).
29
2.6. Topological Indices and Property Prediction
The chemical structure of a molecule can provide a lot of information about the
properties that it possesses. The information available from a structural formula includes
(1) total number of atoms (2) number of each type of atoms and (3) the bonding between
the atoms. These sets of information enable the structure to be represented in a graphical
form (Kier & Hall, 1986). The representation of a molecule in the form of a graph is the
first step in the development of topological indices. This would allow the conversion of
the structural formula into indices and a potential opportunity to relate the structure to
properties (Kier & Hall, 1986). Once the chemical structure is represented in the form of
a graph, a number of graph theoretical matrices can be formed from the chemical
structure. The most commonly used matrices in the formation of topological indices are
adjacency matrix and distance matrix (Trinajstic, 1992). The vertex adjacency matrix can
be represented as shown in eq. (2.12):
( )
?
?
?
???
?
=
otherwise 0
weighted-k is
) vand (v edge andadjacent are vand v verticesifk
adjacent are vand v verticesif 1
jiji
ji
ijA (2.12)
Here, A(G) is the vertex adjacency matrix of the connected molecular graph G.
This matrix is an N x N symmetrical matrix, where N is the number of vertices.
The topological indices can be calculated from the graph theoretical matrices by
performing different operations over the matrix. The topological index of a molecular
graph is a single number that can be used to characterize that graph. Therefore, this
30
number must have the same value regardless of the way in which the graph is labeled
(Trinajstic, 1992). A topological index thus is a convenient way for representing the
chemical constitution in the form of a number. The challenge in developing topological
indices is that the descriptor should be able to form Quantitative Structure Property
Relationships (QSPR) or Quantitative Structure Activity Relationships (QSAR). Randic
and Basak (2001) suggested that, in order for a topological index (TI) to be of practical
significance, it should possess a set of desirable attributes. The important qualities that
make a meaningful TI include direct structural interpretation, correlation with at least one
property, linearly independent, non-triviality, a basis on structural concepts, etc.
The topological indices are defined based on the topology of a molecule. These
are developed based on the principles in chemical graph theory. Graph theory is a branch
of mathematics that deals with objects that are connected (Wilson, 1986). The objects in
the graph are called vertices, the lines used to connect the objects are called edges, and
the diagram thus obtained is called a graph. The relationships developed pertaining to
graphs have been extended to different disciplines. The principles in graph theory applied
to analyze the consequences of connectivity in a chemical graph are termed as chemical
graph theory (Trinajstic, 1992). Here, the sites may be atoms, molecules, molecular
groups, etc. and the connection between those sites may be bonds, interactions etc. The
branch of chemical graphs that represent the constitution of molecules are called
molecular graphs. More details on applications of molecular graphs in molecular
synthesis are given in section 2.8.
The molecular graphs are generally represented as hydrogen-suppressed graphs
where only the molecular skeletons without hydrogen atoms (except for heteroatoms) are
31
used. Double and triple bonds are also not shown in the hydrogen-suppressed graph. The
presence of hydrogen atoms and multiple bonds are handled in the general formulation of
molecular indices. The difference between normal molecular structures and hydrogen-
suppressed graphs is shown in figure 2.7. The first figure is the molecular structure of
acetone and its hydrogen suppressed graph representation is shown in the second figure.
In the hydrogen-suppressed graph, the numbers 1, 2 and 4 represent the carbon
atoms without hydrogen atoms on it and 3 represents the oxygen atom. The edges a, b
and c represent the bonds connecting these atoms. Even though the double bond and
single bonds are represented identically, the definition of bond indices is defined in such
a way to take care of that difference.
CH3 C
O
CH3
Figure 2.7: Example of Hydrogen Suppressed Graphs
The objective in developing quantitative structure activity relationships (QSAR)
and quantitative structure property relationships (QSPR) is to develop practical tools to
relate the properties to chemical structure. The property of a molecule can only be
explained in terms of the three dimensional aspects known as molecular topography such
as shape, volume and surface area. However, the topographical characteristics are indeed
related to the nature of individual atoms and the bonds between them. The effect of the
bonding pattern is such that it controls the topography and thereby the properties.
32
Therefore, the properties also must be related to the topology of the molecule (Kier &
Hall, 1986).
An exhaustive study has been conducted by Katritzky and Gordeeva (1993) with a
variety of classical topological indices and geometric/electric descriptors for their ability
to provide meaningful QSAR/QSPR relationships. It has been confirmed that, the
classical topological indices give the best correlations for the determination of physical
properties whereas a combination of topological indices and geometrical descriptors give
the best quality regression expressions even though the relationships with topological
indices alone also did not perform too bad. However, for most of the topological indices,
no unambiguous criterion has been followed for their selection and verification (Gutman
& Polansky, 1986). Therefore, many of them may contain the same kind of structural
information with the difference being only in the scaling factor (Trinajstic, 1992).
Most topological indices are developed either from the adjacency matrix or from
the distance matrix of the molecular graph (Trinajstic, 1992). However, since the distance
matrix can be developed from an adjacency matrix (Bondy & Murty, 2008), the
topological indices can be developed from the adjacency matrix alone. Some of the
common topological indices are described in the following sections.
2.6.1. Connectivity Indices
The basic assumption employed in the connectivity index method is that, the
structural formula of a molecule has enough information for relating it to its properties.
Therefore, the efforts to obtain non-empirical expressions for such index values are
logical. In one of the initial works, Randic (1975) found that in the alkane skeletons, the
33
number of adjacent carbon atoms to one specific atom can give a description of the
branching of the molecule. In that work, he proposed to use a molecular descriptor called
the delta value. The delta value is the count of formally bonded carbon atoms. The delta
value is obtained from the individual atomic valencies of the atoms that form the bond.
The product of the atomic valencies is raised to the power of -0.5 to obtain the delta
value.
The sum of all delta values in the molecule provide an index associated with that
molecule. The larger the branching in a structure, the lower will be the branching index
corresponding to that structure because of the inverse relationship.
The following example illustrates the calculation of the branching index value for
a 3,3 dimethyl pentane molecule:
CH3 C CH2 CH2 CH3
CH3
CH3
The structure is described with a molecular skeleton with the count of all bonded
carbon atoms in figure 2.8:
34
a
b
2
d
e
f
1
4
1 1
2
1
c
Figure 2.8: Molecular Skeleton of 3,3 Dimethyl Pentane
Here, a, b, c, d, e, f represents different bonds. According to the definition above,
the delta value corresponding to each bond can be calculated as follows:
( )
( )
( )
( )
( )
( ) 0.707 12: Bond
5.0 22: Bond
0.3535 42: Bond
0.5 41: Bond
.50 41: Bond
0.5 41: Bond
5.0
5.0
5.0
5.0
5.0
5.0
=?
=?
=?
=?
=?
=?
?
?
?
?
?
?
f
e
d
c
b
a
Branching Index of 3,3 dimethyl pentane molecule = 0.5 + 0.5 + 0.5 + 0.3535 + 0.5 + 0.7
= 3.0605
The branching number was found to correlate with properties like boiling point,
Kovats constants and a calculated surface area (Kier & Hall, 1986).
However, additional descriptors developed based on delta values have the
limitation that they would not differentiate among saturated and unsaturated bonds.
35
Therefore, a new concept was developed by Kier and Hall (1976). The new structural
descriptor is termed as the valence delta (?v) which is based on the explicit counting of
each bond to a nearby atom and it is estimated as follows:
hZ Vv ?=? (2.13)
Here, ZV of an atom is the count of all adjacent bonded atoms and all pi and lone
pair electrons and h is the number of hydrogen atoms bonded to that atom.
While dealing with high atomic weight atoms, the effect of non-valence core
electrons to the atomic size and properties must be considered because such core
electrons also have a significant role in influencing the properties. For that, the valence
delta is redefined for an atom of atomic weight Z as follows:
( )
( )1??
?=
v
v
v
ZZ
hZ? (2.14)
The valence delta values for heteroatoms at higher oxidation states will be
different from the values given by the above formula. Some empirical values for a few
atoms at different oxidation levels have been estimated and are available in literature
(Kier & Hall, 1986).
If i and j are the atoms involved in the bond, then, bond indices, ?k are defined
through the pair of valence delta values (Kier & Hall, 1986).
v
j
v
i
k ??? = (2.15)
36
The zero?th order connectivity index of an atom has been estimated from the
individual valence deltas and the first order CI is formed from the possible bonds present
in the molecule as follows (Kier & Hall, 1986):
vi
v
??
10 = (2.16)
k
v
??
11 = (2.17)
Zero order and first order connectivity indices are the most commonly used
molecular descriptors. However, higher orders of connectivity indices can be calculated
by following the same methodology. The nth order connectivity index is calculated using
eq. (2.18):
( )nvivnv ??? ....
1= (2.18)
The zero order connectivity indices (CI) for a molecule are obtained by
summation of the CI values of each atom:
( ) ( )?= n nvmoleculev
1
00 ?? (2.19)
37
The first order CI for a molecule is obtained as the sum of CI?s of all edges:
( ) ( )?=
n
n
v
molecule
v
1
11 ?? (2.20)
It is to be noted that, if the first order connectivity indices for the different groups
are written separately, their sum will not give the CI value of the molecule. This is
because, the contribution to first order CI by the bonds between two separate groups is
not represented in the expression. To account for that, an additional term for the bond
between different groups has to be included in the expression (Gani et al., 2005).
( ) ? ? ??
?
?
?
?
?
?
+??
?
?
???
?= k
m
group
v
1
m
groups ofout bonds
k
bonds internal
1 5.01
??? (2.21)
Here, k is the number of bonds inside the group for which the expression is
written and m is the number of free bonds the group has.
2.6.2. Edge Adjacency Index
The edge adjacency index (?) is a topological index developed by Estrada
(1995a). The reported correlation coefficient (R>0.99) of ? in QSPRs to estimate molar
volume is significantly higher than any of the available group contribution methods. The
calculation of ? index is described below.
The development of the vertex adjacency matrix has been explained in section
2.6. Similar to that, an edge adjacency matrix can also be developed. Two edges are
38
adjacent if one vertex in a chemical graph is incident (having one common vertex) to
both the edges. If there are m edges in a graph and gij are the elements in that graph, the
edge adjacency matrix, E= [gij]mxm can be defined as shown in eq. (2.22):
?
?
?=
Otherwise 0
adjacent are and If 1 ji
ij
eeg (2.22)
The edge degree ?(ek) has been defined in eq. (2.23):
( ) ?=
i
ikk ge? (2.23)
This can be calculated as the sum of elements of kth row in the matrix E. Now, edge
adjacency index can be calculated using eq. (2.24):
( ) ( )[ ] 21?? ?=
ll
ji kee ??? (2.24)
where the sum is over all l adjacent edges in the graph. Here, k is a constant,
which is defined in eq. (2.25):
?
?
?=
Otherwise 0
adjacent are and if 1 ji eek
(2.25)
39
If the graph contains heteroatoms, it is necessary to account for the differences in
bonds formed between heteroatoms and carbon from the other types of bonds. In that
case, the values of gik are replaced with KC-X , which are the values corresponding to the
bonds between carbon and the heteroatom (Estrada, 1995b). The KC-X parameters are
related to the resonance integral associated with the bond between the heteroatoms and
the carbon atom (R. Daudel et al., 1959). Different values have been reported in the
literature for KC-X. The values reported by Ortiz and Perez (1982) are shown in table 2.4:
Table 2.4: Values of KC-X Parameters
2.6.3. Shape Indices
The shape index is a molecular descriptor used for the quantification of the
molecular shape. Depending on different aspects of the shape, shape indices of different
orders have been developed. Shape index of order one is defined as shown in eq. (2.26):
C-X Bond KC-X
C-N 0.9
C-O 0.8
C=O 1.6
C-S 0.7
C-F 0.7
C-Cl 0.4
C-Br 0.3
40
( )
( )21
2
1 1
P
nn ?=? (2.26)
1P is the number of paths of length 1. The other shape indices can also be
calculated in the similar manner.
2.6.4. Wiener Indices
The Wiener number is introduced as the path number, which is the number of
bonds between all pairs of atoms in an acyclic molecule. The main significance of this
index is that, this was the first time the significance of paths in a molecular skeleton had
been recognized (Wiener, 1947). The Wiener number, W is defined as one half of the off-
diagonal elements of the molecular distance matrix:
??
= =
=
N
k
N
l
klDW
1 1
5.0 (2.27)
where Dkl is the off-diagonal elements of the distance matrix, D.
Together with other molecular descriptors, the Weiner number can make
predictions on alkane properties like boiling points, heats of formation, heats of
vaporization, molar refractions and molar volume.
2.6.5. The Hosoya Topological Index
The Hosoya topological index is defined by the count of all possible patterns of
considering k disjoint bonds in a molecule (Hosoya, 1971):
41
KZZZZ ++++= ....1 21 (2.28)
Here, Z1 is the number of bonds in a graph, Z2 is the number of pairs of disjoint
bonds, Z3 is the number of triples of disjoint bonds and so on. The disjoint bonds are any
two or more bonds in the structure for which there is no incident vertex.
A variety of other connectivity indices is available in many published works. A
good review and the classification and applications of different topological indices can be
found in Trinajstic (1992) and Balaban (2001).
2.7. Connectivity Indices and GC+ Method
The GCM can provide fairly accurate estimates of the properties of the molecules
if the group contribution values of all the building blocks are known. However, if a
group, whose property contribution is not available, makes up at least one part of the
molecule the property estimation cannot be completed. In practice, this is a very common
frustration encountered while using GCM for property estimation or in reverse problems
because, property contributions of many common molecular groups are not available in
literature. So, recent works on the correlation between connectivity indices and some
physical properties can be used in such situations (Gani et al., 2005).
In a recent work to correlate the CI?s to physical properties, the following pure
component property model was proposed by Gani et al (2005):
( ) ( ) ( ) dcbAaYf vv
i
ii +++= ?
10 2)( ?? (2.29)
42
Here, Y is the sought property, Ai is the number of atom i, ai is the estimated
contribution of atom i while b, c and d are adjustable parameters. The property functions
are defined the same way as in the GCM. The values of the constants are given in the
appendix (Table A.6).
It is to be noted that the constants in the above equation are regressed using the
same pure-component property data used by Marrero and Gani (2001) in their group
contribution model development. Here, as the derived equation is on the atomic level, the
expected accuracy of prediction is less than the group contribution model. So, the CI
model is used only for deriving the property contributions of groups not available in
existing group contribution methods.
Since the property functions are defined the same way in both GCM and CI
methods, the formulation of a combined approach is straightforward. The combined GC-
CI model, known as the GC+ model, has been written as follows (Gani et al., 2005):
( ) ( ) ( )mvmv
i
imimm cbAaYf
10
,, 2)( ?? ++= ? (2.30)
dYfnYf
m
mm +??
??
?
?= ? )()( * (2.31)
?
?
??
?
?+?
?
??
?
?++?
?
??
?
?= ???
t
tt
s
ss
i
ii CNCNYfCNYf )()(
* (2.32)
43
where m is the number of different missing groups and nm is the number of times
the missing group is present in the molecule. f (Y*) is the property function of the missing
group.
2.8. Molecular Signature Descriptors
2.8.1. Current Status in Inverse Design
The inverse design of identifying the molecules from property constraints is a
relatively new problem from a chemical engineering perspective. In this section, the
different methodologies developed for designing molecules with a set of target properties
will be reviewed. It should be noted that, the current methodologies are very specific to
certain classes of property models and even in such restricted situations, the accuracy and
computational expenses requires a lot of improvement. Most of the existing methods
used group contribution based approaches for solving the inverse design problem
(Achenie et al., 2003). A number of recent publications have used the group contribution
based techniques to solve for different classes of molecular design problems ( Sahinidis
et al., 2003; Achenie & Sinha, 2004; Eljack & Eden, 2008; Chemmangattuvalappil et al.,
2009). However, the suitability of group contribution methods for molecular design is
limited because of the following reasons:
1. It is not always possible to find a suitable correlation between the molecular
groups and properties
2. Not all possible atomic arrangements are represented in GCM
3. Many group contribution models have limited ranges of accuracy
As mentioned before, the representation of a molecule in the form of a graph can
provide lot of information about its properties through the use of molecular descriptors.
44
Molecular descriptors are operators developed from the molecular graph to characterize
the properties of the molecule. The numerical value obtained after performing the
operations suggested by the descriptor on the molecular graph can generally be used to
correlate and predict physical properties and biological activities (Faulon et al., 2003b).
There are thousands of molecular descriptors available and that makes it difficult
to select the appropriate one(s) for a specific problem. A lot of work has been done in
molecular design using topological indices as structural descriptors (Baskin et al., 1990;
Gordeeva et al., 1990; Kvasnicka & Pospichal, 1990; Kier et al., 1993a; Skvortsova et
al., 1993). In all these approaches, the descriptors' structural features have been used to
generate the feasible molecular structures. However, the inverse relationships between
the topological indices generally do not provide a unique molecular graph (Trinajstic,
1992). Therefore, the degeneracy in these approaches is very large. In addition, most
topological indices exhibit highly non-linear functional dependence on the elements of
the vertex-adjacency matrix (Raman & Maranas, 1998). Because of that, obtaining a
global solution when employing mathematical programming techniques is difficult.
The techniques developed recently for obtaining unique molecular structures from
the molecular descriptors use stochastic approaches. In the algorithms developed by
Venkatasubramanian et al. (1994, 1995), and Sheridan and Kearsley (1995), a genetic
algorithm based approach has been introduced for large scale molecular design. Genetic
algorithms are stochastic optimization methods based on the Darwinian model of
evolution. In summary, genetic algorithms identify the population of best candidates from
an earlier population following the rules of crossover and mutation and thus producing
the best offspring for the next generation (survival of the fittest). A fitness function based
45
on the target properties has been used to evaluate the fitness of the candidate solutions.
This approach is expected to identify better offspring after each generation and eventually
end up with the optimal solution. These approaches were the first efficient methodology
for solving large-scale molecular design problems. They have the ability to locate optimal
designs for those problems with multiple target specifications. However, because of the
heuristic nature of the developed algorithms, there is no guarantee that a solution will be
obtained after running the algorithm. In addition, even though these algorithms can obtain
a near optimal solution very fast, the efficiency is very limited in obtaining the final
solution.
Algorithms based on Monte Carlo simulations (Faulon, 1996; Kvasnicka &
Pospichal, 1996) have also been published. A simulated annealing based algorithm has
been published by the group of Kokossis (Marcoulaki & Kokossis, 1998) for the design
of refrigerants and liquid-liquid extraction solvents. One advantage of simulated
annealing over other stochastic optimization methods is that, it can provide probabilistic
guarantee on the quality of the final solutions. These algorithms can theoretically provide
the solution in polynomial time. A variety of other papers have been published later using
stochastic techniques, however, the reconstruction of molecular structures using
deterministic techniques has rarely been attempted.
The notable contributions that use deterministic techniques for the exhaustive
generation of molecular graphs corresponding to the predefined molecular descriptors
have come from Kier?s group (Hall et al., 1993a; Hall et al., 1993b; Kier et al., 1993b)
and from Skvortsova?s group (Skvortsova et al., 1993). The works from the former group
compute the possible degree sequences that match the paths of the target descriptors up to
46
the length of two that can track degree sequences up to length three. These contributions
applied the developed techniques to chi-indices. Once the different paths of length two
are obtained, the degree sequences corresponding to all the molecular structures will be
generated using an isomer generator. Only those structures that match the path of length 3
(which is the maximum path length that can be tracked using this technique) are accepted
as the final solutions. In the contribution from Skvortsova?s group, apart from the degree
sequence, an edge sequence also has been generated, which simultaneously formed the
input to the isomer generator to build the exhaustive list of corresponding structures. The
edge sequence can estimate the number of edges between each combination of atoms.
This can decrease the degeneracy of the solutions significantly. However, the
descriptional features in these methods do not always produce feasible or unique
structures.
In more recent works (Raman & Maranas, 1998; Camarda & Maranas, 1999),
methodologies have been developed to incorporate topological indices within an
optimization framework. In the work by Raman and Maranas (1998), many hydrocarbon
properties are correlated with connectivity indices and shape indices of different orders.
In this work, the molecular structure has been represented in the form of a vertex
adjacency matrix that can completely explain the molecular interconnectivity. Even
though the actual topological indices used in this work are non-linear, the matrix
representation has been used to systematically transform them into linear form. A mixed
integer linear program (MILP) formulation has been formed which ensures that a global
optimum solution can be achieved. However, its application has been limited to the
design of alkyl structures. In the work by Camarda and Maranas (1999), nonlinearities
47
due to the expressions for connectivity indices led to MINLP formulations, which make
the solution methodology computationally expensive and susceptible to local optima
traps.
Another important inverse problem solution technique had been developed by
forming a target scaffold (Garg & Achenie, 2001) for drug design. This is the first
attempt to apply mathematical programming techniques in the area of drug design. In this
work, a target scaffold based on a drug molecule has been used to generate a QSAR and
the inverse problem has been solved for the best values of selectivity by changing the
substituents on the scaffold. The limitations of this approach was that it was not able to
provide nonintuitive solutions because the scaffold limits the type of molecules obtained
as solutions. However, this approach was effective in controlling the combinatorial
explosion.
In a later contribution, a fitness function was directly incorporated into an
optimization framework (Siddhaye et al., 2004). However, in many formulations, a
globally optimal solution cannot be guaranteed in this approach. In another work, second
order connectivity indices had been used for the design of value added soybean oil
products (Camarda & Sunderesan, 2005). The interesting aspect of this work is that, the
highly non linear second order connectivity indices have been represented with the exact
linear equivalent expressions using Glover transformation. Glover transformation is a
technique used in non-linear integer programming to represent non-linear expressions in
terms of linear equations that captures the essential non-linearities of the original
problem. By applying Glover transformation, the non-convex terms have been converted
into products of binary variables. Even though this approach eliminates the possibility of
48
local minima traps by avoiding non-convex terms, the computational requirements are
relatively high. To decrease the computational complexity, an approach known as
?templating? has been applied in this work. In templating, a portion of the vertex
adjacency matrix has been predefined to control the generation of a large number of
structures. However, because of templating, the generation of non-intuitive solutions may
be eliminated.
In some recent works, heuristic methods have been used in order to handle the
non-linear constraints. (Lin et al., 2005; Eslick et al., 2009; McLeese et al., 2010). These
methods cannot ensure a global optimum solution. However, in order to ensure that the
solution is near optimum, Tabu search methods have been employed while solving the
problem. In Tabu search, a library will be generated that keeps track of the recently
generated local optima solutions that will prevent the generation of the same local
minima solution as the search proceeds (Lin et al., 2005). Even though this technique
ensures better quality of the solutions, the optimum solution can never be guaranteed
because of the non-linear constraints.
Apart from these limitations, the above-mentioned techniques can be used only
when the QSAR/QSPRs are based on one topological index. For many properties, the
QSAR/QSPRs are formed based on more than one topological index. In addition, the
topological indices required to form the QSAR/QSPRs may be different for different
properties of interest. For instance, the QSPR for one property target may be based on
connectivity indices and the second property target may be based on shape indices. Since
different topological indices are formulated using different mathematical expressions,
there is no standard way to combine everything on a common platform and solve it all
49
simultaneously. Therefore, the current techniques that employ QSAR/QSPR models in
reverse problem formulations can handle those property models with one topological
index. Different topological indices have to be formulated using different mathematical
transformations. In addition, the degeneracy of the solutions in all these methodologies is
very high. That means, for a specific solution, there could be many possible molecular
structures. The degeneracy increases with the size of the molecules in the solution. Here,
we are looking for a computationally efficient algorithm that can simultaneously
incorporate different topological indices based on QSAR/QSPRs. The recent works on
molecular signature descriptors (Visco et al., 2002; Faulon et al., 2003a; Faulon et al.,
2003b; Weis et al., 2005) provides a convenient way to represent a variety of TIs as
linear combinations of molecular signatures. Therefore, this approach has the potential to
develop an efficient methodology for the design of molecules with QSARs/QSPRs
related to diverse TIs.
2.8.2. Development of Molecular Signature
The concept of molecular signature ( Visco et al., 2002; Faulon et al., 2003b) is
significant for the reverse problem formulation framework because, it forms a finite set of
not highly correlated descriptors based on the molecular structure from which all other
TI?s can be calculated.
The molecular signature is a systematic way of representing the atoms in a
molecule using the extended valancies to a pre-defined height. The systematic procedure
for the identification of the signature of a molecule developed by Visco et al., (2002) is
explained in the next section.
50
One of the characteristics that make the molecular signature descriptors unique
among other molecular descriptors is that the building blocks of the molecular signatures
complement each other. If G is a molecular graph and x is an atom of G, the atomic
signature of height h of x is a canonical representation of the subgraph of G containing all
atoms that are at a distance h from x. This canonical representation can be obtained by the
following systematic procedure:
1. All atoms (vertices) in the graph are labeled in a canonical order starting with
atom x
2. For the atom x for which the atomic signature is to be constructed, all atoms and
bonds will be shown up to the height h in the sub graph )(xGh .
3. Construct the tree that spans over all the edges in the sub graph. The root of the
tree is the atom x itself. The tree is constructed one layer at a time up to level h.
The first layer of the tree are the nearest neighbors of atom x, the second layer of
the tree consists of all the neighbors of the vertices in layer 1 except the atom x. In
general, when the tree has been constructed up to level h-n, then, layer h-n+1 will
be constructed considering each vertex of layer h-n. All vertices in the tree are
labeled and colored with the necessary coloring function. The vertex color of a
graph is the assignment of a unique description provided to its vertices in order to
distinguish among different groups of atoms. The coloring function will be
selected based on the type of the molecule. Typical coloring functions include the
type of atoms, type of bonds and valency. It is possible to have one vertex more
than once in the graph. However, no edge should be repeated in the same graph.
51
4. The signature can be written by reading the tree from the atom x. The child level
vertices will be enclosed in parenthesis at each level. The vertex color must be
written along with the vertices in each level. After each level, the next level must
be followed until the signature reaches the required height. While writing the
signatures, all the neighbors including the root atom must be written.
One example of the construction of atomic signature is shown in figure 2.9. Here,
the stepwise procedure for obtaining the atomic signature of atom N (nitrogen) up to
height 3 in a molecule is illustrated.
In the first step, the atoms are labeled to distinguish between them when writing
the molecular signatures in the later stages.
In the second step, all the atoms at height 3 from atom N are extracted. In other
words, the neighbors of height three includes the atoms bonded to N (say y), the atoms
bonded to all atoms in y (say z) and all the atoms bonded to all atoms in z.
In the third step, the molecular groups are replaced with vertices. All the vertices
are colored with the atom type, valency and the type of bond. Here the atoms types are C
and N. The different carbon types are distinguished with their valancies and the type of
bond. Note that the bond type has been retained in the canonical representation in step 2.
In the final step, the signatures of different heights have been formed by reading the tree
starting from the root N atom. The atomic signature of height zero is the root N atom
itself. The atomic signature of height one is the root atom followed by its nearest
neighbors (in this case, two carbon atoms) enclosed in parenthesis. In signature of height
two, the neighbors of the carbon atoms (including the root N atom) are listed in the
52
parenthesis. Finally, signature of height three is obtained by adding the neighbors of the
atoms in the previous layer. While writing the signatures, the vertices at the different
levels have been color coded for clarification of the levels.
CH2
NH
C
CH2
CHCH2
CH2CH2
CH3
CH3
CH3
C2
N2
C4
C2 C3C2
CC C
CH2
NH
C
CH2
CHCH2
CH2CH2
CH3
. CH3
CH2
2
NH
1 C
3
CH2
5
CH
6CH24 CH2
9CH27
CH3
10
CH3
8
CH3
11
0?(x) = N Height 0
1?(x) = N2 (CC) Height 1
Molecular
structure
Step 1
Step 2
Step 3
Step 4
53
2?(x) = N2 (C2 (NC) C4 (=CCN)) Height 2
3?(x) = N2 (C2 (N2(CC) C1(C)) C4 (=C3(=CC) C1(C) N2(CC))) Height 3
Figure 2.9: Atomic Signatures up to Height 3
It is clear that the set of atomic signatures up to a given height is of finite size. So,
any molecule can be represented by its coordinates in a vectorial space where the base
vectors are its atomic signatures. Thus, the signature of a molecule is defined as the linear
combination of atomic signatures (Visco et al., 2002; Faulon et al., 2003b). If
)( ihGh X? is a base vector, ih? is the number of atoms having the signature of the base
vector and Gh K is the number of base vectors, then the molecular signature )(Gh? is
represented as:
)()()(
1
i
h
G
K
i
h
i
h
G
Vx
hh XxG G
h
G
???? ??
=?
== (2.33)
A simple example is presented to show molecular signatures of different heights
using eq. (2.33). In the first step, the subgraphs have been generated for all the atoms that
produce signatures up to the required height. Then, individual atomic signatures for all
the atoms have been generated. Finally, eq. (2.33) is applied to generate the molecular
signature.
CH3CH2CH(CH3)CH2NH2
54
CH3
1
CH2
2
CH
3
CH3
4 CH
25 NH26
CH3 CH2 CH
CH3
CH2 CH3
CH2
CH3
CH
CH3
CH2 NH2
CH
CH2
CH3
CH3
CH2 NH2
CH3 CH
CH2
CH3
CH2 NH2
CH2 CH
CH3
CH2 CH3
NH2
NH2 CH2 CH
CH3
CH2 CH3
0? = 5C + N Height 0
1? = 2C1(C) + C2(CC) + C3(CCC) + C2(NC) + N1(C) Height 1
2? = C1(C2(CC)) + C2(C1(C)C3(CCC)) + C3(C1(C)C2(CC)C2(NC)) +
C1(C3(CCC)) + C2(C3(CCC)N1(C)) + N1(C2(NC)) Height 2
3? = C1(C2(C1(C)C3(CCC))) +
55
C2(C1(C2(CC)) C3(C1(C)C2(CC)C2(NC))) +
C3(C1(C3(CCC))C2(C3(CCC)N1(C))C2(C1(C)C3(CCC))) +
C1(C3(C1(C)C2(CC)C2(NC))) +
C2(C3(C1(C)C2(CC)C2(NC)) N1(C2(NC))) +
N1(C2(C3(CCC)N1(C))) Height 3
Figure 2.10: Molecular Signature Tree
2.8.3. Application of Signature Descriptors in Property Prediction
The atomic signature concept is useful in QSAR/QSPR studies because of its
applicability in defining many topological indices. Consider a molecule G with known
atomic signature up to height h as defined in section 2.6.1. Suppose the number of
vertices in an intermediate layer k of graph G is )(xVk . Note that k