Last edited: 21/11/2011
By: VLG
Sourceforge: SA2 website
Help: SA2 forums
Documentation index
This page describes the main algorithms available in SA2 to perform diverse subset extraction.
Diversity is an important topic in library design. The main idea is that a general-purpose (and even sometimes target specific) screening library should contain a wide range of chemotypes in order to maximize the chance of getting hits out of a screening campaign. Using diversrity also ensure that most (ideally all) parts of the known chemical space will be represented in the library.
SA2 provides the possibility of extracting diverse subsets of molecules using a scaffold-based algorithm. The diversity selection can be performed on the whole database or on an existing library. This way, one can restrict the search to a carefully selected set of molecules.
The base algorithm has been designed to ensure the presence of one molecule per scaffold(or framework, up to you!), when possible. It starts by retrieving all scaffolds within the database (or the selected library). These scaffolds are either randomly shuffled, or ordered by decreasing number of associated molecules. The first molecule is added to the library as being the molecule that is the most similar to an average fingerprint computed on all the molecules that belong to the first selected scaffold. The similarity between two molecules is defined by any similarity coefficient (e.g. Tanimoto) available in SA2 applied to the selected fingerprint. Next, for each remaining scaffold, the molecule having the lowest similarity to the currently selected molecules is added to the library.
A maximum similarity cutoff can also be defined. For a particular scaffold, all candidate molecules that have a similarity to the already selected molecules greater than this cutoff are not accepted, thereby ensuring that similar scaffolds are not over-represented in the library. The counter part of this is a higher computational complexity if the similarity cutoff is defined too small.
Once all the scaffolds have been processed, the final number of molecules may still be lower than the desired size of the library. Two reasons can lead to this situation: (1) the number of scaffolds in the database is lower than the required number of molecules, and (2) the similarity cutoff used is too small. In both cases, the entire selection process is just repeated. In the second case, the cutoff is automatically increased for each new run. The selection process finally stops when $ N $ molecules have been selected.
Let's now illustrate the creation of a diverse subset using SA2. You need the demo database (either follow the quickstart guide, or create it directely) to run this simple example.
The library should now be created and visible in the "List of libraries" window.
There are many ways of evaluating the diversity of a library. As you may know, diversity is not an absolute concept, and it is advised to analyse your library using different ways. Here, we will only discribe one way of doing so, using the Similarity report.
Other ways of evaluating diversity using SA2 include: plot the library in a reduced chemical space, compare the distribution of various descriptors with the database or other libraries, compute a scaffold / framework report... Soon (hopefully), some DRCS-specific indices will be integrated to provide further ways of evaluating this diversity.
Let's generate a similarity report then:
The report should be generated very quickly as we only have 100 molecules in our diverse library. As the report generates all pairwise similarity within the library, it might be much smaller for medium or large libraries.
The first thing you will see is the distribution of the average pairwise similarity.
It gives you a good idea on how, on average, each molecule is similar to the entire library, which is a good start: if the histogram is biaised on high values, your library is certainly not diverse (or your fingerprint not discriminative regarding the molecules in your database!), and you can drop us an email to tell us that our diversity algorithms is crappy! :)
On the second tab, you will end up with similar information, but this time you will get the distribution of the nearest neighbor similarity. Here, you can see that most of our molecules have a nearest neighbor similarity around 0.7. Alone, this information has little interest, except when the histogram is completely biaised toward one or another way. Now if you compute the same report on the whole database, you will end up with the following chart:
Hopefully, you can see the difference, and conclude that the diversity selection was fairly successfull...
Finally, this information can be summirised by numerical values, which you can see in the third tab.
As you can see, it was possible to create a subset of 100 molecules with no pair of molecules having a tanimoto similarity up to 0.8 using the JOELib fingerprint (which is small and therefore quite generic). Remember that the cutoff was set to 0.6 at first, which means that the algorithms had to perform two iterations to obtain the diverse subset.