Last edited: 06/11/2011
By: VLG
Sourceforge: SA2 website
Help: SA2 forums
Documentation index
This page provides a set of usefull information on various aspects of SA2. The aim here is to highlight important, or simply usefull information to take the best from SA2, be aware of important limitations, and to avoid common mistakes. As any GUI-based scientific applications, there are things that may not be designed perfectly (its all relative!), or algorithms that should be used in certain ways. This page is intended to regroup such information.
Depending on the handler you will select when creating a new database, some differences will be visible on various aspects of SA2. Let us detail the most important ones:
If you already have dealt with chemistry programming toolkits, or even with large general-purpose computational chemistry applications (such as MOE, Sybyl or Schrodinger), you might know that there is no absolute definition of what a molecule is, and thus how it is represented in your computer and percieved by the toolkit. Therefore, each toolkit defines its own set of atom types, which will associate to each atom present in your molecule, a pre-defined type, e.g. SP2 carbon.
Now what if the toolkit cannot assign atom types to a particular molecule ? Well, the toolkit will probably not be able to compute any descriptor on this molecule. In such case, all basic descriptors / flags will be set to null.
How to recognize these molecules ? A specific flag has been introduced in SA2, called "exotic". The handlers are supposed to set this flag to 1 when such event happens, so you can get informed of such error. Note however that sometimes, unknown atom-types can be the result of an error when the structure has been drawn (e.g. a wrong valence for a particular atom).
The descriptors computed on each molecules are also likely to be different from one handler to another. Typically, the definition of hydrogen bond acceptors / donnors are generally not the same.
Performances (the time required to import new molecules in the database) can also be affected by the handlers. In particular, there is a lot of SMART flags that need to be matched on each molecule freshly imported in a SA2 database. If the SMART matching algorithm is slow, so will be the import process, and this is typically the case with the CDK handler (see bellow).
We have faced an incredible number of issues using MySQL with really huge databases. Actually, it was probably more our limited experience with this database engine rather than the engine itself which was responsible for these problems (although MySQL is still known to scale poorly; integrating Postgresql would be great!).
Anyways, let's share our experience by providing a set of options that can be configured for mysql to get better performances (and less crashes!!) for very large databases. If you have, say 100 000 molecules in your database, you will probably not experience any such issue. But it's still worth to know!
So everything starts with a configuration file named my.cnf (or my.ini on windows). This file contains a set of options that will define, for example, the amount of memory allocated to mysql for various processes. Under linux, it is usually located in /etc/my.cnf. For Windows, it can be located at various places, includinc C:/my.cnf or .ini. More information on this file here:
http://dev.mysql.com/doc/refman/5.1/en/option-files.html
As said previously, the database is usually not configured for large databases. To overcome this, MySQL has installed various templates that fit various situations, including large database issues. More information on this can be found here. So based on these pre-configured files, we ended up with the following options that should be added (or modified if already there) to deal with large databases. If you experience any trouble with SA2, you may want to try setting up these options and start your operation again to see if it fixes the issue.
# Some parameters that should come up after the
# [mysqld] tag
key_buffer = 384M
max_allowed_packet = 16M
table_cache = 512
sort_buffer_size = 2M
read_buffer_size = 2M
read_rnd_buffer_size = 8M
myisam_sort_buffer_size = 64M
thread_cache_size = 8
query_cache_size = 32M
thread_stack = 256K
# Try number of CPU's*2 for thread_concurrency
thread_concurrency = 8
# VERY IMPORTANT: this variable is usually
# dramatically downsized by default in MySQL;
# You can set innodb_buffer_pool_size up to 50 - 80 %
# of RAM but beware of setting memory usage too high
innodb_buffer_pool_size = 1024M
innodb_additional_mem_pool_size = 20M
# Set log_file_size to 20 % of buffer pool size
innodb_log_file_size = 200M
innodb_log_buffer_size = 8M
innodb_lock_wait_timeout = 50
For curious people, or for those who want to be more aware of MySQL configuration, it is a good idea to have a look at the full list of parameters in MySQL .
Note that once you have changed the configuration file, you will need to restart the MySQL server! (using mysqld restart, or /etc/init.d/mysqld restart)
As there is a lot of CDK descriptors / fingeprints, and some of them require significant time to be calculated. As a consequence, the import process will be slower if you use the CDK worker.
Also, some descriptors will never be calculated for various reasons (known bug, too long to compute...), although they can still be stored in the dedicated table. Thus, you don't want to use these descriptors in e.g. DRCS analysis or XY plots. Learn more about this here.
The CDK handler has another notable limitation: the SMART substructure matching seems relatively slow, and importing new molecules will require much more time (compared to other handlers) if you use the CDK handler and if you compute the various HTS flags available in SA2 (in particular the PAINSl15 flag which requires to match more than 400 substructure patterns).
Memory problems sometimes happens in JAVA... Plus if you are plotting a database containing hundreds of thousands of molecules, you will need memory anyways.
To increase (or decrease...) the memory allocated to SA2 at startup, do the following:
NOTE: This section is not usefull anymore with the current stable version of SA2, which you should use. If for some reason, you are still using the first beta version (1.0.*b), then you may be interested in it.
As you probably already experienced, the position of all windows in SA2 is entirely flexible. You can move each window of SA2 pretty much everywhere you want, undock windows, group (using tabs) windows... In addition to this, when you close the application, the position that you assigned to each window will be saved in your user directory and restored upon restart.
We just said that your windows will be restored just like they were in the last run of SA2.
Consider the following situation: when you run SA2 and get to the connection window,
all your windows will be closed as you will not be connected to any database. What
if you now close the application BEFORE opening any database ? Well... the platform will
remember that all you windows were all closed the last time you used SA2, and thus you
will have to recreate the layout of your dream from scratch. So much fun in perspective!
We will try to improve this as soon as possible.
When running SA2 for the first time, a default window organization (layout) will be setup. In the current beta version of SA2, this default layout is not really handy. Untill we improve this, we provide a small tip that will setup a better layout.