A Four Thieves Vinegar Collective endeavour…

4tv_logo.jpeg

… to build an open-source network of chemical reactions for organic chemistry.


Contribute

Any expertise you could bring to any stage is welcome. If you are already a member of the collective, and would like to add anything to this wiki, please see the Contact tab above to request moderator status. If you are not a member of the collective and have some expertise in any of the steps involved in this endeavor, please contact the collective.

Overview

We are working this project in stages. The diagram below shows a high-level overview of each stage, labeled I. through V.

Workflow.gif
  Stage I

Here we gather the data to build the network. Currently we are unable to obtain an extensive set of chemical reactions as such a thing is not (yet) open-source. However, we are working with what we can find online. Namely we are acquiring CAS numbers from a chemical wholesalers website and feeding them into a NIST database of chemical reactions. This method, although 'watered-down', is free and provides on the order of 10K reactions.

Ideally, we would like to obtain as complete a set of chemical reactions as possible, something on the order of ~1-10 million reactions like that of the Chematica software. We have some ideas for how to freely and openly obtain a more complete set of reactions. If you have any ideas or would like to help please reach out to us.

  Stage II

Once reactions are obtained we can build the network. We will proceed with this step even with the lesser amount of reactions. This involves some data-wrangling to turn the files of reactions into a single datafile of a format easily digestable to Pajek, the software we will use to visualize the network.

  Stage III

After the network is constructed, the next step is to perform extensive validations on it to compare it to the network of organic chemistry published in one of the original works of the Chematica collaboration, Fial2005. In that paper, extensive descriptions of the topological properties of the network they constructed from ~6 million reactions are given. One of the key conclusions of their work, which make them extensible to our smaller network, is that the network is scale-free. This means that any subset of the network, regardless of size, should have the same topological characteristics of the full network. The specific topological characteristics we can measure from our network to compare with theirs, are, at the node level, the distributions of node connectivity, and at the global level, the average connectivity of the entire network (which strangely enough was shown to be very similar to that of the world wide web (WWW)). In the follow-up publication, Bish05, they delve into structural patterns of the network and demonstrate that the network has a set of hub molecules dubbed the 'core', surrounded by a periphery of molecules which interact frequently with the core, all surrounded by islands of molecules which interact infrequently with the core and periphery, but relatively frequently with themselves. They lay out characteristics of these three levels of structure that are quantifiable, and hence, provide us with another way to compare our network to theirs. If the scale-free conclusion is in fact true, and also, if the reactions obtained from the NIST site represent a random sample of all chemical reactions, then we would expect our results to match theirs for whichever metrics we measure. Each metric we intend to test is laid out in the Project Details section and also in the literature reviews found in the References tab.

  Stage IV

Once the network is validated, we can then identify from it short and efficient reactions. We will attempt to implement the algorithm as published by the Chematica collaboration in their 2012 publication and its supplemental information, Goth2012. There is one snag here, which is that an important portion of their algorithm they did not publish. This part is a matrix of interaction rules for the few hundred most common functional groups in organic chemistry. This is a very important part of finding the optimal reaction path for one-pot reactions as in these situations, one wants to ensure that certain chemicals are nonreactive to prevent unwanted reactions. This is a task that requires the expertise of an organic chemist. If you are an organic chemist and would like to help with this step, we would be happy to hear from you. If we are unable to recreate the non-public portions of the algorithm we will attempt to implement in a consistent fashion whatever is available to us. That's why Stage IV.a in the diagram above is not shown as a required precursor in the workflow.

  Stage V

The final step will be to make this model portable and user-friendly. This will require the assistance of a developer. The overall intention is to make this software open-source and fully-maintained. If you are a software developer and would like to help with this step, we would be happy to hear from you.

To dive more deeply into the details on each stage, see the Project Details tab above.