Workflow.gif

Stage I. Obtain a set of chemical reactions

We have to do this the 'watered-down' way for now, until a more complete set can be obtained. The present approach is two steps: obtain the CAS numbers by scraping the website of a chemical distributor, and then use them to obtain a set of reactions from an open-source NIST database. Both are described in detail below.

Stage I.a Obtain CAS Numbers

CAS numbers were scraped from the Chemnet website. The code for this was written in Python and is available at the 4TV github page. All scraping was done using the file therein entitled scrape.py. The specific information scraped was the CAS number, the chemical name, and the molecular formula. Not all had the accompanying name or formula as sometimes this information was not included on Chemnet. The number of CAS numbers included on their site, and the actual number downloaded are shown in the table below, grouped by the first digit in the CAS number.

CAS Number Range Number on Chemnet Number Scraped
100-00-5 to 1954032-35-4 ~425500 425324
21-19-2 to 299974-85-9 ~120100 120067
32-07-5 to 399580-63-3 ~121000 120902
44-30-4 to 499999-99-4 ~110400 110354
50-00-0 to 599932-28-2 ~146100 145995
60-00-4 to 6241454-34-6 ~287700 287619
70-00-8 to 799841-56-8 ~126200 126112
80-00-2 to 899900-53-9 ~315100 315005
90-00-6 to 999999-99-4 ~127700 127672

The end result was 1779050 CAS numbers. An example of the output is shown below.

cas_screenshot.png

Stage I.b Obtain Reactions

After the CAS numbers were scraped, they were then fed into the search form on the NIST Solution Kinetics Database website. A screenshot of the search form is shown below:

nist_screenshot.png

Since each CAS number may potentially be either a reactant or a product in a reaction, the entire list of CAS numbers had to be fed into the search form twice; separately in the 'Reactants' and 'Products' portion of the search form. The 'Solvents' portion of the search form was not searched or used in any way. Only a small fraction of the CAS numbers were found to return a reaction. This was not a surprise as the NIST database says it contains only ~20K reactions. A screenshot of a portion of the results is shown below.

reactions_results.png

Overall, 4021 reactions were found from the 'Reactant' search, and 3380 reactions from the 'Product' search, yielding a total of 7401 reactions obtained from the NIST database.

The code for this was written in Python and is available at the 4TV github page. All scraping was done using the file therein entitled crawl.py which relies on the modules in nist_modules.py.

Stage II. Construct Network

This step is quite brief and simply entails concatenating the product and reactant files obtained in the previous step as well as assigning each reaction a unique identification. The overall structure of the file is the same as the previous screenshot shown above. The structure of the ID has meaning and points back to the first digit of the CAS number that was searched, whether it was searched as a product or reactant (in the NIST database) and it's overall count in the total number of reactions.

Stage III. Sanity Checks

Now that the network is constructed, it can be tested against a well-known network of chemical reactions. We compare to the network described in the Chematica literature. To see an overview of the relevant literature, please see the References tab above where in-depth descriptions of the network are provided by various researchers as well as summaries of the papers relevant to this project. For completeness sake, the following table lists all of the quantities and figures we aim to reproduce with our network. The goal is to test if the topological properties of our network match those of the much larger network obtained by the Chematica collaboration, which benefited from the use of the very extensive Reaxys database.

1. write brief overviews of paper 1 and 2 and post in references tab.
2. make table of plots and results to check from 1 i.e.

  • gamma values (paper 1)
  • overall <k> values (paper 1)
  • islands (paper 2)
  • anything else?

think about starting the section on the algorithm.