Downloadable Structure Files of NCI Open Database Compounds

Note to Windows users: While downloading with Netscape on Unix platforms usually works flawlessly, we've received reports (and have confirmed in very limited tests) that in Windows, using Netscape with the "Save Link As..." option may produce corrupted binary files. In these cases, you may want to try Internet Explorer with the option "Save Target As..." for downloading. Because of the file name/extension used (.sdz), you may have to either rename the downloaded binary file or open it manually with a program such as WinZip or similar.
 

Release 2 Files

August 2000 2D File

The "raw" structure data that were used to build the Release 2 of this service. These are 250,251 2D structures calculated with CACTVS Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. The SMILES string and the CAS RN (where available) are also included for each structure.

August 2000 SMILES Strings

A SMILES version of the 250,251 August 2000 structures. These are Unique SMILES (USMILES) strings, calculated according to Daylight's original (1989) canonicalization rules. (These rules have been changed in the meantime, but are not published.)

New Structures Only

These are 1,170 structures that were not in the previous (October 1999) release. This file may be most interesting for those who have already downloaded the previous structure file(s) and only need the difference set. It contains 3D coordinates calculated by the program CORINA. Please note the same warning regarding stereochemistry as for the large 3D file (see below).


 

"0D"

The "raw" structure data that were used to build this service, plus about 2,900 new structures. These are 249,081 "0D" structures (i.e. all coordinates set to 0.0) as of October 1999 in SDF format, in one file compressed with the widely available program gzip.

SMILES

A SMILES version of the structures (i.e. the above "0D" dataset) that were used to build this service, plus about 2,900 new structures. These are 249,081 structures as of October 1999 in SMILES format, in one file compressed with the widely available program gzip. SMILES string were generated with the help of CACTVS.  (This is a newly generated dataset and therefore not guaranteed to contain SMILES strings identical, for each compound, with those in previous SMILES string files, such as downloadable data from DTP .)

2D

2D version of NCI Open Database compounds as of October 1999.  2D coordinates (essentially structure drawings) calculated with CACTVS Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. (See also the 3D section.)

2D + Biological Data

2D versions of NCI Open Database compounds as of October 1999, with biological test data added. These data are publicly available from the DTP Human Tumor Cell Line Screen and/or the DTP AIDS  Antiviral Screen.  2D coordinates (essentially structure drawings) calculated with CACTVSAttention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. (See also the 3D section.)

3D

A 3D version of the 0D file, containing 249,071 structures as of October 1999. The program CORINA v. 1.7 was used to generate the 3D coordinates. Please note that, just as with the 3D results provided by the Enhanced NCI Database Browser, stereochemistry of chiral compounds is not guaranteed to be correct due to the lack of stereochemical information in the original data. This is not a shortcoming of CORINA. Please also note that, as of now, the 3D structures in this bulk file were not generated with the same version of CORINA as is used in the Browser, the latter being somewhat newer. This file is the result of a one-time conversion; no efforts have been undertaken to compare the conformations in it with those you obtain from the Browser (although we don't necessarily expect huge differences.)

 

Notes:


All these files are based on the publicly and freely available data from NCI's Developmental Therapeutics Program (DTP). We collected the structures and biological data from DTP, combined them where applicable, and generated SMILES and MDL SD files from this information.

These files were compressed with the program gzip. This program is available for many platforms, and comes preloaded on most of the recent versions of many major varieties of Unix. In order to prevent possible problems with web browsers trying to uncompress "on the fly", and display on your screen (!), a file with the extension ".gz", the names of the downloadable files were changed to NCInDA99.sdz (n = 0, 2, 3 for the 0D, 2D, 3D file, respectively; "A99" stands for October 1999 [with hexadecimal notation for the month]), CAN2DA99.sdz, AID2DA99.sdz etc.

You may have to rename them to NCInDA99.sdf.gz etc. before gunzip'ing them. If you (have to) rename them to NCInDA99.gz, gunzip will uncompress them to a file name NCInDA99, unless you use the gunzip option "-N", which will restore the name NCInDA99.sdf. (These file names were chosen to conform to the 8.3 file name convention for those users that may download, e.g., to DOS-type FAT 16 file systems. This practice may be discontinued in future.)

All files (after decompression) are in MDL's SDFile format with two identification fields:

In the 2D files with biological data, you'll find the following additional fields (not necessarily present in all files for all compounds): For more explanation on these data, in particular the meaning of the column headings, please see the Web pages of the DTP Human Tumor Cell Line Screen and/or the DTP AIDS  Antiviral Screen.
Please also note that no editing of the biological test data has been performed. This means that all DTP results for which the chemical structure  is available have been included. This includes data from "non-production" cell lines, i.e. cell lines that were used only a short time during test phases, as well as data from those ten cell lines that were replaced by a new block of ten around 1992. It is up to the user to do their own evaluation, statistics, and, if necessary, (pre-)processing, of these data before using them for any purpose.

In the 3D file, hydrogens were added by CORINA, whereas they are not present in the 0D and 2D files. In the 2D files,  the stereochemistry shown is in fact meaningless since decided upon at random. This is not easily changeable.

In previous versions of this page, the 0D information was called "2D". This has been changed to avoid confusion with the new 2D information added. The file that was previously called NCI2D397.sdz is therefore mostly identical with the new file NCI0DA99.sdz with the exception of the newly added compounds.

The sizes listed for the uncompressed files are in "real" MB, i.e. 1024 x 1024 bytes.
 

Our 249,081 structure set is a combination of three sets:

1) the March 1997 set, still downloadable here as NCI3D397.sdz
2) 689 supplemental structures selected from the DTP Human Tumor Cell Line Screen 3D SD files as of August 1999
3) 2,212 supplemental structures selected from DTP AIDS  Antiviral Screen 3D SD files as of October 1999.
Our 2D files with AIDS data contain 2 more structure data than the one available at the  DTP AIDS  Antiviral Screen .
The  DTP Human Tumor Cell Line Screen biological data file contains cancer screen data for 370 more entries for which we don't have the structure (these structures are not available on the DTP site).
For 10 out of the 249,081 structures, the 3D generation process failed.

Acknowledgments

All the SD files were prepared with the help of the SDF_toolkit. Thanks to Bruno Bienfait for both the toolkit and this work.

We gratefully acknowledge Prof. Gasteiger's group at the Computer Chemistry Center (CCC), Institute of Organic Chemistry, University of Erlangen-Nuremberg, Germany, for providing us with their program CORINA, and help with the database conversion.


Home

Last change: M. C. Nicklaus, 2001-07-06