Downloadable Structure Files of NCI Open Database Compounds
Note to Windows users: While downloading with Netscape on Unix platforms usually
works flawlessly, we've received reports (and have confirmed in very limited tests)
that in Windows, using Netscape with the "Save Link As..." option may produce
corrupted binary files. In these cases, you may want to try Internet Explorer
with the option "Save Target As..." for downloading. Because of the file name/extension
used (.sdz), you may have to either rename the downloaded binary file or open
it manually with a program such as WinZip or similar.
Release 2 Files
August 2000 2D File
The "raw" structure data that were used to build the Release 2 of this service.
These are 250,251 2D structures calculated with
CACTVS.
Attention:
Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical
information in the original NCI data. The SMILES string and the CAS RN (where
available) are also included for each structure.
250,251 2D structures in SDF
format. WARNING: This is a 90 MB file that uncompresses to about 982
MB! Use the "Save Link As..." (Netscape) or "Save Target As..." (IE)
option of your web browser to download the file. It has the name NCI_aug00_2D.sdz.
To uncompress, rename the file to something like "NCI_aug00_2D.sdf.gz" and gunzip it.
August 2000 SMILES Strings
A SMILES version of the 250,251 August 2000 structures. These are Unique SMILES
(USMILES) strings, calculated according to Daylight's original (1989) canonicalization
rules. (These rules have been changed in the meantime, but are not published.)
250,251 structures in USMILES format.
Caution: This is a 2.9 MB file that uncompresses to about 13 MB.
Use the "Save Link As..." (Netscape) or "Save Target As..." (IE)
option of your web browser to download the file. It has the name NCI_aug00_SMI.sdz.
To uncompress, rename the file to something like "NCI_aug00_SMI.gz" and gunzip it.
New Structures Only
These are 1,170 structures that were not in the previous (October 1999) release.
This file may be most interesting for those who have already downloaded the previous
structure file(s) and only need the difference set.
It contains 3D coordinates calculated by the program
CORINA.
Please note the same warning regarding stereochemistry as for the large
3D file (see below).
1,170 new 3D structures in SDF
format. Note: This is a 0.8 MB file that uncompresses to about 5.7
MB. Use the "Save Link As..." (Netscape) or "Save Target As..." (IE)
option of your web browser to download the file. It has the name NCI_new_oct99-aug00_3D.sdz.
To uncompress, rename the file to something like "NCI_new_oct99-aug00_3D.sdf.gz" and gunzip it.
"0D"
The "raw" structure data that were used to build this service,
plus about 2,900 new structures. These
are 249,081 "0D" structures (i.e. all coordinates set to 0.0) as of October
1999 in SDF format, in one file compressed with the widely available program
gzip.
249,081 0D structures in SDF format. Caution:
q:This is a 16.5 MB file that uncompresses to about 380 MB! Use the "Save
Link As..." (Netscape) or "Save Target As..." (IE) option of your web browser
to download the file.
SMILES
A SMILES version of the structures (i.e. the above "0D" dataset)
that were used to build this service, plus about 2,900 new structures. These
are 249,081 structures as of October 1999 in
SMILES format,
in one file compressed with the widely available program gzip.
SMILES string were generated with the help of
CACTVS.
(This is a newly generated dataset and therefore not guaranteed to
contain SMILES strings identical, for each compound, with those in previous
SMILES string files, such as downloadable data from
DTP .)
249,081 structures in SMILES format. Caution:
This is a 3.2 MB file that uncompresses to about 18.5 MB. Use the "Save
Link As..." (Netscape) or "Save Target As..." (IE) option of your web browser
to download the file.
2D
2D version of NCI Open Database compounds as of October 1999. 2D coordinates
(essentially structure drawings) calculated with
CACTVS.
Attention:
Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical
information in the original NCI data. (See also the 3D section.)
249,081 2D structures in SDF
format. WARNING: This is a 40 MB file that uncompresses to about 527
MB! Use the "Save Link As..." (Netscape) or "Save Target As..." (IE)
option of your web browser to download the file.
2D + Biological Data
2D versions of NCI Open Database compounds as of October 1999, with
biological test data added. These data are publicly available from
the DTP Human
Tumor Cell Line Screen and/or the DTP
AIDS Antiviral Screen. 2D coordinates (essentially structure
drawings) calculated with CACTVS.
Attention: Stereochemistry assigned by CACTVS according to default rules due to lack
of stereochemical information in the original NCI data. (See also the 3D
section.)
3D
A 3D version of the 0D file, containing 249,071
structures as of October 1999. The program
CORINA
v. 1.7 was used to generate the 3D coordinates. Please note that, just
as with the 3D results provided by the Enhanced NCI Database Browser, stereochemistry
of chiral compounds is not guaranteed to be correct due to the lack of
stereochemical information in the original data. This is not a shortcoming
of CORINA. Please also note that, as of now, the 3D structures in this
bulk file were not generated with the same version of CORINA as is used
in the Browser, the latter being somewhat newer. This file is the result
of a one-time conversion; no efforts have been undertaken to compare the
conformations in it with those you obtain from the Browser (although we
don't necessarily expect huge differences.)
249,071 3D structures in SDF format.
WARNING:
This is a 127 MB file that uncompresses to about 574 MB! Use the "Save
Link As..." (Netscape) or "Save Target As..." (IE) option of your web browser
to download the file.
Notes:
All these files are based on the publicly and freely
available data from NCI's Developmental
Therapeutics Program (DTP). We collected the structures and biological
data from DTP, combined them where applicable, and generated SMILES and
MDL SD files from this information.
These files were compressed with the program gzip. This program
is available for many platforms, and comes preloaded on most of the recent
versions of many major varieties of Unix. In order to prevent possible
problems with web browsers trying to uncompress "on the fly", and display
on your screen (!), a file with the extension ".gz", the names of the downloadable
files were changed to NCInDA99.sdz (n = 0, 2, 3 for the 0D, 2D, 3D file,
respectively; "A99" stands for October 1999 [with hexadecimal notation
for the month]), CAN2DA99.sdz, AID2DA99.sdz etc.
You may have to rename them to NCInDA99.sdf.gz etc. before gunzip'ing
them. If you (have to) rename them to NCInDA99.gz, gunzip will uncompress
them to a file name NCInDA99, unless you use the gunzip option "-N", which
will restore the name NCInDA99.sdf. (These file names were chosen to conform
to the 8.3 file name convention for those users that may download, e.g.,
to DOS-type FAT 16 file systems. This practice may be discontinued in future.)
All files (after decompression) are in MDL's SDFile format with two
identification fields:
-
NSC - the NCI's internal identification number of the database entry
-
CAS_RN - the CAS Registry Number. Present with a value other than 999-99-9
(dummy value) only for those compounds for which it was entered in the
NCI database. (This does not mean that a compound with a CAS_RN of 999-99-9
does not necessarily
have a CAS Registry Number - it just was not
entered
in the NCI database.)
In the 2D files with biological data, you'll find the following additional
fields (not necessarily present in all files for all compounds):
-
NLOGGI50 - Log GI50 data, comprising the following columns:
CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGGI50, INDN, TOTN
-
NLOGTGI - Log TGI data, comprising the following columns:
CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGTGI, INDN, TOTN
-
NLOGLC50 - Log LC50 data, comprising the following columns:
CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGLC50, INDN, TOTN
-
NCI_AIDS_Antiviral_Screen_Conclusion - AIDS Screening result (CI = Confirmed Inactive,
CM = Confirmed Moderate[ly active], CA = Confirmed Active)
-
NCI_AIDS_Antiviral_Screen_EC50 - AIDS EC50 result with four columns:
HiConc, ConcUnit, Flag, EC50, NumExp.
Note that for some compounds, the EC50 has been measured more than once.
-
NCI_AIDS_Antiviral_Screen_IC50 - AIDS IC50 result with four columns:
HiConc, ConcUnit, Flag, IC50, NumExp.
Note that for some compounds, the IC50 has been measured more than once.
For more explanation on these data, in particular the meaning of the column
headings, please see the Web pages of the DTP
Human Tumor Cell Line Screen and/or the DTP
AIDS Antiviral Screen.
Please also note that no editing of the biological test data has been
performed. This means that all DTP results for which the chemical
structure is available have been included. This includes data from "non-production"
cell lines, i.e. cell lines that were used only a short time during test
phases, as well as data from those ten cell lines that were replaced by a new
block of ten around 1992. It is up to the user to do their own evaluation,
statistics, and, if necessary, (pre-)processing, of these data before using them for any purpose.
In the 3D file, hydrogens were added by CORINA, whereas they are not
present in the 0D and 2D files. In the 2D files, the stereochemistry
shown is in fact meaningless since decided upon at random. This is not
easily changeable.
In previous versions of this page, the 0D information was called "2D".
This has been changed to avoid confusion with the new 2D information added.
The file that was previously called NCI2D397.sdz is therefore mostly identical
with the new file NCI0DA99.sdz with the exception of the newly added compounds.
The sizes listed for the uncompressed files are in "real" MB, i.e. 1024 x 1024 bytes.
Our 249,081 structure set is a combination of three sets:
1) the March 1997 set, still downloadable here as
NCI3D397.sdz
2) 689 supplemental structures selected from the
DTP
Human Tumor Cell Line Screen 3D SD files as of August 1999
3) 2,212 supplemental structures selected from
DTP
AIDS Antiviral Screen 3D SD files as of October 1999.
Our 2D files with AIDS data contain 2 more structure data than the one available at
the DTP
AIDS Antiviral Screen .
The DTP
Human Tumor Cell Line Screen biological data file contains cancer screen
data for 370 more entries for which we don't have the structure (these structures
are not available on the DTP site).
For 10 out of the 249,081 structures, the 3D generation process failed.
Acknowledgments
All the SD files were prepared with the help of the
SDF_toolkit. Thanks to Bruno Bienfait
for both the toolkit and this work.
We gratefully acknowledge Prof.
Gasteiger's group at the Computer
Chemistry Center (CCC), Institute of Organic Chemistry, University
of Erlangen-Nuremberg, Germany, for providing us with their program CORINA,
and help with the database conversion.
Home
Last change: M. C. Nicklaus,
2001-07-06