Skip to content

Data Sources

In addition to what is described in this section, to fully understand the data served by the CIP-API please refer to to the following sources of information:

These documents not only describe the methodology but also the sources used.

Data sources

The following table details the data sources and versions used by the Tiering Interpretation Service when populating Interpretated Genomes in the CIP-API

Tiering Interpretation Service (Cancer and Rare Disease)

Data Source Name Data Source Reference Resource Type Analysis Type (used in: cancer/rare disease/ both) Resource Version
ENSEMBL_gene
(Data loaded in cellbase)
Collection of gft, fasta and gff files for Ensembl
ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/*90.gtf.gz
ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/pep/.pep.all.fa.gz
ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/cdna/
.cdna.all.fa.gz
ftp://ftp.ensembl.org/pub/release-90/regulation/homo_sapiens/MotifFeatures.gff.gz
Genomic Entities e.g. genes, transcipts Both 90
ClinVar
(Data loaded in cellbase)
CellBase ClinVar release data
ftp://ftp.ebi.ac.uk/pub/databases/eva/ClinVar/2015/ClinVar_Traits_EFO_Names_260615.csv
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variation_allele.txt.gz"
Clinically Relevant Variants Both 2019-06
COSMIC
(Data loaded in cellbase)
https://cancer.sanger.ac.uk/cosmic Clinically Relevant Variants Cancer v89
1000 genomes project
(Data loaded in cellbase)
https://www.internationalgenome.org/about Variant Population Frequencies (aka allele frequencies) Both Phase3 (original data in assembly GRCh37, it was liftover to GRCh38)
DiscovEHR
(Data loaded in cellbase)
http://www.discovehrshare.com/ Variant Population Frequencies (aka allele frequencies) Both GHS Freeze 50 (original data in assembly GRCh37, it was liftover to GRCh38)
GoNL
(Data loaded in cellbase)
http://www.nlgenome.nl/ Variant Population Frequencies (aka allele frequencies) Both Release 5 (original data in assembly GRCh37, it was liftover to GRCh38)
gnomAD
(Data loaded in cellbase)
https://gnomad.broadinstitute.org/ Variant Population Frequencies (aka allele frequencies) Both 2.0.1 (original data in assembly GRCh37, it was liftover to GRCh38)
UK10K project
(Data loaded in cellbase)
https://www.uk10k.org/ Variant Population Frequencies (aka allele frequencies) Both N/A data obtain in 2016-02-15 (original data in assembly GRCh37, it was liftover to GRCh38)
Cancer Analysis Resources A set of Gene list used in cancer analysis. They are described and version in:https://www.genomicsengland.co.uk/about-genomics-england/the-100000-genomes-project/information-for-gmc-staff/cancer-programme/cancer-genome-analysis/ Additional Resources Cancer v1.11
CellBase database version The cellbase database version. Used internally to control the overall version of our data sources - Both 2.4.0

Last update: 2023-03-01