What is the easiest way to find transcription start site data?
The problem
You need to find the transcription start site (TSS) of a gene, perhaps for many genes.The solution
When it comes to identifying the TSS for e.g., dozens of genes, you have two options:1. Parsing method: You can parse NCBI genome annotation files for the information. As part of the genome annotation process, tab-delimited files are created that give the position of key features in both contig (RefSeq accessions of the format NW_ or NT_) and chromosome coordinates, if applicable.
Process
- Go to ftp://ftp.ncbi.nih.gov/genomes using your browser or FTP tool.
- Find the genome-specific directories of interest.
- Within each directory, click on "maps", then "mapview", then the folder for the current build. In that directory you will find the file "seq_gene.md". The first line in the file names the columns. chrStart, chrEnd, and orientation refer to the positions on chromosome. cnt_start, cnt_stop, and cnt_orient refer to positions on the contigs. Note that both of these positions are 1-based (i.e., start at 1, not 0).
- The "gene" lines in this file give the ranges for the gene on the chromosome (as applicable), as well as contig coordinates.
- Scan the file using e.g., the UNIX commands gzcat and egrep. Example: "gzcat seq_gene.md | egrep "GENE.*reference". This will extract the 'GENE' lines for the reference assembly.
2. Direct SQL querying method. You can query the Ensembl databases directly using SQL to retrieve this kind of data.
Other approaches suitable for limited number of genes
- Access the NCBI Gene database to visualize your gene. This database does eventually let you see the nucleotide number for the TSS of a gene (see example for TP53). However, you have to navigate several links to get to the Download Sequence Region link, which is time-consuming. There is no way to query either the Gene or the Nucleotide databases programmatically for that datum.
- Search SOURCE with your gene ID and click on the TRASER tool to retrieve the 5' genome region.
- The GeneCards database provides an explicit TSS number upfront. Caveat: You may have to convert the coordinates to whatever assembly you are using, which is not clear. Example for TP53.
- 5'SAGE: SAGE-based TSS identification.
- Eukaryotic Promoter Database. Doesn't provide the TSS explicitely but can be inferred. Similar, and higher quality/coverage information is available from the TRANSFAC Pro database of transcription factor data.
- DBTSS: database of 5'-end sequences from the sequencing of full-length cDNAs.
- TRED: database of mammalian cis- and trans- transcriptional regulatory elements.
Key references
- eBook: Learning the UNIX Operating System (2001), 5th edition.
- Quick reference eBook: UNIX in a Nutshell (2005), 4th edition.
- All UNIX resources at the Lane Library
- All SQL resources at the Lane Library
Source
Lane Librarian- Anesthesia
- Biomedical Ethics
- Bioresearch
- Cardiology
- Clinical
- Consumer Health
- Emergency Medicine
- Global Health
- Hematology
- Internal Medicine
- LPCH Heart Center Nursing
- Medical Education
- Multicultural Health
- Neurology
- Nursing
- Obstetrics & Gynecology
- Oncology
- Ophthalmology
- Otolaryngology
- Pathology
- Pediatrics
- Pharmacy
- Physical Medicine & Rehabilitation
- PICU
- Psychiatry
- Pulmonary
- Reference Desk
- Special Programs
- Spiritual Care
- Student (IL)