Skip to content

Mapping scientific and common names

Gavin Huttley edited this page Oct 17, 2019 · 1 revision

ensembldb relies on mapping "species" names to common names to simplify creating Genome and Compara instances. The following script is one way of exporting this content from the Ensembl database.

The result of running this script produces "common" names are not always what you want. For instance, there are multiple members of the Drosophila genus, making the common name "fruitfly" ambiguous. Accordingly, the species.tsv file distributed with ensembldb3 is an edited version of this.

import os
from collections import defaultdict
import sqlalchemy as sql
from pprint import pprint
from ensembldb3 import HostAccount, Compara, Species

account = HostAccount(*os.environ['ENSEMBL_ACCOUNT'].split())

compara = Compara(['human', 'mouse', 'dog', 'platypus'], release=85,
                       account=account)

gen_db = compara.ComparaDb.get_table('genome_db')
ncbi_db = compara.ComparaDb.get_table('ncbi_taxa_name')
joined = gen_db.outerjoin(ncbi_db, gen_db.c.taxon_id==ncbi_db.c.taxon_id)

mapping = defaultdict(dict)
query = sql.select([joined], use_labels=True,
                whereclause=sql.or_(ncbi_db.c.name_class=='ensembl alias name',
                ncbi_db.c.name_class=='scientific name'))
recs = query.execute().fetchall()

for r in recs:
    names = {r['ncbi_taxa_name_name_class']: r['ncbi_taxa_name_name']}
    mapping[r['genome_db_name']].update(names)

rows = []
for db in mapping:
    sci = db.split('_')
    sci[0] = sci[0].capitalize()
    sci = ' '.join(sci)
    db_sci = mapping[db]['scientific name']
    syn = '' if db_sci.lower() == sci.lower() else db_sci
    row = [sci, mapping[db]['ensembl alias name'], syn]
    rows.append(row)

rows = list(sorted(rows))
rows = ['\t'.join(r) for r in rows]
with open('species.tsv', 'wt') as out:
    out.write('\n'.join(rows))
Clone this wiki locally