Epiviz File Server - Query & Transform Data from Indexed Genomic Files in Python

Epiviz file Server is a scalable data query and compute system for indexed genomic files. In addition to querying data, users can also compute transformations, summarization and aggregation using NumPy functions directly on data queried from files.

Since the genomic files are indexed, the library will only request and parse necessary bytes from these files to process the request (without loading the entire file into memory). We implemented a cache system to efficiently manage already accessed bytes of a file. We also use dask to parallelize computing requests for query and transformation. This allows us to process and scale our system to large data repositories.

This blog post (Jupyter notebook) describes various features of the file server library using genomic files hosted from the NIH Roadmap Epigenomics project.

The library provides various modules to
  • Parser: Read various genomic file formats,
  • Query: Access only necessary bytes of file for a given genomic location,
  • Compute: Apply transformations on data,
  • Server: Instantly convert the datasets into a REST API
  • Visualization: Interactive Exploration of data using Epiviz (uses the Server module above).

Note

  • The Epiviz file Server is an open source project on GitHub
  • Let us know what you think and any feedback or feature requests to improve the library!

Contents

Installation

Using PyPI

To install the package from PyPi,

pip install epivizfileserver

Development Version

To install the development version from GitHub: Install using pip

pip install git@github.com:epiviz/epivizFileParser.git

you can also clone the repository and install from local directory using pip

Note

If you don’t have sudo rights to install the package, you can install it to the user directory using

pip install --user epivizfileserver

Tutorial

This blog post (Jupyter notebook) describes various features of the file server library using genomic files hosted from the NIH Roadmap Epigenomics project.

Note

This post describes a general walkthrough of the features of the file server. More usecases will be posted soon!

Import Measurements from File

Since large data repositories contains hundreds of files, manually adding files would be cumbersome. In order to make this process easier, we create a configuration file that lists all files with their locations. An example configuration file is described below -

Configuration file

The following is a configuration file for data hosted on the roadmap FTP server. This contains data for ChIP-seq experiments for the H3k27me3 marker in Esophagus and Sigmoid Colon tissues. Most fields in the configuration file are self explanatory.

[
    {
        url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E079-H3K27me3.fc.signal.bigwig",
        file_type: "bigwig",
        datatype: "bp",
        name: "E079-H3K27me3",
        id: "E079-H3K27me3",
        annotation: {
            group: "digestive",
            tissue: "Esophagus",
            marker: "H3K27me3"
        }
    }, {
        url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E106-H3K27me3.fc.signal.bigwig",
        file_type: "bigwig",
        datatype: "bp",
        name: "E106-H3K27me3",
        id: "E106-H3K27me3",
        annotation: {
            group: "digestive",
            tissue: "Sigmoid Colon",
            marker: "H3K27me3"
        }
    }
]

Once the configuration file is generated, we can import these measurements into the file server. We first create a MeasurementManager object which handles measurements from files and databases. We can then use the helper function import_files to import all measurements from this configuration file.

mMgr = MeasurementManager()
fmeasurements = mMgr.import_files(os.getcwd() + "/roadmap.json", mHandler)
fmeasurements

Query for a genomic location

After loading the measurements, we can query the object for data in a particular genomic region using the get_data function.

result, err = await fmeasurements[1].get_data("chr11", 10550488, 11554489)
result.head()

The reponse is a tuple, DataFrame that contains all results and an error if there is any.

Compute a Function over files

We can define and create new measurements that can be computed using a Numpy function over the files loaded from the previous step.

Note

you can also write a custom statistical function, that applies to every row in the DataFrame. It must follow the same syntax as any Numpy row-apply function.

As an example to demonstrate, we can calculate the average ChIP-seq expression for H3K27me3 marker.

computed_measurement = mMgr.add_computed_measurement("computed", "avg_ChIP_seq", "Average ChIP seq expression",
                                        measurements=fmeasurements, computeFunc=numpy.mean)

After defining a computed measurement, we can query this measurement for a genomic location.

result, err = await computed_measurement.get_data("chr11", 10550488, 11554489)
result.head()

Setup a REST API

Often times, developers would like to include data from genomic files into a web application for visualization or into their workflows. We can quickly setup a REST API web server from the measurements we loaded -

from epivizfileserver import setup_app
    app = setup_app(mMgr)
    app.run(port=8000)

The REST API is an asynchronous web server that is built on top of SANIC.

Query Files from AnnotationHub

We can also use the Bioconductor’s AnnotationHub to search for files and setup the file server. We are working on simplifying this process.

Annotation Hub API is hosted at https://annotationhub.bioconductor.org/.

We first download the annotationhub sqlite database for available data resources.

wget http://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3

After download the resource database from AnnotatiobnHub, we can now load the sqlite database into python and query for datasets.

import pandas
import os
import sqlite3

conn = sqlite3.connect("annotationhub.sqlite3")
cur = conn.cursor()
cur.execute("select * from resources r JOIN input_sources inp_src ON r.id = inp_src.resource_id;")
results = cur.fetchall()
pd = pandas.DataFrame(results, columns = ["id", "ah_id", "title", "dataprovider", "species", "taxonomyid", "genome",
                                        "description", "coordinate_1_based", "maintainer", "status_id",
                                        "location_prefix_id", "recipe_id", "rdatadateadded", "rdatadateremoved",
                                        "record_id", "preparerclass", "id", "sourcesize", "sourceurl", "sourceversion",
                                        "sourcemd5", "sourcelastmodifieddate", "resource_id", "source_type"])
pd.head()

For the purpose of the tutorial, we will filter for Sigmoid Colon (“E106”) and Esophagus (“E079”) tissues, and the ChipSeq Data for “H3K27me3” histone marker files from the roadmap epigenomics project.

roadmap = pd.query('dataprovider=="BroadInstitute" and genome=="hg19"')
roadmap = roadmap.query('title.str.contains("H3K27me3") and (title.str.contains("E106") or title.str.contains("E079"))')
# only use fc files
roadmap = roadmap.query('title.str.contains("fc")')
roadmap

After filtering for resources we are interested in, we can load them into the file server using the import_ahub helper function.

mMgr = MeasurementManager()
ahub_measurements = mMgr.import_ahub(roadmap)
ahub_measurements

The rest of the process is similar as described in the beginning of this tutorial.

Workspaces or Usecases setup using the File Server

The following workspaces have been setup using the Epiviz File Server

  1. BICCN/NEMO Miniatlas Mouse MOp Dataset
  2. BICCN Cross Species Dataset

Deployment

Using In built Sanic server (for development)

Sanic provides a default asynchronous web server to run the API. As a working example, checkout the Roadmap project from the Usecases section.

app = setup_app(mMgr)
app.run("0.0.0.0", port=8000)

Deploy using gunicorn + supervisor (for production)

Setup virtualenv and API

This process assumes the root API directory is /var/www/epiviz-api

Setup virtualenv either through pip or conda

cd /var/www/epiviz-api
virtualenv env
source env/bin/activate
pip install epivizfileserver

A generic version of the API script would look something like this (add this to /var/www/epiviz-api/epiviz.py)

from epivizfileserver import setup_app, create_fileHandler, MeasurementManager
from epivizfileserver.trackhub import TrackHub

# create measurements to load multiple trackhubs or configuration files
mMgr = MeasurementManager()

# create file handler, enables parallel processing of multiple requests
mHandler = create_fileHandler()

# add genome. - for supported genomes
# check https://obj.umiacs.umd.edu/genomes/index.html
genome = mMgr.add_genome("mm10")
genome = mMgr.add_genome("hg19")

# load measurements/files through config or TrackHub

# setup the app from the measurements manager
# and run the app
app = setup_app(mMgr)

# only if this file is run directly!
if __name__ == "__main__":
    app.run(host="127.0.0.1", port=8000)
Install dependencies
  1. Supervisor (system wide) - http://supervisord.org/
  2. Gunicorn (to the virtual environment) - https://gunicorn.org/
# if using ubuntu
sudo apt install supervisor

# activate virtualenv that runs the API
source /var/www/epiviz-api/env/bin/activate
pip install gunicorn
Configure supervisor

Add this configuration to /etc/supervisor/conf.d/epiviz.conf

This snippet also assumes epiviz-api repo is in /var/www/epiviz-api

[program:gunicorn]
directory=/var/www/epiviz-api
environment=PYTHONPATH=/var/www/epiviz-api/bin/python
command=/var/www/epiviz-api/env/bin/gunicorn epiviz:app --log-level debug --bind 0.0.0.0:8000 --worker-class sanic.worker.GunicornWorker
autostart=true
autorestart=true
stderr_logfile=/var/log/gunicorn/gunicorn.err.log
stdout_logfile=/var/log/gunicorn/gunicorn.out.log

Enable Supervisor configuration

sudo supervisorctl reread
sudo supervisorctl update

service supervisor restart

Note

check status of supervisor to make sure there are no errors

Add Proxypass to nginx/Apache

(the port number here should match the binding port from supervisor configuration

for Apache

sudo a2enmod proxy
sudo a2enmod proxy-http

# add this to the apache site config
ProxyPreserveHost On
<Location "/api">
    ProxyPass "http://127.0.0.1:8000/"
    ProxyPassReverse "http://127.0.0.1:8000/"
</Location>

for nginx

# add this to nginx site config

upstream epiviz_api_server {
    server 127.0.0.1:8000 fail_timeout=0;
}

location /api/ {
    proxy_pass http://epiviz_api_server/;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_redirect off;
}

License

The MIT License (MIT)

Copyright (c) 2019 Jayaram Kancherla

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contributors

Changelog

Version 0.1.2

  • Package pushed to PyPi
  • Updated readme and documentation

Version 0.1

  • First release!

epivizfileserver

epivizfileserver package

Subpackages
epivizfileserver.client package
Submodules
epivizfileserver.client.EpivizClient module
class epivizfileserver.client.EpivizClient.EpivizClient(server)[source]

Bases: object

Client implementation of the epiviz server

Parameters:server – endpoint where the API is running
get_data(measurement, chr, start, end)[source]

Get data for a genomic region from the API

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
Returns:

a json with results

get_measurements()[source]
get_seq_info()[source]
version = 5
Module contents
epivizfileserver.handler package
Submodules
epivizfileserver.handler.HandlerNoActor module
class epivizfileserver.handler.HandlerNoActor.FileHandlerProcess(fileTime, MAXWORKER, client=None)[source]

Bases: object

Class to manage query, transformation and cache using dask distributed

Parameters:
  • fileTime (int) – time to keep file objects in memory
  • MAXWORKER (int) – maximum workers that can be used
records

a dictionary of all file objects

client

asynchronous dask server client

binFileData(fileName, data, chr, start, end, bins, columns, metadata)[source]

submit tasks to the dask client

cleanFileOBJ()[source]

automated task to pickle all fileobjects to disk

getRecord(name)[source]

get file object from records by name

Parameters:name (str) – file name
Returns:file object
handleFile(fileName, fileType, chr, start, end, bins=2000)[source]

submit tasks to the dask client

Parameters:
  • fileName – file location
  • fileType – file type
  • chr – chromosome
  • start – genomic start
  • end – genomic end
  • points – number of base-pairse to group per bin
handleSearch(fileName, fileType, query, maxResults)[source]

submit tasks to the dask client

Parameters:
  • fileName – file location
  • fileType – file type
  • chr – chromosome
  • start – genomic start
  • end – genomic end
pickleFileObject(fileName)[source]

automated task to load a pickled file object

Parameters:fileName – file name to load
setRecord(name, fileObj, fileType)[source]

add or update records with new file object

Parameters:
  • name (str) – file name
  • fileObj – file object
  • fileType – file type
epivizfileserver.handler.handler module
class epivizfileserver.handler.handler.FileHandlerProcess(fileTime, MAXWORKER, client=None)[source]

Bases: object

Class to manage query, transformation and cache using dask distributed

Parameters:
  • fileTime (int) – time to keep file objects in memory
  • MAXWORKER (int) – maximum workers that can be used
records

a dictionary of all file objects

client

asynchronous dask server client

binFileData(fileName, fileType, data, chr, start, end, bins, columns, metadata)[source]

submit tasks to the dask client

check_who_has_obj(obj)[source]
cleanFileOBJ()[source]

automated task to pickle all fileobjects to disk

getRecord(name)[source]

get file object from records by name

Parameters:name (str) – file name
Returns:file object
get_dask_actor(fileClass, fileName)[source]
get_file_object(fileName, fileType)[source]
handleFile(fileName, fileType, chr, start, end, bins=2000)[source]

submit tasks to the dask client

Parameters:
  • fileName – file location
  • fileType – file type
  • chr – chromosome
  • start – genomic start
  • end – genomic end
  • points – number of base-pairse to group per bin
handleSearch(fileName, fileType, query, maxResults)[source]

submit tasks to the dask client

Parameters:
  • fileName – file location
  • fileType – file type
  • chr – chromosome
  • start – genomic start
  • end – genomic end
pickleFileObject(fileName)[source]

automated task to load a pickled file object

Parameters:fileName – file name to load
setRecord(name, fileObj, fileType)[source]

add or update records with new file object

Parameters:
  • name (str) – file name
  • fileObj – file object
  • fileType – file type
epivizfileserver.handler.handler.bin_rows(data, chr, start, end, columns=None, metadata=None, bins=400)[source]
epivizfileserver.handler.utils module
epivizfileserver.handler.utils.create_parser_object(format, source)[source]

Create appropriate File class based on file format

Parameters:
  • format – Type of file
  • request – Other request parameters
Returns:

An instance of parser class

Module contents
epivizfileserver.measurements package
Submodules
epivizfileserver.measurements.measurementClass module
class epivizfileserver.measurements.measurementClass.ComputedMeasurement(mtype, mid, name, measurements, source='computed', computeFunc=None, datasource='computed', genome=None, annotation={'group': 'computed'}, metadata=None, isComputed=True, isGenes=False, fileHandler=None, columns=None, computeAxis=1)[source]

Bases: epivizfileserver.measurements.measurementClass.Measurement

Class for representing computed measurements

In addition to params on base Measurement class -

Parameters:
  • computeFunc – a NumPy function to apply on our dataframe
  • source – defaults to ‘computed’
  • datasource – defaults to ‘computed’
computeWrapper(computeFunc, columns)[source]

a wrapper for the ‘computeFunc’ function

Parameters:
  • computeFunc – a NumPy compute function
  • columns – columns from file to apply
Returns:

a dataframe with results

get_columns()[source]

get columns from file

get_data(chr, start, end, bins, dropna=True)[source]

Get data for a genomic region from files and apply the computeFunc function

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • dropna (bool) – True to dropna from a measurement since any computation is going to fail on this row
Returns:

a dataframe with results

class epivizfileserver.measurements.measurementClass.DbMeasurement(mtype, mid, name, source, datasource, dbConn, genome=None, annotation=None, metadata=None, isComputed=False, isGenes=False, minValue=None, maxValue=None, columns=None)[source]

Bases: epivizfileserver.measurements.measurementClass.Measurement

Class representing a database measurement

In addition to params from the base measurement class -

Parameters:dbConn – a database connection object
connection

a database connection object

get_data(chr, start, end, bin=False)[source]

Get data for a genomic region from database

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • bin (bool) – True to bin the results, defaults to False
Returns:

a dataframe with results

query(obj, params)[source]

Query from db/source

Parameters:
  • obj – the query string
  • query_params – query parameters to search
Returns:

a dataframe of results from the database

class epivizfileserver.measurements.measurementClass.FileMeasurement(mtype, mid, name, source, datasource='files', genome=None, annotation=None, metadata=None, isComputed=False, isGenes=False, minValue=None, maxValue=None, fileHandler=None, columns=None)[source]

Bases: epivizfileserver.measurements.measurementClass.Measurement

Class for file based measurement

In addition to params from the base Measurement class

Parameters:fileHandler – an optional file handler object to process query requests (uses dask)
create_parser_object(type, name, columns=None)[source]

Create appropriate File class based on file format

Parameters:
  • type (str) – format of file
  • name (str) – location of file
  • columns ([str]) – list of columns from file
Returns:

An file object

get_data(chr, start, end, bins, bin=True)[source]

Get data for a genomic region from file

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • bin (bool) – True to bin the results, defaults to False
Returns:

a dataframe with results

get_status()[source]

Get status of this measurement (most pertinent for files)

search_gene(query, maxResults)[source]

Get data for a genomic region from file

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
Returns:

a array of matched genes

class epivizfileserver.measurements.measurementClass.Measurement(mtype, mid, name, source, datasource, genome=None, annotation=None, metadata=None, isComputed=False, isGenes=False, minValue=None, maxValue=None, columns=None)[source]

Bases: object

Base class for managing measurements from files

Parameters:
  • mtype – Measurement type, either ‘file’ or ‘db’
  • mid – unique id to use for this measurement
  • name – name of the measurement
  • source – location of the measurement, if mtype is ‘db’ use table name, if file, file location
  • datasource – is the database name if mtype is ‘db’ use database name, else ‘files’
  • annotation – annotation for this measurement, defaults to None
  • metadata – metadata for this measurement, defaults to None
  • isComputed – True if this measurement is Computed from other measurements, defaults to False
  • isGenes – True if this measurement is an annotation (for example: reference genome hg19), defaults to False
  • minValue – min value of all values, defaults to None
  • maxValue – max value of all values, defaults to None
  • columns – column names for the file
bin_rows(data, chr, start, end, columns=None, metadata=None, bins=400)[source]
bin_rows_legacy(data, chr, start, end, bins=2000)[source]

Bin genome by bin length and summarize the bin

Parameters:
  • data – DataFrame from the file
  • chr – chromosome
  • start – genomic start
  • end – genomic end
  • length – max rows to summarize the data frame into
Returns:

a binned data frame whose max rows is length

get_columns()[source]

get columns from file

get_data(chr, start, end)[source]

Get Data for this measurement

Parameters:
  • chr – chromosome
  • start – genomic start
  • end – genomic end
get_measurement_annotation()[source]

Get measurement annotation

get_measurement_genome()[source]

Get measurement genome

get_measurement_id()[source]

Get measurement id

get_measurement_max()[source]

Get measurement max value

get_measurement_metadata()[source]

Get measurement metadata

get_measurement_min()[source]

Get measurement min value

get_measurement_name()[source]

Get measurement name

get_measurement_source()[source]

Get source

get_measurement_type()[source]

Get measurement type

get_status()[source]

Get status of this measurement (most pertinent for files)

is_computed()[source]

Is measurement computed ?

is_file()[source]

Is measurement a file ?

is_gene()[source]

is the file a genome annotation ?

query(obj, query_params)[source]

Query from db/source

Parameters:
  • obj – db obj
  • query_params – query parameters to search
class epivizfileserver.measurements.measurementClass.WebServerMeasurement(mtype, mid, name, source, datasource, datasourceGroup, annotation=None, metadata=None, isComputed=False, isGenes=False, minValue=None, maxValue=None)[source]

Bases: epivizfileserver.measurements.measurementClass.Measurement

Class representing a web server measurement

In addition to params from the base measurement class, source is now server API endpoint

get_data(chr, start, end, bin=False, requestId=662)[source]

Get data for a genomic region from the API

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • bin (bool) – True to bin the results, defaults to False
Returns:

a dataframe with results

epivizfileserver.measurements.measurementManager module
class epivizfileserver.measurements.measurementManager.EMDMeasurementMap(url, fileHandler)[source]

Bases: object

Manage mapping between measuremnts in EFS and metadata service

add_new_collections(new_collection_ids)[source]
add_new_measurements(new_ms_ids)[source]
init()[source]
init_collections()[source]
init_measurements()[source]
process_emd_record(rec)[source]
sync(current_ms)[source]
sync_collections()[source]
sync_measurements(current_ms)[source]
class epivizfileserver.measurements.measurementManager.MeasurementManager[source]

Bases: object

Measurement manager class

measurements

list of all measurements managed by the system

add_computed_measurement(mtype, mid, name, measurements, computeFunc, genome=None, annotation=None, metadata=None, computeAxis=1)[source]

Add a Computed Measurement

Parameters:
  • mtype – measurement type, defaults to ‘computed’
  • mid – measurement id
  • name – name for this measurement
  • measurements – list of measurement to use
  • computeFuncNumPy function to apply
Returns:

a ComputedMeasurement object

add_genome(genome, url='http://obj.umiacs.umd.edu/genomes/', type=None, fileHandler=None)[source]
Add a genome to the list of measurements. The genome has to be tabix indexed for the file server
to make remote queries. Our tabix indexed files are available at https://obj.umiacs.umd.edu/genomes/index.html
Parameters:
  • genome – for example : hg19 if type = “tabix” or full location of gtf file if type = “gtf”
  • genome_id – required if type = “gtf”
  • url – url to the genome file
format_ms(rec)[source]
get_from_emd(url=None)[source]

Make a GET request to a metadata api

Parameters:url – the url of the epiviz-md api. If none the url on self.emd_endpoint is used if available (None)
get_genomes()[source]

Get all available genomes

get_measurement(ms_id)[source]

Get a specific measurement

get_measurements()[source]

Get all available measurements

get_ms_from_emd(mid)[source]

grabs the measurement from emd by id

import_ahub(ahub, handler=None)[source]

Import measurements from annotationHub objects.

Parameters:
  • ahub – list of file records from annotationHub
  • handler – an optional filehandler to use
import_dbm(dbConn)[source]

Import measurements from a database.The database needs to have a measurements_index table with information of files imported into the database.

Parameters:dbConn – a database connection
import_emd(url, fileHandler=None, listen=True)[source]

Import measurements from an epiviz-md metadata service api.

Parameters:
  • url – the url of the epiviz-md api
  • handler – an optional filehandler to use
  • listen – activate ‘updateCollections’ endpoint to add measurements from the service upon request
import_files(fileSource, fileHandler=None, genome=None)[source]

Import measurements from a file.

Parameters:
  • fileSource – location of the configuration file to load
  • fileHandler – an optional filehandler to use
import_records(records, fileHandler=None, genome=None, skip=False)[source]

Import measurements from a list of records (usually from a decoded json string)

Parameters:
  • fileSource – location of the configuration json file to load
  • fileHandler – an optional filehandler to use
  • genome – genome to use if its missing from measurement
  • skip – skips adding measurement to mgr
import_trackhub(hub, handler=None)[source]

Import measurements from annotationHub objects.

Parameters:
  • ahub – list of file records from annotationHub
  • handler – an optional filehandler to use
use_emd(url, fileHandler=None)[source]

Delegate all getMeasurement calls to an epiviz-md metdata service api

Parameters:
  • url – the url of the epiviz-md api
  • fileHandler – an optional filehandler to use
using_emd()[source]
class epivizfileserver.measurements.measurementManager.MeasurementSet[source]

Bases: object

append(ms)[source]
get(key)[source]
get_measurements()[source]
get_mids()[source]
Module contents
epivizfileserver.parser package
Submodules
epivizfileserver.parser.BamFile module
class epivizfileserver.parser.BamFile.BamFile(file, columns=None)[source]

Bases: epivizfileserver.parser.SamFile.SamFile

Bam File Class to parse bam files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_bin(x)[source]
get_col_names(result)[source]
to_DF(result)[source]
to_msgpack(result)[source]
epivizfileserver.parser.BaseFile module

Genomics file classes

class epivizfileserver.parser.BaseFile.BaseFile(file)[source]

Bases: object

Base file class for parser module

This class provides various useful functions

Parameters:file – file location
local

if file is local or hosted on a public server

endian

check for endianess

HEADER_STRUCT = <Struct object>
SUMMARY_STRUCT = <Struct object>
bin_rows(data, chr, start, end, columns=None, metadata=None, bins=400)[source]

Bin genome by bin length and summarize the bin

decompress_binary(bin_block)[source]

decompress a binary string

Parameters:bin_block – binary string
Returns:a zlib decompressed binary string
formatAsJSON(data)[source]

Encode a data object as JSON

Parameters:data – any data object to encode
Returns:data encoded as JSON
get_bytes(offset, size)[source]

Get bytes within a given range

Parameters:
  • offset (int) – byte start position in file
  • size (int) – size of bytes to access from offset
Returns:

binary string from offset to (offset + size)

get_bytes_http(offset, size)[source]
get_data(chr, start, end)[source]
get_status()[source]
is_local(file)[source]

Checks if file is local or hosted publicly

Parameters:file – location of file
parse_header()[source]
parse_url(furl=None)[source]
parse_url_http(furl=None)[source]
simplified_bin_rows(data, chr, start, end, columns=None, metadata=None, bins=400)[source]
epivizfileserver.parser.BigBed module
class epivizfileserver.parser.BigBed.BigBed(file, columns=None)[source]

Bases: epivizfileserver.parser.BigWig.BigWig

Bed file parser

Parameters:file (str) – bigbed file location
get_autosql()[source]

parse autosql stored in file

Returns:an array of columns in file parsed from autosql
magic = '0x8789F2EB'
parseLeafDataNode(chrmId, start, end, zoomlvl, rStartChromIx, rStartBase, rEndChromIx, rEndBase, rdataOffset, rDataSize)[source]

Parse leaf node

epivizfileserver.parser.BigWig module
class epivizfileserver.parser.BigWig.BigWig(file, columns=None)[source]

Bases: epivizfileserver.parser.BaseFile.BaseFile

BigWig file parser

Parameters:file (str) – bigwig file location
tree

chromosome tree parsed from file

columns

column names

cacheData

locally cached data for this file

daskWrapper(fileObj, chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='JSON')[source]

Dask Wrapper

getHeader()[source]

get header byte region in file

getId(chrmzone)[source]

Get mapping of chromosome to id stored in file

Parameters:chrmzone (str) – chromosome
Returns:id in file for the given chromosome
getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

getTree(zoomlvl)[source]

Get chromosome tree for a given zoom level

Parameters:zoomlvl (int) – zoomlvl to get
Returns:Tree binary bytes
getTreeBytes(zoomlvl, start, size)[source]
getValues(chr, start, end, zoomlvl)[source]

Get data for a region

Note: Do not use this directly, use getRange

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
Returns:

data for the region

getZoom(zoomlvl, binSize)[source]

Get Zoom record for the given bin size

Parameters:
  • zoomlvl (int) – zoomlvl to get
  • binSize (int) – bin data by bin size
Returns:

zoom level

getZoomHeader(data)[source]
get_autosql()[source]

parse autosql in file

Returns:an array of columns in file parsed from autosql
get_cache()[source]
locateTree(chrmId, start, end, zoomlvl, offset)[source]

Locate tree for the given region

Parameters:
  • chrmId (int) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • zoomlvl (int) – zoom level
  • offset (int) – offset position in the file
Returns:

nodes in the stored R-tree

magic = '0x888FFC26'
parseLeafDataNode(chrmId, start, end, zoomlvl, rStartChromIx, rStartBase, rEndChromIx, rEndBase, rdataOffset, rDataSize)[source]

Parse an Rtree leaf node

parse_header(data=None)[source]

parse header in file

Returns:attributed stored in the header
readRtreeHeaderNode(zoomlvl)[source]

Parse an Rtree Header node

Parameters:zoomlvl (int) – zoom level
Returns:header node Rtree object
readRtreeNode(zoomlvl, offset)[source]

Parse an Rtree node

Parameters:
  • zoomlvl (int) – zoom level
  • offset (int) – offset in the file
Returns:

node Rtree object

set_cache(cache)[source]
traverseRtreeNodes(node, zoomlvl, chrmId, start, end, result=[])[source]

Traverse an Rtree to get nodes in the given range

epivizfileserver.parser.GWASBigBedPIP module
class epivizfileserver.parser.GWASBigBedPIP.GWASBigBedPIP(file, columns=None)[source]

Bases: epivizfileserver.parser.BigBed.BigBed

Bed file parser

Parameters:file (str) – GWASBigBedPIP file location
getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

magic = '0x8789F2EB'
epivizfileserver.parser.GWASBigBedPval module
class epivizfileserver.parser.GWASBigBedPval.GWASBigBedPval(file, columns=None)[source]

Bases: epivizfileserver.parser.BigBed.BigBed

Bed file parser

Parameters:file (str) – GWASBigBedPval file location
getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

magic = '0x8789F2EB'
epivizfileserver.parser.GtfFile module
class epivizfileserver.parser.GtfFile.GtfFile(file, columns=['chr', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'group'])[source]

Bases: object

GTF File Class to parse gtf/gff files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_col_names()[source]
get_data(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]
parse_attribute(item, key)[source]
searchGene(query, maxResults=5)[source]
search_gene(query, maxResults=5)[source]
epivizfileserver.parser.GtfParsedFile module
class epivizfileserver.parser.GtfParsedFile.GtfParsedFile(file, columns=['chr', 'start', 'end', 'width', 'strand', 'geneid', 'exon_starts', 'exon_ends', 'gene'])[source]

Bases: object

GTF File Class to parse gtf/gff files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_col_names()[source]
get_data(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]
parse_attribute(item, key)[source]
searchGene(query, maxResults=5)[source]
search_gene(query, maxResults=5)[source]
epivizfileserver.parser.GtfTabixFile module
class epivizfileserver.parser.GtfTabixFile.GtfTabixFile(file, columns=None)[source]

Bases: epivizfileserver.parser.SamFile.SamFile

GTF File Class to parse gtf/gff files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', ensembl=True)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_bin(x)[source]
get_col_names(result)[source]
toDF(result)[source]
epivizfileserver.parser.HDF5File module
class epivizfileserver.parser.HDF5File.HDF5File(file)[source]

Bases: object

HDF5 File Class to parse only local hdf5 files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start=None, end=None, row_names=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

read_10x_hdf5(chr, query_names)[source]

read a 10xGenomics hdf5 file

Parameters:
  • chr (str) – chromosome
  • query_names ([str]) – genes to filter
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

epivizfileserver.parser.Helper module
epivizfileserver.parser.Helper.get_range_helper(toDF, get_bin, get_col_names, chr, start, end, file_iter, columns, respType)[source]
epivizfileserver.parser.InteractionBigBed module
class epivizfileserver.parser.InteractionBigBed.InteractionBigBed(file, columns=['chr', 'start', 'end', 'name', 'score', 'value', 'exp', 'color', 'region1chr', 'region1start', 'region1end', 'region1name', 'region1strand', 'region2chr', 'region2start', 'region2end', 'region2name', 'region2strand'])[source]

Bases: epivizfileserver.parser.BigBed.BigBed

BigBed file parser for chromosome interaction Data

Columns in the bed file are

(chr, start, end, name, score, value (strength of interaction, same as value), exp, color, region1chr, region1start, region1end, region1name, region1strand, region2chr, region2start, region2end, region2name, region2strand)
Parameters:file (str) – InteractionBigBed file location
getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

magic = '0x8789F2EB'
epivizfileserver.parser.SamFile module
class epivizfileserver.parser.SamFile.SamFile(file, columns=None)[source]

Bases: object

SAM File Class to parse sam files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_bin(x)[source]
get_cache()[source]
get_col_names(result)[source]
set_cache(cache)[source]
toDF(result)[source]
epivizfileserver.parser.TbxFile module
class epivizfileserver.parser.TbxFile.TbxFile(file, columns=['chr', 'start', 'end', 'width', 'strand', 'geneid', 'exon_starts', 'exon_ends', 'gene'])[source]

Bases: epivizfileserver.parser.SamFile.SamFile

TBX File Class to parse tbx files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_bin(x)[source]
get_col_names(result)[source]
get_data(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]
searchGene(query, maxResults=5)[source]
toDF(result)[source]
epivizfileserver.parser.TileDB module
class epivizfileserver.parser.TileDB.TileDB(path)[source]

Bases: object

TileDB Class to parse only local tiledb files

Parameters:
  • path (str) – local full path to a dataset tiledb_folder. This folder should contain data.tiledb, rows and cols files. See below for more detail.
  • columns ([str]) – column names for various columns in file
Detail:
The tiledb_folder should contain:

‘data.tiledb’ directory - corresponds to the uri of a tiledb array. The tiledb array must have a ‘vals’ attribute from which values are read. The array should have as many rows as the number of lines in the ‘rows’ file, and as many columns as the number of lines in the ‘cols’ file.

‘rows’ file - this is a tab-separated value file describing the rows of the tiledb array it must have as many lines as rows in the tiledb file. There should be no index column in this file (i.e., it is read with pandas.read_csv(…, sep=’ ‘, index_col=False)). It must have columns ‘chr’, ‘start’ and ‘end’. We index the rows file using Tabix so we are not loading the entire file into memory. This file contains columns as annotated in .json file

‘cols’ file - this is a tab-separated value file describing the columns of the tiledb array. It must have as many files as columns in the tiledb file. Column names for the tiledb array will be obtained from the first column in this file (i.e., iti is read with pandas.read_csv(…, sep=’ ‘, index_col=0)).

getRange(chr, start=None, end=None, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

epivizfileserver.parser.TileDBTbxFile module
class epivizfileserver.parser.TileDBTbxFile.TileDBTbxFile(file, columns=['chr', 'start', 'end', 'rownumber', 'gene'])[source]

Bases: epivizfileserver.parser.SamFile.SamFile

Tiledb specific TBX File Class to parse row files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_bin(x)[source]
get_col_names(result)[source]
toDF(result)[source]
epivizfileserver.parser.TranscriptTbxFile module
class epivizfileserver.parser.TranscriptTbxFile.TranscriptTbxFile(file, columns=['chr', 'start', 'end', 'strand', 'transcript_id', 'exon_starts', 'exon_ends', 'gene'])[source]

Bases: epivizfileserver.parser.TbxFile.TbxFile

Class for tabix indexed transcript files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

epivizfileserver.parser.utils module
epivizfileserver.parser.utils.create_parser_object(format, source, columns=None)[source]

Create appropriate File class based on file format

Parameters:
  • format (str) – format of file
  • source (str) – location of file
Returns:

An instance of parser class

epivizfileserver.parser.utils.toDataFrame(records, header=None)[source]
Module contents
epivizfileserver.server package
Submodules
epivizfileserver.server.request module
class epivizfileserver.server.request.DataRequest(request)[source]

Bases: epivizfileserver.server.request.EpivizRequest

Data requests class

get_data(mMgr, handler=None)[source]

Get Data for this request type

Returns:JSON response for this request error: HTTP ERROR CODE
Return type:result
validate_params(request)[source]

Validate parameters for requests

Parameters:request – dict of params from request
class epivizfileserver.server.request.EpivizRequest(request)[source]

Bases: object

Base class to process requests

get_data(mMgr, handler=None)[source]

Get Data for this request type

Returns:JSON response for this request error: HTTP ERROR CODE
Return type:result
validate_params(request)[source]

Validate parameters for requests

Parameters:request – dict of params from request
class epivizfileserver.server.request.MeasurementRequest(request)[source]

Bases: epivizfileserver.server.request.EpivizRequest

Measurement requests class

get_data(mMgr, handler=None)[source]

Get Data for this request type

Returns:JSON response for this request error: HTTP ERROR CODE
Return type:result
validate_params(request)[source]

Validate parameters for requests

Parameters:request – dict of params from request
class epivizfileserver.server.request.SearchRequest(request)[source]

Bases: epivizfileserver.server.request.EpivizRequest

Search requests class

get_data(mMgr, handler=None)[source]

Get Data for this request type

Returns:JSON response for this request error: HTTP ERROR CODE
Return type:result
validate_params(request)[source]

Validate parameters for requests

Parameters:request – dict of params from request
class epivizfileserver.server.request.SeqInfoRequest(request)[source]

Bases: epivizfileserver.server.request.EpivizRequest

SeqInfo requests class

get_data(mMgr, handler=None)[source]

Get Data for this request type

Returns:JSON response for this request error: HTTP ERROR CODE
Return type:result
validate_params(request)[source]

Validate parameters for requests

Parameters:request – dict of params from request
class epivizfileserver.server.request.StatusRequest(request, datasource)[source]

Bases: epivizfileserver.server.request.EpivizRequest

get_status(mMgr)[source]
epivizfileserver.server.request.create_request(action, request)[source]

Create appropriate request class based on action

Parameters:
  • action – Type of request
  • request – Other request parameters
Returns:

An instance of EpivizRequest class

epivizfileserver.server.utils module
epivizfileserver.server.utils.bin_rows(input, max_rows=2000)[source]

Helper function to bin rows to resolution

Parameters:
  • input – dataframe to bin
  • max_rows – resolution to scale rows
Returns:

data frame with scaled rows

epivizfileserver.server.utils.create_parser_object(format, source)[source]

Create appropriate File class based on file format

Parameters:
  • format – Type of file
  • request – Other request parameters
Returns:

An instance of parser class

epivizfileserver.server.utils.format_result(input, params, offset=True)[source]

Fromat result to a epiviz compatible format

Parameters:
  • input – input dataframe
  • params – request parameters
  • offset – defaults to True
Returns:

formatted JSON response

Module contents
epivizfileserver.server.MAXWORKER = 10

The server module allows users to instantly create a REST API from the list of measuremensts. The API can then be used to interactive exploration of data or build various applications.

epivizfileserver.server.clean_up(app, loop)[source]
epivizfileserver.server.create_fileHandler()[source]

create a dask file handler if one doesn’t exist

epivizfileserver.server.schedulePickle()[source]

Sanic task to regularly pickle file objects from memory

epivizfileserver.server.setup_after_connection(app, loop)[source]
epivizfileserver.server.setup_app(measurementsManager, dask_scheduler=None)[source]

Setup the Sanic Rest API

Parameters:measurementsManager – a measurements manager object
Returns:a sanic app object
epivizfileserver.server.setup_connection(app, loop)[source]

Sanic callback for app setup before the server starts

epivizfileserver.trackhub package
Submodules
epivizfileserver.trackhub.TrackHub module
class epivizfileserver.trackhub.TrackHub.TrackHub(file)[source]

Bases: object

Base class for managing trackhub files TrackHub documentation is available at https://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html

Parameters:file – location of trackhub directory
parse_genome()[source]
parse_genomeTracks()[source]
parse_hub()[source]
parse_trackDb(track_loc)[source]
Module contents
Submodules
epivizfileserver.cli module
Module contents

Indices and tables