Epiviz File Server - Query & Transform Data from Indexed Genomic Files in Python¶
Epiviz file Server is a scalable data query and compute system for indexed genomic files. In addition to querying data, users can also compute transformations, summarization and aggregation using NumPy functions directly on data queried from files.
Since the genomic files are indexed, the library will only request and parse necessary bytes from these files to process the request (without loading the entire file into memory). We implemented a cache system to efficiently manage already accessed bytes of a file. We also use dask to parallelize computing requests for query and transformation. This allows us to process and scale our system to large data repositories.
This blog post (Jupyter notebook) describes various features of the file server library using genomic files hosted from the NIH Roadmap Epigenomics project.
- The library provides various modules to
- Parser: Read various genomic file formats,
- Query: Access only necessary bytes of file for a given genomic location,
- Compute: Apply transformations on data,
- Server: Instantly convert the datasets into a REST API
- Visualization: Interactive Exploration of data using Epiviz (uses the Server module above).
Note
- The Epiviz file Server is an open source project on GitHub
- Let us know what you think and any feedback or feature requests to improve the library!
Contents¶
Installation¶
Development Version¶
To install the development version from GitHub: Install using pip
pip install git@github.com:epiviz/epivizFileParser.git
you can also clone the repository and install from local directory using pip
Note
If you don’t have sudo rights to install the package, you can install it to the user directory using
pip install --user epivizfileserver
Tutorial¶
This blog post (Jupyter notebook) describes various features of the file server library using genomic files hosted from the NIH Roadmap Epigenomics project.
Note
This post describes a general walkthrough of the features of the file server. More usecases will be posted soon!
Import Measurements from File¶
Since large data repositories contains hundreds of files, manually adding files would be cumbersome. In order to make this process easier, we create a configuration file that lists all files with their locations. An example configuration file is described below -
Configuration file¶
The following is a configuration file for data hosted on the roadmap FTP server. This contains data for ChIP-seq experiments for the H3k27me3 marker in Esophagus and Sigmoid Colon tissues. Most fields in the configuration file are self explanatory.
[
{
url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E079-H3K27me3.fc.signal.bigwig",
file_type: "bigwig",
datatype: "bp",
name: "E079-H3K27me3",
id: "E079-H3K27me3",
annotation: {
group: "digestive",
tissue: "Esophagus",
marker: "H3K27me3"
}
}, {
url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E106-H3K27me3.fc.signal.bigwig",
file_type: "bigwig",
datatype: "bp",
name: "E106-H3K27me3",
id: "E106-H3K27me3",
annotation: {
group: "digestive",
tissue: "Sigmoid Colon",
marker: "H3K27me3"
}
}
]
Once the configuration file is generated, we can import these measurements into the file server. We first create a MeasurementManager object which handles measurements from files and databases. We can then use the helper function import_files to import all measurements from this configuration file.
mMgr = MeasurementManager()
fmeasurements = mMgr.import_files(os.getcwd() + "/roadmap.json", mHandler)
fmeasurements
Query for a genomic location¶
After loading the measurements, we can query the object for data in a particular genomic region using the get_data function.
result, err = await fmeasurements[1].get_data("chr11", 10550488, 11554489)
result.head()
The reponse is a tuple, DataFrame that contains all results and an error if there is any.
Compute a Function over files¶
We can define and create new measurements that can be computed using a Numpy function over the files loaded from the previous step.
Note
you can also write a custom statistical function, that applies to every row in the DataFrame. It must follow the same syntax as any Numpy row-apply function.
As an example to demonstrate, we can calculate the average ChIP-seq expression for H3K27me3 marker.
computed_measurement = mMgr.add_computed_measurement("computed", "avg_ChIP_seq", "Average ChIP seq expression",
measurements=fmeasurements, computeFunc=numpy.mean)
After defining a computed measurement, we can query this measurement for a genomic location.
result, err = await computed_measurement.get_data("chr11", 10550488, 11554489)
result.head()
Setup a REST API¶
Often times, developers would like to include data from genomic files into a web application for visualization or into their workflows. We can quickly setup a REST API web server from the measurements we loaded -
from epivizfileserver import setup_app
app = setup_app(mMgr)
app.run(port=8000)
The REST API is an asynchronous web server that is built on top of SANIC.
Query Files from AnnotationHub¶
We can also use the Bioconductor’s AnnotationHub to search for files and setup the file server. We are working on simplifying this process.
Annotation Hub API is hosted at https://annotationhub.bioconductor.org/.
We first download the annotationhub sqlite database for available data resources.
wget http://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3
After download the resource database from AnnotatiobnHub, we can now load the sqlite database into python and query for datasets.
import pandas
import os
import sqlite3
conn = sqlite3.connect("annotationhub.sqlite3")
cur = conn.cursor()
cur.execute("select * from resources r JOIN input_sources inp_src ON r.id = inp_src.resource_id;")
results = cur.fetchall()
pd = pandas.DataFrame(results, columns = ["id", "ah_id", "title", "dataprovider", "species", "taxonomyid", "genome",
"description", "coordinate_1_based", "maintainer", "status_id",
"location_prefix_id", "recipe_id", "rdatadateadded", "rdatadateremoved",
"record_id", "preparerclass", "id", "sourcesize", "sourceurl", "sourceversion",
"sourcemd5", "sourcelastmodifieddate", "resource_id", "source_type"])
pd.head()
For the purpose of the tutorial, we will filter for Sigmoid Colon (“E106”) and Esophagus (“E079”) tissues, and the ChipSeq Data for “H3K27me3” histone marker files from the roadmap epigenomics project.
roadmap = pd.query('dataprovider=="BroadInstitute" and genome=="hg19"')
roadmap = roadmap.query('title.str.contains("H3K27me3") and (title.str.contains("E106") or title.str.contains("E079"))')
# only use fc files
roadmap = roadmap.query('title.str.contains("fc")')
roadmap
After filtering for resources we are interested in, we can load them into the file server using the import_ahub helper function.
mMgr = MeasurementManager()
ahub_measurements = mMgr.import_ahub(roadmap)
ahub_measurements
The rest of the process is similar as described in the beginning of this tutorial.
Workspaces or Usecases setup using the File Server¶
The following workspaces have been setup using the Epiviz File Server
Deployment¶
Using In built Sanic server (for development)¶
Sanic provides a default asynchronous web server to run the API. As a working example, checkout the Roadmap project from the Usecases section.
app = setup_app(mMgr)
app.run("0.0.0.0", port=8000)
Deploy using gunicorn + supervisor (for production)¶
Setup virtualenv and API¶
This process assumes the root API directory is /var/www/epiviz-api
Setup virtualenv either through pip or conda
cd /var/www/epiviz-api
virtualenv env
source env/bin/activate
pip install epivizfileserver
A generic version of the API script would look something like this (add this to /var/www/epiviz-api/epiviz.py)
from epivizfileserver import setup_app, create_fileHandler, MeasurementManager
from epivizfileserver.trackhub import TrackHub
# create measurements to load multiple trackhubs or configuration files
mMgr = MeasurementManager()
# create file handler, enables parallel processing of multiple requests
mHandler = create_fileHandler()
# add genome. - for supported genomes
# check https://obj.umiacs.umd.edu/genomes/index.html
genome = mMgr.add_genome("mm10")
genome = mMgr.add_genome("hg19")
# load measurements/files through config or TrackHub
# setup the app from the measurements manager
# and run the app
app = setup_app(mMgr)
# only if this file is run directly!
if __name__ == "__main__":
app.run(host="127.0.0.1", port=8000)
Install dependencies¶
- Supervisor (system wide) - http://supervisord.org/
- Gunicorn (to the virtual environment) - https://gunicorn.org/
# if using ubuntu
sudo apt install supervisor
# activate virtualenv that runs the API
source /var/www/epiviz-api/env/bin/activate
pip install gunicorn
Configure supervisor¶
Add this configuration to /etc/supervisor/conf.d/epiviz.conf
This snippet also assumes epiviz-api repo is in /var/www/epiviz-api
[program:gunicorn]
directory=/var/www/epiviz-api
environment=PYTHONPATH=/var/www/epiviz-api/bin/python
command=/var/www/epiviz-api/env/bin/gunicorn epiviz:app --log-level debug --bind 0.0.0.0:8000 --worker-class sanic.worker.GunicornWorker
autostart=true
autorestart=true
stderr_logfile=/var/log/gunicorn/gunicorn.err.log
stdout_logfile=/var/log/gunicorn/gunicorn.out.log
Enable Supervisor configuration
sudo supervisorctl reread
sudo supervisorctl update
service supervisor restart
Note
check status of supervisor to make sure there are no errors
Add Proxypass to nginx/Apache¶
(the port number here should match the binding port from supervisor configuration
for Apache
sudo a2enmod proxy
sudo a2enmod proxy-http
# add this to the apache site config
ProxyPreserveHost On
<Location "/api">
ProxyPass "http://127.0.0.1:8000/"
ProxyPassReverse "http://127.0.0.1:8000/"
</Location>
for nginx
# add this to nginx site config
upstream epiviz_api_server {
server 127.0.0.1:8000 fail_timeout=0;
}
location /api/ {
proxy_pass http://epiviz_api_server/;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_redirect off;
}
License¶
The MIT License (MIT)
Copyright (c) 2019 Jayaram Kancherla
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Contributors¶
- Jayaram Kancherla <jayaram.kancherla@gmail.com>
- Yifan Yang <yang7832@umd.edu>
- Hector Corrada Bravo <hcorrada@gmail.com>
epivizfileserver¶
epivizfileserver package¶
Subpackages¶
epivizfileserver.client package¶
-
class
epivizfileserver.client.EpivizClient.
EpivizClient
(server)[source]¶ Bases:
object
Client implementation of the epiviz server
Parameters: server – endpoint where the API is running -
get_data
(measurement, chr, start, end)[source]¶ Get data for a genomic region from the API
Parameters: Returns: a json with results
-
version
= 5¶
-
epivizfileserver.handler package¶
-
class
epivizfileserver.handler.HandlerNoActor.
FileHandlerProcess
(fileTime, MAXWORKER, client=None)[source]¶ Bases:
object
Class to manage query, transformation and cache using dask distributed
Parameters: -
records
¶ a dictionary of all file objects
-
client
¶ asynchronous dask server client
-
binFileData
(fileName, data, chr, start, end, bins, columns, metadata)[source]¶ submit tasks to the dask client
-
getRecord
(name)[source]¶ get file object from records by name
Parameters: name (str) – file name Returns: file object
-
handleFile
(fileName, fileType, chr, start, end, bins=2000)[source]¶ submit tasks to the dask client
Parameters: - fileName – file location
- fileType – file type
- chr – chromosome
- start – genomic start
- end – genomic end
- points – number of base-pairse to group per bin
-
handleSearch
(fileName, fileType, query, maxResults)[source]¶ submit tasks to the dask client
Parameters: - fileName – file location
- fileType – file type
- chr – chromosome
- start – genomic start
- end – genomic end
-
-
class
epivizfileserver.handler.handler.
FileHandlerProcess
(fileTime, MAXWORKER, client=None)[source]¶ Bases:
object
Class to manage query, transformation and cache using dask distributed
Parameters: -
records
¶ a dictionary of all file objects
-
client
¶ asynchronous dask server client
-
binFileData
(fileName, fileType, data, chr, start, end, bins, columns, metadata)[source]¶ submit tasks to the dask client
-
getRecord
(name)[source]¶ get file object from records by name
Parameters: name (str) – file name Returns: file object
-
handleFile
(fileName, fileType, chr, start, end, bins=2000)[source]¶ submit tasks to the dask client
Parameters: - fileName – file location
- fileType – file type
- chr – chromosome
- start – genomic start
- end – genomic end
- points – number of base-pairse to group per bin
-
handleSearch
(fileName, fileType, query, maxResults)[source]¶ submit tasks to the dask client
Parameters: - fileName – file location
- fileType – file type
- chr – chromosome
- start – genomic start
- end – genomic end
-
epivizfileserver.measurements package¶
-
class
epivizfileserver.measurements.measurementClass.
ComputedMeasurement
(mtype, mid, name, measurements, source='computed', computeFunc=None, datasource='computed', genome=None, annotation={'group': 'computed'}, metadata=None, isComputed=True, isGenes=False, fileHandler=None, columns=None, computeAxis=1)[source]¶ Bases:
epivizfileserver.measurements.measurementClass.Measurement
Class for representing computed measurements
In addition to params on base Measurement class -
Parameters: - computeFunc – a NumPy function to apply on our dataframe
- source – defaults to ‘computed’
- datasource – defaults to ‘computed’
-
computeWrapper
(computeFunc, columns)[source]¶ a wrapper for the ‘computeFunc’ function
Parameters: - computeFunc – a NumPy compute function
- columns – columns from file to apply
Returns: a dataframe with results
-
class
epivizfileserver.measurements.measurementClass.
DbMeasurement
(mtype, mid, name, source, datasource, dbConn, genome=None, annotation=None, metadata=None, isComputed=False, isGenes=False, minValue=None, maxValue=None, columns=None)[source]¶ Bases:
epivizfileserver.measurements.measurementClass.Measurement
Class representing a database measurement
In addition to params from the base measurement class -
Parameters: dbConn – a database connection object -
connection
¶ a database connection object
-
-
class
epivizfileserver.measurements.measurementClass.
FileMeasurement
(mtype, mid, name, source, datasource='files', genome=None, annotation=None, metadata=None, isComputed=False, isGenes=False, minValue=None, maxValue=None, fileHandler=None, columns=None)[source]¶ Bases:
epivizfileserver.measurements.measurementClass.Measurement
Class for file based measurement
In addition to params from the base Measurement class
Parameters: fileHandler – an optional file handler object to process query requests (uses dask) -
create_parser_object
(type, name, columns=None)[source]¶ Create appropriate File class based on file format
Parameters: Returns: An file object
-
-
class
epivizfileserver.measurements.measurementClass.
Measurement
(mtype, mid, name, source, datasource, genome=None, annotation=None, metadata=None, isComputed=False, isGenes=False, minValue=None, maxValue=None, columns=None)[source]¶ Bases:
object
Base class for managing measurements from files
Parameters: - mtype – Measurement type, either ‘file’ or ‘db’
- mid – unique id to use for this measurement
- name – name of the measurement
- source – location of the measurement, if mtype is ‘db’ use table name, if file, file location
- datasource – is the database name if mtype is ‘db’ use database name, else ‘files’
- annotation – annotation for this measurement, defaults to None
- metadata – metadata for this measurement, defaults to None
- isComputed – True if this measurement is Computed from other measurements, defaults to False
- isGenes – True if this measurement is an annotation (for example: reference genome hg19), defaults to False
- minValue – min value of all values, defaults to None
- maxValue – max value of all values, defaults to None
- columns – column names for the file
-
bin_rows_legacy
(data, chr, start, end, bins=2000)[source]¶ Bin genome by bin length and summarize the bin
Parameters: - data – DataFrame from the file
- chr – chromosome
- start – genomic start
- end – genomic end
- length – max rows to summarize the data frame into
Returns: a binned data frame whose max rows is length
-
class
epivizfileserver.measurements.measurementClass.
WebServerMeasurement
(mtype, mid, name, source, datasource, datasourceGroup, annotation=None, metadata=None, isComputed=False, isGenes=False, minValue=None, maxValue=None)[source]¶ Bases:
epivizfileserver.measurements.measurementClass.Measurement
Class representing a web server measurement
In addition to params from the base measurement class, source is now server API endpoint
-
class
epivizfileserver.measurements.measurementManager.
EMDMeasurementMap
(url, fileHandler)[source]¶ Bases:
object
Manage mapping between measuremnts in EFS and metadata service
-
class
epivizfileserver.measurements.measurementManager.
MeasurementManager
[source]¶ Bases:
object
Measurement manager class
-
measurements
¶ list of all measurements managed by the system
-
add_computed_measurement
(mtype, mid, name, measurements, computeFunc, genome=None, annotation=None, metadata=None, computeAxis=1)[source]¶ Add a Computed Measurement
Parameters: - mtype – measurement type, defaults to ‘computed’
- mid – measurement id
- name – name for this measurement
- measurements – list of measurement to use
- computeFunc – NumPy function to apply
Returns: a ComputedMeasurement object
-
add_genome
(genome, url='http://obj.umiacs.umd.edu/genomes/', type=None, fileHandler=None)[source]¶ - Add a genome to the list of measurements. The genome has to be tabix indexed for the file server
- to make remote queries. Our tabix indexed files are available at https://obj.umiacs.umd.edu/genomes/index.html
Parameters: - genome – for example : hg19 if type = “tabix” or full location of gtf file if type = “gtf”
- genome_id – required if type = “gtf”
- url – url to the genome file
-
get_from_emd
(url=None)[source]¶ Make a GET request to a metadata api
Parameters: url – the url of the epiviz-md api. If none the url on self.emd_endpoint is used if available (None)
-
import_ahub
(ahub, handler=None)[source]¶ Import measurements from annotationHub objects.
Parameters: - ahub – list of file records from annotationHub
- handler – an optional filehandler to use
-
import_dbm
(dbConn)[source]¶ Import measurements from a database.The database needs to have a measurements_index table with information of files imported into the database.
Parameters: dbConn – a database connection
-
import_emd
(url, fileHandler=None, listen=True)[source]¶ Import measurements from an epiviz-md metadata service api.
Parameters: - url – the url of the epiviz-md api
- handler – an optional filehandler to use
- listen – activate ‘updateCollections’ endpoint to add measurements from the service upon request
-
import_files
(fileSource, fileHandler=None, genome=None)[source]¶ Import measurements from a file.
Parameters: - fileSource – location of the configuration file to load
- fileHandler – an optional filehandler to use
-
import_records
(records, fileHandler=None, genome=None, skip=False)[source]¶ Import measurements from a list of records (usually from a decoded json string)
Parameters: - fileSource – location of the configuration json file to load
- fileHandler – an optional filehandler to use
- genome – genome to use if its missing from measurement
- skip – skips adding measurement to mgr
-
import_trackhub
(hub, handler=None)[source]¶ Import measurements from annotationHub objects.
Parameters: - ahub – list of file records from annotationHub
- handler – an optional filehandler to use
-
epivizfileserver.parser package¶
-
class
epivizfileserver.parser.BamFile.
BamFile
(file, columns=None)[source]¶ Bases:
epivizfileserver.parser.SamFile.SamFile
Bam File Class to parse bam files
Parameters: -
file
¶ a pysam file object
-
fileSrc
¶ location of the file
-
cacheData
¶ cache of accessed data in memory
-
columns
¶ column names to use
-
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
Genomics file classes
-
class
epivizfileserver.parser.BaseFile.
BaseFile
(file)[source]¶ Bases:
object
Base file class for parser module
This class provides various useful functions
Parameters: file – file location -
local
¶ if file is local or hosted on a public server
-
endian
¶ check for endianess
-
HEADER_STRUCT
= <Struct object>¶
-
SUMMARY_STRUCT
= <Struct object>¶
-
bin_rows
(data, chr, start, end, columns=None, metadata=None, bins=400)[source]¶ Bin genome by bin length and summarize the bin
-
decompress_binary
(bin_block)[source]¶ decompress a binary string
Parameters: bin_block – binary string Returns: a zlib decompressed binary string
-
formatAsJSON
(data)[source]¶ Encode a data object as JSON
Parameters: data – any data object to encode Returns: data encoded as JSON
-
get_bytes
(offset, size)[source]¶ Get bytes within a given range
Parameters: Returns: binary string from offset to (offset + size)
-
-
class
epivizfileserver.parser.BigBed.
BigBed
(file, columns=None)[source]¶ Bases:
epivizfileserver.parser.BigWig.BigWig
Bed file parser
Parameters: file (str) – bigbed file location -
get_autosql
()[source]¶ parse autosql stored in file
Returns: an array of columns in file parsed from autosql
-
magic
= '0x8789F2EB'¶
-
-
class
epivizfileserver.parser.BigWig.
BigWig
(file, columns=None)[source]¶ Bases:
epivizfileserver.parser.BaseFile.BaseFile
BigWig file parser
Parameters: file (str) – bigwig file location -
tree
¶ chromosome tree parsed from file
-
columns
¶ column names
-
cacheData
¶ locally cached data for this file
-
daskWrapper
(fileObj, chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='JSON')[source]¶ Dask Wrapper
-
getId
(chrmzone)[source]¶ Get mapping of chromosome to id stored in file
Parameters: chrmzone (str) – chromosome Returns: id in file for the given chromosome
-
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
getTree
(zoomlvl)[source]¶ Get chromosome tree for a given zoom level
Parameters: zoomlvl (int) – zoomlvl to get Returns: Tree binary bytes
-
getValues
(chr, start, end, zoomlvl)[source]¶ Get data for a region
Note: Do not use this directly, use getRange
Parameters: Returns: data for the region
-
getZoom
(zoomlvl, binSize)[source]¶ Get Zoom record for the given bin size
Parameters: Returns: zoom level
-
get_autosql
()[source]¶ parse autosql in file
Returns: an array of columns in file parsed from autosql
-
locateTree
(chrmId, start, end, zoomlvl, offset)[source]¶ Locate tree for the given region
Parameters: Returns: nodes in the stored R-tree
-
magic
= '0x888FFC26'¶
-
parseLeafDataNode
(chrmId, start, end, zoomlvl, rStartChromIx, rStartBase, rEndChromIx, rEndBase, rdataOffset, rDataSize)[source]¶ Parse an Rtree leaf node
-
readRtreeHeaderNode
(zoomlvl)[source]¶ Parse an Rtree Header node
Parameters: zoomlvl (int) – zoom level Returns: header node Rtree object
-
-
class
epivizfileserver.parser.GWASBigBedPIP.
GWASBigBedPIP
(file, columns=None)[source]¶ Bases:
epivizfileserver.parser.BigBed.BigBed
Bed file parser
Parameters: file (str) – GWASBigBedPIP file location -
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
magic
= '0x8789F2EB'¶
-
-
class
epivizfileserver.parser.GWASBigBedPval.
GWASBigBedPval
(file, columns=None)[source]¶ Bases:
epivizfileserver.parser.BigBed.BigBed
Bed file parser
Parameters: file (str) – GWASBigBedPval file location -
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
magic
= '0x8789F2EB'¶
-
-
class
epivizfileserver.parser.GtfFile.
GtfFile
(file, columns=['chr', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'group'])[source]¶ Bases:
object
GTF File Class to parse gtf/gff files
Parameters: -
file
¶ a pysam file object
-
fileSrc
¶ location of the file
-
cacheData
¶ cache of accessed data in memory
-
columns
¶ column names to use
-
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
-
class
epivizfileserver.parser.GtfParsedFile.
GtfParsedFile
(file, columns=['chr', 'start', 'end', 'width', 'strand', 'geneid', 'exon_starts', 'exon_ends', 'gene'])[source]¶ Bases:
object
GTF File Class to parse gtf/gff files
Parameters: -
file
¶ a pysam file object
-
fileSrc
¶ location of the file
-
cacheData
¶ cache of accessed data in memory
-
columns
¶ column names to use
-
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
-
class
epivizfileserver.parser.GtfTabixFile.
GtfTabixFile
(file, columns=None)[source]¶ Bases:
epivizfileserver.parser.SamFile.SamFile
GTF File Class to parse gtf/gff files
Parameters: -
file
¶ a pysam file object
-
fileSrc
¶ location of the file
-
cacheData
¶ cache of accessed data in memory
-
columns
¶ column names to use
-
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', ensembl=True)[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
-
class
epivizfileserver.parser.HDF5File.
HDF5File
(file)[source]¶ Bases:
object
HDF5 File Class to parse only local hdf5 files
Parameters: -
file
¶ a pysam file object
-
fileSrc
¶ location of the file
-
cacheData
¶ cache of accessed data in memory
-
columns
¶ column names to use
-
getRange
(chr, start=None, end=None, row_names=None)[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
-
class
epivizfileserver.parser.InteractionBigBed.
InteractionBigBed
(file, columns=['chr', 'start', 'end', 'name', 'score', 'value', 'exp', 'color', 'region1chr', 'region1start', 'region1end', 'region1name', 'region1strand', 'region2chr', 'region2start', 'region2end', 'region2name', 'region2strand'])[source]¶ Bases:
epivizfileserver.parser.BigBed.BigBed
BigBed file parser for chromosome interaction Data
Columns in the bed file are
(chr, start, end, name, score, value (strength of interaction, same as value), exp, color, region1chr, region1start, region1end, region1name, region1strand, region2chr, region2start, region2end, region2name, region2strand)Parameters: file (str) – InteractionBigBed file location -
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
magic
= '0x8789F2EB'¶
-
-
class
epivizfileserver.parser.SamFile.
SamFile
(file, columns=None)[source]¶ Bases:
object
SAM File Class to parse sam files
Parameters: -
file
¶ a pysam file object
-
fileSrc
¶ location of the file
-
cacheData
¶ cache of accessed data in memory
-
columns
¶ column names to use
-
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
-
class
epivizfileserver.parser.TbxFile.
TbxFile
(file, columns=['chr', 'start', 'end', 'width', 'strand', 'geneid', 'exon_starts', 'exon_ends', 'gene'])[source]¶ Bases:
epivizfileserver.parser.SamFile.SamFile
TBX File Class to parse tbx files
Parameters: -
file
¶ a pysam file object
-
fileSrc
¶ location of the file
-
cacheData
¶ cache of accessed data in memory
-
columns
¶ column names to use
-
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
-
class
epivizfileserver.parser.TileDB.
TileDB
(path)[source]¶ Bases:
object
TileDB Class to parse only local tiledb files
Parameters: - Detail:
- The tiledb_folder should contain:
‘data.tiledb’ directory - corresponds to the uri of a tiledb array. The tiledb array must have a ‘vals’ attribute from which values are read. The array should have as many rows as the number of lines in the ‘rows’ file, and as many columns as the number of lines in the ‘cols’ file.
‘rows’ file - this is a tab-separated value file describing the rows of the tiledb array it must have as many lines as rows in the tiledb file. There should be no index column in this file (i.e., it is read with pandas.read_csv(…, sep=’ ‘, index_col=False)). It must have columns ‘chr’, ‘start’ and ‘end’. We index the rows file using Tabix so we are not loading the entire file into memory. This file contains columns as annotated in .json file
‘cols’ file - this is a tab-separated value file describing the columns of the tiledb array. It must have as many files as columns in the tiledb file. Column names for the tiledb array will be obtained from the first column in this file (i.e., iti is read with pandas.read_csv(…, sep=’ ‘, index_col=0)).
-
getRange
(chr, start=None, end=None, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
class
epivizfileserver.parser.TileDBTbxFile.
TileDBTbxFile
(file, columns=['chr', 'start', 'end', 'rownumber', 'gene'])[source]¶ Bases:
epivizfileserver.parser.SamFile.SamFile
Tiledb specific TBX File Class to parse row files
Parameters: -
file
¶ a pysam file object
-
fileSrc
¶ location of the file
-
cacheData
¶ cache of accessed data in memory
-
columns
¶ column names to use
-
getRange
(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶ Get data for a given genomic location
Parameters: Returns: - result
a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
- error
if there was any error during the process
-
-
class
epivizfileserver.parser.TranscriptTbxFile.
TranscriptTbxFile
(file, columns=['chr', 'start', 'end', 'strand', 'transcript_id', 'exon_starts', 'exon_ends', 'gene'])[source]¶ Bases:
epivizfileserver.parser.TbxFile.TbxFile
Class for tabix indexed transcript files
Parameters: -
file
¶ a pysam file object
-
fileSrc
¶ location of the file
-
cacheData
¶ cache of accessed data in memory
-
columns
¶ column names to use
-
epivizfileserver.server package¶
-
class
epivizfileserver.server.request.
DataRequest
(request)[source]¶ Bases:
epivizfileserver.server.request.EpivizRequest
Data requests class
-
class
epivizfileserver.server.request.
EpivizRequest
(request)[source]¶ Bases:
object
Base class to process requests
-
class
epivizfileserver.server.request.
MeasurementRequest
(request)[source]¶ Bases:
epivizfileserver.server.request.EpivizRequest
Measurement requests class
-
class
epivizfileserver.server.request.
SearchRequest
(request)[source]¶ Bases:
epivizfileserver.server.request.EpivizRequest
Search requests class
-
class
epivizfileserver.server.request.
SeqInfoRequest
(request)[source]¶ Bases:
epivizfileserver.server.request.EpivizRequest
SeqInfo requests class
-
epivizfileserver.server.utils.
bin_rows
(input, max_rows=2000)[source]¶ Helper function to bin rows to resolution
Parameters: - input – dataframe to bin
- max_rows – resolution to scale rows
Returns: data frame with scaled rows
-
epivizfileserver.server.
MAXWORKER
= 10¶ The server module allows users to instantly create a REST API from the list of measuremensts. The API can then be used to interactive exploration of data or build various applications.
-
epivizfileserver.server.
create_fileHandler
()[source]¶ create a dask file handler if one doesn’t exist
-
epivizfileserver.server.
schedulePickle
()[source]¶ Sanic task to regularly pickle file objects from memory
epivizfileserver.trackhub package¶
-
class
epivizfileserver.trackhub.TrackHub.
TrackHub
(file)[source]¶ Bases:
object
Base class for managing trackhub files TrackHub documentation is available at https://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html
Parameters: file – location of trackhub directory