epivizfileserver.parser package

Submodules

epivizfileserver.parser.BamFile module

class epivizfileserver.parser.BamFile.BamFile(file, columns=None)[source]

Bases: epivizfileserver.parser.SamFile.SamFile

Bam File Class to parse bam files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_bin(x)[source]
get_col_names(result)[source]
to_DF(result)[source]
to_msgpack(result)[source]

epivizfileserver.parser.BaseFile module

Genomics file classes

class epivizfileserver.parser.BaseFile.BaseFile(file)[source]

Bases: object

Base file class for parser module

This class provides various useful functions

Parameters:file – file location
local

if file is local or hosted on a public server

endian

check for endianess

HEADER_STRUCT = <Struct object>
SUMMARY_STRUCT = <Struct object>
bin_rows(data, chr, start, end, columns=None, metadata=None, bins=400)[source]

Bin genome by bin length and summarize the bin

decompress_binary(bin_block)[source]

decompress a binary string

Parameters:bin_block – binary string
Returns:a zlib decompressed binary string
formatAsJSON(data)[source]

Encode a data object as JSON

Parameters:data – any data object to encode
Returns:data encoded as JSON
get_bytes(offset, size)[source]

Get bytes within a given range

Parameters:
  • offset (int) – byte start position in file
  • size (int) – size of bytes to access from offset
Returns:

binary string from offset to (offset + size)

get_bytes_http(offset, size)[source]
get_data(chr, start, end)[source]
get_status()[source]
is_local(file)[source]

Checks if file is local or hosted publicly

Parameters:file – location of file
parse_header()[source]
parse_url(furl=None)[source]
parse_url_http(furl=None)[source]
simplified_bin_rows(data, chr, start, end, columns=None, metadata=None, bins=400)[source]

epivizfileserver.parser.BigBed module

class epivizfileserver.parser.BigBed.BigBed(file, columns=None)[source]

Bases: epivizfileserver.parser.BigWig.BigWig

Bed file parser

Parameters:file (str) – bigbed file location
get_autosql()[source]

parse autosql stored in file

Returns:an array of columns in file parsed from autosql
magic = '0x8789F2EB'
parseLeafDataNode(chrmId, start, end, zoomlvl, rStartChromIx, rStartBase, rEndChromIx, rEndBase, rdataOffset, rDataSize)[source]

Parse leaf node

epivizfileserver.parser.BigWig module

class epivizfileserver.parser.BigWig.BigWig(file, columns=None)[source]

Bases: epivizfileserver.parser.BaseFile.BaseFile

BigWig file parser

Parameters:file (str) – bigwig file location
tree

chromosome tree parsed from file

columns

column names

cacheData

locally cached data for this file

daskWrapper(fileObj, chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='JSON')[source]

Dask Wrapper

getHeader()[source]

get header byte region in file

getId(chrmzone)[source]

Get mapping of chromosome to id stored in file

Parameters:chrmzone (str) – chromosome
Returns:id in file for the given chromosome
getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

getTree(zoomlvl)[source]

Get chromosome tree for a given zoom level

Parameters:zoomlvl (int) – zoomlvl to get
Returns:Tree binary bytes
getTreeBytes(zoomlvl, start, size)[source]
getValues(chr, start, end, zoomlvl)[source]

Get data for a region

Note: Do not use this directly, use getRange

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
Returns:

data for the region

getZoom(zoomlvl, binSize)[source]

Get Zoom record for the given bin size

Parameters:
  • zoomlvl (int) – zoomlvl to get
  • binSize (int) – bin data by bin size
Returns:

zoom level

getZoomHeader(data)[source]
get_autosql()[source]

parse autosql in file

Returns:an array of columns in file parsed from autosql
get_cache()[source]
locateTree(chrmId, start, end, zoomlvl, offset)[source]

Locate tree for the given region

Parameters:
  • chrmId (int) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • zoomlvl (int) – zoom level
  • offset (int) – offset position in the file
Returns:

nodes in the stored R-tree

magic = '0x888FFC26'
parseLeafDataNode(chrmId, start, end, zoomlvl, rStartChromIx, rStartBase, rEndChromIx, rEndBase, rdataOffset, rDataSize)[source]

Parse an Rtree leaf node

parse_header(data=None)[source]

parse header in file

Returns:attributed stored in the header
readRtreeHeaderNode(zoomlvl)[source]

Parse an Rtree Header node

Parameters:zoomlvl (int) – zoom level
Returns:header node Rtree object
readRtreeNode(zoomlvl, offset)[source]

Parse an Rtree node

Parameters:
  • zoomlvl (int) – zoom level
  • offset (int) – offset in the file
Returns:

node Rtree object

set_cache(cache)[source]
traverseRtreeNodes(node, zoomlvl, chrmId, start, end, result=[])[source]

Traverse an Rtree to get nodes in the given range

epivizfileserver.parser.GWASBigBed module

class epivizfileserver.parser.GWASBigBed.GWASBigBed(file, columns=None)[source]

Bases: epivizfileserver.parser.BigBed.BigBed

Bed file parser

Parameters:file (str) – GWASBigBed file location
getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

magic = '0x8789F2EB'

epivizfileserver.parser.GtfFile module

class epivizfileserver.parser.GtfFile.GtfFile(file, columns=['chr', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'group'])[source]

Bases: object

GTF File Class to parse gtf/gff files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_col_names()[source]
get_data(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]
parse_attribute(item, key)[source]
searchGene(query, maxResults=5)[source]
search_gene(query, maxResults=5)[source]

epivizfileserver.parser.GtfParsedFile module

class epivizfileserver.parser.GtfParsedFile.GtfParsedFile(file, columns=['chr', 'start', 'end', 'width', 'strand', 'geneid', 'exon_starts', 'exon_ends', 'gene'])[source]

Bases: object

GTF File Class to parse gtf/gff files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_col_names()[source]
get_data(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]
parse_attribute(item, key)[source]
searchGene(query, maxResults=5)[source]
search_gene(query, maxResults=5)[source]

epivizfileserver.parser.GtfTabixFile module

class epivizfileserver.parser.GtfTabixFile.GtfTabixFile(file, columns=None)[source]

Bases: epivizfileserver.parser.SamFile.SamFile

GTF File Class to parse gtf/gff files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', ensembl=True)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_bin(x)[source]
get_col_names(result)[source]
toDF(result)[source]

epivizfileserver.parser.HDF5File module

class epivizfileserver.parser.HDF5File.HDF5File(file)[source]

Bases: object

HDF5 File Class to parse only local hdf5 files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start=None, end=None, row_names=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

read_10x_hdf5(chr, query_names)[source]

read a 10xGenomics hdf5 file

Parameters:
  • chr (str) – chromosome
  • query_names ([str]) – genes to filter
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

epivizfileserver.parser.Helper module

epivizfileserver.parser.Helper.get_range_helper(toDF, get_bin, get_col_names, chr, start, end, file_iter, columns, respType)[source]

epivizfileserver.parser.InteractionBigBed module

class epivizfileserver.parser.InteractionBigBed.InteractionBigBed(file, columns=['chr', 'start', 'end', 'name', 'score', 'value', 'exp', 'color', 'region1chr', 'region1start', 'region1end', 'region1name', 'region1strand', 'region2chr', 'region2start', 'region2end', 'region2name', 'region2strand'])[source]

Bases: epivizfileserver.parser.BigBed.BigBed

BigBed file parser for chromosome interaction Data

Columns in the bed file are

(chr, start, end, name, score, value (strength of interaction, same as value), exp, color, region1chr, region1start, region1end, region1name, region1strand, region2chr, region2start, region2end, region2name, region2strand)
Parameters:file (str) – InteractionBigBed file location
getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

magic = '0x8789F2EB'

epivizfileserver.parser.SamFile module

class epivizfileserver.parser.SamFile.SamFile(file, columns=None)[source]

Bases: object

SAM File Class to parse sam files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_bin(x)[source]
get_cache()[source]
get_col_names(result)[source]
set_cache(cache)[source]
toDF(result)[source]

epivizfileserver.parser.TbxFile module

class epivizfileserver.parser.TbxFile.TbxFile(file, columns=['chr', 'start', 'end', 'width', 'strand', 'geneid', 'exon_starts', 'exon_ends', 'gene'])[source]

Bases: epivizfileserver.parser.SamFile.SamFile

TBX File Class to parse tbx files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

get_bin(x)[source]
get_col_names(result)[source]
get_data(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]
searchGene(query, maxResults=5)[source]
toDF(result)[source]

epivizfileserver.parser.TileDB module

class epivizfileserver.parser.TileDB.TileDB(path)[source]

Bases: object

TileDB Class to parse only local tiledb files

Parameters:
  • path (str) – local full path to a dataset tiledb_folder. This folder should contain data.tiledb, rows and cols files. See below for more detail.
  • columns ([str]) – column names for various columns in file
Detail:
The tiledb_folder should contain:

‘data.tiledb’ directory - corresponds to the uri of a tiledb array. The tiledb array must have a ‘vals’ attribute from which values are read. The array should have as many rows as the number of lines in the ‘rows’ file, and as many columns as the number of lines in the ‘cols’ file.

‘rows’ file - this is a tab-separated value file describing the rows of the tiledb array it must have as many lines as rows in the tiledb file. There should be no index column in this file (i.e., it is read with pandas.read_csv(…, sep=’ ‘, index_col=False)). It must have columns ‘chr’, ‘start’ and ‘end’.

‘cols’ file - this is a tab-separated value file describing the columns of the tiledb array. It must have as many files as columns in the tiledb file. Column names for the tiledb array will be obtained from the first column in this file (i.e., iti is read with pandas.read_csv(…, sep=’ ‘, index_col=0)).

getRange(chr, start=None, end=None, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]

Get data for a given genomic location

Parameters:
  • chr (str) – chromosome
  • start (int) – genomic start
  • end (int) – genomic end
  • respType (str) – result format type, default is “DataFrame
Returns:

result

a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array

error

if there was any error during the process

epivizfileserver.parser.TranscriptTbxFile module

class epivizfileserver.parser.TranscriptTbxFile.TranscriptTbxFile(file, columns=['chr', 'start', 'end', 'strand', 'transcript_id', 'exon_starts', 'exon_ends', 'gene'])[source]

Bases: epivizfileserver.parser.TbxFile.TbxFile

Class for tabix indexed transcript files

Parameters:
  • file (str) – file location can be local (full path) or hosted publicly
  • columns ([str]) – column names for various columns in file
file

a pysam file object

fileSrc

location of the file

cacheData

cache of accessed data in memory

columns

column names to use

epivizfileserver.parser.utils module

epivizfileserver.parser.utils.create_parser_object(format, source, columns=None)[source]

Create appropriate File class based on file format

Parameters:
  • format (str) – format of file
  • source (str) – location of file
Returns:

An instance of parser class

epivizfileserver.parser.utils.toDataFrame(records, header=None)[source]

Module contents