epivizfileserver.parser package¶

Submodules¶

epivizfileserver.parser.BamFile module¶

class epivizfileserver.parser.BamFile.BamFile(file, columns=None)[source]¶

Bases: epivizfileserver.parser.SamFile.SamFile

Bam File Class to parse bam files

Parameters:	file (str) – file location can be local (full path) or hosted publicly columns ([str]) – column names for various columns in file

file¶: a pysam file object

fileSrc¶: location of the file

cacheData¶: cache of accessed data in memory

columns¶: column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

get_bin(x)[source]¶

get_col_names(result)[source]¶

to_DF(result)[source]¶

to_msgpack(result)[source]¶

epivizfileserver.parser.BaseFile module¶

Genomics file classes

class epivizfileserver.parser.BaseFile.BaseFile(file)[source]¶

Bases: object

Base file class for parser module

This class provides various useful functions

Parameters:	file – file location

local¶: if file is local or hosted on a public server

endian¶: check for endianess

HEADER_STRUCT = <Struct object>¶

SUMMARY_STRUCT = <Struct object>¶

bin_rows(data, chr, start, end, columns=None, metadata=None, bins=400)[source]¶: Bin genome by bin length and summarize the bin

decompress_binary(bin_block)[source]¶

decompress a binary string

Parameters:	bin_block – binary string
Returns:	a zlib decompressed binary string

formatAsJSON(data)[source]¶

Encode a data object as JSON

Parameters:	data – any data object to encode
Returns:	data encoded as JSON

get_bytes(offset, size)[source]¶

Get bytes within a given range

Parameters:	offset (int) – byte start position in file size (int) – size of bytes to access from offset
Returns:	binary string from offset to (offset + size)

get_bytes_http(offset, size)[source]¶

get_data(chr, start, end)[source]¶

get_status()[source]¶

is_local(file)[source]¶

Checks if file is local or hosted publicly

Parameters:	file – location of file

parse_header()[source]¶

parse_url(furl=None)[source]¶

parse_url_http(furl=None)[source]¶

simplified_bin_rows(data, chr, start, end, columns=None, metadata=None, bins=400)[source]¶

epivizfileserver.parser.BigBed module¶

class epivizfileserver.parser.BigBed.BigBed(file, columns=None)[source]¶

Bases: epivizfileserver.parser.BigWig.BigWig

Bed file parser

Parameters:	file (str) – bigbed file location

get_autosql()[source]¶

parse autosql stored in file

Returns:	an array of columns in file parsed from autosql

magic = '0x8789F2EB'¶

parseLeafDataNode(chrmId, start, end, zoomlvl, rStartChromIx, rStartBase, rEndChromIx, rEndBase, rdataOffset, rDataSize)[source]¶: Parse leaf node

epivizfileserver.parser.BigWig module¶

class epivizfileserver.parser.BigWig.BigWig(file, columns=None)[source]¶

Bases: epivizfileserver.parser.BaseFile.BaseFile

BigWig file parser

Parameters:	file (str) – bigwig file location

tree¶: chromosome tree parsed from file

columns¶: column names

cacheData¶: locally cached data for this file

daskWrapper(fileObj, chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='JSON')[source]¶: Dask Wrapper

getHeader()[source]¶: get header byte region in file

getId(chrmzone)[source]¶

Get mapping of chromosome to id stored in file

Parameters:	chrmzone (str) – chromosome
Returns:	id in file for the given chromosome

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

getTree(zoomlvl)[source]¶

Get chromosome tree for a given zoom level

Parameters:	zoomlvl (int) – zoomlvl to get
Returns:	Tree binary bytes

getTreeBytes(zoomlvl, start, size)[source]¶

getValues(chr, start, end, zoomlvl)[source]¶

Get data for a region

Note: Do not use this directly, use getRange

Parameters:	chr (str) – chromosome start (int) – genomic start end (int) – genomic end
Returns:	data for the region

getZoom(zoomlvl, binSize)[source]¶

Get Zoom record for the given bin size

Parameters:	zoomlvl (int) – zoomlvl to get binSize (int) – bin data by bin size
Returns:	zoom level

getZoomHeader(data)[source]¶

get_autosql()[source]¶

parse autosql in file

Returns:	an array of columns in file parsed from autosql

get_cache()[source]¶

locateTree(chrmId, start, end, zoomlvl, offset)[source]¶

Locate tree for the given region

Parameters:	chrmId (int) – chromosome start (int) – genomic start end (int) – genomic end zoomlvl (int) – zoom level offset (int) – offset position in the file
Returns:	nodes in the stored R-tree

magic = '0x888FFC26'¶

parseLeafDataNode(chrmId, start, end, zoomlvl, rStartChromIx, rStartBase, rEndChromIx, rEndBase, rdataOffset, rDataSize)[source]¶: Parse an Rtree leaf node

parse_header(data=None)[source]¶

parse header in file

Returns:	attributed stored in the header

readRtreeHeaderNode(zoomlvl)[source]¶

Parse an Rtree Header node

Parameters:	zoomlvl (int) – zoom level
Returns:	header node Rtree object

readRtreeNode(zoomlvl, offset)[source]¶

Parse an Rtree node

Parameters:	zoomlvl (int) – zoom level offset (int) – offset in the file
Returns:	node Rtree object

set_cache(cache)[source]¶

traverseRtreeNodes(node, zoomlvl, chrmId, start, end, result=[])[source]¶: Traverse an Rtree to get nodes in the given range

epivizfileserver.parser.GWASBigBed module¶

class epivizfileserver.parser.GWASBigBed.GWASBigBed(file, columns=None)[source]¶

Bases: epivizfileserver.parser.BigBed.BigBed

Bed file parser

Parameters:	file (str) – GWASBigBed file location

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

magic = '0x8789F2EB'¶

epivizfileserver.parser.GtfFile module¶

class epivizfileserver.parser.GtfFile.GtfFile(file, columns=['chr', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'group'])[source]¶

Bases: object

GTF File Class to parse gtf/gff files

Parameters:	file (str) – file location can be local (full path) or hosted publicly columns ([str]) – column names for various columns in file

file¶: a pysam file object

fileSrc¶: location of the file

cacheData¶: cache of accessed data in memory

columns¶: column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

get_col_names()[source]¶

get_data(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶

parse_attribute(item, key)[source]¶

searchGene(query, maxResults=5)[source]¶

search_gene(query, maxResults=5)[source]¶

epivizfileserver.parser.GtfParsedFile module¶

class epivizfileserver.parser.GtfParsedFile.GtfParsedFile(file, columns=['chr', 'start', 'end', 'width', 'strand', 'geneid', 'exon_starts', 'exon_ends', 'gene'])[source]¶

Bases: object

GTF File Class to parse gtf/gff files

Parameters:	file (str) – file location can be local (full path) or hosted publicly columns ([str]) – column names for various columns in file

file¶: a pysam file object

fileSrc¶: location of the file

cacheData¶: cache of accessed data in memory

columns¶: column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

get_col_names()[source]¶

get_data(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶

parse_attribute(item, key)[source]¶

searchGene(query, maxResults=5)[source]¶

search_gene(query, maxResults=5)[source]¶

epivizfileserver.parser.GtfTabixFile module¶

class epivizfileserver.parser.GtfTabixFile.GtfTabixFile(file, columns=None)[source]¶

Bases: epivizfileserver.parser.SamFile.SamFile

GTF File Class to parse gtf/gff files

Parameters:	file (str) – file location can be local (full path) or hosted publicly columns ([str]) – column names for various columns in file

file¶: a pysam file object

fileSrc¶: location of the file

cacheData¶: cache of accessed data in memory

columns¶: column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', ensembl=True)[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

get_bin(x)[source]¶

get_col_names(result)[source]¶

toDF(result)[source]¶

epivizfileserver.parser.HDF5File module¶

class epivizfileserver.parser.HDF5File.HDF5File(file)[source]¶

Bases: object

HDF5 File Class to parse only local hdf5 files

Parameters:	file (str) – file location can be local (full path) or hosted publicly columns ([str]) – column names for various columns in file

file¶: a pysam file object

fileSrc¶: location of the file

cacheData¶: cache of accessed data in memory

columns¶: column names to use

getRange(chr, start=None, end=None, row_names=None)[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

read_10x_hdf5(chr, query_names)[source]¶

read a 10xGenomics hdf5 file

Parameters:

chr (str) – chromosome
query_names ([str]) – genes to filter

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

epivizfileserver.parser.Helper module¶

epivizfileserver.parser.Helper.get_range_helper(toDF, get_bin, get_col_names, chr, start, end, file_iter, columns, respType)[source]¶

epivizfileserver.parser.InteractionBigBed module¶

class epivizfileserver.parser.InteractionBigBed.InteractionBigBed(file, columns=['chr', 'start', 'end', 'name', 'score', 'value', 'exp', 'color', 'region1chr', 'region1start', 'region1end', 'region1name', 'region1strand', 'region2chr', 'region2start', 'region2end', 'region2name', 'region2strand'])[source]¶

Bases: epivizfileserver.parser.BigBed.BigBed

BigBed file parser for chromosome interaction Data

Columns in the bed file are

(chr, start, end, name, score, value (strength of interaction, same as value), exp, color, region1chr, region1start, region1end, region1name, region1strand, region2chr, region2start, region2end, region2name, region2strand)

Parameters:	file (str) – InteractionBigBed file location

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

magic = '0x8789F2EB'¶

epivizfileserver.parser.SamFile module¶

class epivizfileserver.parser.SamFile.SamFile(file, columns=None)[source]¶

Bases: object

SAM File Class to parse sam files

Parameters:	file (str) – file location can be local (full path) or hosted publicly columns ([str]) – column names for various columns in file

file¶: a pysam file object

fileSrc¶: location of the file

cacheData¶: cache of accessed data in memory

columns¶: column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

get_bin(x)[source]¶

get_cache()[source]¶

get_col_names(result)[source]¶

set_cache(cache)[source]¶

toDF(result)[source]¶

epivizfileserver.parser.TbxFile module¶

class epivizfileserver.parser.TbxFile.TbxFile(file, columns=['chr', 'start', 'end', 'width', 'strand', 'geneid', 'exon_starts', 'exon_ends', 'gene'])[source]¶

Bases: epivizfileserver.parser.SamFile.SamFile

TBX File Class to parse tbx files

Parameters:	file (str) – file location can be local (full path) or hosted publicly columns ([str]) – column names for various columns in file

file¶: a pysam file object

fileSrc¶: location of the file

cacheData¶: cache of accessed data in memory

columns¶: column names to use

getRange(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

get_bin(x)[source]¶

get_col_names(result)[source]¶

get_data(chr, start, end, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame')[source]¶

searchGene(query, maxResults=5)[source]¶

toDF(result)[source]¶

epivizfileserver.parser.TileDB module¶

class epivizfileserver.parser.TileDB.TileDB(path)[source]¶

Bases: object

TileDB Class to parse only local tiledb files

Parameters:	path (str) – local full path to a dataset tiledb_folder. This folder should contain data.tiledb, rows and cols files. See below for more detail. columns ([str]) – column names for various columns in file

Detail:

The tiledb_folder should contain:

‘data.tiledb’ directory - corresponds to the uri of a tiledb array. The tiledb array must have a ‘vals’ attribute from which values are read. The array should have as many rows as the number of lines in the ‘rows’ file, and as many columns as the number of lines in the ‘cols’ file.

‘rows’ file - this is a tab-separated value file describing the rows of the tiledb array it must have as many lines as rows in the tiledb file. There should be no index column in this file (i.e., it is read with pandas.read_csv(…, sep=’ ‘, index_col=False)). It must have columns ‘chr’, ‘start’ and ‘end’.

‘cols’ file - this is a tab-separated value file describing the columns of the tiledb array. It must have as many files as columns in the tiledb file. Column names for the tiledb array will be obtained from the first column in this file (i.e., iti is read with pandas.read_csv(…, sep=’ ‘, index_col=0)).

getRange(chr, start=None, end=None, bins=2000, zoomlvl=-1, metric='AVG', respType='DataFrame', treedisk=None)[source]¶

Get data for a given genomic location

Parameters:

chr (str) – chromosome
start (int) – genomic start
end (int) – genomic end
respType (str) – result format type, default is “DataFrame

Returns:

result: a DataFrame with matched regions from the input genomic location if respType is DataFrame else result is an array
error: if there was any error during the process

epivizfileserver.parser.TranscriptTbxFile module¶

class epivizfileserver.parser.TranscriptTbxFile.TranscriptTbxFile(file, columns=['chr', 'start', 'end', 'strand', 'transcript_id', 'exon_starts', 'exon_ends', 'gene'])[source]¶

Bases: epivizfileserver.parser.TbxFile.TbxFile

Class for tabix indexed transcript files

Parameters:	file (str) – file location can be local (full path) or hosted publicly columns ([str]) – column names for various columns in file

file¶: a pysam file object

fileSrc¶: location of the file

cacheData¶: cache of accessed data in memory

columns¶: column names to use

epivizfileserver.parser.utils module¶

epivizfileserver.parser.utils.create_parser_object(format, source, columns=None)[source]¶

Create appropriate File class based on file format

Parameters:	format (str) – format of file source (str) – location of file
Returns:	An instance of parser class

epivizfileserver.parser.utils.toDataFrame(records, header=None)[source]¶

epivizfileserver.parser package¶

Submodules¶

epivizfileserver.parser.BamFile module¶

epivizfileserver.parser.BaseFile module¶

epivizfileserver.parser.BigBed module¶

epivizfileserver.parser.BigWig module¶

epivizfileserver.parser.GWASBigBed module¶

epivizfileserver.parser.GtfFile module¶

epivizfileserver.parser.GtfParsedFile module¶

epivizfileserver.parser.GtfTabixFile module¶

epivizfileserver.parser.HDF5File module¶

epivizfileserver.parser.Helper module¶

epivizfileserver.parser.InteractionBigBed module¶

epivizfileserver.parser.SamFile module¶

epivizfileserver.parser.TbxFile module¶

epivizfileserver.parser.TileDB module¶

epivizfileserver.parser.TranscriptTbxFile module¶

epivizfileserver.parser.utils module¶

Module contents¶