neuralee.dataset package¶

This module is modified from scVI.

class neuralee.dataset.CortexDataset(save_path='data/', genes_to_keep=[], genes_fish=[], additional_genes=None)[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

Loads cortex dataset.

The Mouse Cortex Cells dataset contains 3005 mouse cortex cells and gold-standard labels for seven distinct cell types. Each cell type corresponds to a cluster to recover. We retain top 558 genes ordered by variance.

Parameters:	save_path – Save path of raw data file.

Examples:

gene_dataset = CortexDataset()

preprocess()[source]¶

static reorder_genes(x, genes, first_genes)[source]¶: In case the order of the genes needs to be changed: puts the gene present in ordered_genes first, conserving the same order.

class neuralee.dataset.BrainLargeDataset(subsample_size=None, save_path='data/', nb_genes_kept=720, max_cells=None)[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

Loads brain-large dataset.

This dataset contains 1.3 million brain cells from 10x Genomics. We randomly shuffle the data to get a 1M subset of cells and order genes by variance to retain first 10,000 and then 720 sampled variable genes.

Parameters:	save_path – Save path of raw data file.

Examples:

gene_dataset = BrainLargeDataset()

preprocess()[source]¶

class neuralee.dataset.RetinaDataset(save_path='data/')[source]¶

Bases: neuralee.dataset.loom.LoomDataset

Loads retina dataset.

The dataset of bipolar cells contains after their original pipeline for filtering 27,499 cells and 13,166 genes coming from two batches. We use the cluster annotation from 15 cell-types from the author.

Parameters:	save_path – Save path of raw data file.

Examples:

gene_dataset = RetinaDataset()

class neuralee.dataset.GeneExpressionDataset(Y, batch_indices=None, labels=None, gene_names=None, cell_types=None)[source]¶

Bases: object

Gene Expression dataset.

Parameters:	Y (numpy.ndarray or numpy.matrix) – gene expression matrix. batch_indices – batch indices. if None, set as np.zeros. labels – labels. if None, set as np.zeros. gene_name – gene names. cell_types – cell types.

affinity(aff='ea', perplexity=30.0, neighbors=None)[source]¶

Affinity calculation.

Parameters:	aff ({'ea', 'x2p'}) – affinity used to calculate attractive weights. perplexity – perplexity defined in elastic embedding function. neighbors (int) – the number of nearest neighbors

affinity_split(N_small=None, aff='ea', perplexity=30.0, verbose=False, neighbors=None)[source]¶

Affinity calculation on each batch.

Preparation for NeuralEE with mini-batch trick.

Parameters:	N_small (int or percentage) – size of each batch. aff ({'ea', 'x2p'}) – affinity used to calculate attractive weights. perplexity – perplexity defined in elastic embedding function. verbose (bool.) – whether to show the progress of affinity calculation. neighbors (int) – the number of nearest neighbors

static concat_datasets(*gene_datasets, on='gene_names', shared_labels=True, shared_batches=False)[source]¶: Combines multiple unlabelled gene_datasets based on the intersection of gene names intersection. Datasets should all have gene_dataset.n_labels=0. Batch indices are generated in the same order as datasets are given. :param gene_datasets: a sequence of gene_datasets object :return: a GeneExpressionDataset instance of the concatenated datasets

download()[source]¶: download dataset.

download_and_preprocess()[source]¶: download and preprocess dataset.

filter_cell_types(cell_types)[source]¶

update data by given cell types.

Parameters:	cell_types (numpy.ndarray) – indices(np.int) or cell-types names(np.str).

filter_genes(gene_names_ref, on='gene_names')[source]¶

update dataset by given subset of genes’ names.

Parameters:	gene_names_ref – subset of genes’ names.

static get_attributes_from_list(Xs, list_batches=None, list_labels=None)[source]¶: acquire dataset from lists.

static get_attributes_from_matrix(X, batch_indices=0, labels=None)[source]¶: acquire dataset from matrix.

log_shift()[source]¶: lambda: x -> log(1+x)

map_cell_types(cell_types_dict)[source]¶

A map for the cell types to keep, and optionally merge together under a new name (value in the dict).

Parameters:	cell_types_dict – a dictionary with tuples (str or int) as input and value (str or int) as output

merge_cell_types(cell_types, new_cell_type_name)[source]¶

Merge some cell types into a new one, a change the labels accordingly.

Parameters:	cell_types (numpy.ndarray) – indices(np.int) or cell-types names(np.str). new_cell_type_name (numpy.ndarray) – indices(np.int) or cell-types names(np.str).

remove_zero_sample()[source]¶: remove zero expression samples.

standardscale()[source]¶: standard scaling across gene.

subsample_cells(size=1.0)[source]¶

update dataset by filtering cells according to variance.

Parameters:	size (int or percentage) – subsample size.

subsample_genes(new_n_genes=None, subset_genes=None)[source]¶

update dataset by filtering genes according to variance.

Parameters:	new_n_genes – number of genes remain. if subset_genes not provided. subset_genes – subset of cells’indexes.

update_cells(subset_cells)[source]¶

update dataset by given subset of cells’ indexes.

Parameters:	subset_cells – subset of cells’indexes.

update_genes(subset_genes)[source]¶

update dataset by given subset of genes’ indexes.

Parameters:	subset_genes – subset of genes’ indexes.

class neuralee.dataset.CiteSeqDataset(name='cbmc', save_path='data/citeSeq/')[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

preprocess()[source]¶

class neuralee.dataset.BrainSmallDataset(save_path='data/')[source]¶

Bases: neuralee.dataset.dataset10X.Dataset10X

This dataset consists in 9,128 mouse brain cells profiled using 10x Genomics is used as a complement of PBMC for our study of zero abundance and quality control metrics correlation with our generative posterior parameters. We derived quality control metrics using the cellrangerRkit R package (v.1.1.0). Quality metrics were extracted from CellRanger throughout the molecule specific information file. We kept the top 3000 genes by variance. We used the clusters provided by cellRanger for the correlation analysis of zero probabilities.

Parameters:	save_path – Save path of raw data file.

Examples:

gene_dataset = BrainSmallDataset()

class neuralee.dataset.HematoDataset(save_path='data/HEMATO/')[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

Loads hemato dataset.

This dataset with continuous gene expression variations from hematopoeitic progenitor cells contains 4,016 cells and 7,397 genes. We removed the library basal-bm1 which was of poor quality based on authors recommendation. We use their population balance analysis result as a potential function for differentiation.

Parame save_path:
	Save path of raw data file.

Examples:

gene_dataset = HematoDataset()

preprocess()[source]¶

class neuralee.dataset.CbmcDataset(save_path='data/citeSeq/')[source]¶

Bases: neuralee.dataset.cite_seq.CiteSeqDataset

Loads cbmc dataset.

This dataset that includes 8,617 cord blood mononuclear cells profiled using 10x along with for each cell 13 well-characterized mononuclear antibodies. We kept the top 600 genes by variance.

Parameters:	save_path – Save path of raw data file.

Examples:

gene_dataset = CbmcDataset()

class neuralee.dataset.PbmcDataset(save_path='data/')[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

Loads pbmc dataset.

We considered scRNA-seq data from two batches of peripheral blood mononuclear cells (PBMCs) from a healthy donor (4K PBMCs and 8K PBMCs). We derived quality control metrics using the cellrangerRkit R package (v. 1.1.0). Quality metrics were extracted from CellRanger throughout the molecule specific information file. After filtering, we extract 12,039 cells with 10,310 sampled genes and get biologically meaningful clusters with the software Seurat. We then filter genes that we could not match with the bulk data used for differential expression to be left with g = 3346.

Parameters:	save_path – Save path of raw data file.

Examples:

gene_dataset = PbmcDataset()

class neuralee.dataset.LoomDataset(filename, save_path='data/', url=None)[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

Loads a .loom file.

Parameters:	filename – Name of the .loom file. save_path – Save path of the dataset. url – Url of the remote dataset.

Examples:

# Loading a remote dataset
remote_loom_dataset = LoomDataset(
"osmFISH_SScortex_mouse_all_cell.loom", save_path='data/',
url='http://linnarssonlab.org/osmFISH/'
    'osmFISH_SScortex_mouse_all_cells.loom')
# Loading a local dataset
local_loom_dataset = LoomDataset(
    "osmFISH_SScortex_mouse_all_cell.loom", save_path='data/')

preprocess()[source]¶

class neuralee.dataset.AnnDataset(filename, save_path='data/', url=None, new_n_genes=False, subset_genes=None)[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

Loads a .h5ad file .

AnnDataset class supports loading Anndata object.

Parameters:	filename – Name of the .h5ad file. save_path – Save path of the dataset. url – Url of the remote dataset. new_n_genes – Number of subsampled genes. subset_genes – List of genes for subsampling.

Examples:

# Loading a local dataset
local_ann_dataset = AnnDataset(
    "TM_droplet_mat.h5ad", save_path = 'data/')

preprocess()[source]¶

class neuralee.dataset.CsvDataset(filename, save_path='data/', url=None, new_n_genes=600, subset_genes=None, compression=None, sep=', ', gene_by_cell=True, labels_file=None, batch_ids_file=None)[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

Loads a .csv file.

Parameters:

filename – Name of the .csv file.
save_path – Save path of the dataset.
url – Url of the remote dataset.
new_n_genes – Number of subsampled genes.
subset_genes – List of genes for subsampling.
compression – For on-the-fly decompression of on-disk data. If ‘infer’ and filepath_or_bufferis path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in.
batch_ids_file – Name of the .csv file with batch indices. File contains two columns. The first holds gene names and second holds batch indices - type int. The first row of the file is header.

Examples:

# Loading a remote dataset
remote_url = "https://www.ncbi.nlm.nih.gov/geo/download/" \
"?acc=GSE100866&format=file&file=" \
"GSE100866%5FCBMC%5F8K%5F13AB%5F10X%2DRNA%5Fumi%2Ecsv%2Egz")
remote_csv_dataset = CsvDataset(
    "GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz", save_path='data/',
    compression='gzip', url=remote_url)
# Loading a local dataset
local_csv_dataset = CsvDataset(
    "GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz",
    save_path='data/', compression='gzip')

preprocess()[source]¶

class neuralee.dataset.Dataset10X(filename, save_path='data/', type='filtered', dense=False, remote=True, genecol=0)[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

Loads a file from 10x website.

Parameters:

filename – Name of the dataset file.
save_path – Save path of the dataset.
type – Either filtered data or raw data.
subset_genes – List of genes for subsampling.
dense – Whether to load as dense or sparse.
remote – Whether the 10X dataset is to be downloaded from the website or whether it is a local dataset, if remote is False then os.path.join(save_path, filename) must be the path to the directory that contains matrix.mtx and genes.tsv files

Examples:

tenX_dataset = Dataset10X("neuron_9k")

preprocess()[source]¶

class neuralee.dataset.SeqfishDataset(save_path='data/')[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

preprocess()[source]¶

class neuralee.dataset.SmfishDataset(save_path='data/', cell_type_level='major')[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

preprocess()[source]¶

class neuralee.dataset.BreastCancerDataset(save_path='data/')[source]¶: Bases: neuralee.dataset.csv.CsvDataset

class neuralee.dataset.MouseOBDataset(save_path='data/')[source]¶: Bases: neuralee.dataset.csv.CsvDataset

class neuralee.dataset.PurifiedPBMCDataset(save_path='data/', filter_cell_types=None)[source]¶

Bases: neuralee.dataset.dataset.GeneExpressionDataset

The purified PBMC dataset from: “Massively parallel digital transcriptional profiling of single cells”.

Parameters:	save_path – Save path of raw data file.

Examples:

gene_dataset = PurifiedPBMCDataset()