Create Signatures from Single-cell RNA-seq Data
This page guides you through the process of creating signature matrices (X
and stdX
) from your own single-cell RNA-seq data. Additionally, instructions for using a Snakemake pipeline and optionally scheduling it with SLURM on the HPC are provided.
Accepted Input Formats
-
Anndata
- Add some information about the file types
-
Have data in some other format ?
- Convert it using our script
- .rds files
- counts data and metadata separately (.mtx, .csv....)
- Convert it using our script
Steps to Create Signatures
1. Preprocess Your scRNA-seq Data
- Quality Control: Filter out low-quality cells and genes.
2. Annotate Cell Types
- Assign cell type labels to each cell in your dataset.
- Ensure that the metadata includes a column for cell type annotations (e.g.,
cell_type_major
).- For adata objects should be a column in adata.obs
- For .rds and metadata should have an appropriate column with the cell type annotations for the cells
3. Calculate Average Expression Profiles
-
Compute
X
: Calculate the average expression of each gene for each cell type.- Use log-transformed data for consistency.
-
Compute
stdX
: Calculate the standard deviation of gene expression for each gene within each cell type.
4. Use the Snakemake Pipeline
To streamline the process, you can use the Snakemake pipeline provided in the example above. It handles data preprocessing, identifying highly variable genes, running AutoGeneS, and generating the signature matrices.
Using the Snakemake Pipeline
Prerequisites
-
Install Snakemake:
- Use
conda
:conda install -c bioconda -c conda-forge snakemake
- Use
-
Prepare Configuration:
- Edit the
config.yaml
file to match your dataset:- Set
use_existing_h5ad
toTrue
orFalse
. - Provide paths to raw or existing
.h5ad
data. - Adjust parameters like
Ngenes
and relevant column names.
- Set
- Edit the
-
Directory Structure:
- Ensure your data is organized as follows:
data/
├── raw/
│ ├── RNA_rawcounts_matrix.rds
│ ├── metadata.csv
├── processed/
results/
scripts/
- Ensure your data is organized as follows:
-
Activate Environment:
- Use the appropriate Conda environment for each rule (e.g.,
env_r.yaml
,env_preprocess.yaml
).
- Use the appropriate Conda environment for each rule (e.g.,
Running the Pipeline
To run the pipeline locally:
snakemake --cores 4