Build a single-cell RNA-seq analysis pipeline with Scanpy to perform PBMC clustering, annotation, and trajectory discovery

The tutorial demonstrates a workflow for performing advanced single-cell analyses using Scanpy The PBMC-3k dataset is used as a benchmark. Start by loading and inspecting the dataset. Apply quality control checks on gene counts and total counts. Also, evaluate mitochondrial contents and ribosomal signals. After filtering low-quality cell and gene counts, we detect possible doublets using Scrublet. We normalize data, perform log transformation and then identify genes with high variability for further analysis. PCA, UMAP and tSNE are used to reduce the dimensionality of data. We can also use the Leiden algorithm to cluster cells, find marker genes, annotate populations with PBMC markers and explore trajectory structure using PAGA.

Copy CodeCopiesUse a different Browser

!pip install -q scanpy leidenalg python-igraph scrublet


Import scanpy (sc)
Numpy can be imported as np
import pandas as pd
Import matplotlib.pyplot into plt
import warnings
warnings.filterwarnings("ignore")


sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor="white", figsize=(5, 5))
sc.logging.print_header()


adata = datasets.pbmc3k()
adata.var_names_make_unique()
print(adata)


adata.var["mt"]   = adata.var_names.str.startswith("MT-")
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
sc.pp.calculate_qc_metrics(
   adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True
)


sc.pl.violin(
   adata,
   ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
   jitter=0.4, multi_panel=True,
)
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")

Installation of the Single-Cell Analysis Libraries. Importing Scanpy NumPy Pandas Matplotlib Warning Controls. The PBMC-3k dataset is loaded, gene names are made unique and the AnnData object structures are inspected. Next, we compute quality control metrics and visualise count-level quality patterns with violin and scatterplots.

Copy codeCopiesUse a different Browser

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
Adata[adata.obs.n_genes_by_counts < 2500, :].copy()
Adata[adata.obs.pct_counts_mt < 5, :].copy()


sc.pp.scrublet(adata)
print("Predicted doublets:", int(adata.obs["predicted_doublet"].sum()))
Adata[~adata.obs["predicted_doublet"], :].copy()


adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)


sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata.raw = adata
Adata[:, adata.var.highly_variable].copy()

To improve the quality of the data, we filter out cells with low quality and genes that are rarely detected. Scrublet via Scanpy is used to find predicted doublets, and we remove them from the dataset before further analysis. Then, we preserve the raw counts and normalize expression, log transform, select genes with high variability, and only keep those features that are most informative.

Copy codeCopiesUse a different Browser

s_genes = ["MCM5","PCNA","TYMS","FEN1","MCM2","MCM4","RRM1","UNG","GINS2",
          "MCM6","CDCA7","DTL","PRIM1","UHRF1","HELLS","RFC2","NASP",
          "RAD51AP1","GMNN","WDR76","SLBP","CCNE2","UBR7","POLD3","MSH2",
          "ATAD2","RAD51","RRM2","CDC45","CDC6","EXO1","TIPIN","DSCC1",
          "BLM","CASP8AP2","USP1","CLSPN","POLA1","CHAF1B","E2F8"]
g2m_genes = ["HMGB2","CDK1","NUSAP1","UBE2C","BIRC5","TPX2","TOP2A","NDC80",
            "CKS2","NUF2","CKS1B","MKI67","TMPO","CENPF","TACC3","SMC4",
            "CCNB2","CKAP2L","CKAP2","AURKB","BUB1","KIF11","ANP32E",
            "TUBB4B","GTSE1","KIF20B","HJURP","CDCA3","CDC20","TTK",
            "CDC25C","KIF2C","RANGAP1","NCAPD2","DLGAP5","CDCA2","CDCA8",
            "ECT2","KIF23","HMMR","AURKA","PSRC1","ANLN","LBR","CKAP5",
            "CENPE","NEK2","G2E3","CBX5","CENPA"]
s_genes   = [g for g in s_genes   if g in adata.var_names]
g2m_genes = [g for g in g2m_genes if g in adata.var_names]
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)


sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)


sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)


sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.tsne(adata, n_pcs=40)

The S-phase, G2/M and marker genes are defined and only the ones present in the data set will be retained. Score each cell according to its cell-cycle stage, remove unwanted variance from the total counts and mitochondrial proportion, then scale data for downstream modelling. Next, we perform PCA to inspect variance explained, create the neighborhood graph and produce UMAP or t-SNE.

Copy CodeCopiesUse a different Browser

sc.tl.leiden(adata, resolution=0.5, flavor="igraph", n_iterations=2, directed=False)
sc.pl.umap(adata, color="leiden", legend_loc="on data", title="Leiden clusters")
sc.pl.tsne(adata, color="leiden", legend_loc="on data", title="t-SNE clusters")


sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)


result = uns["rank_genes_groups"]
Groups = result["names"].dtype.names
top_df   = pd.DataFrame({Top_df = DataFrame (g: result)["names"][g][:10] for g in groups})
print("nTop 10 markers per cluster:n", top_df)


marker_genes = {
   "B-cell":           ["CD79A", "MS4A1"],
   "CD8 T-cell":       ["CD8A", "CD8B"],
   "CD4 T-cell":       ["IL7R", "CD4"],
   "NK":               ["GNLY", "NKG7"],
   "CD14 Monocyte":    ["CD14", "LYZ"],
   "FCGR3A Monocyte":  ["FCGR3A", "MS4A7"],
   "Dendritic":        ["FCER1A", "CST3"],
   "Megakaryocyte":    ["PPBP"],
}
sc.pl.dotplot(adata, marker_genes, groupby="leiden", standard_scale="var")
sc.pl.stacked_violin(adata, marker_genes, groupby="leiden", swap_axes=True)

The clusters are visualized on UMAP plots and tSNE plots using Leiden Clustering. We use the Wilcoxon analysis to do differential expression and identify top markers genes in each cluster. We use canonical PBMC markers to provide cell type annotation using dot plots, stacked violins plots, and stacked PBMC gene plots.

Copy CodeCopiesUse a different Browser

sc.tl.paga(adata, groups="leiden")
sc.pl.paga(adata, color="leiden", threshold=0.1)


sc.tl.umap(adata, init_pos="paga")
sc.pl.umap(adata, color="leiden", legend_loc="on data")


sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep="X_diffmap")
adata.uns["iroot"] = np.flatnonzero(adata.obs["leiden"] == adata.obs["leiden"].cat.categories[0])[0]
sc.tl.dpt(adata)
sc.pl.umap(adata, color=["leiden", "dpt_pseudotime"], legend_loc="on data")


ifn_genes = ["ISG15", "IFI6", "IFIT1", "IFIT3", "MX1", "OAS1", "STAT1", "IRF7"]
ifn_genes = [g for g in ifn_genes if g in adata.raw.var_names]
sc.tl.score_genes(adata, gene_list=ifn_genes, score_name="IFN_score")
sc.pl.umap(adata, color="IFN_score", cmap="viridis")


adata.write("pbmc3k_analyzed.h5ad")
print("n Analysis complete — saved to pbmc3k_analyzed.h5ad")
print(adata)

PAGA models connectivity between Leiden Clusters. We then reinitialize UMAP with the PAGA graph in order to achieve a better trajectory structure. To explore the possible patterns of progression across different cell states, we compute diffusion maps. In addition, we compute an interferon gene-set response score. We visualize this on UMAP and then save it as an H5ad.

In summary, we developed an end-toend Scanpy-pipeline for single-cell RNA sequencing analysis. This pipeline transforms raw PBMC data to interpretable biological insight. Preprocessing the data, removing doublets and noisy cells, selecting informative genes and generating meaningful embeddings for cellular structure was our goal. Leiden-clustering and differential analysis were used to find marker genes, and to connect clusters and immune cell types. Addition of PAGA and diffusion pseudotime as well as custom gene-set scores extended the workflow and demonstrated how Scanpy supported deeper biological interpretation. Finally, the.h5ad saved object contained all of our data and annotations. It also included scores, clusters and analysis visual results.

Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter Don’t forget about our 150k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

This post is about How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery The first time that appeared on MarkTechPost.

Build a single-cell RNA-seq analysis pipeline with Scanpy to perform PBMC clustering, annotation, and trajectory discovery

OpenAI’s AI Agent can now access LinkedIn, Salesforce Gmail and internal tools via sign-in sessions.

Natural Language Automatencoders by Anthropic Convert Claude’s internal activations directly into human-readable text explanations

OpenAI Releases Three Realtime Audio Models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in the Realtime API

LightSeek Foundation releases TokenSpeed, a open-source LLM inference engine targeting TensorRT-LLM level performance for agentic workloads

Google Shakes Up Its Agent Team After OpenClaw Craze

Why are Therapists so Expensive? ChatGPT: Why are thousands of women spilling their deepest secrets?

OpenAI is poised to become the most valuable startup ever. What Should It Be?

Vibe Coding is Shoot-and-Overlook Coding

Data Centers: Are you against them? Working Families Party Wants you to Run For Office

Top Insights

Use MLflow to establish a robust versioning system that allows for rapid and rigorous testing of large language models

The Body Betraying Privacy

Latest News

Build a single-cell RNA-seq analysis pipeline with Scanpy to perform PBMC clustering, annotation, and trajectory discovery

OpenAI’s AI Agent can now access LinkedIn, Salesforce Gmail and internal tools via sign-in sessions.

Build a single-cell RNA-seq analysis pipeline with Scanpy to perform PBMC clustering, annotation, and trajectory discovery

Related Posts