The tutorial demonstrates a workflow for performing advanced single-cell analyses using Scanpy The PBMC-3k dataset is used as a benchmark. Start by loading and inspecting the dataset. Apply quality control checks on gene counts and total counts. Also, evaluate mitochondrial contents and ribosomal signals. After filtering low-quality cell and gene counts, we detect possible doublets using Scrublet. We normalize data, perform log transformation and then identify genes with high variability for further analysis. PCA, UMAP and tSNE are used to reduce the dimensionality of data. We can also use the Leiden algorithm to cluster cells, find marker genes, annotate populations with PBMC markers and explore trajectory structure using PAGA.
!pip install -q scanpy leidenalg python-igraph scrublet
Import scanpy (sc)
Numpy can be imported as np
import pandas as pd
Import matplotlib.pyplot into plt
import warnings
warnings.filterwarnings("ignore")
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor="white", figsize=(5, 5))
sc.logging.print_header()
adata = datasets.pbmc3k()
adata.var_names_make_unique()
print(adata)
adata.var["mt"] = adata.var_names.str.startswith("MT-")
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True
)
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4, multi_panel=True,
)
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")
Installation of the Single-Cell Analysis Libraries. Importing Scanpy NumPy Pandas Matplotlib Warning Controls. The PBMC-3k dataset is loaded, gene names are made unique and the AnnData object structures are inspected. Next, we compute quality control metrics and visualise count-level quality patterns with violin and scatterplots.
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
Adata[adata.obs.n_genes_by_counts < 2500, :].copy()
Adata[adata.obs.pct_counts_mt < 5, :].copy()
sc.pp.scrublet(adata)
print("Predicted doublets:", int(adata.obs["predicted_doublet"].sum()))
Adata[~adata.obs["predicted_doublet"], :].copy()
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata.raw = adata
Adata[:, adata.var.highly_variable].copy()
To improve the quality of the data, we filter out cells with low quality and genes that are rarely detected. Scrublet via Scanpy is used to find predicted doublets, and we remove them from the dataset before further analysis. Then, we preserve the raw counts and normalize expression, log transform, select genes with high variability, and only keep those features that are most informative.
s_genes = ["MCM5","PCNA","TYMS","FEN1","MCM2","MCM4","RRM1","UNG","GINS2",
"MCM6","CDCA7","DTL","PRIM1","UHRF1","HELLS","RFC2","NASP",
"RAD51AP1","GMNN","WDR76","SLBP","CCNE2","UBR7","POLD3","MSH2",
"ATAD2","RAD51","RRM2","CDC45","CDC6","EXO1","TIPIN","DSCC1",
"BLM","CASP8AP2","USP1","CLSPN","POLA1","CHAF1B","E2F8"]
g2m_genes = ["HMGB2","CDK1","NUSAP1","UBE2C","BIRC5","TPX2","TOP2A","NDC80",
"CKS2","NUF2","CKS1B","MKI67","TMPO","CENPF","TACC3","SMC4",
"CCNB2","CKAP2L","CKAP2","AURKB","BUB1","KIF11","ANP32E",
"TUBB4B","GTSE1","KIF20B","HJURP","CDCA3","CDC20","TTK",
"CDC25C","KIF2C","RANGAP1","NCAPD2","DLGAP5","CDCA2","CDCA8",
"ECT2","KIF23","HMMR","AURKA","PSRC1","ANLN","LBR","CKAP5",
"CENPE","NEK2","G2E3","CBX5","CENPA"]
s_genes = [g for g in s_genes if g in adata.var_names]
g2m_genes = [g for g in g2m_genes if g in adata.var_names]
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.tsne(adata, n_pcs=40)
The S-phase, G2/M and marker genes are defined and only the ones present in the data set will be retained. Score each cell according to its cell-cycle stage, remove unwanted variance from the total counts and mitochondrial proportion, then scale data for downstream modelling. Next, we perform PCA to inspect variance explained, create the neighborhood graph and produce UMAP or t-SNE.
sc.tl.leiden(adata, resolution=0.5, flavor="igraph", n_iterations=2, directed=False)
sc.pl.umap(adata, color="leiden", legend_loc="on data", title="Leiden clusters")
sc.pl.tsne(adata, color="leiden", legend_loc="on data", title="t-SNE clusters")
sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
result = uns["rank_genes_groups"]
Groups = result["names"].dtype.names
top_df = pd.DataFrame({Top_df = DataFrame (g: result)["names"][g][:10] for g in groups})
print("nTop 10 markers per cluster:n", top_df)
marker_genes = {
"B-cell": ["CD79A", "MS4A1"],
"CD8 T-cell": ["CD8A", "CD8B"],
"CD4 T-cell": ["IL7R", "CD4"],
"NK": ["GNLY", "NKG7"],
"CD14 Monocyte": ["CD14", "LYZ"],
"FCGR3A Monocyte": ["FCGR3A", "MS4A7"],
"Dendritic": ["FCER1A", "CST3"],
"Megakaryocyte": ["PPBP"],
}
sc.pl.dotplot(adata, marker_genes, groupby="leiden", standard_scale="var")
sc.pl.stacked_violin(adata, marker_genes, groupby="leiden", swap_axes=True)
The clusters are visualized on UMAP plots and tSNE plots using Leiden Clustering. We use the Wilcoxon analysis to do differential expression and identify top markers genes in each cluster. We use canonical PBMC markers to provide cell type annotation using dot plots, stacked violins plots, and stacked PBMC gene plots.
sc.tl.paga(adata, groups="leiden")
sc.pl.paga(adata, color="leiden", threshold=0.1)
sc.tl.umap(adata, init_pos="paga")
sc.pl.umap(adata, color="leiden", legend_loc="on data")
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep="X_diffmap")
adata.uns["iroot"] = np.flatnonzero(adata.obs["leiden"] == adata.obs["leiden"].cat.categories[0])[0]
sc.tl.dpt(adata)
sc.pl.umap(adata, color=["leiden", "dpt_pseudotime"], legend_loc="on data")
ifn_genes = ["ISG15", "IFI6", "IFIT1", "IFIT3", "MX1", "OAS1", "STAT1", "IRF7"]
ifn_genes = [g for g in ifn_genes if g in adata.raw.var_names]
sc.tl.score_genes(adata, gene_list=ifn_genes, score_name="IFN_score")
sc.pl.umap(adata, color="IFN_score", cmap="viridis")
adata.write("pbmc3k_analyzed.h5ad")
print("n Analysis complete — saved to pbmc3k_analyzed.h5ad")
print(adata)
PAGA models connectivity between Leiden Clusters. We then reinitialize UMAP with the PAGA graph in order to achieve a better trajectory structure. To explore the possible patterns of progression across different cell states, we compute diffusion maps. In addition, we compute an interferon gene-set response score. We visualize this on UMAP and then save it as an H5ad.
In summary, we developed an end-toend Scanpy-pipeline for single-cell RNA sequencing analysis. This pipeline transforms raw PBMC data to interpretable biological insight. Preprocessing the data, removing doublets and noisy cells, selecting informative genes and generating meaningful embeddings for cellular structure was our goal. Leiden-clustering and differential analysis were used to find marker genes, and to connect clusters and immune cell types. Addition of PAGA and diffusion pseudotime as well as custom gene-set scores extended the workflow and demonstrated how Scanpy supported deeper biological interpretation. Finally, the.h5ad saved object contained all of our data and annotations. It also included scores, clusters and analysis visual results.
Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter Don’t forget about our 150k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.
Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us
This post is about How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery The first time that appeared on MarkTechPost.

