Note

In order to run this notebook in Google Colab run the following cells. For best performance make sure that you run the notebook on a GPU instance, i.e. choose from the menu bar Runtime > Change Runtime > Hardware accelerator > GPU.

[3]:

# Install scPCA + dependencies
!pip install --quiet scpca scikit-misc

[2]:

# Download the Kang et. al. dataset
!wget https://www.huber.embl.de/users/harald/scpca/angelidis.h5ad /content/angelidis.h5ad

Analysing the brain cells in light stimulated mice#

In this notebook, we analyse the dataset from Angelidis et. al. in which the authors investigated lung cells of aging mice.

[54]:

import scpca as scp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
import seaborn as sns

Data preprocessing#

We begin by importing the dataset with scanpy.

[57]:

data_path = '/content/hrvatin.h5ad'
adata = sc.read_h5ad(data_path)

[58]:

adata

[58]:

AnnData object with n_obs × n_vars = 14745 × 20908
    obs: 'barcode', 'mouse_id', 'age', 'cell_type', 'cell_id'
    uns: 'X_name', 'celltype_column', 'condition_column', 'main_assay'

Examining the obs field of the adata object reveals that the dataset includes two conditions (age column), corresponding to mice aged 3 and 24 months.

[4]:

adata.obs

[4]:

	barcode	mouse_id	age	cell_type	cell_id
muc3838:GCACTTTAGAAT	GCACTTTAGAAT	muc3838	24m	Ciliated_cells	muc3838:GCACTTTAGAAT
muc3838:TCCTGCTCCCTT	TCCTGCTCCCTT	muc3838	24m	Ciliated_cells	muc3838:TCCTGCTCCCTT
muc3838:TCCTGCTCCCTG	TCCTGCTCCCTG	muc3838	24m	Ciliated_cells	muc3838:TCCTGCTCCCTG
muc3838:TGCGAGTCTTCT	TGCGAGTCTTCT	muc3838	24m	Ciliated_cells	muc3838:TGCGAGTCTTCT
muc3838:TCCTGCTCCCTC	TCCTGCTCCCTC	muc3838	24m	Ciliated_cells	muc3838:TCCTGCTCCCTC
...	...	...	...	...	...
muc4657:ATCAAGACAGTG	ATCAAGACAGTG	muc4657	3m	Type_2_pneumocytes	muc4657:ATCAAGACAGTG
muc4657:GACTGCGCATGG	GACTGCGCATGG	muc4657	3m	Ciliated_cells	muc4657:GACTGCGCATGG
muc4657:ATGACCGAATGT	ATGACCGAATGT	muc4657	3m	Type_2_pneumocytes	muc4657:ATGACCGAATGT
muc4657:GTGTTTGGACCG	GTGTTTGGACCG	muc4657	3m	Cd4+_T_cells	muc4657:GTGTTTGGACCG
muc4657:TGGCCGTGTAGC	TGGCCGTGTAGC	muc4657	3m	Goblet_cells	muc4657:TGGCCGTGTAGC

14745 rows × 5 columns

Before proceeding with the selection of highly variable genes, we filter out low-quality cells and red blood cells, as the latter are not representative of lung cell types.

[5]:

adata = adata[~adata.obs['cell_type'].isin(['low_quality_cells', 'red_blood_cells'])]

[6]:

adata.layers["counts"] = adata.X.copy()

/tmp/ipykernel_657/1517723426.py:1: ImplicitModificationWarning: Setting element `.layers['counts']` of view, initializing view as actual.
  adata.layers["counts"] = adata.X.copy()

[7]:

sc.pp.highly_variable_genes(
    adata,
    n_top_genes=2000,
    flavor="seurat_v3",
    subset=True,
    layer="counts"
)

Factor analysis#

Next, we model the data using scPCA. One strategy is to treat the 3-month condition as the reference and capture gene expression shifts in old mice as deviations from this baseline. This can be specified using the design formula ~ age, similar to how one would define a model in a linear regression framework.

[8]:

m0 = scp.scPCA(
    adata,
    num_factors=10,
    layers_key='counts',
    loadings_formula='age',
    intercept_formula='1',
    seed=8999074
)

[9]:

m0.fit()

  0%|          | 0/5000 [00:00<?, ?it/s]/g/huber/users/voehring/conda/scpca_dev/lib/python3.12/site-packages/scpca/models/scpca.py:68: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at /pytorch/aten/src/ATen/native/TensorShape.cpp:4413.)
  α_rna = deterministic("α_rna", (1 / α_rna_inv).T)
Epoch: 4990 | lr: 1.00E-02 | ELBO: 2913242 | Δ_10: 43149.50 | Best: 2838556: 100%|██████████| 5000/5000 [02:12<00:00, 37.65it/s]

[10]:

m0.fit(lr=0.001, num_epochs=10000)

Epoch: 14990 | lr: 1.00E-03 | ELBO: 2876858 | Δ_10: 37865.50 | Best: 2809824: 100%|██████████| 10000/10000 [04:08<00:00, 40.29it/s]

Let’s store the result from the factorisation in the adata object.

[11]:

m0.mean_to_anndata(model_key='m0', num_samples=10, num_split=512)

Predicting z for obs 13824-14065.: 100%|██████████| 28/28 [00:02<00:00, 13.78it/s]

If we now take a close look at the adata object we will find that the factors and loadings were added to the obsm and varm fields respectively (X_m0 and W_m0). Meta data related to the model field are stored under adata.uns[{model_key}].

[12]:

adata

[12]:

AnnData object with n_obs × n_vars = 14066 × 2000
    obs: 'barcode', 'mouse_id', 'age', 'cell_type', 'cell_id'
    var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: 'X_name', 'celltype_column', 'condition_column', 'main_assay', 'hvg', 'm0'
    obsm: 'X_m0'
    varm: 'W_m0'
    layers: 'counts'

Let’s check our factor representation by getting a visual overview using UMAP. We use the UMAP helper function of scpca to accomplish that.

[13]:

scp.tl.umap(adata, 'X_m0')

[14]:

sc.pl.embedding(adata, basis='X_m0_umap', color=['cell_type', 'age'], ncols=1)

We now try to assess the main factors of the decomposition.

[15]:

scp.pl.factor_embedding(adata, model_key='m0', factor=[0, 1, 2, 3], basis='X_m0_umap', ncols=2)
plt.tight_layout()

Factor 0#

We begin by examining factor 0 more closely. This factor displays notably negative weights in myeloid cells, including alveolar macrophages and monocytes, and positive weights in club and goblet cells.

[26]:

scp.pl.factor_strip(adata, model_key='m0', factor=[0], cluster_key='cell_type', highlight=['Alveolar_macrophage', 'Club_cells'], s=2)

[26]:

<Axes: title={'center': 'Factor 0'}, xlabel='cluster', ylabel='Factor weight'>

Next we take a close look at the negative loading weights of factor 0 to learn more about the expression patterns in the myeloid cells of this dataset. We can extract those from the adata object using the state_loadings function. Since the factor weights are negative for the myeloid cells, we use the argument lowest=20 and highest=0 to extract the genes with the largest negative loading weights.

[34]:

top_weights_fac0 = scp.tl.state_loadings(adata, 'm0', 'Intercept', 0, highest=0, lowest=20)

[35]:

top_weights_fac0

[35]:

	gene	magnitude	weight	type	index
0	Abcg1	8.844831	-8.844831	lowest	73
1	Bcl2a1a	8.854088	-8.854088	lowest	162
2	Mpeg1	9.212790	-9.212790	lowest	1053
3	Marco	9.213015	-9.213015	lowest	1020
4	1100001G20Rik	9.315811	-9.315811	lowest	0
5	Atp6v0d2	9.373060	-9.373060	lowest	144
6	Ltc4s	9.418015	-9.418015	lowest	994
7	Fabp1	9.497688	-9.497688	lowest	542
8	Clec4n	9.505602	-9.505602	lowest	368
9	Mrc1	9.583654	-9.583654	lowest	1055
10	Ear1	9.637006	-9.637006	lowest	496
11	Bcl2a1d	9.889424	-9.889424	lowest	164
12	Clec4a3	10.041142	-10.041142	lowest	363
13	Fcer1g	10.077384	-10.077384	lowest	573
14	Tyrobp	10.141203	-10.141203	lowest	1513
15	Cybb	10.273161	-10.273161	lowest	435
16	Ccl6	10.514906	-10.514906	lowest	252
17	Chi3l3	10.604345	-10.604345	lowest	340
18	Ear2	10.699730	-10.699730	lowest	498
19	Ctss	10.881914	-10.881914	lowest	420

As expected the genes are largely associated with myeloid cells, particularly macrophages, and more specifically, alveolar macrophages or monocyte-derived macrophages:

Strong indicators of macrophages / myeloid lineage: Mpeg1, Clec4a3, Clec4n, Fcer1g, Tyrobp, Cybb, Ctss – general myeloid/macrophage markers.
Marco, Mrc1 (CD206) – alveolar macrophage markers; Marco is specific to tissue-resident macrophages like alveolar macrophages.
Chi3l3 (Ym1), Ear1, Ear2, Ccl6 – often upregulated in alternatively activated (M2-like) macrophages in mice.
Bcl2a1a, Bcl2a1d – anti-apoptotic genes, enriched in immune cells including myeloid cells.
Atp6v0d2 – involved in osteoclast/macrophage function.
Ltc4s – leukotriene pathway, expressed in myeloid-derived cells.
Abcg1 – involved in lipid homeostasis, enriched in alveolar macrophages.
Tyrobp – adapter protein for many immune receptors, common in microglia/macrophages.

We next ask the question how the factor 0 changed across conditions. To this end we can compute the loading differences between factor 0 loading weigths between the 24m and 3m condition.

[41]:

_ = scp.pl.loadings_state(adata, 'm0', ['Intercept', 'age[T.24m]'], 0, sign=-1, highest=10, lowest=10, width=4, height=3.5)
plt.tight_layout()

Interestingly we can see here see the downregulation of Ltc4s, Ear1, Ear2 – These are typically expressed in tissue-resident lung macrophages. Their downregulation suggests a shift in macrophage state (e.g., toward a more pro-inflammatory phenotype or loss of identity).

The set of upregulated genes appears to reflect a shift in macrophage state—potentially toward a more inflammatory, monocyte-derived, or activated phenotype, possibly related to aging or stress in the lung environment. This is especially supported by the upregulation of:

Nr4a1: Transcription factor critical for the development of non-classical monocytes; also involved in anti-inflammatory responses and macrophage self-renewal. Upregulation may indicate a monocyte-to-macrophage transition or stress adaptation.
Apoe, Apoc2: Apolipoproteins associated with lipid metabolism and foam-cell–like macrophage states; commonly upregulated in aging, inflammation, or lipid-rich environments (e.g., alveolar space). These are known markers of activated macrophages in chronic inflammation and neurodegeneration, too.

Factor 1#

Next we look at factor 1. Factor 1 exhibits high factor values in Ciliated Cells. Let’s also have a look factor 1s corresponding loading weights.

[31]:

top_weights_fac1 = scp.tl.state_loadings(adata, 'm0', 'Intercept', 1, highest=20)

[33]:

top_weights_fac1.head()

[33]:

	gene	magnitude	weight	type	factor	index
0	Fam183b	8.030411	8.030411	highest	1	554
1	1700016K19Rik	7.806339	7.806339	highest	1	15
2	Mlf1	7.700116	7.700116	highest	1	1041
3	Tmem212	7.574149	7.574149	highest	1	1443
4	1700024G13Rik	7.569454	7.569454	highest	1	17

[35]:

ax = sns.barplot(data=top_weights_fac1, x='gene', y='weight')
ax.tick_params(axis='x', rotation=90)

These genes—like CCDC39, Drc1, Dynlrb2, plus several coiled‑coil domain proteins (CCDC113, CCDC153, etc.)—are classic markers of motile cilia structures: they play key roles in axonemal assembly, dynein arm formation, and ciliary motility. For example, CCDC39 is essential for the dynein regulatory complex and inner dynein arms, and DRC1 also has a similar role in ciliary beat regulation.

We next ask the question how the factor 1 changed across conditions. To this end we can compute the loading differences between factor 1 loading weigths from the 24m and 3m condition.

[44]:

_ = scp.pl.loadings_state(adata, 'm0', ['Intercept', 'age[T.24m]'], 1, highest=10, width=4, height=3.5)
plt.tight_layout()

[50]:

top_loading_diff_1 = scp.tl.state_diff(adata, 'm0',  ['Intercept', 'age[T.24m]'], 1, highest=20)

These genes, when considered collectively, point to a signature of metabolically active, detoxifying epithelial cells—possibly from liver or lung tissue. They combine:

Detox enzymes and transporters (P450s, Sult1d1, Slc16a11)
Epithelial structural/barrier proteins (Krt8, Cldn3, WFDC2)
Stress response and regulatory factors (Nupr1, Tiam1, Wif1, Dkkl1)

Together these results indicate that scPCAs inferred loading weight differences indicate that ciliated cells shift toward a more metabollically active / stress respond state in old mice.

[ ]:

Analysing the brain cells in light stimulated mice

Contents

Analysing the brain cells in light stimulated mice#

Data preprocessing#

Factor analysis#

Factor 0#

Factor 1#