1. Topic matching tutorial#
This notebook demonstrates how to use the topic matching functionality in atac_mapper to infer topic distribution for query data using topic loadings from reference data. Topic modelling of scATAC-seq data could be performed using cisTopics.
import pandas as pd
from scipy.io import mmread
from atac_mapper.topic_matching import TopicMatch
Load Data#
We need two main pieces of data:
Region-topic matrix from reference dataset (from cisTopics analysis). Normally you can find it in
cistopic_obj.selected_model.region_topic.Query fragment matrix (regions x cells)
The region-topic matrix should contain the topic loadings for each genomic region, typically saved as a TSV file from cisTopics. The query fragment matrix contains accessibility counts for each region in your query cells.
💡 Tip: Make sure the regions (rows) in both matrices correspond to the same genomic coordinates. You can use tools like pyranges or Feature Matrix to match region coordinates of query and reference data if needed.
# Load reference topic distributions
region_topic_df = pd.read_csv("../../../../test_data_atac_mapper/cistopic_loading_mannens.tsv", sep="\t", index_col=0)
# Load query fragment matrix (regions x cells)
query_matrix = mmread("../../../../test_data_atac_mapper/FM_atlas_test_20_subset.mtx").tocsr()
print(f"Reference shape (regions x topics): {region_topic_df.shape}")
print(f"Query shape (regions x cells): {query_matrix.shape}")
Reference shape (regions x topics): (410863, 175)
Query shape (regions x cells): (410863, 20)
Initialize TopicMatch and Run Inference#
Now we can use the TopicMatch class to infer topic distributions for our query cells. Topic inference implementation was inspired by lda package. We highly recommend to allocate as many cores as possible to ensure parallelization, especially for bigger datasets.
# Initialize topic matcher
topic_matcher = TopicMatch(region_topic_df)
# Run inference
topic_distributions = topic_matcher.infer_topics(
query=query_matrix,
njobs=-1, # Use all available cores
n_iterations=100,
tol=1e-4,
)
# Convert to DataFrame for easier inspection
results_df = pd.DataFrame(
topic_distributions,
columns=region_topic_df.columns,
)
print("\nFirst few cells and their topic distributions:")
print(results_df.head())
Converged after 83 iterations (delta=9.88e-05)
Converged after 99 iterations (delta=9.95e-05)
Converged after 94 iterations (delta=9.80e-05)
Converged after 92 iterations (delta=9.89e-05)
Converged after 100 iterations (delta=9.99e-05)
First few cells and their topic distributions:
Topic1 Topic2 Topic3 Topic4 Topic5 \
0 2.176386e-08 1.664294e-11 1.746968e-05 1.384006e-13 0.010057
1 7.951094e-03 2.679191e-03 1.049767e-02 5.712014e-03 0.001593
2 2.590506e-12 2.809658e-07 1.422678e-03 1.744847e-10 0.010116
3 3.265713e-02 4.445763e-04 1.816033e-02 6.666232e-05 0.003741
4 2.483622e-04 6.483961e-10 1.280752e-07 6.682240e-03 0.013579
Topic6 Topic7 Topic8 Topic9 Topic10 ... \
0 4.184981e-03 7.211357e-07 6.976538e-04 0.021367 1.185928e-18 ...
1 4.078344e-04 6.189188e-03 3.997446e-06 0.012007 1.544286e-04 ...
2 2.267622e-05 1.827246e-10 8.665125e-09 0.004133 6.726061e-08 ...
3 8.255901e-04 2.127160e-02 4.903567e-05 0.000139 2.697414e-04 ...
4 9.846099e-08 1.929619e-03 7.959138e-09 0.011825 1.525845e-05 ...
Topic166 Topic167 Topic168 Topic169 Topic170 \
0 1.068586e-04 3.056275e-09 2.789160e-07 2.505092e-03 2.192374e-02
1 1.524659e-06 4.767883e-05 1.258490e-04 2.688697e-03 7.104237e-03
2 2.442049e-09 5.490281e-08 5.611078e-03 1.078140e-05 1.511663e-07
3 1.269284e-04 3.616528e-04 1.962602e-05 4.454316e-03 1.635973e-04
4 1.375858e-03 5.869555e-07 1.208770e-10 2.622162e-07 4.487152e-02
Topic171 Topic172 Topic173 Topic174 Topic175
0 7.841314e-03 0.000003 1.534054e-02 5.845776e-04 2.849854e-10
1 1.539699e-03 0.010355 3.317070e-03 3.018989e-04 1.180509e-03
2 9.212135e-03 0.005072 1.745747e-03 5.427419e-04 5.958500e-06
3 5.033489e-07 0.000102 2.129651e-03 2.848389e-04 4.410671e-08
4 5.865168e-05 0.000008 3.527414e-10 9.215071e-11 2.997555e-05
[5 rows x 175 columns]
Analyze Results#
Let’s look at some basic statistics of the inferred topic distributions.
# Basic statistics
print("Average topic probability per topic:")
print(results_df.mean())
print("\nMost common dominant topic for cells:")
dominant_topics = results_df.idxmax(axis=1).value_counts()
print(dominant_topics)
Average topic probability per topic:
topic_0 0.005722
topic_1 0.001145
topic_2 0.004587
topic_3 0.000797
topic_4 0.004191
...
topic_170 0.003773
topic_171 0.002717
topic_172 0.008476
topic_173 0.003091
topic_174 0.005285
Length: 175, dtype: float64
Most common dominant topic for cells:
topic_141 2
topic_164 2
topic_83 2
topic_34 2
topic_132 2
topic_46 1
topic_125 1
topic_111 1
topic_37 1
topic_144 1
topic_9 1
topic_7 1
topic_107 1
topic_63 1
topic_159 1
Name: count, dtype: int64
# Save results to a file
results_df.to_csv("topic_inference_results.tsv", index=False, sep="\t")
This matrix one should save as a layer in query adata to continue with reference matching.