1. Topic matching tutorial#

This notebook demonstrates how to use the topic matching functionality in atac_mapper to infer topic distribution for query data using topic loadings from reference data. Topic modelling of scATAC-seq data could be performed using cisTopics.

import pandas as pd
from scipy.io import mmread
from atac_mapper.topic_matching import TopicMatch

Load Data#

We need two main pieces of data:

  1. Region-topic matrix from reference dataset (from cisTopics analysis). Normally you can find it in cistopic_obj.selected_model.region_topic.

  2. Query fragment matrix (regions x cells)

The region-topic matrix should contain the topic loadings for each genomic region, typically saved as a TSV file from cisTopics. The query fragment matrix contains accessibility counts for each region in your query cells.

💡 Tip: Make sure the regions (rows) in both matrices correspond to the same genomic coordinates. You can use tools like pyranges or Feature Matrix to match region coordinates of query and reference data if needed.

# Load reference topic distributions
region_topic_df = pd.read_csv("../../../../test_data_atac_mapper/cistopic_loading_mannens.tsv", sep="\t", index_col=0)

# Load query fragment matrix (regions x cells)
query_matrix = mmread("../../../../test_data_atac_mapper/FM_atlas_test_20_subset.mtx").tocsr()

print(f"Reference shape (regions x topics): {region_topic_df.shape}")
print(f"Query shape (regions x cells): {query_matrix.shape}")
Reference shape (regions x topics): (410863, 175)
Query shape (regions x cells): (410863, 20)

Initialize TopicMatch and Run Inference#

Now we can use the TopicMatch class to infer topic distributions for our query cells. Topic inference implementation was inspired by lda package. We highly recommend to allocate as many cores as possible to ensure parallelization, especially for bigger datasets.

# Initialize topic matcher
topic_matcher = TopicMatch(region_topic_df)

# Run inference
topic_distributions = topic_matcher.infer_topics(
    query=query_matrix,
    njobs=-1,  # Use all available cores
    n_iterations=100,
    tol=1e-4,
)

# Convert to DataFrame for easier inspection
results_df = pd.DataFrame(
    topic_distributions,
    columns=region_topic_df.columns,
)

print("\nFirst few cells and their topic distributions:")
print(results_df.head())
Converged after 83 iterations (delta=9.88e-05)
Converged after 99 iterations (delta=9.95e-05)
Converged after 94 iterations (delta=9.80e-05)
Converged after 92 iterations (delta=9.89e-05)
Converged after 100 iterations (delta=9.99e-05)

First few cells and their topic distributions:
         Topic1        Topic2        Topic3        Topic4    Topic5  \
0  2.176386e-08  1.664294e-11  1.746968e-05  1.384006e-13  0.010057   
1  7.951094e-03  2.679191e-03  1.049767e-02  5.712014e-03  0.001593   
2  2.590506e-12  2.809658e-07  1.422678e-03  1.744847e-10  0.010116   
3  3.265713e-02  4.445763e-04  1.816033e-02  6.666232e-05  0.003741   
4  2.483622e-04  6.483961e-10  1.280752e-07  6.682240e-03  0.013579   

         Topic6        Topic7        Topic8    Topic9       Topic10  ...  \
0  4.184981e-03  7.211357e-07  6.976538e-04  0.021367  1.185928e-18  ...   
1  4.078344e-04  6.189188e-03  3.997446e-06  0.012007  1.544286e-04  ...   
2  2.267622e-05  1.827246e-10  8.665125e-09  0.004133  6.726061e-08  ...   
3  8.255901e-04  2.127160e-02  4.903567e-05  0.000139  2.697414e-04  ...   
4  9.846099e-08  1.929619e-03  7.959138e-09  0.011825  1.525845e-05  ...   

       Topic166      Topic167      Topic168      Topic169      Topic170  \
0  1.068586e-04  3.056275e-09  2.789160e-07  2.505092e-03  2.192374e-02   
1  1.524659e-06  4.767883e-05  1.258490e-04  2.688697e-03  7.104237e-03   
2  2.442049e-09  5.490281e-08  5.611078e-03  1.078140e-05  1.511663e-07   
3  1.269284e-04  3.616528e-04  1.962602e-05  4.454316e-03  1.635973e-04   
4  1.375858e-03  5.869555e-07  1.208770e-10  2.622162e-07  4.487152e-02   

       Topic171  Topic172      Topic173      Topic174      Topic175  
0  7.841314e-03  0.000003  1.534054e-02  5.845776e-04  2.849854e-10  
1  1.539699e-03  0.010355  3.317070e-03  3.018989e-04  1.180509e-03  
2  9.212135e-03  0.005072  1.745747e-03  5.427419e-04  5.958500e-06  
3  5.033489e-07  0.000102  2.129651e-03  2.848389e-04  4.410671e-08  
4  5.865168e-05  0.000008  3.527414e-10  9.215071e-11  2.997555e-05  

[5 rows x 175 columns]

Analyze Results#

Let’s look at some basic statistics of the inferred topic distributions.

# Basic statistics
print("Average topic probability per topic:")
print(results_df.mean())

print("\nMost common dominant topic for cells:")
dominant_topics = results_df.idxmax(axis=1).value_counts()
print(dominant_topics)
Average topic probability per topic:
topic_0      0.005722
topic_1      0.001145
topic_2      0.004587
topic_3      0.000797
topic_4      0.004191
               ...   
topic_170    0.003773
topic_171    0.002717
topic_172    0.008476
topic_173    0.003091
topic_174    0.005285
Length: 175, dtype: float64

Most common dominant topic for cells:
topic_141    2
topic_164    2
topic_83     2
topic_34     2
topic_132    2
topic_46     1
topic_125    1
topic_111    1
topic_37     1
topic_144    1
topic_9      1
topic_7      1
topic_107    1
topic_63     1
topic_159    1
Name: count, dtype: int64
# Save results to a file
results_df.to_csv("topic_inference_results.tsv", index=False, sep="\t")

This matrix one should save as a layer in query adata to continue with reference matching.