Sharing is an attitude

## brief introduction

PHATE [1] (potential of heat diffusion for affinity based transition embedding) is a tool developed by Krishnaswamy Laboratory for visualizing high-dimensional single cell data with natural processes or trajectories. PHATE uses a novel conceptual framework to learn and visualize manifolds inherent in biological systems. Among them, smooth transition marks the progress of cells from one state to another. PHATE generates a low dimensional embedding, which uses the geometric distance information between data points (i.e. the distance between sample expression profiles) to capture the local and global nonlinear structures in the data set.

At present, the article is published in the top journal of Nature Biotechnology: Visualizing Structure and Transitions in High-Dimensional Biological Data. 2019\. Nature Biotechnology.[2]

The following is its schematic diagram:

## Data introduction

In this tutorial, we will demonstrate how to use PHATE to analyze the data of 31000 single cells differentiated by embryonic body (EB) at different time points of 27 days. Running this tutorial takes approximately 15 minutes (excluding t-SNE comparisons), or 25 minutes (including comparisons).

### Differentiation process of human embryo at different time points

Firstly, the low generation H1 hESCs cells were cultured in matrix glue coated Petri dishes, and FGF2 was supplemented in DMEM/F12-N2B27 medium. For the formation of EB, the cells were treated with Dispase, dissociated into small masses and laid in non adherent plates. The medium was supplemented with 20% FBS. During the 27 day differentiation process, samples were collected every 3 days, including undifferentiated hESC samples. The expression of key germ marker genes in these EB cultures was verified by qPCR. For the construction of single-cell library, EB culture was isolated, FACS was sorted to remove diploid cells and dead cells, and cDNA library was generated using 10x Genomics instrument, and then sequenced.

## Install PHATE and its dependent packages

- Install directly using pip

pip install --user phate #Installing MAGIC and scprep dependency packages pip install --user magic-impute pip install --user scprep

- Download the source code for installation

git clone --recursive git://github.com/KrishnaswamyLab/PHATE.git cd PHATE/Python python setup.py install --user

## Download 10x single cell sample data

The EB sample dataset scRNAseq.zip can be downloaded from( https://data.mendeley.com/datasets/v6n743h5ng/ )Download and use Mendelay Datasets. The scrna SEQ folder contains five subdirectories. Each subdirectory contains three files generated by CellRanger processing: barcodes.tsv, genes.tsv, and matrix.mtx.

# Import the required python package import os import scprep download_path = os.path.expanduser("~") print(download_path) >>> /home/user # Download data if not os.path.isdir(os.path.join(download_path, "scRNAseq", "T0_1A")): # need to download the data scprep.io.download.download_and_extract_zip( "https://md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com/" "5739738f-d4dd-49f7-b8d1-5841abdbeb1e", download_path)

This is the structure of the directory:

## Use scprep package to import data and perform data quality control pretreatment

We use scprep package to import data through scprep.io.load_ The 10x function will automatically load the 10x scRNAseq dataset into the DataFrame type of Pandas. Of course, we can also use scprep.io.load_ The TSV () function directly reads the TSV file of the gene expression matrix.

Note: by default, scprep.io.load_ The 10x function uses panda's SparseDataFrame format (see panda's documentation) [3] to load scrna SEQ data to maximize memory efficiency. However, this will be slower than loading the deny matrix. If you want to load the deny matrix, you can pass the spark = false parameter to load_10X function. At the same time, we also set gene_labels = 'both' parameter, so that the information of gene symbol and gene ID can be retained at the same time.

# Import the required python package import pandas as pd import numpy as np import phate import scprep # matplotlib settings for Jupyter notebooks only %matplotlib inline # Use load_ The 10x function imports each sample file separately sparse=True T1 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T0_1A"), sparse=sparse, gene_labels='both') T2 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T2_3B"), sparse=sparse, gene_labels='both') T3 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T4_5C"), sparse=sparse, gene_labels='both') T4 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T6_7D"), sparse=sparse, gene_labels='both') T5 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T8_9E"), sparse=sparse, gene_labels='both') # View data information T1.head()

### Library data quality control

First, we will filter out cells with very large or very small library size. For this data set, the size of the library has a certain correlation with the samples, so we will filter each sample separately. Here, we use filter_ library_ The size () function filters out 20% of the cells at the top and bottom of each sample.

# View library size distribution scprep.plot.plot_library_size(T1, percentile=20) plt.show()

# Filter each library separately filtered_batches = [] for batch in [T1, T2, T3, T4, T5]: batch = scprep.filter.filter_library_size(batch, percentile=20, keep_cells='above') batch = scprep.filter.filter_library_size(batch, percentile=75, keep_cells='below') filtered_batches.append(batch) del T1, T2, T3, T4, T5 # removes objects from memory

### Merge data of all samples

Next, we will use combine_ The batches () function combines the data of all samples and constructs a vector representing the time point of each sample.

EBT_counts, sample_labels = scprep.utils.combine_batches( filtered_batches, ["Day 00-03", "Day 06-09", "Day 12-15", "Day 18-21", "Day 24-27"], append_to_cell_names=True ) del filtered_batches # removes objects from memory EBT_counts.head()

### Remove rare genes

Next, we will remove genes that are expressed in 10 or fewer cells.

EBT_counts = scprep.filter.filter_rare_genes(EBT_counts, min_cells=10)

### data normalization

To correct for differences in Library sizes, we use library_ size_ The normalize() function divides each cell by its library size, and then rescales by the median library size.

EBT_counts = scprep.normalize.library_size_normalize(EBT_counts)

### Remove dead cells with high mitochondrial content

Generally, the expression level of mitochondrial RNA in dead cells may be higher than that in normal living cells. Therefore, we removed suspicious dead cells by eliminating the cells with the highest average mitochondrial RNA expression level.

First, let's look at the distribution of mitochondrial genes.

mito_genes = scprep.select.get_gene_set(EBT_counts, starts_with="MT-") # Get all mitochondrial genes. There are 14, FYI. scprep.plot.plot_gene_set_expression(EBT_counts, genes=mito_genes, percentile=90) plt.show()

Here, we see a sharp increase in the expression of mitochondrial RNA in the first 90%. Therefore, we will delete these cells in subsequent analysis.

### data conversion

Here, we will use the scprep.transform.sqrt() function for square root transformation.

EBT_counts = scprep.transform.sqrt(EBT_counts)

## Visualization of low dimensional embedding and dimensionality reduction using PHATE

Firstly, we instantiate a PHATE estimator object to embed and fit PHATE low dimension into a given data set. Next, use fit and fit_ The transform function generates a low dimensional embedding. For more information, please check the PHATE reading document [4].

Here, we will use the default parameters, but we can adjust the following parameters (read our document on phase. Readthedocs. IO [5] for more information):

- knn: number of nearest neighbors (default: 5). If the resulting PHATE low dimensional embedding looks very incoherent, we can increase this value (for example, to 20). If your dataset is very large (e.g. > 100k cells), you should also consider increasing this value.
- Decay: Alpha falloff value (default: 15). Decreasing the decay value can increase the connectivity of knn graphs, and increasing the decay value will reduce the connectivity of graphs. In general, this value rarely needs to be adjusted.
- t: Number of times to power the operator (default: 'auto'). This is equivalent to the amount of smoothing the data. It is automatically selected by default, but if your low dimensional embedding lacks structure, you can consider increasing it, or reducing it if the structure looks too tight.
- Gamma: Informational distance constant (default: 1). gamma=1 gives the PHATE log potential, but other informational distances can be interesting. If most of the points seem concentrated in one section of the plot, you can try gamma=0.

This is the easiest way to apply PHATE:

# Instantiate phase estimator object phate_operator = phate.PHATE(n_jobs=-2) # Using fit_ Low dimensional embedding fitting with transform function Y_phate = phate_operator.fit_transform(EBT_counts) Calculating PHATE... Running PHATE on 16821 cells and 17845 genes. Calculating graph and diffusion operator... Calculating PCA... Calculated PCA in 32.16 seconds. Calculating KNN search... Calculated KNN search in 14.15 seconds. Calculating affinities... Calculated affinities in 0.26 seconds. Calculated graph and diffusion operator in 52.76 seconds. Calculating landmark operator... Calculating SVD... Calculated SVD in 2.15 seconds. Calculating KMeans... Calculated KMeans in 36.05 seconds. Calculated landmark operator in 40.53 seconds. Calculating optimal t... Automatically selected t = 21 Calculated optimal t in 2.98 seconds. Calculating diffusion potential... Calculated diffusion potential in 2.12 seconds. Calculating metric MDS... Calculated metric MDS in 13.67 seconds. Calculated PHATE in 112.09 seconds.

Next, we can use the scprep.plot.scatter2d() function to visually display the results of PHATE low-dimensional embedding.

scprep.plot.scatter2d(Y_phate, c=sample_labels, figsize=(12,8), cmap="Spectral", ticks=False, label_prefix="PHATE") plt.show()

Because we want to find the subtle structure of EB in the differentiation process at different time points, and we expect some differentiation trajectories to be sparse. Therefore, we can reduce the knn value from the default of 5 and the t value from the automatic detection of 21 (in the above output results). For single-cell RNA SEQ data, if you want to find subtle structural changes in the data set, you can try to reduce the knn value to 3 or 4. If you have hundreds of thousands of cells, you can try to increase the value to 30 or 40. Here, we also reduce the alpha value to 15.

phate_operator.set_params(knn=4, decay=15, t=12) # We could also create a new operator: # phate_operator = phate.PHATE(knn=4, decay=15, t=12, n_jobs=-2) Y_phate = phate_operator.fit_transform(EBT_counts) Calculating PHATE... Running PHATE on 16821 cells and 17845 genes. Calculating graph and diffusion operator... Calculating PCA... Calculated PCA in 30.84 seconds. Calculating KNN search... Calculated KNN search in 17.06 seconds. Calculating affinities... Calculated affinities in 8.93 seconds. Calculated graph and diffusion operator in 62.44 seconds. Calculating landmark operator... Calculating SVD... Calculated SVD in 10.38 seconds. Calculating KMeans... Calculated KMeans in 35.15 seconds. Calculated landmark operator in 47.79 seconds. Calculating diffusion potential... Calculated diffusion potential in 1.45 seconds. Calculating metric MDS... Calculated metric MDS in 7.24 seconds. Calculated PHATE in 118.93 seconds.

scprep.plot.scatter2d(Y_phate, c=sample_labels, figsize=(12,8), cmap="Spectral", ticks=False, label_prefix="PHATE") plt.show()

Of course, we can also use the scprep.plot.scatter3d() function for 3D visualization.

phate_operator.set_params(n_components=3) Y_phate_3d = phate_operator.transform() scprep.plot.scatter3d(Y_phate_3d, c=sample_labels, figsize=(8,6), cmap="Spectral", ticks=False, label_prefix="PHATE") plt.show()

We can even generate a dynamic gif graph of 3D rotation.

scprep.plot.rotate_scatter3d(Y_phate_3d, c=sample_labels, figsize=(8,6), cmap="Spectral", ticks=False, label_prefix="PHATE") # to save as a gif: # >>> scprep.plot.rotate_scatter3d(Y_phate_3d, c=sample_labels, # ... figsize=(8,6), cmap="Spectral", # ... ticks=False, label_prefix="PHATE", filename="phate.gif") # to save as an mp4: # >>> scprep.plot.rotate_scatter3d(Y_phate_3d, c=sample_labels, # ... figsize=(8,6), cmap="Spectral", # ... ticks=False, label_prefix="PHATE", filename="phate.mp4")

## Comparison with other dimensionality reduction visualization tools

Next, we will compare the results of PCA and t-SNE dimensionality reduction visualization.

import sklearn.decomposition # PCA import sklearn.manifold # t-SNE import time start = time.time() # PCA dimensionality reduction visualization pca_operator = sklearn.decomposition.PCA(n_components=2) Y_pca = pca_operator.fit_transform(np.array(EBT_counts)) end = time.time() print("Embedded PCA in {:.2f} seconds.".format(end-start)) start = time.time() pca_operator = sklearn.decomposition.PCA(n_components=100) # tSNE dimensionality reduction visualization tsne_operator = sklearn.manifold.TSNE(n_components=2) Y_tsne = tsne_operator.fit_transform(pca_operator.fit_transform(np.array(EBT_counts))) end = time.time() print("Embedded t-SNE in {:.2f} seconds.".format(end-start))

# plot everything import matplotlib.pyplot as plt fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(16, 5)) #plotting PCA scprep.plot.scatter2d(Y_pca, label_prefix="PC", title="PCA", c=sample_labels, ticks=False, cmap='Spectral', ax=ax1) #plotting tSNE scprep.plot.scatter2d(Y_tsne, label_prefix="t-SNE", title="t-SNE", legend=False, c=sample_labels, ticks=False, cmap='Spectral', ax=ax2) #plotting PHATE scprep.plot.scatter2d(Y_phate, label_prefix="PHATE", title="PHATE", legend=False, c=sample_labels, ticks=False, cmap='Spectral', ax=ax3) plt.tight_layout() plt.show()

It can be seen that PHATE can well show the continuous change state of EB at different differentiation time points, while retaining their global overall structure.

reference material

[1]

PHATE: https://github.com/KrishnaswamyLab/PHATE

[2]

Visualizing Structure and Transitions in High-Dimensional Biological Data. 2019. Nature Biotechnology.: https://www.nature.com/articles/s41587-019-0336-3

[3]

Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/sparse.html

[4]

PHATE reading documents: https://phate.readthedocs.io/

[5]

phate.readthedocs.io: https://phate.readthedocs.io/

[6]

reference resources: https://nbviewer.org/github/KrishnaswamyLab/PHATE/blob/master/Python/tutorial/EmbryoidBody.ipynb