CROP-seq (CRISPR droplet sequencing)

CROP-seq is a technique that combines pooled CRISPR screening with droplet-based single-cell RNA sequencing, and can directly measure guide RNA expression as part of the transcriptome readout from thousands of cells.

Study design

In our study, we try to use CROP-seq to confirm the regulatory functionality and putative cis-target genes for a set of genomic loci of interest.

We used human induced pluripotent stem cell (iPSC)-derived neurons as a neurodevelopmental model. First, we conducted ATAC-seq on each cell line and obtained a list of allele-specific open-chromatin (ASoC) SNPs in each cell type. Next, we assessed the utility of these ASoC SNPs in inferring functional noncoding risk variants for neuropsychiatric disorders. Our enrichment analysis results suggest that among schizophrenia associated variants, the ones showing ASoC are more likely to be causal functional variants.

SNPs of interest

We thus intersected the ASoC SNPs with the schizophrenia GWAS risk SNPs, and for the top 21 ASoC SNPs, we assessed their functionality and cis-regulated genes using CROP-seq.

We focused on iPSC-derived neural progenitor cells (NPC). For CRISPRi, we designed 3 guide RNA sequences inhibiting the transcriptional activity near the target locus for each of the 21 SNP sites of interest. We also included 5 negative control gRNAs targeting green fluorescent protein-GFP.

library(ggplot2)
library(knitr)
library(kableExtra)
library(gridExtra)
options(stringsAsFactors = F)
snp.lst <- read.table('~/Downloads/ASoC/data/scRNA_seq_SNP_list.txt',sep = '\t',
                      header = F,col.names = c('rsID','gene locus'))
kable(snp.lst,caption = 'SNP list') %>% kable_styling() %>% scroll_box(width = "90%", height = "400px")
SNP list
rsID gene.locus
rs7148456 BAG5
rs12895055 BCL11B
rs7170068 CHRNA3
rs520843 GALNT10
rs12716973 KCTD13
rs2192932 KMT2E
rs17200916 LOC105376975
rs1198588 miR137
rs324017 NAB2
rs4151680 PCDHA1
rs301791 RERE
rs324015 STAT6
rs9882911 TRANK1
rs11633075 UBE2Q2P1
rs2027349 VPS45
rs186132169 NGEF
rs9661794 DPYD
rs7936858 LOC100507431
rs3861678 LINC00637
rs10933 PBRM1
rs6071578 PPP1R16B

Data quality

Single-cell gRNA profiles

1. Number of gRNAs each cell contains:

wkdir <- '~/Downloads/ASoC/singlecell/'
# load(paste0(wkdir,'data/cropseq_design.Rdata'))
load(paste0(wkdir,'data/cropseq_expression.Rd'))
nlocus_gRNA <- colSums(exp.per.gRNA>0)
qplot(nlocus_gRNA,xlab = '# of gRNA types per cell',ylab = 'count') +
  theme_classic() + theme(axis.text = element_text(size = 14),
                          axis.title = element_text(size = 14,face = 'bold'))

Out of the 4144 cells sequenced, 2142 cells contain only one type of gRNA.

2. Number of cells targeted for each of the 25 loci:

ncell_enh <- rowSums(exp.per.enhancer>0)
tmp <- c(ncell_enh,ncell_enh[c(17,13)])
ncell_enh <- tmp[-c(13,17)]
ncell_enh <- data.frame(count=ncell_enh, locus=names(ncell_enh))
ncell_enh$locus <- factor(ncell_enh$locus, levels = ncell_enh$locus)
ggplot(data = ncell_enh,aes(x=locus,y=count)) + geom_bar(stat="identity",width = 0.6) +
    theme_classic() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 11), 
          axis.text.y = element_text(size = 14), 
          axis.title.y = element_text(size = 14,face = 'bold'),
          axis.title.x = element_blank()) 

3. Number of cells uniquely targeted by each of the 76 gRNAs:

ncell_uniq <- rep(NA,nrow(exp.per.gRNA))
for (i in 1:nrow(exp.per.gRNA)){
  glocus <- row.names(exp.per.gRNA)[i]
  ncell_uniq[i]=sum(exp.per.gRNA[glocus,]>0 & nlocus_gRNA==1)
}
tmp <- sapply(strsplit(row.names(exp.per.gRNA),split = '_'), 
              function(x){paste(x[1],x[2],sep = '_')})
names(ncell_uniq) <- tmp
# print(ncell_uniq)
tmp.mtx <- matrix(ncell_uniq[1:69],nrow = 3)
tmp.name <- names(ncell_uniq)[seq(1,69,3)]
tmp.name <- sapply(strsplit(tmp.name,split = '_'), function(x){x[1]})
ncell_gRNA <- as.data.frame(t(tmp.mtx), row.names = tmp.name)
names(ncell_gRNA) <- c('gRNA 1','gRNA 2','gRNA 3')
ncell_gRNA['pos_SNAP91',] <- c(3,26,NA)
ncell_gRNA['neg_EGFP',] <- c(5,28,35)
ncell_gRNA['neg_CTRL',] <- c(17,25,NA)
kable(ncell_gRNA) %>% kable_styling() %>% scroll_box(width = "90%", height = "400px")
gRNA 1 gRNA 2 gRNA 3
BAG5 33 32 29
BCL11B 13 33 13
CHRNA3 3 23 13
GALNT10 37 36 50
KCTD13 29 32 14
KMT2E 11 19 18
LOC105376975 34 47 6
miR137 44 30 54
NAB2 36 24 25
PCDHA123 53 63 61
RERE 45 43 23
SETD1A 22 27 4
STAT6 24 38 53
TRANK1 3 18 64
UBE2Q2P1 11 8 11
VPS45 13 78 9
NGEF 11 2 0
DPYD 37 68 28
LOC100507431 15 28 48
LINC00637 42 31 51
PBRM1 4 12 4
PPP1R16B 44 31 35
TCF4-ITF2 48 44 11
pos_SNAP91 3 26 NA
neg_EGFP 5 28 35
neg_CTRL 17 25 NA

Single-cell transcriptome profiles

Dimension of the gene x cell UMI count matrix:

20435 4144

1. Distribution of cell sequencing depth / library size:

lib.size <- colSums(gene.exp)
lib.size <- data.frame(lib = lib.size)
ggplot(lib.size, aes(x=lib)) + geom_histogram(bins = 40,color="lightcyan3", fill="lightcyan4") + 
  xlab(label = 'total UMI count per cell') + ylab(label = 'count') +
  theme_classic() + theme(axis.text = element_text(size = 14),
                          axis.title = element_text(size = 14,face = 'bold'))

2. Distribution of gene detection rate:

gene.stats <- data.frame(detection = rowMeans(gene.exp>0),
                         mean_exp = rowMeans(gene.exp),
                         var_exp = apply(gene.exp, 1, var))
ggplot(gene.stats,aes(x=detection)) + geom_histogram(color="lightcyan3", fill="lightcyan4") +
    xlab(label = '% of cells a gene is detected in') + 
    theme_classic() + theme(axis.text = element_text(size = 14),
                          axis.title = element_text(size = 14,face = 'bold'))

Library size/sequencing depth of cells ~ gene expression??

3. Library size ~ genes detected

lib.size$n_genes <- colSums(gene.exp>0)
ggplot(lib.size, aes(log10(lib), n_genes)) + geom_point(alpha=0.3, shape=16) + 
  geom_density_2d(size = 0.3) + xlab(label = 'log10(total UMI count)') + 
  ylab(label = '# of genes detected') + theme_bw() + theme(axis.text = element_text(size = 14),
                          axis.title = element_text(size = 14,face = 'bold'))

4. Gene expression mean ~ variance:

ggplot(gene.stats, aes(log10(mean_exp), log10(var_exp))) + 
  geom_point(alpha=0.3, shape=16) + geom_density_2d(size = 0.3) +
  geom_abline(intercept = 0, slope = 1, color='red') + 
  xlab(label = 'log10(mean of expression)') + 
  ylab(label = 'log10(variance of expression)') + theme_bw() +
  theme(axis.text = element_text(size = 14),axis.title = element_text(size = 14,face = 'bold'))

We can see that up to a mean UMI count of ~1, the mean-variance follows the red 1:1 line, i.e. variance and mean are roughly equal (as expected under a Poisson model). However, genes with a higher average UMI count show overdispersion compared to Poisson.

5. Gene mean expression ~ detection rate:

x = seq(from = -3, to = 2, length.out = 1000)
poisson_model <- data.frame(log_mean = x, detection_rate = 1 - dpois(0, lambda = 10^x))
ggplot(gene.stats,aes(x=log10(mean_exp),y=detection)) + geom_point(alpha=0.3, shape=16) +
  geom_line(data=poisson_model, aes(x=log_mean, y=detection_rate), color='red') +
    xlab(label = 'log10(mean expression)') + 
    ylab(label = 'detection rate') +
    theme_bw() + theme(axis.text = element_text(size = 14),
                          axis.title = element_text(size = 14,face = 'bold'))

Red curve: the expected detection rate under Poisson model.
We see a lower than expected detection rate in the medium expression range. However, for the highly expressed genes, the rate is very close to 1.0, suggesting that there is no zero-inflation in the counts for those genes and that zero-inflation is a result of overdispersion, rather than an independent systematic bias.