CRISPR-Seq Workflow Documentation¶

Sequencing the predicted cut sites of CRISPR/cas9 experiments is an effective method of validating the CRISPR/cas9 system is creating loss-of-function (LOF) mutations. Abundant LOF allele fractions indicate sufficient Cas9 activity, guide efficiency, and tolerance to LOF mutation. Sequence analysis of predicted cut sites also facilitates studying the complex population dynamics of CRISPR/cas9 edited cells under positive or negative selection.

The CRISPR-Seq analysis workflow inputs single-end targeted sequencing reads that span predicted CRISPR/cas9 cut sites and outputs an analysis of LOF allele fractions and detailed indel descriptions. The CRISPR-Seq algorithm is more accurate than traditional indel callers at detecting large indels (>20bp) by using the predicted cut sites based on gRNA sequence which are unique to CRISPR/cas9 experiments. Convient options for running the analysis pipeline exist for both computational and laboratory scientists.

Why use CRISPR-Seq?¶

High accuracy

Improved detection of large indels (>20bp) using predicted cut sites

Run with FireCloud

Easy to use web interface for experimentalists

Cheap (2 GB FASTQ file costs approximately $0.40 for computation and $0.33 per month for storage)

Billing is managed by Google Cloud services

Simple inputs

single-end reads in FASTQ form

barcode annotation (multiplex only)

gRNA annotation

negative controls

Comprehensive output

aligned bam files per sample

characterization of all indels that overlap a predicted cutsite

quantification of indel reads versus total reads for each sample/target pair

statistical significance of indel allele fractions

QC of indel size detection accuracy per target

indel distribution plots per target

sunburst plots to investigate population dynamics

Open source

Docker image with all source code

Option to run workflow with Snakemake

Algorithm Description¶

The algorithm is fine-tuned for detecting indels in 300nt single-end reads where the predicted gRNA binding site is near the center of the read.

The alignment is a two step process. First, a basic Smith-Waterman alignment identifies wild-type reads and small indels. Second, a search for reads with fragments that have high quality mappings to the reference either before or after the 50bp region around the predicted cutsite is used to identify reads with large indels.

Running CRISPR-Seq¶

FireCloud is the recommended method of running CRISPR-Seq for all users who don’t need to modify the workflow. For users who need tweak the execution of tasks or have a different preffered computation environment, a Docker image is available with all source code.

FireCloud Workflow¶

FireCloud is a cloud-based genomics anlaysis platform developed at the Broad Institute. Billing is handled by the Google Cloud Platform, allowing FireCloud to be accessible to users external to the Broad Institute.

Register a FireCloud Account¶

First, register for FireCloud using the FireCloud portal. This step only requires a Google account, such as Gmail.

Second, create a Google Billing Account and attach it to the FireCloud account following these instructions. Google offers promotional credit for new accounts so new users can try CRISPR-Seq for free.

Create a Workspace¶

From the FireCloud portal, create a new workspace by clicking on the Create New Workspace button in the upper right corner of the main page. Workspace names within the same billing project must be unique so if you share the billing account with others, you might want to include your username in the workspace name, e.g. CRISPRseq_<username>.

After creating the workspace, the summary tab will be displayed. In order to run the CRISPR-Seq analysis, the method definition must be imported into the workspace. Move to the Method Configurations tab and select Import Configuration....

Enter “crisprseq” into the search field and select Configs Only. Select the most recent snapshot of the cpds/crisprseq and import it with the default namespace.

When returning to FireCloud at a later time, use the filter on the main page to search for the workspace name and return to the summary tab.

Using Google Bucket¶

From the summary page of the FireCloud workspace, a link to the Google bucket associated with the workspace is displayed on the right side of the page. User files required for the analysis must be uploaded to this bucket.

Clicking on the bucket ID on the FireCloud workspace summary opens the Google Cloud Platform web interface where files can be uploaded directly.

Upload FASTQ Files¶

If the FASTQ files are on a local machine, it might be easiest to upload the files directly to the bucket using the web interface. However, if the files are not local, the gsutil from the `Google Cloud SDK <https://cloud.google.com/sdk/>`_has a function to move files directly to the bucket using the command line:
gsutil cp *.fastq gs://<bucket>
Broad Institute users have the .google-cloud-sdk DOTKIT available for use on shared Broad servers that includes the gsutil function. One could write a simple script to transfer all fastq files in a directory to a Google bucket:
#!/bin/bash

#$ -cwd
#$ -q long
#$ -N googlebucket
#$ -l m_mem_free=2g

source /broad/software/scripts/useuse
reuse .google-cloud-sdk

gsutil cp *.fastq gs://<bucket-ID>/
Assuming the script is saved as bucket.sh, Broad users could then submit to UGER to execute:
$ use UGER
$ qsub bucket.sh

Upload Multiplex Barcodes¶

The barcode annotation is a two column, comma separated text file with the sample name in the first column and the barcode in the second column. Sample names must be unique and contain only alphanumeric characters, underscore, and hyphen. Here is an example of a barcode annotation table with 10 samples:

BC_1 AAGAACTA

BC_2 AACTTGTA

BC_3 CCAGTGAT

BC_4 TTGATGCG

BC_5 GGTCGTGC

BC_6 GGAGTGTA

BC_7 TTAGACCG

BC_8 CCGAACAT

BC_9 GGTCCACG

BC_10 GGCTCAAT

Save the barcode annotation table as a .csv and upload it to the workspace Google Bucket.

Upload gRNA Annotation¶

The gRNAs used should be listed in a comma separated text file (.csv) with four columns; gene, strand, cut, and amplicon. Below is an example table for an experiment targeting four genes with one guide per gene. The column definitions are as follows:

gene: Any unique gene symbol identifier. If the same gene is targeted with multiple guides, say STAG2 is targeted with two gRNAs, the names should be something like STAG2_1 and STAG2_2.

strand: Indicates whether the gene is on the forward or reverse strand using + or - respectively.

cut: Specifies the Single base location representing the predicted cut site between the gRNA and the PAM in hg19 coordinates.

amplicon: Range from start to end of sequencing amplicon using hg19 coordinates.

gene strand cut amplicon

Gene1 + 10:112341797 10:112341673-112341888

Gene2 - 4:106155180 4:106155115-106155320

Gene3 - 20:30956834 20:30956741-30956945

Gene4 - 17:29422368 17:29422233-29422455

Save the gRNA annotation table as a .csv and upload it to the workspace Google Bucket.

Upload negative control annotation¶

Negative controls can be annotated in two forms; a list of samples, or a sample by gene matrix. If each negative control sample is a negative control for all target genes, the negative control samples can be listed with one on each line:

sample

Sample1

Sample2

Sample3

Sample4

If negative control samples only serve as negative controls for particular gene targets, a binary sample by gene matrix can be used to indicate which sample/gene pairs are negative controls.

Gene1 Gene2 Gene3 Gene4

Sample1 1 1 0 0

Sample2 1 1 0 0

Sample3 0 0 1 1

Sample4 0 0 1 1

Add Data Entity to Workspace¶

The final required configuration file is simply a list of files that were uploaded to the workspace’s Google bucket. This includes the fastq and annotation files (barcode, gRNA, and negative control). Create a table with a single row and the following column headers:

entity:participant_id barcodes_list barcodes_fastq reads_fastq gRNAs controls ref_idxs ref_fasta

USER_VARIABLE USER_FILE USER_FILE USER_FILE USER_FILE USER_FILE gs://seq-references/ensembl/hg19/seq/hg19_files.txt gs://seq-references/ensembl/hg19/seq/Homo_sapiens_assembly19.fasta

The fields marked with USER are specific to the experiment. The ref_idxs and ref_fasta fields are provided and constant for all experiments using the hg19 reference.

entity:participant_id: Unique experiment ID to differentiate workflow results within the workspace

barcodes_list: Link to the multiplex barcode annotation CSV file.

barcodes_fastq: Link to fastq file containing read barcodes.

reads_fastq: Link to fastq file containing reads.

gRNAs: Link to gRNA annotation CSV file.

controls: Link to negative control annotation CSV file.

ref_idxs: gs://seq-references/ensembl/hg19/seq/hg19_files.txt

ref_fasta: gs://seq-references/ensembl/hg19/seq/Homo_sapiens_assembly19.fasta

The Google bucket format for links to files is gs://bucketID/filename, where the bucketID is listed on the workspace summary page and the filename is user defined.

Given a barcode annotation that was named AU6R0_barcodes.csv by the user and the bucket pictured above, the link would be gs://fc-ae7d8f79-257d-4763-9128-27edfc148e42/AU6R0_barcodes.csv.

Once the table is complete, save it as a tab delimited text file and import it as a Data entity into the workspace using the Data tab within the workspace.

Launch Analysis¶

To run the workflow, navigate to the Method Configurations tab of the workspace and select the crisprseq method.

From method configuration view select Launch Analysis....

Select the Data entity to run the workflow on, and launch the analysis.

Monitor Analysis¶

Refresh the Monitor tab of the workspace to make sure the analysis is running. A typical 2GB FASTQ file takes less than 2 hours to finish analyzing. In this example, if the analysis were to exceed a 4 hour time period, it is recommended to abort the analysis to avoid excess billing.

View Results¶

When the analysis is finished new columns will be added to the Data Entity. Clicking on the link in the table will take you to the Google Bucket with the output files. Descriptions of the outputs can be found here (ref).

Citing¶

Tothova Z, Krill-Burger JM, Popova KD, Landers CC, Sievers QL, Yudovich D, Belizaire R, Aster JC, Morgan EA, Tsherniak A, Ebert BL. Multiplex CRISPR/Cas9-Based Genome Editing in Human Hematopoietic Stem Cells Models Clonal Hematopoiesis and Myeloid Neoplasia. Cell Stem Cell. 2017. 21(4): 547-555. PMID:28985529

Help¶

Please contact mburger@broadinstitute.org with any questions.