Extract gene-count matrices from plated-based SMART-Seq2 data¶
Run SMART-Seq2 Workflow¶
Follow the steps below to extract gene-count matrices from SMART-Seq2 data on Terra. This WDL aligns reads using Bowtie 2 and estimates expression levels using RSEM.
Copy your sequencing output to your workspace bucket using gsutil in your unix terminal.
You can obtain your bucket URL in the dashboard tab of your Terra workspace under the information panel.
Note: Broad users need to be on an UGER node (not a login node) in order to use the
-m
flagRequest an UGER node:
reuse UGER qrsh -q interactive -l h_vmem=4g -pe smp 8 -binding linear:8 -P regevlab
The above command requests an interactive node with 4G memory per thread and 8 threads. Feel free to change the memory, thread, and project parameters.
Once you’re connected to an UGER node, you can make gsutil available by running:
reuse Google-Cloud-SDK
Use
gsutil cp [OPTION]... src_url dst_url
to copy data to your workspace bucket. For example, the following command copies the directory at /foo/bar/nextseq/Data/VK18WBC6Z4 to a Google bucket:gsutil -m cp -r /foo/bar/nextseq/Data/VK18WBC6Z4 gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4
-m
means copy in parallel,-r
means copy the directory recursively.Create a sample sheet.
Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.
The sample sheet provides metadata for each cell:
Column Description Cell Cell name. Plate Plate name. Cells with the same plate name are from the same plate. Read1 Location of the FASTQ file for read1 in the cloud (gsurl). Read2 Location of the FASTQ file for read1 in the cloud (gsurl). Example:
Cell,Plate,Read1,Read2 cell-1,plate-1,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-1_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-1_L001_R2_001.fastq.gz cell-2,plate-1,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-2_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-2_L001_R2_001.fastq.gz cell-3,plate-2,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-3_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-3_L001_R2_001.fastq.gz cell-4,plate-2,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-4_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-4_L001_R2_001.fastq.gz
Upload your sample sheet to the workspace bucket.
Example:
gsutil cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/
Import smartseq2 workflow to your workspace.
See the Terra documentation for adding a workflow. The smartseq2 workflow is under
Broad Methods Repository
with name “cumulus/smartseq2”.Moreover, in the workflow page, click
Export to Workspace...
button, and select the workspace to which you want to export smartseq2 workflow in the drop-down menu.In your workspace, open
smartseq2
inWORKFLOWS
tab. SelectProcess single workflow from files
as belowand click
SAVE
button.
Inputs:¶
Please see the description of inputs below. Note that required inputs are shown in bold.
Name | Description | Example | Default |
---|---|---|---|
input_csv_file | Sample Sheet (contains Cell, Plate, Read1, Read2) | “gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv” | |
output_directory | Output directory | “gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2_output” | |
reference | Reference transcriptome to align reads to. Acceptable values:
|
“GRCh38”, or
“gs://fc-e0000000-0000-0000-0000-000000000000/rsem_ref.tar.gz”
|
|
smartseq2_version | SMART-Seq2 version to use. Versions available: 1.0.0. | “1.0.0” | “1.0.0” |
docker_registry | Docker registry to use. Options:
|
“cumulusprod/” | “cumulusprod/” |
zones | Google cloud zones | “us-east1-d us-west1-a us-west1-b” | “us-east1-d us-west1-a us-west1-b” |
num_cpu | Number of cpus to request for one node | 4 | 4 |
memory | Memory size string | “3.60G” | “3.60G” |
disk_space | Disk space in GB | 10 | 10 |
preemptible | Number of preemptible tries | 2 | 2 |
Outputs:¶
See the table below for important outputs.
Name | Type | Description |
---|---|---|
output_count_matrix | Array[String] | A list of google bucket urls containing gene-count matrices, one per plate. Each gene-count matrix file has the suffix .dge.txt.gz . |
This WDL generates one gene-count matrix per SMART-Seq2 plate. The gene-count matrix uses Drop-Seq format:
- The first line starts with
"Gene"
and then gives cell barcodes separated by tabs. - Starting from the second line, each line describes one gene. The first item in the line is the gene name and the rest items are TPM-normalized count values of this gene for each cell.
The gene-count matrices can be fed directly into cumulus for downstream analysis.
TPM-normalized counts are calculated as follows:
- Estimate the gene expression levels in TPM using RSEM.
- Suppose
c
reads are achieved for one cell, then calculate TPM-normalized count for genei
asTPM_i / 1e6 * c
.
TPM-normalized counts reflect both the relative expression levels and the cell sequencing depth.
Custom Genome¶
We also provide a way of generating user-customized Genome references for SMART-Seq2 workflow.
Import smartseq2_create_reference workflow to your workspace.
See the Terra documentation for adding a workflow. The smartseq2_create_reference workflow is under
Broad Methods Repository
with name “cumulus/smartseq2_create_reference”.Moreover, in the workflow page, click
Export to Workflow...
button, and select the workspace to which you want to exportsmartseq2_create_reference
in the drop-down menu.In your workspace, open
smartseq2_create_reference
inWORKFLOWS
tab. SelectProcess single workflow from files
as belowand click
SAVE
button.
Inputs:¶
Please see the description of inputs below. Note that required inputs are shown in bold.
Name | Description | Type or Example | Default |
---|---|---|---|
fasta | Genome fasta file | File.
For example, “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.primary_assembly.fa”
|
|
gtf | GTF gene annotation file (e.g. Homo_sapiens.GRCh38.83.gtf) | File.
For example, “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.83.gtf”
|
|
smartseq2_version | SMART-Seq2 version to use.
Versions available: 1.0.0.
|
String | “1.0.0” |
docker_registry | Docker registry to use. Options:
|
String | “cumulusprod/” |
zones | Google cloud zones | String | “us-east1-b us-east1-c us-east1-d” |
cpu | Number of CPUs | Integer | 8 |
memory | Memory size string | String | “7.2G” |
extra_disk_space | Extra disk space in GB | Integer | 15 |
preemptible | Number of preemptible tries | Integer | 2 |
Outputs¶
Name | Type | Description |
---|---|---|
reference | File | The custom Genome reference generated. Its default file name is rsem_ref.tar.gz . |