Skip to the content.

Running evSeq

Post Installation

Running example data

To confirm that evSeq has been installed and works on your machine, we have provided test datasets that work with the example refseqs. You have three options for running this:

- via Jupyter Notebook

A demo notebook can be found in examples/8-full_demo.ipynb which effectively runs evSeq via the command line. This is also rendered as documentation here. The end of the notebook uses the outputs of this run and compares them to the expected output given the provided example data.

- via Command Line

General info on running evSeq from the command line can be found below.

First, you will need to activate your conda environment (if using it) with

conda activate evSeq

To run the example files, first navigate to the examples folder of the evSeq repository:

cd path/to/evSeq/examples

From here, run evSeq as follows:

evSeq refseqs/DefaultRefSeq.csv ../data/multisite_runs

This should start the run and create an evSeqOutput folder in the current working directory (evSeq/examples) with a timestamped results folder. Once the run has finished, confirm it has done so without errors (they will be sent to the standard output and the log file) and compare the results to the expected results in evSeq/data/multisite_runs/evSeqOutput/expected. You can find example code for comparing these in the demo documentation in the above section, using the function compare_to_expected from evSeq.util.

- via GUI

General info on running evSeq from the GUI can be found below.

First, double click the evSeq shortcut on the Desktop (if you have not moved it to another location). This will open the GUI; it may take a minute, especially the first time opening it.

In the GUI, click on the refseq selector and navigate to the examples folder (path/to/evSeq/examples) and select DefaultRefSeq.csv.

For folder, navigate to the data folder (path/to/evSeq/data) and select multisite_runs.

Then click Start.

This should start the run and create a new timestamped results folder in the evSeqOutput directory already present in multisite_runs folder (the same location as the folder argument). Once the run has finished, confirm it has done so without errors (they will be sent to the GUI console and the log file) and compare the results to the expected results in evSeq/data/multisite_runs/evSeqOutput/expected. You can find code for comparing these in the demo documentation in the above section, or do so visually by comparing the Platemaps.html files.

Troubleshooting

For common problems encountered when using evSeq, please reference Troubleshooting.

Known Limitations

evSeq expects no insertions or deletions relative to the reference sequence provided. Indeed, any read with a detected insertion or deletion is automatically discarded during QC. This works well for speeding up analysis of returned reads, but can lead to problems if (1) you expect insertions and deletions or (2) the best-scoring alignment of a read to the reference is one that opens a gap. There are currently no workarounds for problem 1. Problem 2 can be addressed by tuning the alignment parameters.

Alignment parameters are given as optional arguments (see here); the parameter gap_open_penalty can be raised further to decrease the probability of problem 2 (i.e. score alignments such that those with gaps are scored poorly). Note that we have stress-tested the code with default alignment parameters against ~40,000 random DNA sequences with random mutation positions and found <10 instances of problem 2 (<0.025% of instances; most cases occurred when multiple mutations were placed next to (or near) one another at the end of the evSeq reads). The default evSeq alignment parameters are thus highly robust, but there are situations where they might need to be tuned. We strongly recommend that users evaluate the alignments returned by evSeq to make sure there are no unexpected insertions or deletions. Poor alignments will result in false sequencing negatives – this can be particularly problematic if there are multiple variants in a well, and not all of them are recognized by the alignment (i.e., evSeq fails to recognize that there is a mixed well). As noted, such a situation would be exceedingly rare, but is worth being aware of. In many cases, alignment issues can be easily detected by reviewing both the decoupled and coupled files. To tell if the aligner has included insertions or deletions, (1) look for mutations present in the decoupled file that are not found in the coupled file and (2) look for “#DEAD#” wells that have more reads than the variable_count argument.

Using evSeq from the command line or GUI

Command line

Thanks to setuptools entry_points, evSeq can be accessed from the command line after installation as if it were added to PATH by running:

evSeq refseq folder --OPTIONAL_ARGS ARG_VALUE --FLAGS

where refseq is the .csv file containing information about the experiment described above, and folder is the directory that contains the raw .fastq files (.gz or unzipped) for the experiment.

For information on optional arguments and flags, run

evSeq -h

or see below.

Note: You must be in the environment in which evSeq was installed or this will not be accessible. If you installed the evSeq environment, run

conda activate evSeq

to activate it before running.

GUI

Upon installation, evSeq automatically installs a shortcut onto your Desktop that will launch the evSeq GUI with a double-click. If evSeq was installed in the evSeq environment, the GUI will always run from that environment without you needing to activate it.

The GUI is designed for use by non-programming experts. If you are comfortable with a command line interface, that is the recommended way to use evSeq. If using the GUI, make sure you check the log file after each run to check for warnings or errors encountered. See details on the log file here.

Once opened, you should see a window that looks like this:

gui interface

You will see two required arguments — the refseq and folder args — at the top of the GUI. Details on the refseq argument are given below, and the GUI should provide a description of what the folder contains. For more advanced use, other arguments can be accessed by scrolling down. (These additional arguments are detailed in Optional Arguments). You will typically not need these arguments, however, and the standard evSeq run can be started by clicking Start once refseq and folder are populated. Once started, the progress of the program will be printed to the GUI along with any encountered warnings and errors.

Required Arguments

The refseq file

The primary user inputs that are required are contained in the refseq file, which contains information that allows the evSeq software to know how to process each well. From the information contained in this file, evSeq will construct reference sequences for each plate (or well, if using a Detailed refseq file) and analyze the NGS data accordingly.

Default refseq

An example Default refseq format is given in the evSeq GitHub repository here.

This form of the file assumes the same reference sequence in each well of the analyzed plates and requires eight columns: PlateName, IndexPlate, FPrimer, RPrimer, VariableRegion, FrameDistance, BpIndStart, and AaIndStart. These columns are detailed below:

Column Type Description
PlateName str This is a nickname given to the plate. For instance, if you performed evSeq on a plate that you called “Plate1”, you would put “Plate1” in this column.
IndexPlate DI0X, X=[1,8] This is the evSeq index plate used for library preparation corresponding to the plate in PlateName. For instance, if “Plate1” were prepared using index plate 2, IndexPlate would be DI02. Allowed barcode names are DI01 through DI08, as given in the index map file.
FPrimer str, DNA Sequence This is the forward inner primer you used to create the amplicon for attaching evSeq barcodes, including the evSeq adapter regions. It should be input exactly as ordered from your oligo supplier, 5’ - 3’.
RPrimer str, DNA Sequence This is the reverse inner primer you used to create the amplicon for attaching evSeq barcodes, including the evSeq adapter regions. It should be input exactly as ordered from your oligo supplier, 5’ - 3’. Do not use the reverse complement.
VariableRegion str, DNA Sequence This is the entire region between the 3’ ends of each primer used to generate the evSeq amplicon, 5’ - 3’. This is called the “variable region” as it is the sequenced region that can vary meaningfully, since the primers should be invariant. If a specific codon has been specifically mutagenized, this codon may be replaced by “NNN”. See below for more details.
FrameDistance int, [0,1,2] Distance (in base pairs) from the 3’ end of FPrimer to the first in-frame codon in your VariableRegion. This is required for accurate translation of sequences. For a given evSeq run, 2 in 3 times the sequence used will be out of reading frame with the full amplicon, so this is important to check. As an example, if the 3’ end of your FPrimer ends on the last base of a codon, your VariableRegion is in-frame and this is argument should be 0. If FPrimer ends on the second base of a codon (e.g., is shifted back 1 bp), then your first in-frame base is 1 base away and this argument should be 1. Note that if “NNN” is used in the VariableRegion, evSeq will double check that you correctly defined this argument — with no “NNN” it will assume you are correct.
BpIndStart int, [0,inf] This argument tells the program what index the first base in the variable region belongs to. This is useful for formatting the outputs, as any variation identified in evSeq can be output at the index corresponding to the full gene, rather than the amplicon.
AaIndStart int, [0,inf] This argument tells the program what index the first in-frame amino acid in the variable region belongs to. This means that if your FrameDistance argument is not 0, AaIndStart should not be the position of the codon your variable region starts in but rather the next one, since the first codon is not in frame.

Example Sequence Construction

   -------FPrimer------>
5'-AAAAAAAAAAGGGGGGGGGGG-3'
             |||||||||||--------VariableRegion------->
5'-CCCCCCCCCCGGGGGGGGGGGTNNNTTTTTTTT...TTTTTTTTTTTTTTTGGGGGGGGGGGGCCCCCCCCCC-3'
   FrameDistance = 1 -> x                             ||||||||||||
                                                   3'-CCCCCCCCCCCCAAAAAAAAAA-5'
                                                      <-------RPrimer-------

In this simple example, the FPrimer sequence is AAAAAAAAAAGGGGGGGGGGG and the RPrimer sequence is AAAAAAAAAACCCCCCCCCCCC, with the A regions corresponding to the evSeq adapter regions and the G/C regions corresponding to the directional sequence-specific regions. The VariableRegion is the area between them, TNNNTTTTTTTT...TTTTTTTTTTTTTTT. The NNN sequence specifies the correct reading frame, and the 3’ base of FPrimer is 1 base away from this in-frame codon (indicated with the x), therefore FrameDistance is 1. If the VariableRegion started as base 35 in the gene, BpIndStart would be 35 and AaIndStart would correspond to the position of the NNN codon, which would be position 12 (the first base of this codon is base 36, which is the 12th codon overall).

(Note that if any of your sequences look anything like this example, your library preparation will not work. This is to be interpreted as a simple example only.)

VariableRegion containing “NNN”

(OPTIONAL) You may replace the bases at the known mutagenized positions with “NNN” as the codon. Doing so forces evSeq to return the sequence identified at these positions (e.g., from a site-saturation mutagenesis library), whether or not it matches the parent. If you know where your mutations will occur, this is the recommended way to use evSeq; any off-target mutations not given by “NNN” will still be identified and reported.

evSeq Index Plates

As currently deployed, up to 8 plates (DI01DI08) can be input in a single evSeq run. No more than 8 rows should thus ever be filled in this form of refseq file.

Detailed refseq

An example Detailed refseq format is given in the evSeq GitHub repository here.

This form of the file allows for a different reference sequence in each well of the analyzed plates, rather than the same reference sequence in every well of a given plate. In addition to the column headers given in Default refseq, this form of the file has a required Well column, enabling specification of a different FPrimer, RPrimer, and VariableRegion for each well in the input plates. As currently deployed, up to 8 plates (DI01DI08) can be input in a single evSeq run, so no more than 768 rows should ever be filled in this form of refseq file.

When using this form of refseq, the detailed_refseq flag must be set (see next sections for details).

folder

This is the folder containing the fastq or fastq.gz files generated during next-gen sequencing. Once activated, evSeq will…

  1. Look in this folder to find all filenames containing _R1_ or _R2_.
  2. Match forward and reverse files by the name preceding the identified _R1_ or _R2_. For instance, the files CHL1_S193_L001_R1_001.fastq.gz and CHL1_S193_L001_R2_001.fastq.gz would be matched because the text preceding the _R1_ and _R2_, CHL1_S193_L001, matches for both files. The file with the _R1_ is designated the forward read file and the file with the _R2_ is designated the reverse read file.

Note that both files without a _R1_ or _R2_ in their name and files for which no matching partner is identified will be ignored; all ignored files are recorded in the log file. If multiple forward-reverse file pairs are identified, evSeq will raise an error.

In special cases the forward read file can be passed in as the folder argument and the reverse read file can be passed in as the optional argument fastq_r. See the entry on fastq_r in the Optional Arguments section for more detail.

Optional Arguments

There are a number of flags and optional arguments that can be passed for evSeq, all detailed in the table below:

Argument Type Description
Input/Output    
fastq_r Argument This argument is only available for command line use. If a case arises where, for whatever reason, evSeq cannot auto-identify the forward and reverse read files, this option acts as a failsafe. Instead of passing the folder containing the forward and reverse files in to the folder required argument, pass in the forward read file as the folder argument and the reverse read file as this optional argument.
output Argument By default, deSeq will save to the current working directory (command line) or the evSeq Git repository folder (GUI). The default save location can be overwritten with this argument.
detailed_refseq Flag Set this flag (check the box in the GUI) when passing in a detailed reference sequence file. See Detailed refseq for more information.
analysis_only Flag Set this flag (check the box in the GUI) to only perform Q-score analysis on the input fastq files. The only output in this case will be the quality score histograms.
only_parse_fastqs Flag Set this flag to stop evSeq after generation of parsed, well-filtered fastq files. Counts, platemaps, and alignments will not be returned in this case. Used in case the well-specific fastq sequences are desired but not the entire evSeq analysis.
keep_parsed_fastqs Flag Set this flag to save parsed, well-filtered fastq files as in only_parse_fastqs but to also finish the regular evSeq run.
return_alignments Flag Set this flag to return alignments along with the evSeq output. Note that this flag is ignored if either analysis_only or stop_after_fastq are used.
Read Analysis    
average_q_cutoff Argument During initial sequencing QC, evSeq will discard any sequence with an average quality score below this value. The default value is 25.
bp_q_cutoff Argument Bases with a q-score below this value are ignored when counting the number of sequences aligned at each position. For the coupled outputs (see below), counts are only returned if all bases in the combination pass. The default value is 30.
length_cutoff Argument During initial sequencing QC, evSeq will discard any sequence with an read with total length below length_cutoff * read_length. The default value is 0.9.
match_score Argument When making an alignment, matching bases add this value to the score. The default value is 1.
mismatch_penalty Argument When making an alignment, mis-matching bases subtract this value from the score. The default value is 0.
gap_open_penalty Argument When making an alignment, opening a gap subtracts this value from the score. The default value is 3.
gap_extension_penalty Argument When making an alignment, extending a gap subtracts this value from the score. The default value is 1.
Position Identification    
variable_thresh Argument This argument sets the threshold that determines whether or not a position is variable. In other words, if a position contains a non-reference sequence sequence at a given position at a fraction greater than variable_thresh, then it is a variable position. The default is 0.2. Setting this value lower makes evSeq more sensitive to variation, while setting it higher makes it less sensitive. A value of 1, for instance, would find no variable positions.
variable_count Argument This sets the count threshold for identifying “dead” wells. If a well has fewer sequences that pass QC than this value, then it is considered “dead”. The default value is 10 (meaning only wells with fewer than 10 sequences are dead).
Advanced    
jobs Argument This is the number of processors used by deSeq for data processing. By default, evSeq uses 1 less processor than are available on your computer. As with all multiprocessing programs, it is typically not recommended to use all available processors unless you are okay devoting all computer resources to the task (e.g. you don’t want to be concurrently checking email, playing music, running another program, etc.). The number of jobs can be lowered to reduce the memory demands of evSeq.
read_length Argument By default, evSeq will attempt to determine the read length from the fastq files. If this process is failing (e.g., due to heavy primer-dimer contamination), the read length can be manually set using this argument.
fancy_progress_bar Flag Launches a tqdm.gui instance for intensive evSeq processes to give you a better estimate for performance/time remaining on your run. While tqdm.gui is still in experimental/alpha stages (which it will warn you about), we have not found any problems with this as of yet.

Next page: Understanding the outputs.

Back to the main page.