from IPython.display import IFrame
Understanding the Outputs¶
The output location of evSeq
is controlled with the --output
optional argument (see here). If the "output" argument is not set, then evSeq will save to the current working directory (command line) or the same location as the folder
argument (GUI). If the save location has not previously been used, then evSeq
will create a folder titled evSeqOutput
in the output location which contains a folder giving the date-time of the run initialization (in yyyymmdd-hhmmss
format). If the save location has been previously used, then evSeq will add another date-time folder with the previously generated evSeqOutput
folder. All evSeq
outputs of a specific run are contained in the associated date-time folder. The below sections detail the folders found within the date-time folder.
# Good example
IFrame('assets/qualplot_good.html', width=810, height=320)
The example presented results from a good run — as a heuristic, you typically want most reads above 30 in both the forward and reverse direction (though the reverse reads are generally a bit worse). Checking this file is critical, as it gives you insight into how confident you can be in your sequencing results. An example of a bad quality score histogram (specifically the reverse read) is below:
# Bad example
IFrame('assets/qualplot_bad.html', width=810, height=320)
Note that most of the reverse reads have Q-scores below 30. If you have a histogram like this, it's highly likely that something went wrong at some stage of evSeq
library prep/sequencing. See the troubleshooting page for more details.
OutputCounts
¶
The OutputCounts
folder contains most tabular information needed for downstream processing after evSeq
is run. For each run, 8 files will be generated and stored within the OutputCounts
folder. The files all follow the general format ([AminoAcids/Bases]_[Decoupled/Coupled]_[All/Max].csv
) and contain information on all variants identified in the run. Any AminoAcid
file contains information for the mutant amino acids identified while a Bases
file contains information for the mutant bases identified. Note that, while the amino acid outputs have been thoroughly validated, the Bases outputs should be considered experimental. They are generated using almost-identical code to the amino acid outputs, and so while we expect that they are fine, without complete validation it is impossible to know with absolute certainy. All this to say: if something looks odd with those files, report it. Decoupled
files are the result of counting bases independent of reads (i.e., they do not capture information about how frequently two mutations occur together when considering pair-end sequencing) while Coupled
files contain the results of counting bases considering paired reads. All
files contain information on all non-parent variants identified regardless of frequency while Max
files contain information only on the single most frequent non-parent variant found in each well. For the purpose of constructing sequence-function pairs, the most useful files are AminoAcids_Decoupled_Max
and AminoAcids_Coupled_Max
. As necessary, the other files (e.g., AminoAcids_Decoupled_All
) can provide information on mixed populations or other imperfections.
Each OutputCount
file holds a table with the following information:
Header | Information Contained |
---|---|
IndexPlate |
The index plate used (e.g., DI01 ) |
Plate |
User-specified plate name |
Well |
Source plate/index plate well |
Aligment Frequency |
The fraction of reads corresponding to combination or individual mutant, depending on the specific file |
WellSeqDepth |
The total number of reads in a well that passed QC |
Flag |
Contains any non-standard information about the variant. A particularly useful flag is Unexpected Variation , which is returned for any variant/mutant identified that was not expected according to the provided reference sequence OR in cases where a mixed well is possible. |
In addition to the above information, the Coupled
files contain the below columns:
Header | Information Contained |
---|---|
VariantCombo |
The identity of any variant identified. Each variant is given in the format [original character][position in sequence][new character] , and variants are separated by underscores. |
SimpleCombo |
The same information as VariantCombo , but only the new character is given. This is a useful shorthand when mutation sites are known. |
VariantsFound |
The number of variants identified in the given combination. |
VariantSequence |
The VariableRegion sequence for the well updated to reflect the identified variant. |
while the Decoupled
files contain the below columns:
Header | Information Contained |
---|---|
[Aa/Bp]Position |
The position where a variant amino acid or base was found. |
[Aa/Bp] |
The identify of the variant amino acid found. |
Note that evSeq
handles identified parent and dead wells differently from others. Some notes on these "special" outputs:
- When a parent well is identified (i.e., a well with no variation compared to the reference sequence), the returned values for a number of columns will be
#PARENT#
. Note the flanking use of "#" to highlight that this is not an amino acid sequence. The returnedAlignmentFrequency
will be given as the average frequency across all positions sequenced and the returnedWellSeqDepth
will be the average count over all positions aligned to the reference. - A "dead" well is one that either has fewer unpaired usable reads passing QC than that given by
variable_count
(in the case of decoupled results), less paired usable reads passing QC than that given byvariable_count
, or at least one position for which no counts were observed. If not enough paired reads are present pre-QC, then the number of paired reads identified is returned forWellSeqDepth
; if not enough reads are present after QC, then the number of reads remaining after QC are returned. TheWellSeqDepth
should be less thanvariable_count
for all cases except if one position has no reads (we have never actually seen this single-zero-position situation arise except for during stress testing of the software -- it's included as a filter to handle as many eventualities as possible). For dead wells, theAlignmentFrequency
andVariantsFound
columns will be given as0
. Any sequence-related output will be reported as#DEAD#
, where "#" is again used to avoid confusion with an amino acid sequence.
Platemaps
¶
For each plate passed in via the refseq
file, an interactive platemap plot will be generated. These platemaps are stored in an html file found in the Platemaps
folder, which can be opened with any browser to render the following:
# Interactive Platemap
IFrame('assets/Platemaps.html', width=1010, height=810)
As can be seen from the plot about, these are interactive and contain toggles to choose between plates (when multiple plates are being analyzed by evSeq
) and extra information about each well when hovering over them.
The text within each well is the combination of amino acids (in 5' -> 3' order, as passed in in the refseq
file) with the highest alignment frequency for that well. The fill color of the well is the log sequencing depth, while the well border color is the alignment frequency of the well. Note that the border color is binned rather than existing on a continuous scale. Also note that, because the position information is not given, the output csv files in the previous section should be used for downstream processing — these images are simply a nice way to quickly analyze your data.
evSeqLog
files¶
evSeq
keeps a log of every run. A single log is output for each evSeq
run as RunSpecificLog.txt
in that run's evSeqOutput
folder. However, a continuous log is also stored within the local evSeq
install repository and can be found here: evSeqLog.log
. Information captured by the log file includes:
- The start time of the
evSeq
run, given asyyyymmdd-hhmmss
followed by a series of underscores. This is the first line of each log block. - The values of all parameters input to
evSeq
. Note that if parameters are unspecified, the log records the default parameters. - Information on files used for processing, including
- The forward and reverse read file pairs identified in the
folder
argument - Any files within
folder
that were not matched.
- The forward and reverse read file pairs identified in the
- Any warnings encountered during the run. These warnings will also be printed to the console during the run.
- Fatal errors. If the program completed successfully, the last line in the log entry will read "Run completed. Log may contain warnings."
The amount of information stored in the log file is small (bytes per run), but will build with continued use of evSeq
. If the file gets too large (this will take a long time...) you can delete evSeqLog.log
; on the next run a fresh evSeqLog.log
file will be created.
ParsedFilteredFastqs
¶
Optional; requires --keep_parsed_fastqs
or --only_parse_fastqs
flags to be passed.
For each well identified, fastq files containing all forward and reverse reads that passed initial sequencing QC (i.e., their average Q-score is above average_q_cutoff
and the length of the read is greater than length_filter
) are generated. For all sequences returned, barcodes and adapter sequences are stripped from the returned reads, meaning that they represent only the sequencing region that covered the amplicon. These files can be used for further downstream processing by software other than evSeq
. Note that only paired reads are returned (i.e., if one partner in a forward-reverse pair failed initial QC, neither is returned in these fastq files).
Alignments
¶
Optional; requires --return_alignments
argument to be passed.
For each well in the run, a text file is generated containing every alignment of sequences that passed initial QC. Alignments for sequences that did not pass QC (either because their average Q-score was below average_q_cutoff
or the length of the read fell below length_filter
) are not included.
The alignment file is ordered in blocks of paired forward and reverse reads. Each block begins with "Alignment #:", followed by a forward alignment and/or a reverse alignment. Note that if a sequence did not pass QC, its alignment is not included in the block; if both sequences in a pair did not pass QC, then no alignments are reported.
Note that just because an alignment is present in these files, it does not mean that it was used for analysis, sequences that pass initial QC will not necessarily pass alignment QC. In particular, any returned sequence that shows an insertion or deletion is automatically discarded and not used for analysis. The alignment files can be used to identify sequences that likely have insertions or deletions present.
Next page: Using evSeq
data.
Back to the main page.