After MARSeG is started, it asks for information with the following questions:
Figure 1: MARSeG input command line dialogue in Windows
Figure 2: MARSeG input command line dialogue in Ubuntu
”Name of motif list (motifs.txt): ”
Asks for the name of a file which contains the motifs that should be avoided when generating sequences. Default is ”motifs.txt” if no input is given. The file should be a plain text (.txt) file located in the current directory and contain the DNA sequence of the motifs separated by a single white space (see motifs.txt for an example). The user should be cautious with the selection of the motif and the number of motifs that are chosen to avoid in the generated sequences as the amount of motifs increases the homology of the sequence. Furthermore, a higher amount of motifs increases the probability that the program arrives at a point where it cannot remove a given motif from the degenerated template sequence, which will cause the program to fail. See Background & Reference for details.
”Pattern for main sequence”
Asks for a template for the generated, motif-less sequences. This sequence needs to have degenerated nucleotides (see table 1 below for a list of all accepted degenerative nucleotides) and mustn’t contain the motifs provided in the motif list. However, as long as they do not constitute a direct occurrence of one of the motifs, defined bases are allowed as parts of the template. Sequences that contain repetitions of single characters can be written using a short notation as follows: N(150) will be handled as a sequence of 150 N’s, AGGCGV(5)S(3)N(6) as AGGCGVVVVVSSSNNNNNN. This input is case insensitive.
”Prefix sequence” & ”Suffix sequence”
Asks for two sequences that will be added to the main sequence that can contain motifs that are on the motif list. This input is case insensitive.
Figure 3: Structure of MARSeG generated sequences
MARSeG creates a subfolder to the current directory called ”output”, in which it will place two output files. Both file names contain the date and time of their creation in order to distinguish different MARSeG runs. In every run, MARSeG creates 150 degenerated sequences and generates
100 defined sequences from every single degenerated sequence. The defined sequences are used to calculate GC content and a homology value for every degenerated sequence (through pairwise comparison of the defined sequences). Furthermore, MARSeG calculates a ”possibilities per base” value by multiplying the number of possible outcomes of each letter in the DNA sequence and dividing them by the sequence length.
The file suggested_sequence_*.txt contains the sequence with the lowest value of homology. The defined sequences that were generated using this degenerated sequence show the highest amount of diversity, which makes it optimal for the most approaches. In addition to the sequence itself, this file contains information about the input that was used and the date of creation (figure 4).
In case that there’s not just one sequence needed or the homology shouldn’t be the deciding factor which sequence is to be used, all_sequences_*.txt contains all the degenerated sequences that were generated by MARSeG. These sequences are written in fasta format and the description line contains the sequences homology, GC content and possibilities per base values.
Examples for both files are provided in the folder examples, and the files are called all sequences example.txt and suggested sequence example.txt.
Figure 4: Content of the output file suggested_sequence_*.txt