How the MARSeG algorithm works
Starting from the (degenerated) input sequence, the suffix and the prefix that are provided by the user, MARSeG creates n motif-free or motif-avoiding (degenerated) output sequences. The motifs to be avoided are also provided by the user in a motif list. The user should be cautious with the selection of the motifs and the number of motifs that are chosen to avoid in the generated sequences as the amount of motifs increases the homology of the sequence (figure 1). Furthermore, a higher amount of motifs increases the probability that the program arrives at a point where it cannot remove a given motif from the degenerated template sequence, which will cause the program to fail.
Figure 1: A higher amount of motifs increases the homology of the generated sequences
For each of the n output sequences and for every motif, the program loops through all positions of the sequence and checks whether this motif could occur on this position. If that is the case, the program will reduce the amount of possibilities at one position (base pair) of the sequence in such a way that the motif cannot not occur at that position.
Figure 2: Workflow of MARSeG
Internally, nucleotides, nucleobases or positions are represented using a 4-bit coding format, in which each of the bits represents the possibility of each of the ”real” nucleotides adenine, cytosine, guanine and thymine appearing at this position (table 1). This representation of nucleotides as sets of possible defined nucleotides enables easy calculation of new nucleotides, as e.g. 16 NAND 1 = 15 (i.e. N NAND T = V or 1111 NAND 0001 = 1110) are then feasible.
As the decision which of the bases in the sequence is largely random (except for bases that cannot be changed with the given nucleotide from the motif), the generated sequences are all different and need further evaluation. After the motif-avoiding, randomized sequences are generated, they are evaluated by calculating the mean homology values of the defined sequences one can create from the degenerated sequence. The generation of the defined sequences is a random process, and the program creates 100 sequences per MARSeG sequence. These 100 sequences are then compared pairwise for homology and these values are stored.
The output of the program consists of a full list of all generated sequences and one text files which shows the sequence with the lowest homology value (see Input & Output).
For a reference, please see "DNA replication in engineered Escherichia coli genomes with extra replication origins", D. Schindler, S. Milbredt, T. Sperlea, T. Waldminghaus, 2016.