Documentation for SeqShuffle.jl

SeqShuffle

Documentation for SeqShuffle.

SeqShuffle.data_2_dummyMethod
data_2_dummy(dna_strings::Vector{String}; F=FloatType)

Turn a fasta file of DNA sequences to dummy encoded array; assumes each sequence in the fasta is the same length.

Input:

  • dna_strings: A vector of strings

Output: A matrix of type FloatType (e.g., Float32) where each column corresponds to the dummy encoded string in the fasta file.

source
SeqShuffle.est_1st_order_markov_bgMethod
est_1st_order_markov_bg(vec_str::Vector{Sting}; laplace=1; F=FloatType)

Estimate the Markov matrix, e.g.,
        A   C   G   T       
    A  0.1 0.3 0.4 0.2 
    C  0.5 0.1 0.1 0.3
    G  0.2 0.2 0.2 0.4
    T  0.7 0.1 0.1 0.1
in the (shuffled) background sequence.
The rows and the columns are ordered in A, C, G, T.

And estimate the initial distribution, e.g,
    A   C   G   T       
   0.2 0.2 0.3 0.3 
The columns are ordered in A, C, G, T.

Input:

  • vec_str: vector of strings
  • laplace: pseudocounts to add (optional, default to 1)
  • F: datatype of the estiamtes (optional, default to Float32)

Output: The estimated 4x4 markov matrix and 4x1 initial distribution.

source
SeqShuffle.read_fastaMethod
read_fasta(filepath;  max_entries=1000000)

Read a fasta file into a vector of strings

Input:

  • filepath: A string that's the input fasta file's absolute filepath.
  • max_entries: The max number of entries to take from the fasta file (from 1 to max_entries).

Output: A vector of strings that corresponds to the strings in the fasta file. All strings are in uppercase.

source
SeqShuffle.seq_shuffleMethod
seq_shuffle(seq::String; k=2, seed=nothing)

Shuffle the input string such that it preserves the frequency of k-mers

Input:

  • seq: A string
  • k: interger; k-mer frequency
  • seed: (integer) seed for random number generation

Output: A shuffled version of the input string seq

source
SeqShuffle.shuffle_fastaMethod
shuffle_fasta(input_fasta_location::String, 
          fasta_output_location::String;
          k::Int=2, seed=nothing,
          max_entries=1000000)

Shuffle each sequence in the input fasta file such that it preserves the frequency of the k-mers in each sequence.

Input:

  • input_fasta_location: the absolute file path of the fasta file.
  • fasta_output_location: the absolute file path of the output fasta file.
  • k: k for k-mer frequency.
  • seed: The seed for random number generator.
  • max_entries: The max number of entries to take from the fasta file (from 1 to max_entries).

Output: A fasta file that contains the shuffled version of the input fasta file.

source