Documentation for SeqShuffle.jl
SeqShuffle
Documentation for SeqShuffle.
SeqShuffle.data_2_dummy
SeqShuffle.est_1st_order_markov_bg
SeqShuffle.read_fasta
SeqShuffle.seq_shuffle
SeqShuffle.shuffle_fasta
SeqShuffle.data_2_dummy
— Methoddata_2_dummy(dna_strings::Vector{String}; F=FloatType)
Turn a fasta file of DNA sequences to dummy encoded array; assumes each sequence in the fasta is the same length.
Input:
dna_strings
: A vector of strings
Output: A matrix of type FloatType
(e.g., Float32) where each column corresponds to the dummy encoded string in the fasta file.
SeqShuffle.est_1st_order_markov_bg
— Methodest_1st_order_markov_bg(vec_str::Vector{Sting}; laplace=1; F=FloatType)
Estimate the Markov matrix, e.g.,
A C G T
A 0.1 0.3 0.4 0.2
C 0.5 0.1 0.1 0.3
G 0.2 0.2 0.2 0.4
T 0.7 0.1 0.1 0.1
in the (shuffled) background sequence.
The rows and the columns are ordered in A, C, G, T.
And estimate the initial distribution, e.g,
A C G T
0.2 0.2 0.3 0.3
The columns are ordered in A, C, G, T.
Input:
vec_str
: vector of stringslaplace
: pseudocounts to add (optional, default to 1)F
: datatype of the estiamtes (optional, default to Float32)
Output: The estimated 4x4 markov matrix and 4x1 initial distribution.
SeqShuffle.read_fasta
— Methodread_fasta(filepath; max_entries=1000000)
Read a fasta file into a vector of strings
Input:
filepath
: A string that's the input fasta file's absolute filepath.max_entries
: The max number of entries to take from the fasta file (from 1 to max_entries).
Output: A vector of strings that corresponds to the strings in the fasta file. All strings are in uppercase.
SeqShuffle.seq_shuffle
— Methodseq_shuffle(seq::String; k=2, seed=nothing)
Shuffle the input string such that it preserves the frequency of k-mers
Input:
seq
: A stringk
: interger; k-mer frequencyseed
: (integer) seed for random number generation
Output: A shuffled version of the input string seq
SeqShuffle.shuffle_fasta
— Methodshuffle_fasta(input_fasta_location::String,
fasta_output_location::String;
k::Int=2, seed=nothing,
max_entries=1000000)
Shuffle each sequence in the input fasta file such that it preserves the frequency of the k-mers in each sequence.
Input:
input_fasta_location
: the absolute file path of the fasta file.fasta_output_location
: the absolute file path of the output fasta file.k
: k for k-mer frequency.seed
: The seed for random number generator.max_entries
: The max number of entries to take from the fasta file (from 1 to max_entries).
Output: A fasta file that contains the shuffled version of the input fasta file.