Generating alleles file locally

To reduce the amount of data to be uploaded to the MGT database some of the MGT pipeline processing can be performed locally.

These steps include:

  1. Species, serovar checking using kraken and SISTR (for salmonella)

  2. Genome assembly using shovill and skesa

  3. Genome QC

  4. Extraction of alleles from genome using known allele fasta file

  5. Assignment of 7 gene MLST sequence type

The resulting file is often several orders of magnitude smaller than the raw reads, facilitating rapid upload and analysis.


This pipeline has many dependencies so conda is the best way to handle them all. So the included .yaml file can be used to create the required environment that will need to be activated before running the script

  1. Clone the repo

  2. Download latest miniKraken Database:

    From the kraken website - (warning 2.9GB!)



  3. unzip archive

  4. Add database folder variable with:

    export KRAKEN_DEFAULT_DB="/home/user/minikraken_db_folder"

  5. Install conda environment:**

    install miniconda3 ->

    conda create -f /path/to/fq_to_allele.yaml -n deployable_fq_to_genome

    This may take a while

    conda activate deployable_fq_to_genome


python read_1.fastq.gz,read_2.fastq.gz ref_alleles.fasta output.fasta

The above script will run with all other settings including species and serovar as default (see below).


Reads files

Paired end fastq files (gzipped or not) in format strain_name_1.fastq(.gz) and strain_name_2.fastq(.gz)

Reference alleles

Fasta file provided with script containing intact alleles for each locus (may be initial “1” alleles only or include other intact alleles)


An Alleles file in fasta format: strainID_alleles.fasta. 4 different types of “allele” are recorded.

  1. A header stating the 7 gene MLST type predicted by mlst

  2. A header in the format the locus:0_reason_for_failed_call to denote loci with uncallable alleles

  3. A header in the format locus:allele to describe exact matches to alleles in the reference alleles file

  4. A header in the format locus:new with sequence to describe new intact alleles or alleles with missing data

This allele file can be submitted (optionally with metadata) to the MGT database for full MGT assignment


usage: [-h] -i INPUTREADS –refalleles REFALLELES -o OUTPATH [optional args]

required arguments:

Input paired fastq(.gz) files, comma separated (i.e. name_1.fastq,name_2.fastq ) (default: None)

--refalleles REFALLELES

File path to MGT reference allele file. By default sistr results will be used to determine which subfolder within the default folder (default: /species_specific_files/)

-o OUTPATH, --outpath OUTPATH

Path to ouput file name,required=True (default: None)

optional arguments:

-h, --help

show this help message and exit

-s SPECIES, --species SPECIES

String to find in kraken species confirmation test (default: Salmonella enterica)


Do not run Serotyping of Salmonella using SISTR (ON by default) (default: None)

-y SEROTYPE, --serotype SEROTYPE

Serotype to match in SISTR, semicolon separated (default: Typhimurium;I 4,[5],12:i:-)

-t THREADS, --threads THREADS

number of computing threads (default: 4)

-m MEMORY, --memory MEMORY

memory available in GB (default: 8)

-f, --force

overwrite output files with same strain name? (default: False)

--min_largest_contig MIN_LARGEST_CONTIG

Assembly quality filter: minimum allowable length of the largest contig in the assembly in bp (default: 60000)

--max_contig_no MAX_CONTIG_NO

Assembly quality filter: maximum allowable number of contigs allowed for assembly (default: 700)

--genome_min GENOME_MIN

Assembly quality filter: minimum allowable total assembly length in bp (default: 4500000)

--genome_max GENOME_MAX

Assembly quality filter: maximum allowable total assembly length in bp (default: 5500000)

--n50_min N50_MIN

Assembly quality filter: minimum allowable n50 value in bp (default: 20000)

--kraken_db KRAKEN_DB

path for kraken db (if KRAKEN_DEFAULT_DB variable has already been set then ignore) (default: )



running strain 1234 against salmonella typhimurium MGT with 8 cores and 30gb RAM

python /path/to/ 1234_1.fastq.gz,1234_2.fastq.gz MGT_alleles_file locus_position_file output_file_name –serotype “Typhimurium;I 4,[5],12:i:-” –species “Salmonella enterica” -t 8 -m 30


running strain abcd against vibrio cholerae MGT with 4 cores and 50gb RAM (serotyping is currently only for Salmonella)

python /path/to/ abcd_1.fastq.gz,abcd_2.fastq.gz MGT_alleles_file locus_position_file output_file_name –no_serotyping –species “Vibrio cholerae” -t 4 -m 50