One of the biggest public health news is the "new" E. Coli bacteria, which until now killed 22 people on Germany.
The Genoogle project is always seeking for "real world" tasks, and we decided to make a comparation of the E. Coli O104 Genome Assembly agains the E. Coli genome provided by NCBI.
For that, the E. Coli genome, with 4.5Mb approximately, was formated using the following parameters:
mask="111010010100110111" sub-sequence-length="11" low-complexity-filter="5"
A search using the Escherichia_coli_TY-2482.contig.fa (with 5Mb) sequence with the following parameters was made:
max-sub-sequence-distance value="11" min-hsp-length="11" extend-dropoff="5"
max-hits-results="3" max-threads-index-search="4"max-threads-extend-align="16"
The full search, of all 1217 contigs took 7.5 seconds and the results can be observed at http://pih.bio.br/genoogle/Escherichia_coli_TY-2482_X_ecoli.xml.
It shows that Genoogle is really fast and shows interesting facts, like that for very similar sections, by example at iteration 8, where found a long similar place at AE000437 Escherichia coli K-12 MG1655 section 327 of 400 of the complete genome, it still have small mutations.
For the Genoogle development point of view, two things could be observed:
- It is extremely necessary to display at results the input query sequence, to be possible to have a context. It will be made displaying the input sequence header at the iteration field.
- Filter low level scored/high e-value from the output. The output displays alignment of 4, 5 base pair, with e-value higher than one, which means complete non-sense alignment.
The two tasks are being development now and soon we have done it, a new version will be released and a new search between E. Coli genomes will be made.