Its like you have an intact 6th edition of a textbook, and you have several copies of the 7th edition sorted randomly with no page numbers. Programs like BLAST will build an index based on the contents of 6 and then each page of 7 can be compared against the index and you'll learn that for a given page of 7 it aligns best at character 123456 of 6 or whatever.
Do that for each page in your pile and you get a chart where on the X axis is the character index of 6 and on the Y axis is the number of pages of 7 which were aligned there. The peaks and valleys in that graph can tell you about the inductive strength of your assumption that a given read is aligned correctly to the reference genome (plus you score it based on mismatches, insertions and gaps).
So if many of the same pages were chosen for a given locus, yet the sequence differs, then you have reason to trust that there's an authentic difference between your sample and the reference in that location.
There's a lot of chemical tricks you can do to induce meaningful non-uniformity in this graph. See ChIP-Seq for instance, where peaks indicate methyl markers which typically correspond with a gene that was enabled for transcription when the sample was taken.
If you don't have a reference genome then you can run the sample on a gel to separate the sequences of different length, that'll group by chromosome. From there you've got a much more computationally challenging problem, but as long as you can ensure that it's cut at random locations before reads are taken you can use overlaps to figure out the sequence, because unlike the textbook page example, the page boundaries are not gonna line up (but the chromosome ends are):
Mary had a little
was white as snow
lamb whose fleece was
Marry had
had a little lamb
a little lamb
was white
white as snow
So you can find the start and ends based on where no overlaps occur (nothing ever comes before Mary or after snow) and then you can build the rest of the sequence based on overlaps.
If you're working with circular chromosomes (bacteria and some viruses) you can't reason based on ends but as long as you have enough data there's still gonna be just one way to make a loop out of your reads. (Imagine the above example, but with the song that never ends. You could still manage to build a loop out of it despite not having an end to work from.)
Its like you have an intact 6th edition of a textbook, and you have several copies of the 7th edition sorted randomly with no page numbers. Programs like BLAST will build an index based on the contents of 6 and then each page of 7 can be compared against the index and you'll learn that for a given page of 7 it aligns best at character 123456 of 6 or whatever.
Do that for each page in your pile and you get a chart where on the X axis is the character index of 6 and on the Y axis is the number of pages of 7 which were aligned there. The peaks and valleys in that graph can tell you about the inductive strength of your assumption that a given read is aligned correctly to the reference genome (plus you score it based on mismatches, insertions and gaps).
So if many of the same pages were chosen for a given locus, yet the sequence differs, then you have reason to trust that there's an authentic difference between your sample and the reference in that location.
There's a lot of chemical tricks you can do to induce meaningful non-uniformity in this graph. See ChIP-Seq for instance, where peaks indicate methyl markers which typically correspond with a gene that was enabled for transcription when the sample was taken.
If you don't have a reference genome then you can run the sample on a gel to separate the sequences of different length, that'll group by chromosome. From there you've got a much more computationally challenging problem, but as long as you can ensure that it's cut at random locations before reads are taken you can use overlaps to figure out the sequence, because unlike the textbook page example, the page boundaries are not gonna line up (but the chromosome ends are):
So you can find the start and ends based on where no overlaps occur (nothing ever comes before Mary or after snow) and then you can build the rest of the sequence based on overlaps.If you're working with circular chromosomes (bacteria and some viruses) you can't reason based on ends but as long as you have enough data there's still gonna be just one way to make a loop out of your reads. (Imagine the above example, but with the song that never ends. You could still manage to build a loop out of it despite not having an end to work from.)