-
Notifications
You must be signed in to change notification settings - Fork 8
Hamming distance #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Hamming distance #15
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
d62948c
outline of hamming distance tutorial
danielle-pinto 7e2b404
slight text edits
danielle-pinto 4f3c4e6
make edits according to kevin's suggestions
danielle-pinto 5fa4fc0
add idiomatic solution
danielle-pinto 3a35cc2
add distances.jl solution
danielle-pinto 033e6b5
Update docs/src/rosalind/06-hamm.md
danielle-pinto eb58985
Update docs/src/rosalind/06-hamm.md
danielle-pinto 82d71c8
add semantic line breaks and benchmarking
danielle-pinto 4e5ae14
fix typos
danielle-pinto 9bfbfd8
escape $ due to deployment warning
danielle-pinto 7944497
pin mathjax version to 0.2.7
danielle-pinto 45814fb
add documenter vitepress version pin under compat category
danielle-pinto File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,189 @@ | ||
| # Counting Point Mutations | ||
|
|
||
| 🤔 [Problem link](https://rosalind.info/problems/hamm/) | ||
|
|
||
| !!! warning "The Problem" | ||
|
|
||
| Given two strings s and t of equal length, | ||
| the Hamming distance between s and t, denoted dH(s,t), | ||
| is the number of corresponding symbols that differ in s and t. | ||
|
|
||
|
|
||
| Given: Two DNA strings s and t of equal length (not exceeding 1 kbp). | ||
|
|
||
| Return: The Hamming distance dH(s,t). | ||
|
|
||
| ***Sample Dataset*** | ||
|
|
||
| ``` | ||
| GAGCCTACTAACGGGAT | ||
| CATCGTAATGACGGCCT | ||
| ``` | ||
|
|
||
| ***Sample Output*** | ||
|
|
||
| ``` | ||
| 7 | ||
| ``` | ||
|
|
||
|
|
||
| To calculate the Hamming Distance between two strings/sequences, | ||
| the two strings/DNA sequences must be the same length. | ||
|
|
||
| The simplest way to solve this problem is to compare the corresponding values in each string for each index and then sum the mismatches. | ||
| This is the fastest and most idiomatic Julia solution, as it leverages vector math. | ||
|
|
||
| Let's give this a try! | ||
|
|
||
| ```julia | ||
| ex_seq_a = "GAGCCTACTAACGGGAT" | ||
| ex_seq_b = "CATCGTAATGACGGCCT" | ||
|
|
||
| count(i-> ex_seq_a[i] != ex_seq_b[i], eachindex(ex_seq_a)) | ||
| ``` | ||
|
|
||
| ### For Loop | ||
|
|
||
| Another way we can approach this would be to use the for-loop. | ||
| For loops are traditionally slower and clunkier (especially in Python). | ||
| However, Julia can often optimize for-loops like this, | ||
| which is one of the things that makes it so powerful. | ||
| It has multiple processing units that can run the same task in parallel. | ||
|
|
||
| We can calculate the Hamming Distance by looping over the characters in one of the strings | ||
| and checking if the corresponding character at the same index in the other string matches. | ||
|
|
||
| Each mismatch will cause 1 to be added to a `counter` variable. | ||
| At the end of the loop, we can return the total value of the `counter` variable. | ||
|
|
||
|
|
||
|
|
||
| ```julia | ||
| ex_seq_a = "GAGCCTACTAACGGGAT" | ||
| ex_seq_b = "CATCGTAATGACGGCCT" | ||
|
|
||
| function hamming(seq_a, seq_b) | ||
| # check if the strings are empty | ||
| if isempty(seq_a) | ||
| throw(ErrorException("empty sequences")) | ||
| end | ||
|
|
||
| # check if the strings are different lengths | ||
| if length(seq_a) != length(seq_b) | ||
| throw(ErrorException(" sequences have different lengths")) | ||
| end | ||
|
|
||
| mismatches = 0 | ||
| for i in 1:length(seq_a) | ||
| if seq_a[i] != seq_b[i] | ||
| mismatches += 1 | ||
| end | ||
| end | ||
| return mismatches | ||
| end | ||
|
|
||
| hamming(ex_seq_a, ex_seq_b) | ||
|
|
||
| ``` | ||
|
|
||
|
|
||
| ## BioAlignments method | ||
|
|
||
| Instead of writing your own function, an alternative would be to use the readily-available Hamming Distance [function](https://github.com/BioJulia/BioAlignments.jl/blob/0f3cc5e1ac8b34fdde23cb3dca7afb9eb480322f/src/pairwise/algorithms/hamming_distance.jl#L4) in the `BioAlignments.jl` package. | ||
|
|
||
| ```julia | ||
| using BioAlignments | ||
|
|
||
| ex_seq_a = "GAGCCTACTAACGGGAT" | ||
| ex_seq_b = "CATCGTAATGACGGCCT" | ||
|
|
||
| bio_hamming = BioAlignments.hamming_distance(Int64, ex_seq_a, ex_seq_b) | ||
|
|
||
| bio_hamming[1] | ||
|
|
||
| ``` | ||
|
|
||
| ```julia | ||
| # Double check that we got the same values from both ouputs | ||
| @assert hamming(ex_seq_a, ex_seq_b) == bio_hamming[1] | ||
| ``` | ||
|
|
||
|
|
||
| The BioAlignments `hamming_distance` function requires three input variables -- | ||
| the first of which allows the user to control the `type` of the returned hamming distance value. | ||
|
|
||
| In the above example, `Int64` is provided as the first input variable, | ||
| but `Float64` or `Int8` are also acceptable inputs. | ||
| The second two input variables are the two sequences that are being compared. | ||
|
|
||
| There are two outputs of this function: | ||
| the actual Hamming Distance value and the Alignment Anchor. | ||
| The Alignment Anchor is a one-dimensional array (vector) that is the same length as the length of the input strings. | ||
|
|
||
| Each value in the vector is also an AlignmentAnchor with three fields: | ||
| sequence position, reference position, and an operation code | ||
| ('0' for start, '=' for match, 'X' for mismatch). | ||
|
|
||
| The Alignment Anchor for the above example is: | ||
| ``` | ||
| AlignmentAnchor[ | ||
| AlignmentAnchor(0, 0, '0'), | ||
| AlignmentAnchor(1, 1, 'X'), | ||
| AlignmentAnchor(2, 2, '='), | ||
| AlignmentAnchor(3, 3, 'X'), | ||
| AlignmentAnchor(4, 4, '='), | ||
| AlignmentAnchor(5, 5, 'X'), | ||
| AlignmentAnchor(7, 7, '='), | ||
| AlignmentAnchor(8, 8, 'X'), | ||
| AlignmentAnchor(9, 9, '='), | ||
| AlignmentAnchor(10, 10, 'X'), | ||
| AlignmentAnchor(14, 14, '='), | ||
| AlignmentAnchor(16, 16, 'X'), | ||
| AlignmentAnchor(17, 17, '=')] | ||
| ``` | ||
|
|
||
| ### Distances.jl method | ||
|
|
||
| Another package that calculates the Hamming distance is the [Distances package](https://github.com/JuliaStats/Distances.jl). | ||
| We can call its `hamming` function on our two test sequences: | ||
|
|
||
|
|
||
|
|
||
| ```julia | ||
| using Distances | ||
|
|
||
| ex_seq_a = "GAGCCTACTAACGGGAT" | ||
| ex_seq_b = "CATCGTAATGACGGCCT" | ||
|
|
||
| Distances.hamming(ex_seq_a, ex_seq_b) | ||
| ``` | ||
|
|
||
| ## Benchmarking | ||
|
|
||
| Let's test to see which method is the most efficient! | ||
| Did the for-loop slow us down? | ||
|
|
||
| ```julia | ||
| using BenchmarkTools | ||
|
|
||
| testseq1 = string(randdnaseq(100_000)) # this is defined in BioSequences | ||
|
|
||
| testseq2 = string(randdnaseq(100_000)) | ||
|
|
||
|
|
||
| @benchmark hamming(\$testseq1, \$testseq2) | ||
|
|
||
| @benchmark BioAlignments.hamming_distance(Int64, \$testseq1, \$testseq2) | ||
|
|
||
| @benchmark Distances.hamming(\$testseq1, \$testseq2) | ||
| ``` | ||
|
|
||
| The BioAlignments method takes up a much larger amount of memory, | ||
| and nearly three times as long to run. | ||
| However, it also generates an `AlignmentAnchor` data structure each time the function is called, | ||
| so this is not a fair comparison. | ||
| The `Distances` package is the winner here, which makes sense, | ||
| as it uses a vectorized approach. | ||
|
|
||
|
|
||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.