-
Notifications
You must be signed in to change notification settings - Fork 8
2026 03 04 cookbook sequences #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
danielle-pinto
wants to merge
4
commits into
main
Choose a base branch
from
2026-03-04-cookbook-sequences
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| >NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS | ||
| ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATG | ||
| CTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAA | ||
| AGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATAT | ||
| AATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAAC | ||
| GAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGT | ||
| TAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAA | ||
| AGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCA | ||
| ATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGC | ||
| TAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACC | ||
| TTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTA | ||
| CAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGG | ||
| TCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAA | ||
| AAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACG | ||
| ATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAAC | ||
| TATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATC | ||
| CACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATG | ||
| GCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGAT | ||
| TACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGAC | ||
| GATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTA | ||
| CAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTT | ||
| TGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAA | ||
| GATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTAT | ||
| TAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGC | ||
| ATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAA | ||
| AATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATA | ||
| AAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACA | ||
| AGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCT | ||
| ATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATG | ||
| ATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAA | ||
| CGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTC | ||
| TTCATTT |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| +++ | ||
| using Dates | ||
| date = Date("2026-03-04") | ||
| title = "BioJulia Cookbook" | ||
| rss_descr = "Recipes for performing basic bioinformatics in julia" | ||
| +++ | ||
|
|
||
| # BioJulia Cookbook | ||
|
|
||
| This cookbook will provide a series of "recipes" that will help get started quickly with BioJulia so you can doing some bioinformatics! | ||
|
|
||
| We have tutorials for reading in files, performing alignments, and using tools such as BLAST, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We will have, no
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Though another option would be to bring in FormatSpecimens.jl... maybe not for the very first one. |
||
| as well as links to more documentation about specific BioJulia packages. | ||
|
|
||
| {{list_dir cookbook}} | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,238 @@ | ||
| +++ | ||
| title = "Sequence Input/Output" | ||
| rss_descr = "Reading in FASTA, FASTQ, and FAI files using FASTX.jl" | ||
| +++ | ||
|
|
||
| # Sequence Input/Output | ||
|
|
||
| In this chapter, we'll talk about how to read in sequence files using the `FASTX.jl` module. | ||
| More information about the `FASTX.jl` package can be found at https://biojulia.dev/FASTX.jl/stable/ | ||
| and with the built-in documentation you can access directly within the Julia REPL. | ||
|
|
||
| ```julia | ||
| julia> using FASTX | ||
| julia> ?FASTX | ||
| ``` | ||
|
|
||
| If `FASTX` is not already in your environment, | ||
| it can be easily added from the Julia Registry. | ||
|
|
||
| To demonstrate how to use this package, | ||
| let's try to read in some real-world data! | ||
|
|
||
| The `FASTX` package can read in three file types: `fasta`, `fastq`, and `fai`. | ||
|
|
||
| ### FASTA files | ||
| FASTA files are text files containing biological sequence data. | ||
| They have three parts: name, description, and sequence. | ||
|
|
||
| The template of a sequence record is: | ||
| ``` | ||
| >{description} | ||
| {sequence} | ||
| ``` | ||
|
|
||
| The identifier is the first part of the description until the first whitespace. | ||
| If there is no white space, the name and description are the same. | ||
|
|
||
| Here is an example fasta: | ||
| ``` | ||
| >chrI chromosome 1 | ||
| CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC | ||
| CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA | ||
| ``` | ||
|
|
||
| ### FASTQ files | ||
| FASTQ files are also text-based files that contain sequences, along with a name and description. | ||
| However, they also store sequence quality information (the Q is for quality!). | ||
|
|
||
| The template of a sequence record is: | ||
| ``` | ||
| @{description} | ||
| {sequence} | ||
| +{description}? | ||
| {qualities} | ||
| ``` | ||
|
|
||
| Here is an example record: | ||
| ``` | ||
| @FSRRS4401BE7HA | ||
| tcagTTAAGATGGGAT | ||
| + | ||
| ###EEEEEEEEE##E# | ||
| ``` | ||
|
|
||
| ### FAI files | ||
|
|
||
| FAI (FASTA index) files are used in conjunction with FASTA/FASTQ files. | ||
| They are text files with TAB-delimited columns. | ||
| They allow the user to access specific regions of the reference FASTA/FASTQ without reading in the entire sequence into memory. | ||
| More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html). | ||
|
|
||
| ``` | ||
| NAME Name of this reference sequence | ||
| LENGTH Total length of this reference sequence, in bases | ||
| OFFSET Offset in the FASTA/FASTQ file of this sequence's first base | ||
| LINEBASES The number of bases on each line | ||
| LINEWIDTH The number of bytes in each line, including the newline | ||
| QUALOFFSET Offset of sequence's first quality within the FASTQ file | ||
|
|
||
| ``` | ||
|
|
||
| We will read in a FASTA file containing the _mecA_ gene. | ||
| _mecA_ is an antibiotic resistance gene commonly found in Methicillin-resistant _Staphylococcus aureus_ (MRSA). | ||
| It helps bacteria to break down beta-lactam antibiotics like methicillin. | ||
| It is typically 2.1 kB long. | ||
| This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta). More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1). | ||
|
|
||
| First we'll open the file. | ||
| Then we'll iterate over every record in the file and | ||
| print out the sequence identifier, the sequence description and then the corresponding sequence. | ||
|
|
||
| ```julia | ||
| julia> FASTAReader(open("assets/mecA.fasta")) do reader | ||
| for record in reader | ||
| println(identifier(record)) | ||
| println(description(record)) | ||
| println(sequence(record)) | ||
| println(length(sequence(record))) | ||
| end | ||
| end | ||
|
|
||
| NG_047945.1 | ||
| NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS | ||
| ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATGCTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAAAGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATATAATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAACGAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGTTAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAAAGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCAATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGCTAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACCTTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTACAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGGTCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAAAAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACGATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAACTATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATCCACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATGGCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGATTACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGACGATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTACAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTTTGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAAGATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTATTAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGCATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAAAATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATAAAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACAAGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCTATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATGATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAACGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTCTTCATTT | ||
| 2107 | ||
| ``` | ||
| We confirmed that the length of the gene matches what we'd expect for _mecA_. | ||
| In this case, there is only one sequence that spans the entire length of the gene. | ||
| After this gene was sequenced, all of the reads were assembled together into a single consensus sequence. | ||
| _mecA_ is a well-characterized gene, so there are no ambiguous regions, and there is no need for there to be multiple contigs | ||
| (AKA for the gene to be broken up into multiple pieces, as we know exactly how the reads should be assembled together.). | ||
|
|
||
|
|
||
| Let's try reading in a larger FASTQ file. | ||
|
|
||
| The raw reads for a _Staphylococcus aureus_ isolate were sequenced with Illumina and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625). | ||
| The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625). | ||
|
|
||
| The BioSample ID for this sample is `SAMN02360768`. | ||
| This ID refers to the physical bacterial isolate. | ||
| The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS515580`. | ||
| Both values correspond to one another and are helpful identifiers. | ||
|
|
||
| The SRR (sample run accession number) is the unique identifier within SRA | ||
| and corresponds to the specific sequencing run. | ||
| There are two sequencing runs and accession numbers for this sample, | ||
| but we'll select one to download: `SRX392511`. | ||
|
|
||
| This file can be downloaded on the command line using `curl`. | ||
| > One of the nice things about julia is that it is | ||
| > super easy to toggle between the julia REPL and | ||
| > bash shell by typing `;` to access shell mode | ||
| > from julia. | ||
| > You can use the `backspace`/`delete` key to | ||
| > quickly toggle back to the julia REPL. | ||
|
|
||
| To download using the command line, type: | ||
| ``` | ||
| curl -L --retry 5 --retry-delay 2 \ | ||
| "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511" \ | ||
| | gzip -c > SRX392511.fastq.gz | ||
| ``` | ||
| Alternatively, command line code can be executed from within julia: | ||
|
|
||
| ```julia | ||
| run(pipeline( | ||
| `curl -L --retry 5 --retry-delay 2 "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511"`, | ||
| `gzip -c`, | ||
| "SRX392511.fastq.gz" | ||
| ) | ||
| ) | ||
| ``` | ||
| This file is gzipped, so we'll need to account for that as we are reading it in. | ||
|
|
||
| Instead of printing out every record (this isn't super practical because it's a big file), let's save all the records into a vector. | ||
|
|
||
| ```julia | ||
| using CodecZlib | ||
|
|
||
| records = [] | ||
|
|
||
| FASTQReader(GzipDecompressorStream(open("assets/SRX392511.fastq.gz"))) do reader | ||
| for record in reader | ||
| push!(records, record) | ||
| end | ||
| end | ||
| ``` | ||
|
|
||
| We can see how many reads there are by looking at the length of `records`. | ||
|
|
||
| ```julia | ||
| julia> length(records) | ||
| 9852716 | ||
| ``` | ||
|
|
||
| Let's take a look at what the first 10 reads look like: | ||
|
|
||
| ``` | ||
| julia> records[1:10] | ||
| 10-element Vector{Any}: | ||
| FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGGATTTTGTTATATTGTA…", "$$$$$$$$$$$$$$$$$$$…") | ||
| FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGCATAAATTTAAAAAAAA…", "$$$$$$$$$$$$$$$$$$$…") | ||
| FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAT…", "???????????????????…") | ||
| FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAG…", "???????????????????…") | ||
| FASTX.FASTQ.Record("SRX392511.3 3 length=101", "GAGCAGTAGTATAAAATGA…", "???????????????????…") | ||
| FASTX.FASTQ.Record("SRX392511.3 3 length=101", "CCGACACAATTACAAGCCA…", "???????????????????…") | ||
| FASTX.FASTQ.Record("SRX392511.4 4 length=101", "GATTATCTAGTCATAATTC…", "???????????????????…") | ||
| FASTX.FASTQ.Record("SRX392511.4 4 length=101", "TTCGCCCCCCCAAAAGGCT…", "???????????????????…") | ||
| FASTX.FASTQ.Record("SRX392511.5 5 length=101", "ATAAAATGAACTTGCGTTA…", "???????????????????…") | ||
| FASTX.FASTQ.Record("SRX392511.5 5 length=101", "TAACTTGTGGATAATTATT…", "???????????????????…") | ||
| ``` | ||
|
|
||
| All of the nucleotides in these two reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119. | ||
| This is not concerning, as it is typical for the first couple of reads to be low quality. | ||
|
|
||
| Let's calculate the average quality across all reads. | ||
| We'll do this by converting each PHRED score to its probability error, | ||
| and summing these values across all sequences in all reads. | ||
| Then, we can divide this sum by the total number of bases | ||
| to get the average probability error for the entire sequence. | ||
| It's important that we first convert the PHRED scores to the probability errors before averaging, | ||
| because PHRED is a logarithmic transform of error probability. | ||
| Therefore, simply averaging PHRED and then converting to error probability is not the same as converting PHRED to error probability and averaging that sum. | ||
|
|
||
| ```julia | ||
| function phred_to_prob(Q) | ||
| # The formula is Pe = 10^(-Q/10) | ||
| return 10^(-Q / 10.0) | ||
| end | ||
|
|
||
| # get sum of all error probabilities | ||
| total_error_prob = sum( | ||
| sum(phred_to_prob(q) for q in FASTQ.quality_scores(record)) | ||
| for record in records | ||
| ) | ||
| 4.5100356107423276e7 | ||
| # get total number of base pairs | ||
| total_bases = sum(length(FASTQ.quality_scores(record)) for record in records) | ||
| 995124316 | ||
|
|
||
| # get average error probability | ||
| total_error_prob/total_bases | ||
| 0.045321328584059316 | ||
| ``` | ||
|
|
||
| The average error probability is just 4.53%, | ||
| meaning the majority of reads are high quality. | ||
|
|
||
| More information about how to convert ASCII values/PHRED scores to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html). | ||
|
|
||
|
|
||
|
|
||
|
|
||
| Now that we've learned how to read files in and manipulate them a bit, | ||
| let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome. | ||
| This will tell us if this _S. aureus_ is MRSA. | ||
|
|
||
|
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.