Skip to content

taffish/bioawk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bioawk

TAFFISH wrapper for lh3/bioawk, a BWK awk-derived command-line tool with built-in parsing support for common biological text formats.

Package

  • name: bioawk
  • command: taf-bioawk
  • upstream: lh3/bioawk
  • packaged upstream version: v1.0
  • TAFFISH package version: 1.0-r1
  • platforms: linux/amd64, linux/arm64
  • wrapper license: Apache-2.0
  • upstream license: Lucent permissive notice in the bundled BWK awk source; the upstream repository does not provide a separate SPDX-style LICENSE file

What Is Included

This app builds bioawk from the upstream v1.0 source archive and installs the single bioawk executable together with upstream README/FIXES/awk manual materials and the Lucent notice extracted from the bundled BWK awk source.

Bioawk extends BWK awk with:

  • -c fastx for FASTA/FASTQ records with $name, $seq, $qual, and $comment;
  • -c sam, -c vcf, -c bed, and -c gff with named biological fields;
  • -c hdr / -c header for table columns named by the first input line;
  • -t, equivalent to tab input/output delimiters;
  • biological helper functions such as revcomp();
  • gzip input support when -c mode is used.

There is no external database, reference bundle, model, service, or interpreter required at runtime.

Usage

Show TAFFISH wrapper help:

taf-bioawk --help

Show the packaged upstream tag through TAFFISH package metadata:

taf-bioawk --version

Run the upstream executable explicitly:

taf-bioawk bioawk --version
taf-bioawk bioawk -c help

For awk programs, use an awk program file with upstream -f. This preserves the program text exactly through the current TAFFISH generated-shell boundary:

printf '{print $name "\t" length($seq)}\n' > lengths.awk
taf-bioawk bioawk -c fastx -f lengths.awk reads.fa > lengths.tsv

Because this tool app enables TAFFISH command mode, use the explicit upstream command for ordinary bioawk invocations:

taf-bioawk bioawk -c fastx -f lengths.awk reads.fq

Option-leading calls also work with the wrapper -- separator:

taf-bioawk -- -c fastx -f lengths.awk reads.fq

Important wrapper boundary: current TAFFISH joins raw wrapper arguments into generated shell. Inline awk one-liners such as '{print $name, length($seq)}' are therefore not reliable through taf-bioawk, because spaces, $, braces, parentheses, and quotes may be reinterpreted by the generated shell before bioawk sees them. This is a wrapper argument-boundary issue, not an upstream bioawk limitation. Put non-trivial awk programs in a file and pass that file with -f.

The upstream bioawk --version output reports the underlying BWK awk core version (awk version 20110810) rather than the bioawk release tag. The TAFFISH package version and /opt/bioawk/share/doc/bioawk/source.txt bind this container to upstream tag v1.0 and source checksum 5cbef3f39b085daba45510ff450afcf943cfdfdd483a546c8a509d3075ff51b5.

Examples

FASTA sequence lengths:

printf '{print $name "\t" length($seq)}\n' > lengths.awk
taf-bioawk bioawk -c fastx -f lengths.awk genome.fa > lengths.tsv

Reverse complement FASTA:

printf '{print ">"$name; print revcomp($seq)}\n' > revcomp.awk
taf-bioawk bioawk -c fastx -f revcomp.awk input.fa > rc.fa

Read compressed FASTA/Q directly:

taf-bioawk bioawk -c fastx -f lengths.awk reads.fq.gz

Extract unmapped SAM records:

printf 'and($flag, 4)\n' > unmapped.awk
taf-bioawk bioawk -c sam -f unmapped.awk alignments.sam > unmapped.sam

Use VCF header-derived sample fields:

printf '{print $_CHROM, $POS, $sampleA}\n' > sample-column.awk
grep -v '^##' variants.vcf | taf-bioawk bioawk -tc hdr -f sample-column.awk

Boundaries

This package is intentionally a thin bioawk runtime. It does not bundle samtools, bcftools, bedtools, reference genomes, annotation databases, workflow scripts, or large example datasets. If a pipeline needs to transform BAM to SAM or pre-filter VCF/BED/GFF inputs, use the relevant TAFFISH tool app upstream of bioawk.

Bioawk can read gzip-compressed biological inputs only when -c mode is active, matching the upstream README. Plain awk mode falls back to the original BWK awk line reader.

Verification

The container build and TAFFISH smoke checks verify:

  • source archive checksum and upstream tag binding;
  • bioawk --version and bioawk -c help;
  • dynamic library completeness with ldd;
  • FASTA/FASTQ parsing and revcomp();
  • gzip FASTA input in -c fastx mode;
  • SAM named fields;
  • VCF named fields and header-derived table fields;
  • wrapper-level use of upstream -f program files, which is the recommended TAFFISH invocation style for awk programs.

These smoke checks validate packaging and basic runtime behavior. They do not replace scientific validation of a full downstream workflow.

Upstream

License

The TAFFISH app packaging files are licensed under Apache-2.0.

The packaged upstream bioawk source is derived from BWK awk and carries the Lucent Technologies permissive notice in the source tree. The upstream repository does not provide a separate SPDX-style LICENSE file; the extracted notice is preserved in the image under /opt/bioawk/share/licenses/bioawk/LICENSE.Lucent.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors