bioawk

TAFFISH wrapper for lh3/bioawk, a BWK awk-derived command-line tool with built-in parsing support for common biological text formats.

Package

name: bioawk
command: taf-bioawk
upstream: lh3/bioawk
packaged upstream version: v1.0
TAFFISH package version: 1.0-r1
platforms: linux/amd64, linux/arm64
wrapper license: Apache-2.0
upstream license: Lucent permissive notice in the bundled BWK awk source; the upstream repository does not provide a separate SPDX-style LICENSE file

What Is Included

This app builds bioawk from the upstream v1.0 source archive and installs the single bioawk executable together with upstream README/FIXES/awk manual materials and the Lucent notice extracted from the bundled BWK awk source.

Bioawk extends BWK awk with:

-c fastx for FASTA/FASTQ records with $name, $seq, $qual, and $comment;
-c sam, -c vcf, -c bed, and -c gff with named biological fields;
-c hdr / -c header for table columns named by the first input line;
-t, equivalent to tab input/output delimiters;
biological helper functions such as revcomp();
gzip input support when -c mode is used.

There is no external database, reference bundle, model, service, or interpreter required at runtime.

Usage

Show TAFFISH wrapper help:

taf-bioawk --help

Show the packaged upstream tag through TAFFISH package metadata:

taf-bioawk --version

Run the upstream executable explicitly:

taf-bioawk bioawk --version
taf-bioawk bioawk -c help

For awk programs, use an awk program file with upstream -f. This preserves the program text exactly through the current TAFFISH generated-shell boundary:

printf '{print $name "\t" length($seq)}\n' > lengths.awk
taf-bioawk bioawk -c fastx -f lengths.awk reads.fa > lengths.tsv

Because this tool app enables TAFFISH command mode, use the explicit upstream command for ordinary bioawk invocations:

taf-bioawk bioawk -c fastx -f lengths.awk reads.fq

Option-leading calls also work with the wrapper -- separator:

taf-bioawk -- -c fastx -f lengths.awk reads.fq

Important wrapper boundary: current TAFFISH joins raw wrapper arguments into generated shell. Inline awk one-liners such as '{print $name, length($seq)}' are therefore not reliable through taf-bioawk, because spaces, $, braces, parentheses, and quotes may be reinterpreted by the generated shell before bioawk sees them. This is a wrapper argument-boundary issue, not an upstream bioawk limitation. Put non-trivial awk programs in a file and pass that file with -f.

The upstream bioawk --version output reports the underlying BWK awk core version (awk version 20110810) rather than the bioawk release tag. The TAFFISH package version and /opt/bioawk/share/doc/bioawk/source.txt bind this container to upstream tag v1.0 and source checksum 5cbef3f39b085daba45510ff450afcf943cfdfdd483a546c8a509d3075ff51b5.

Examples

FASTA sequence lengths:

printf '{print $name "\t" length($seq)}\n' > lengths.awk
taf-bioawk bioawk -c fastx -f lengths.awk genome.fa > lengths.tsv

Reverse complement FASTA:

printf '{print ">"$name; print revcomp($seq)}\n' > revcomp.awk
taf-bioawk bioawk -c fastx -f revcomp.awk input.fa > rc.fa

Read compressed FASTA/Q directly:

taf-bioawk bioawk -c fastx -f lengths.awk reads.fq.gz

Extract unmapped SAM records:

printf 'and($flag, 4)\n' > unmapped.awk
taf-bioawk bioawk -c sam -f unmapped.awk alignments.sam > unmapped.sam

Use VCF header-derived sample fields:

printf '{print $_CHROM, $POS, $sampleA}\n' > sample-column.awk
grep -v '^##' variants.vcf | taf-bioawk bioawk -tc hdr -f sample-column.awk

Boundaries

This package is intentionally a thin bioawk runtime. It does not bundle samtools, bcftools, bedtools, reference genomes, annotation databases, workflow scripts, or large example datasets. If a pipeline needs to transform BAM to SAM or pre-filter VCF/BED/GFF inputs, use the relevant TAFFISH tool app upstream of bioawk.

Bioawk can read gzip-compressed biological inputs only when -c mode is active, matching the upstream README. Plain awk mode falls back to the original BWK awk line reader.

Verification

The container build and TAFFISH smoke checks verify:

source archive checksum and upstream tag binding;
bioawk --version and bioawk -c help;
dynamic library completeness with ldd;
FASTA/FASTQ parsing and revcomp();
gzip FASTA input in -c fastx mode;
SAM named fields;
VCF named fields and header-derived table fields;
wrapper-level use of upstream -f program files, which is the recommended TAFFISH invocation style for awk programs.

These smoke checks validate packaging and basic runtime behavior. They do not replace scientific validation of a full downstream workflow.

Upstream

homepage: https://github.com/lh3/bioawk
release: https://github.com/lh3/bioawk/releases/tag/v1.0
README/manual: https://github.com/lh3/bioawk#readme

License

The TAFFISH app packaging files are licensed under Apache-2.0.

The packaged upstream bioawk source is derived from BWK awk and carries the Lucent Technologies permissive notice in the source tree. The upstream repository does not provide a separate SPDX-style LICENSE file; the extracted notice is preserved in the image under /opt/bioawk/share/licenses/bioawk/LICENSE.Lucent.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
docker		docker
docs		docs
src		src
target		target
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
taffish.toml		taffish.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bioawk

Package

What Is Included

Usage

Examples

Boundaries

Verification

Upstream

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

bioawk

Package

What Is Included

Usage

Examples

Boundaries

Verification

Upstream

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages