Skip to content

Konflux integration#145

Open
AdamSaleh wants to merge 57 commits into
mainfrom
konflux-integration
Open

Konflux integration#145
AdamSaleh wants to merge 57 commits into
mainfrom
konflux-integration

Conversation

@AdamSaleh

@AdamSaleh AdamSaleh commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Comprehensive integration test infrastructure for the GitOps operator in Konflux CI. The pipeline provisions an ephemeral HyperShift (EaaS) cluster on every run, installs the operator under test via OLM from the FBC catalog image, executes Ginkgo test suites from a QA fork, and pushes structured results to Quay and a results dashboard repo.

What this branch introduces

Pipeline structure

  • Ephemeral HyperShift cluster (ARM64, configurable OCP version) provisioned per run via EaaS
  • Three-layer test image: heavy base (tools + Go) → pre-compiled Ginkgo binaries → scripts/config rebuilt on every push — no need to rebuild the base for script changes
  • Gate labels on PRs control which expensive scenarios run (rc-sanity-check, rc-operator-check, rc-ui-check, etc.)
  • All scenarios are optional; pipeline-wrapup always runs regardless of test outcome

Test suites

Suite Script Notes
Sanity / smoke run-sanity-tests.sh Fast subset, triggered on every labeled PR
Sequential shard 1 run-sequential-tests-shard1.sh ~23 test files
Sequential shard 2 run-sequential-tests-shard2.sh ~20 test files
Parallel run-parallel-tests.sh Full parallel suite
Argo Rollouts run-rollouts-tests.sh
UI e2e run-ui-e2e-tests.sh Playwright against OCP console GitOps plugin
ArgoCD upstream e2e run-argocd-e2e-tests.sh Standalone ArgoCD, not the operator
DAST RapidAST/ZAP Security scan of ArgoCD REST API

QA fork and downstream branches

Tests run from rh-gitops-release-qa/gitops-operator — a fork carrying downstream-specific patches (relaxed image assertions for registry.redhat.io images, OCP guided tour dismissal in Playwright). One branch per channel:

Channel Branch
latest konflux-integration-latest
gitops-1.21 konflux-integration-1.21
gitops-1.20 konflux-integration-1.20
gitops-1.19 konflux-integration-1.19

Scenarios

  • gitops-operator-tests.yaml / gitops-sanity-tests.yaml / gitops-ui-tests.yaml — latest channel
  • gitops-channel-tests-1-{21,20,19}.yaml — one file per supported channel, each containing sanity, upgrade-sanity, sequential (2 shards), parallel, rollouts, UI e2e, ArgoCD e2e, and DAST scenarios
  • gitops-argocd-tests.yaml / gitops-dast.yaml — standalone upstream ArgoCD and DAST

Log storage (Quay / ORAS)

Each task uploads logs incrementally as OCI artifacts during the run:

quay.io/devtools_gitops/test_image:<pipelinerun>-<task>-logs

pipeline-wrapup pulls all per-task artifacts, merges with cluster-level logs, and pushes a combined bundle:

quay.io/devtools_gitops/test_image:<pipelinerun>-logs

Artifacts expire after 7 days.

Results dashboard

publish-results.sh + render-results.py write structured JUnit summaries to rh-gitops-midstream/catalog-results. Each run appends a JSONL record under gitops-operator/<version>/ocp-<ver>/results.jsonl and the README table is re-rendered.

Code quality fixes (applied in this branch)

The following issues from a thorough code review were resolved before this PR:

  • Shell safety: all test runner scripts now use set -exo pipefail; git fetch scoped to the target branch with --depth=1
  • Shell injection: send-slack-message.py rewritten to use shell=False with explicit argument lists; collect-build-metadata.sh heredoc switched to single-quoted 'EOF' with env var passing
  • Gate label auth: check-gate-labels.yaml now accepts a github-token param to avoid silent bypass when GitHub API rate-limits unauthenticated shared-IP calls
  • Process leak: install-operator.sh background SA patch loop now has trap ... EXIT cleanup
  • Publish retry: publish-results.sh retry loop rewritten to exit 1 if all 3 attempts fail
  • Docker Hub images: resolve-openshift-version.yaml and extract-image-content-sources.yaml moved from docker.io/python:3-alpine to registry.access.redhat.com/ubi9/python-311
  • Render safety: render-results.py wraps the delete+render cycle in try/except with git checkout -- . rollback on failure
  • Typo / missing test: run-sequential-tests-shard1.sh focus list corrected (valiatevalidate, missing _test.go suffixes)
  • Pull secret verification: install-operator.sh | head -3 removed so all injected registries are verified, not just the first three
  • Shallow clone: extract-image-content-sources.yaml uses --depth=1; handles both SHAs and branch names

Documentation

See .tekton/integration-tests/README.md for a full description of the pipeline structure, image layers, when to rebuild the base image, all scripts, Quay log storage, and the catalog-results repo.

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

2 similar comments
@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh AdamSaleh force-pushed the konflux-integration branch from 2e84399 to acd01fe Compare June 5, 2026 12:56
@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

4 similar comments
@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

4 similar comments
@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

1 similar comment
@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

There are currently four test-suites being run:
- gitops-operator's e2e ginkgo test-suite, sharded into 3 scripts
- the rollouts e2e tests
- gitops operator's ui test verifying login (more tests to come)
- the argocd tests in a separate pipeline

There is simple parametrized pipeline, where you can choose:
- the openshift version
- size of cluster nodes
- the channel to be used in the catalog
- the test-script to run

Secont separate pipeline installs standalone argocd and runs the e2e tests

All the tests are run from precompiled docker image,
the pipeline will check at the start and build them if hte images were
changed. The test and utility scripts always get copied.

The logs get uploaded to quay.
At the end of the pipeline, it will send a message to
gitops-test-notification channel on slack

The code is mostly authored by prompting claude and tested
against the v1.20 branch of the catalog repo.

Assisted-by: Claude <usersafety@anthropic.com>
Signed-off-by: Adam Saleh <adam@asaleh.net>
AdamSaleh and others added 2 commits June 23, 2026 16:17
ZAP failures land in failedTests as "dast.high/[HIGH] SQL Injection
(alertRef=40018)". The old fallback split on `: /` and returned
"(alertRef=40018)" — useless. Add a DAST-specific branch that strips
the classname prefix and alertRef suffix, leaving "[HIGH] SQL Injection".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sidecar referenced /usr/local/bin/collect-logs-sidecar.sh which
was not present in the overlay image, causing an immediate failure.
DAST doesn't need live cluster-pod-log snapshots — rapidast writes
its own output and collect-results handles the final artifact upload.
Also remove the now-unused namespace param.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

ArgoCD's /api/v1/stream/* endpoints are Server-Sent Events that keep
connections open indefinitely. ZAP's openapi job times out fetching
them, marks the plan as failed, and skips the active scan and report
generation entirely.

Fix: download swagger.json from ArgoCD in run-dast, strip /stream/
paths with a one-liner Python filter, write to swagger-filtered.json,
and pass it via apiFile instead of apiUrl.

Also fix parse-dast-results.py find_zap_json: RapidAST writes reports
to {results}/{shortName}/DAST-{date}-RapiDAST-{shortName}/zap/zap-report.json
(two levels deep), but the old patterns only searched one level deep.

Verified with a full local scan: 103 URLs, active scan completes,
all reports extracted, parser finds the JSON at the correct path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@red-hat-konflux

Copy link
Copy Markdown
Contributor

Caution

There are some errors in your PipelineRun template.

PipelineRun Error
tasks/test-dast.yaml yaml validation error: line 180: could not find expected ':'

…error

Multi-line Python at column 0 inside script: | terminates the YAML literal
block scalar, causing "could not find expected ':'". Collapse to a single
python3 -c line to stay within the block's indentation boundary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

AdamSaleh and others added 10 commits June 29, 2026 13:24
… surfacing

render-results.py:
- Add testScript as grouping dimension so each test type (sanity,
  sequential-s1/s2, parallel, rollouts, ui, dast) gets its own leaf
  README and matrix column
- OCP-level README shows variant × test-type matrix; columns derived
  from all historical runs for the product+OCP so gaps show as "—"
- Each OCP matrix cell links to its Konflux UI pipelinerun via logUrl
- Version-level README shows per-test-type breakdown with p/f/s counts;
  each line links directly to its leaf README
- Product-level README cells link to the OCP-level README
- Summary levels (product, version, top) collapse across test types
  showing worst status per variant

run-ui-e2e-tests.sh:
- Synthesize a minimal JUnit failure when Playwright exits non-zero but
  writes no usable test output (e.g. global setup crash before any tests)
- Copy JUnit to ${SHARED_DIR}/results/ so wrapup task can find it even
  if the ORAS artifact pull fails

collect-and-upload-logs.sh:
- Add ${SHARED_DIR}/results as fallback JUnit search path so UI auth
  failures are parsed and surfaced in the results summary

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With set -euo pipefail, if grep finds no match in a command
substitution (OC_TOKEN=$(curl...|grep...)), the pipe failure exits
the script before the emptiness check can print a useful error.

- Add || true to all grep token-extraction pipelines so the
  if [[ -z ... ]] checks actually fire with a clear message
- Add --max-time 30 to all curl calls so a hung OAuth/API endpoint
  fails in 30s rather than hanging the step indefinitely
- Split base64 -d onto its own line to keep each extraction step
  independently checkable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the kubeadmin OAuth token request fails, log the first 20 lines
of the response (status line + headers) so the root cause is visible
in the pipeline logs — whether it is a 401 wrong credentials, a 302
redirect without the expected token fragment, a connection error, or
a timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The get-cluster-info step was reading `.data.kubeadmin` from the
kube-system/kubeadmin secret, which contains the htpasswd hash, not
the plain-text password. This caused OAuth authentication to fail on
ephemeral clusters.

Now the get-kubeconfig step fetches the admin password from the CTI's
`.status.adminPassword.name` secret (matching what the
eaas-get-ephemeral-cluster-credentials StepAction does), and
get-cluster-info reads it from /credentials/*password — with a
fallback to `.data.password` (the correct key) from kube-system.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ephemeral clusters may not have their OAuth router fully available
immediately after provisioning. Add a retry loop (5 attempts, 30s
wait) that retries on HTTP 502/503 responses before failing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OAuth router on EaaS clusters returns 503 when the ingress stack
isn't ready, which may persist beyond any reasonable retry window.

The OAuth token was only used to call the Kubernetes API to read the
ArgoCD admin password. Move that `oc get secret` call into the
get-cluster-info step (which already has oc + cert-based kubeconfig)
and pass ARGO_PWD through cluster-info.env. The run-dast step now
goes directly to the ArgoCD session API with no OAuth dependency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ZAP's JVM heap is 2048m but total RSS during active scan (native
memory, thread stacks, site tree) can exceed 4GB, hitting the
namespace LimitRange default and getting SIGKILL (-9).

Set explicit 6Gi limit / 4Gi request so the step gets enough memory
to complete the active scan.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Tekton v1 API uses computeResources for step-level CPU/memory
limits; the resources field is not recognized and causes task
validation to fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ArgoCD v2.14.1 CLI emits JSON log format ({"level":"fatal",...}) but
the test assertions check for logrus text format (level=fatal). All
three tests fail consistently across runs; the underlying CLI behavior
is correct. Skip until upstream test is updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the duplicated inline check-gate taskSpec in all three
pipelines with a shared StepAction (check-gate-labels.yaml).

The GATE_LABEL param now accepts a comma-separated list of labels
(e.g. "rc-sanity-check,channel-1.19") — ALL listed labels must be
present on the PR for the pipeline to proceed. A single label value
is still accepted unchanged for backward compatibility.

Push events (no PR found) always proceed regardless of labels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AdamSaleh and others added 2 commits July 1, 2026 23:29
…19, 1.20, 1.21

Creates 15 scenarios per channel (45 total), gated on channel-X.XX label
in addition to the existing test-type gate label. Each set covers: sanity,
sanity-fips, sanity-upgrade, sanity-upgrade-fips, sequential-s1,
sequential-s1-upgrade, sequential-s2, parallel, parallel-fips,
parallel-upgrade, rollouts, ui-e2e, argocd-e2e, argocd-e2e-fips, dast.

Channel versions:
- 1.19: OPERATOR_CHANNEL=gitops-1.19, TEST_REPO_BRANCH=v1.19, ArgoCD=v3.1.16, upgrade from gitops-1.18
- 1.20: OPERATOR_CHANNEL=gitops-1.20, TEST_REPO_BRANCH=v1.20, ArgoCD=v3.3.12, upgrade from gitops-1.19
- 1.21: OPERATOR_CHANNEL=gitops-1.21, TEST_REPO_BRANCH=v1.21, ArgoCD=v3.4.4, upgrade from gitops-1.20

UI tests use master branch for all channels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
StepAction results written via $(step.results.X.path) are stored at
/tekton/steps/<name>/results/X but are NOT promoted to TaskRun
.status.taskResults automatically. The pipeline's when expressions
that read $(tasks.check-gate.results.proceed) therefore always saw an
empty string, causing all gated tasks to be skipped.

Add a propagate-result step to each check-gate taskSpec that copies
/tekton/steps/check/results/proceed to $(results.proceed.path) so the
result is visible to the pipeline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

1 similar comment
@AdamSaleh

Copy link
Copy Markdown
Collaborator Author

/retest

AdamSaleh and others added 6 commits July 2, 2026 09:09
Tekton prefixes internal step names with "step-" when storing
StepAction results, so the result written by the "check" step
is at /tekton/steps/step-check/results/proceed, not
/tekton/steps/check/results/proceed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ctions

Shell script robustness:
- run-e2e-tests.sh: set -exo pipefail; shallow-fetch only target branch
- run-ui-e2e-tests.sh: set -euo pipefail; guard playwright install with || true
- publish-results.sh: track push success, exit 1 if all 3 attempts fail
- wait-for-resources.sh: remove unconditional 30s sleep before CSV poll loop
- print-cluster-login-info.sh: redact kubeadmin password in log output
- run-sanity-tests.sh: use json.dumps() for safe JSON generation
- upgrade-operator.sh: set -euo pipefail; add NAMESPACE default
- run-sequential-tests-shard{1,2}.sh: fix typo/extensions in focus list,
  remove no-op suite_test.go from shard 2

install-operator.sh hardening:
- trap to always kill background pull-secret loop on exit
- head -1 on DaemonSet grep to avoid multi-line DS_NAME
- remove head -3 from registry verification (check all registries)

Security fixes:
- check-gate-labels.yaml: add github-token param + authenticated curl;
  fix IFS cleanup after break using while-read idiom
- send-slack-message.py: replace shell=True + f-string with shell=False
  explicit argument lists throughout

Scenario YAML:
- gitops-channel-tests-{1-19,1-20,1-21}.yaml: point UI e2e TEST_REPO_BRANCH
  at konflux-integration-* QA fork branches instead of master
- remove TEST_IMAGE_URL from all scenarios (param not declared in pipeline,
  silently dropped by Tekton)

Python scripts:
- collect-build-metadata.sh: single-quoted EOF heredoc + os.environ to
  prevent shell injection into Python triple-quoted strings
- render-results.py: wrap clean+render in try/except with git checkout -- .
  recovery if rendering fails after directories are deleted

External images and stepactions:
- resolve-openshift-version.yaml: replace docker.io/python:3-alpine with
  registry.access.redhat.com/ubi9/python-311:latest
- extract-image-content-sources.yaml: same image replacement; --depth=1
  shallow clone; remove redundant apk add git

ArgoCD e2e:
- Add TODO/FIXME comments for bitnami/git pinning, missing JUnit XML
  output (needs go-junit-report in image), and v3.x pre-compilation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All gitops-operator e2e test scenarios (sanity, sequential, parallel,
rollouts, ui-e2e) now use:
  TEST_REPO_URL: https://github.com/rh-gitops-release-qa/gitops-operator
  TEST_REPO_BRANCH: konflux-integration-{1.19,1.20,1.21,latest}

These QA fork branches carry downstream-specific fixes (OCP guided tour
Playwright fix, argocd-agent image check relaxation) that are needed for
tests to pass against the downstream operator.

ArgoCD e2e and DAST scenarios are unchanged — they don't pull test code
from the gitops-operator repo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ripts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With a single 'read' variable, IFS splitting does not distribute across
the delimiter — the entire string lands in $required. Convert commas to
newlines so each label becomes its own input line.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The konflux-integration-1.19 and konflux-integration-1.20 QA fork branches
do not contain test/ui-e2e — the Playwright tests were only added starting
from the 1.21 cycle. Running the ui-e2e scenario against those branches will
always fail with "directory not found". Remove the scenarios until the tests
are backported.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant