Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
d67ae07
Add support for horizontal/verstical vectorization parameter
sleeepyjack Sep 9, 2025
cb7a78d
Restructure policies
sleeepyjack Sep 9, 2025
21ff88c
Fix indexing bug
sleeepyjack Sep 10, 2025
41e217a
Coalesced output write
sleeepyjack Sep 10, 2025
2b8ecde
Add unit test for adaptive contains kernel
sleeepyjack Sep 10, 2025
c322825
Add parametric filter policy (dummy)
sleeepyjack Sep 10, 2025
17f9c19
Merge remote-tracking branch 'upstream' into exp-filter-policy
sleeepyjack Sep 10, 2025
97693a3
Multiplicative hashing implemented in policy. Some changes needed to …
kevkrist Sep 11, 2025
4285b39
Finalized proposed policy interface.
kevkrist Sep 11, 2025
0934cba
Fixed a mistake in thread_dispatch(). Removed some dead static variab…
kevkrist Sep 16, 2025
cf43c8f
Multiplicative hashing calling code infrastructure.
kevkrist Sep 17, 2025
52d7f17
New example script for sanity checking. Still need to connect the hos…
kevkrist Sep 18, 2025
91714b0
host and device APIs are connected for multiplicative hashing, but co…
kevkrist Sep 18, 2025
fa7d9a7
Debugging done. End-to-end filters working properly.
kevkrist Sep 19, 2025
13918c2
Tests updated.
kevkrist Sep 19, 2025
7008773
Good performance agains arrow FP when early exit is turned off. Will …
kevkrist Sep 19, 2025
9252f69
Updated bloom filter nvbench script.
kevkrist Sep 22, 2025
12b4847
Changing exp kernels from if to while for grid-striding.
kevkrist Sep 22, 2025
d3fcce2
Bug fix in filter size in PFP_EVALUATION_EXAMPLE
kevkrist Sep 23, 2025
a30d1b1
Bug fix in while loop in exp kernels.
kevkrist Sep 23, 2025
18e9c34
Small PR review fixes.
kevkrist Sep 23, 2025
a018896
group-cooperative parametric filter policy code paths implemented.
kevkrist Sep 23, 2025
8020b72
Benchmark scripts updated.
kevkrist Sep 23, 2025
b655183
Notebook with theoretical FPR calculators.
kevkrist Sep 25, 2025
e558f1c
Remove static checks on hash result type that are blocking NVBench.
kevkrist Sep 25, 2025
c83912c
Enum type lists for the add benchmark added.
kevkrist Sep 25, 2025
46cf45f
Added salt generation script. Updated the total number of salts to 64.
kevkrist Sep 25, 2025
892e4a9
Updated block index selection in PFP to match Arrow policy.
kevkrist Sep 26, 2025
e9f8ac9
Merge remote-tracking branch 'upstream' into exp-filter-policy
sleeepyjack Sep 28, 2025
1a4b5e0
Enable magic modulo
sleeepyjack Sep 28, 2025
8c0f49a
Enable warp-cooperative kernels
sleeepyjack Sep 28, 2025
726735a
Update benchmarks
sleeepyjack Sep 30, 2025
04c48ee
Add experimental Arrow filter policy
sleeepyjack Sep 30, 2025
a177107
Update benchmarks
sleeepyjack Oct 1, 2025
db020aa
Add RTX600 Blackwell benchmark results
sleeepyjack Oct 1, 2025
a815053
Add H200 benchmark results
sleeepyjack Oct 1, 2025
a793644
Add B200 benchmark results
sleeepyjack Oct 1, 2025
70376cd
Notebook for generating filter size sweep plots.
kevkrist Oct 1, 2025
f6ef851
Plot generating script added and plots for the add/contains sweeps ge…
kevkrist Oct 2, 2025
69425d5
Modified group-cooperative hashing.
kevkrist Oct 6, 2025
a38caba
Fixed some bugs in the plotting script. New data/plots for new group-…
kevkrist Oct 6, 2025
f638544
Added warpcore benchmarks. Updated frontier.
kevkrist Oct 7, 2025
b6301e2
Updated frontier plot
kevkrist Oct 7, 2025
252a32b
Updated pfp implementation to switch to conditional atomic inserts wh…
kevkrist Oct 9, 2025
8bc13b0
Cleaning up dead code.
kevkrist Oct 9, 2025
087dfdd
Outsource thirdparty benchmarks
sleeepyjack Oct 14, 2025
8368f9b
Merge branch 'dev' into consolidate
sleeepyjack Oct 14, 2025
b454236
Fix add benchmarks and more merge conflicts
sleeepyjack Oct 14, 2025
3db5153
Fix lazy_discard_iterator
sleeepyjack Nov 11, 2025
ac204eb
Add CTA work stealing option
sleeepyjack Nov 11, 2025
4cea931
Add arch guard for work stealing kernels
sleeepyjack Nov 12, 2025
4de8306
Update benchmarks
sleeepyjack Dec 1, 2025
d0ad05c
Minor fixes
sleeepyjack Dec 1, 2025
81cf164
Enable ninja build generator
sleeepyjack Dec 1, 2025
c676524
Merge remote-tracking branch 'upstream' into consolidate
sleeepyjack Dec 1, 2025
65c727b
cache_sectorized implemented in parametric_filter_policy.
kevkrist Dec 5, 2025
555113f
Started on bloom_filter_imp
kevkrist Dec 5, 2025
25bf1cc
Add implemented for cache-sectorized.
kevkrist Dec 8, 2025
aa5b117
Fix bug in set_bits routine for cache-sectorized.
kevkrist Dec 8, 2025
2cce8c2
contains implemented for cache-sectorized.
kevkrist Dec 8, 2025
00d5bde
contains has bug when horizontal_layout > 1.
kevkrist Dec 8, 2025
5643887
Cache-sectorized working.
kevkrist Dec 9, 2025
4d39069
Turned off use_cub_kernels and work stealing for clearer evaluation.
kevkrist Dec 11, 2025
355b646
Add CSBF benchmarks
sleeepyjack Dec 18, 2025
94bce41
Merge remote-tracking branch 'upstream/dev' into consolidate
sleeepyjack Feb 26, 2026
617cb3f
Merge remote-tracking branch 'upstream/dev' into bloom-filter-release
sleeepyjack Apr 29, 2026
b805999
Drop baggage
sleeepyjack Apr 29, 2026
519970f
WIP but working
sleeepyjack Apr 29, 2026
6d299e0
Tuning struct
sleeepyjack Apr 29, 2026
076a5cf
Docs
sleeepyjack Apr 29, 2026
96e2a5a
Address review comments
sleeepyjack Apr 30, 2026
a999f46
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] Apr 30, 2026
c2dc7ab
Merge branch 'dev' into bloom-filter-release
sleeepyjack Apr 30, 2026
b4c4aa5
Update copyright year
sleeepyjack Apr 30, 2026
2a8de62
Merge branch 'bloom-filter-release' of github.com:sleeepyjack/cuColle…
sleeepyjack Apr 30, 2026
20be4e3
Address Doxygen
sleeepyjack Apr 30, 2026
4ac613e
Fix CTK 12.0 build: gate CG invoke_one and cluster launch control
sleeepyjack May 13, 2026
0a2509f
Merge remote-tracking branch 'upstream' into bloom-filter-release
sleeepyjack May 13, 2026
739079f
Review fixes
sleeepyjack Jun 17, 2026
46fbaf0
Merge remote-tracking branch 'upstream/dev' into bloom-filter-release
sleeepyjack Jun 17, 2026
4dfb811
Remove CSBF
sleeepyjack Jun 17, 2026
97635fe
Remove IO-less benchmarks
sleeepyjack Jun 17, 2026
dbe61ec
More tests
sleeepyjack Jun 17, 2026
10f09d6
Remove range scalar device functions as they were ambiguous overloads…
sleeepyjack Jun 17, 2026
8b1e995
Cleanups
sleeepyjack Jun 17, 2026
eae3049
Fix CUCO_HAS_CG_INVOKE_ONE usage
sleeepyjack Jun 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion benchmarks/benchmark_utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

#include <cuco/detail/__config>
#include <cuco/detail/error.hpp>
#include <cuco/hash_functions.cuh>
#include <cuco/utility/key_generator.cuh>

#include <nvbench/nvbench.cuh>
Expand Down Expand Up @@ -65,7 +66,9 @@ struct lazy_discard {
__device__ void device_dispatch(index_type index, value_type const& value) const
{
// pick some predicate that is always false, but depends on the runtime value
if (threadIdx.x > 2025 + *reinterpret_cast<char const*>(&value)) { *(it + index) = value; }
if (threadIdx.x > 2025 + *reinterpret_cast<char const*>(&value) + static_cast<int>(index)) {
*(it + index) = value;
}
}
__host__ __device__ void operator()(index_type index, value_type const& value) const
{
Expand Down Expand Up @@ -94,6 +97,21 @@ NVBENCH_DECLARE_TYPE_STRINGS(cuco::utility::distribution::gaussian,
"GAUSSIAN",
"distribution::gaussian");

NVBENCH_DECLARE_TYPE_STRINGS(cuco::detail::XXHash_64<char>, "xxhash_64", "cuco::xxhash_64");
NVBENCH_DECLARE_TYPE_STRINGS(cuco::detail::XXHash_32<char>, "xxhash_32", "cuco::xxhash_32");
NVBENCH_DECLARE_TYPE_STRINGS(cuco::detail::MurmurHash3_32<char>,
"murmurhash3_32",
"cuco::murmurhash3_32");
NVBENCH_DECLARE_TYPE_STRINGS(cuco::detail::MurmurHash3_x86_128<char>,
"murmurhash3_x86_128",
"cuco::murmurhash3_x86_128");
NVBENCH_DECLARE_TYPE_STRINGS(cuco::detail::MurmurHash3_x64_128<char>,
"murmurhash3_x64_128",
"cuco::murmurhash3_x64_128");
NVBENCH_DECLARE_TYPE_STRINGS(cuco::detail::identity_hash<char>,
"identity_hash",
"cuco::identity_hash");

#if defined(CUCO_HAS_128BIT_ATOMICS)
NVBENCH_DECLARE_TYPE_STRINGS(__int128_t, "I128", "__int128_t");
#endif
215 changes: 96 additions & 119 deletions benchmarks/bloom_filter/add_bench.cu
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
*/

#include "defaults.hpp"
#include "utils.hpp"

#include <benchmark_defaults.hpp>
#include <benchmark_utils.hpp>
Expand All @@ -24,143 +23,121 @@

#include <nvbench/nvbench.cuh>

#include <cuda/iterator>
#include <cuda/std/limits>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/sequence.h>

#include <cstdint>
#include <exception>
#include <limits>

using namespace cuco::benchmark; // defaults, dist_from_state, rebind_hasher_t, add_fpr_summary
using namespace cuco::benchmark; // defaults, dist_from_state, rebind_hasher_t
using namespace cuco::utility; // key_generator, distribution

/**
* @brief A benchmark evaluating `cuco::bloom_filter::add_async` performance
*/
template <typename Key, typename Hash, typename Word, nvbench::int32_t WordsPerBlock, typename Dist>
template <typename Key,
typename Word,
nvbench::int32_t BlockBits,
nvbench::int32_t PatternBits,
nvbench::int32_t HorizontalLayout,
nvbench::int32_t VerticalLayout>
void bloom_filter_add(nvbench::state& state,
nvbench::type_list<Key, Hash, Word, nvbench::enum_type<WordsPerBlock>, Dist>)
nvbench::type_list<Key,
Word,
nvbench::enum_type<BlockBits>,
nvbench::enum_type<PatternBits>,
nvbench::enum_type<HorizontalLayout>,
nvbench::enum_type<VerticalLayout>>)
{
using size_type = std::uint32_t;
using policy_type = cuco::default_filter_policy<rebind_hasher_t<Hash, Key>,
Word,
static_cast<std::uint32_t>(WordsPerBlock)>;
using filter_type =
cuco::bloom_filter<Key, cuco::extent<size_type>, cuda::thread_scope_device, policy_type>;

constexpr auto filter_block_size =
sizeof(typename filter_type::word_type) * filter_type::words_per_block;

auto const num_keys = state.get_int64("NumInputs");
auto const filter_size_mb = state.get_int64("FilterSizeMB");
auto const pattern_bits = WordsPerBlock;

try {
[[maybe_unused]] auto const policy = policy_type{static_cast<uint32_t>(pattern_bits)};
} catch (std::exception const& e) {
state.skip(e.what()); // skip invalid configurations
auto constexpr words_per_block = BlockBits / cuda::std::numeric_limits<Word>::digits;
auto constexpr pattern_bits_per_word = PatternBits / words_per_block;

// Check for a valid configuration
if constexpr ((not cuda::std::has_single_bit(static_cast<uint32_t>(BlockBits))) or
(words_per_block == 0)) {
state.skip("Invalid filter block size");
} else if constexpr (HorizontalLayout * VerticalLayout != words_per_block) {
state.skip("Invalid vectorization layout");
} else if constexpr ((pattern_bits_per_word <= 0) or
(pattern_bits_per_word > cuda::std::numeric_limits<Word>::digits) or
(pattern_bits_per_word * words_per_block > 64)) {
state.skip("Invalid pattern bits per word");
} else {
using size_type = std::uint32_t;
using hasher = cuco::xxhash_64<Key>;
auto constexpr contains_vertical_layout = words_per_block;
auto constexpr contains_horizontal_layout = 1;
using policy_type = cuco::parametric_filter_policy<hasher,
Word,
words_per_block,
PatternBits,
HorizontalLayout,
VerticalLayout,
contains_horizontal_layout,
contains_vertical_layout>;
using filter_type =
cuco::bloom_filter<Key, cuco::extent<size_type>, cuda::thread_scope_device, policy_type>;

auto const num_keys = state.get_int64("NumInputs");
auto const filter_size_mb = state.get_int64("FilterSizeMB");

std::size_t const num_sub_filters =
(filter_size_mb * 1024 * 1024) /
(sizeof(typename filter_type::word_type) * filter_type::words_per_block);

if (num_sub_filters > policy_type::max_filter_blocks) {
// skip invalid configurations
state.skip("num_sub_filters exceeds max_filter_blocks");
}

state.add_element_count(num_keys);

filter_type filter{static_cast<size_type>(num_sub_filters)};

thrust::device_vector<Key> keys(num_keys);
thrust::sequence(thrust::device, keys.begin(), keys.end(), 0);

state.exec(nvbench::exec_tag::timer, [&](nvbench::launch& launch, auto& timer) {
timer.start();
filter.add_async(keys.begin(), keys.end(), {launch.get_stream()});
timer.stop();
filter.clear_async({launch.get_stream()});
});
}

std::size_t const num_sub_filters = (filter_size_mb * 1024 * 1024) / filter_block_size;

if (num_sub_filters > std::numeric_limits<size_type>::max()) {
state.skip("num_sub_filters too large for size_type"); // skip invalid configurations
}

cuda::counting_iterator<Key> keys(0);

state.add_element_count(num_keys);

filter_type filter{
static_cast<size_type>(num_sub_filters), {}, {static_cast<std::uint32_t>(pattern_bits)}};

state.collect_dram_throughput();
state.collect_l2_hit_rates();

add_fpr_summary(state, filter);

state.exec([&](nvbench::launch& launch) {
filter.add_async(keys, keys + num_keys, {launch.get_stream()});
});
}

/**
* @brief A benchmark evaluating `cuco::bloom_filter::add_async` performance with
* `arrow_filter_policy`
*/
template <typename Key, typename Dist>
void arrow_bloom_filter_add(nvbench::state& state, nvbench::type_list<Key, Dist>)
{
using size_type = std::uint32_t;
using policy_type = cuco::arrow_filter_policy<Key>;
using filter_type =
cuco::bloom_filter<Key, cuco::extent<size_type>, cuda::thread_scope_device, policy_type>;

auto const num_keys = state.get_int64("NumInputs");
auto const filter_size_mb = state.get_int64("FilterSizeMB");

std::size_t const num_sub_filters =
(filter_size_mb * 1024 * 1024) /
(sizeof(typename filter_type::word_type) * filter_type::words_per_block);

if (num_sub_filters > policy_type::max_filter_blocks) {
state.skip("bloom filter with arrow policy should have <= 4194304 blocks"); // skip invalid
// configurations
}

cuda::counting_iterator<Key> keys(0);

state.add_element_count(num_keys);

filter_type filter{static_cast<size_type>(num_sub_filters)};

state.collect_dram_throughput();
state.collect_l2_hit_rates();

add_fpr_summary(state, filter);

state.exec([&](nvbench::launch& launch) {
filter.add_async(keys, keys + num_keys, {launch.get_stream()});
});
}

// Default benchmark: single layout matching `cuco::default_filter_policy` (256-bit block,
// 8-bit fingerprint, fully horizontal add) swept across the standard FilterSizeMB range.
NVBENCH_BENCH_TYPES(bloom_filter_add,
NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
nvbench::type_list<defaults::BF_HASH>,
nvbench::type_list<defaults::BF_WORD>,
nvbench::enum_type_list<defaults::BF_WORDS_PER_BLOCK>,
nvbench::type_list<distribution::unique>))
nvbench::type_list<nvbench::uint32_t>, ///< Word
nvbench::enum_type_list<256>, ///< BlockBits
nvbench::enum_type_list<8>, ///< PatternBits
nvbench::enum_type_list<8>, ///< HorizontalLayout
nvbench::enum_type_list<1> ///< VerticalLayout
))
.set_name("bloom_filter_add_unique_size")
.set_type_axes_names({"Key", "Hash", "Word", "WordsPerBlock", "Distribution"})
.set_type_axes_names(
{"Key", "Word", "BlockBits", "PatternBits", "HorizontalLayout", "VerticalLayout"})
.add_int64_axis("NumInputs", {defaults::BF_N})
.add_int64_axis("FilterSizeMB", defaults::BF_SIZE_MB_RANGE_CACHE);

NVBENCH_BENCH_TYPES(bloom_filter_add,
NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
defaults::HASH_RANGE,
nvbench::type_list<defaults::BF_WORD>,
nvbench::enum_type_list<defaults::BF_WORDS_PER_BLOCK>,
nvbench::type_list<distribution::unique>))
.set_name("bloom_filter_add_unique_hash")
.set_type_axes_names({"Key", "Hash", "Word", "WordsPerBlock", "Distribution"})
.add_int64_axis("NumInputs", {defaults::BF_N})
.add_int64_axis("FilterSizeMB", {defaults::BF_SIZE_MB});

NVBENCH_BENCH_TYPES(bloom_filter_add,
NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
nvbench::type_list<defaults::BF_HASH>,
nvbench::type_list<nvbench::uint32_t, nvbench::uint64_t>,
nvbench::enum_type_list<1, 2, 4, 8>,
nvbench::type_list<distribution::unique>))
.set_name("bloom_filter_add_unique_block_dim")
.set_type_axes_names({"Key", "Hash", "Word", "WordsPerBlock", "Distribution"})
.add_int64_axis("NumInputs", {defaults::BF_N})
.add_int64_axis("FilterSizeMB", {defaults::BF_SIZE_MB});

NVBENCH_BENCH_TYPES(arrow_bloom_filter_add,
NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
nvbench::type_list<distribution::unique>))
.set_name("arrow_bloom_filter_add_unique_size")
.set_type_axes_names({"Key", "Distribution"})
.add_int64_axis("NumInputs", {defaults::BF_N})
.add_int64_axis("FilterSizeMB", defaults::BF_SIZE_MB_RANGE_CACHE);
// Exhaustive sweep across block sizes and vectorization layouts. Uncomment for performance
// tuning / paper-style characterization; not run by default because the matrix is large.
// NVBENCH_BENCH_TYPES(
// bloom_filter_add,
// NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
// nvbench::type_list<nvbench::uint64_t, nvbench::uint32_t>, ///< Word
// nvbench::enum_type_list<64, 128, 256, 512, 1024>, ///< BlockBits
// nvbench::enum_type_list<8, 16>, ///< PatternBits
// nvbench::enum_type_list<1, 2, 4, 8, 16>, ///<
// HorizontalLayout nvbench::enum_type_list<1, 2, 4, 8, 16> ///< VerticalLayout
// ))
// .set_name("bloom_filter_add_full_sweep_u64")
// .set_type_axes_names(
// {"Key", "Word", "BlockBits", "PatternBits", "HorizontalLayout", "VerticalLayout"})
// .add_int64_axis("NumInputs", {defaults::BF_N})
// .add_int64_axis("FilterSizeMB", defaults::BF_SIZE_MB_RANGE_CACHE);
Loading
Loading