Benchmark the performance of KNN algorithmsο
In this doc, we benchmark the performance on multiple K-Nearest Neighbor algorithms implemented by dgl.knn_graph()
.
Given a dataset of N
samples with D
dimensions, the common use case of KNN algorithms in graph learning is to build a KNN graph by finding the K
nearest neighbors for each of the N
samples among the dataset.
Empirically, the three parameters, N
, D
, and K
, all have impact on the computation cost. To benchmark the algorithms, we pick a few represensitive datasets to cover most common scenarios:
A synthetic dataset with mixed gaussian samples:
N = 1000
,D = 3
.A point cloud sample from ModelNet:
N = 10000
,D = 3
.- Subsets of MNIST
A small subset:
N = 1000
,D = 784
A medium subset:
N = 10000
,D = 784
A large subset:
N = 50000
,D = 784
Some notes:
bruteforce-sharemem
is an optimized implementation ofbruteforce
on GPU.kd-tree
is currently only implemented on CPU.bruteforce-blas
conducts matrix multiplication, thus is memory inefficient.nn-descent
is an approximate algorithm, and we also report the recall rate of its result.
Resultsο
In this section, we show the runtime and recall rate (where applicable) for the algorithms under various scenarios.
The experiments are run on an Amazon EC2 P3.2xlarge instance. This instance has 8 vCPUs with 61GB RAM, and one Tesla V100 GPU with 16GB RAM. In terms of the environment, we obtain the numbers with DGL==0.7.0(64d0f3f), PyTorch==1.8.1, CUDA==11.1 on Ubuntu 18.04.5 LTS.
Mixed Gaussian:
Model |
CPU |
GPU |
||
---|---|---|---|---|
K = 8 |
K = 64 |
K = 8 |
K = 64 |
|
bruteforce-blas |
0.010 |
0.011 |
0.002 |
0.003 |
kd-tree |
0.004 |
0.006 |
n/a |
n/a |
bruteforce |
0.004 |
0.006 |
0.126 |
0.009 |
bruteforce-sharemem |
n/a |
n/a |
0.002 |
0.003 |
nn-descent |
0.014 (R: 0.985) |
0.148 (R: 1.000) |
0.016 (R: 0.973) |
0.077 (R: 1.000) |
Point Cloud
Model |
CPU |
GPU |
||
---|---|---|---|---|
K = 8 |
K = 64 |
K = 8 |
K = 64 |
|
bruteforce-blas |
0.359 |
0.432 |
0.010 |
0.010 |
kd-tree |
0.007 |
0.026 |
n/a |
n/a |
bruteforce |
0.074 |
0.167 |
0.008 |
0.039 |
bruteforce-sharemem |
n/a |
n/a |
0.004 |
0.017 |
nn-descent |
0.161 (R: 0.977) |
1.345 (R: 0.999) |
0.086 (R: 0.966) |
0.445 (R: 0.999) |
Small MNIST
Model |
CPU |
GPU |
||
---|---|---|---|---|
K = 8 |
K = 64 |
K = 8 |
K = 64 |
|
bruteforce-blas |
0.014 |
0.015 |
0.002 |
0.002 |
kd-tree |
0.179 |
0.182 |
n/a |
n/a |
bruteforce |
0.173 |
0.228 |
0.123 |
0.170 |
bruteforce-sharemem |
n/a |
n/a |
0.045 |
0.054 |
nn-descent |
0.060 (R: 0.878) |
1.077 (R: 0.999) |
0.030 (R: 0.952) |
0.457 (R: 0.999) |
Medium MNIST
Model |
CPU |
GPU |
||
---|---|---|---|---|
K = 8 |
K = 64 |
K = 8 |
K = 64 |
|
bruteforce-blas |
0.897 |
0.970 |
0.019 |
0.023 |
kd-tree |
18.902 |
18.928 |
n/a |
n/a |
bruteforce |
14.495 |
17.652 |
2.058 |
2.588 |
bruteforce-sharemem |
n/a |
n/a |
2.257 |
2.524 |
nn-descent |
0.804 (R: 0.755) |
14.108 (R: 0.999) |
0.158 (R: 0.900) |
1.794 (R: 0.999) |
Large MNIST
Model |
CPU |
GPU |
||
---|---|---|---|---|
K = 8 |
K = 64 |
K = 8 |
K = 64 |
|
bruteforce-blas |
21.829 |
22.135 |
Out of Memory |
Out of Memory |
kd-tree |
542.688 |
573.379 |
n/a |
n/a |
bruteforce |
373.823 |
432.963 |
10.317 |
12.639 |
bruteforce-sharemem |
n/a |
n/a |
53.133 |
58.419 |
nn-descent |
4.995 (R: 0.658) |
75.487 (R: 0.999) |
1.478 (R: 0.860) |
15.698 (R: 0.999) |
Conclusionο
As long as you have enough memory,
bruteforce-blas
is the default algorithm to go with.Specifically, when
D
is small and the data is on CPU,kd-tree
is the best algorithm.