Benchmark the performance of KNN algorithms

In this doc, we benchmark the performance on multiple K-Nearest Neighbor algorithms implemented by dgl.knn_graph().

Given a dataset of N samples with D dimensions, the common use case of KNN algorithms in graph learning is to build a KNN graph by finding the K nearest neighbors for each of the N samples among the dataset.

Empirically, the three parameters, N, D, and K, all have impact on the computation cost. To benchmark the algorithms, we pick a few represensitive datasets to cover most common scenarios:

  • A synthetic dataset with mixed gaussian samples: N = 1000, D = 3.

  • A point cloud sample from ModelNet: N = 10000, D = 3.

  • Subsets of MNIST
    • A small subset: N = 1000, D = 784

    • A medium subset: N = 10000, D = 784

    • A large subset: N = 50000, D = 784

Some notes:

  • bruteforce-sharemem is an optimized implementation of bruteforce on GPU.

  • kd-tree is currently only implemented on CPU.

  • bruteforce-blas conducts matrix multiplication, thus is memory inefficient.

  • nn-descent is an approximate algorithm, and we also report the recall rate of its result.

Results

In this section, we show the runtime and recall rate (where applicable) for the algorithms under various scenarios.

The experiments are run on an Amazon EC2 P3.2xlarge instance. This instance has 8 vCPUs with 61GB RAM, and one Tesla V100 GPU with 16GB RAM. In terms of the environment, we obtain the numbers with DGL==0.7.0(64d0f3f), PyTorch==1.8.1, CUDA==11.1 on Ubuntu 18.04.5 LTS.

  • Mixed Gaussian:

Model

CPU

GPU

K = 8

K = 64

K = 8

K = 64

bruteforce-blas

0.010

0.011

0.002

0.003

kd-tree

0.004

0.006

n/a

n/a

bruteforce

0.004

0.006

0.126

0.009

bruteforce-sharemem

n/a

n/a

0.002

0.003

nn-descent

0.014 (R: 0.985)

0.148 (R: 1.000)

0.016 (R: 0.973)

0.077 (R: 1.000)

  • Point Cloud

Model

CPU

GPU

K = 8

K = 64

K = 8

K = 64

bruteforce-blas

0.359

0.432

0.010

0.010

kd-tree

0.007

0.026

n/a

n/a

bruteforce

0.074

0.167

0.008

0.039

bruteforce-sharemem

n/a

n/a

0.004

0.017

nn-descent

0.161 (R: 0.977)

1.345 (R: 0.999)

0.086 (R: 0.966)

0.445 (R: 0.999)

  • Small MNIST

Model

CPU

GPU

K = 8

K = 64

K = 8

K = 64

bruteforce-blas

0.014

0.015

0.002

0.002

kd-tree

0.179

0.182

n/a

n/a

bruteforce

0.173

0.228

0.123

0.170

bruteforce-sharemem

n/a

n/a

0.045

0.054

nn-descent

0.060 (R: 0.878)

1.077 (R: 0.999)

0.030 (R: 0.952)

0.457 (R: 0.999)

  • Medium MNIST

Model

CPU

GPU

K = 8

K = 64

K = 8

K = 64

bruteforce-blas

0.897

0.970

0.019

0.023

kd-tree

18.902

18.928

n/a

n/a

bruteforce

14.495

17.652

2.058

2.588

bruteforce-sharemem

n/a

n/a

2.257

2.524

nn-descent

0.804 (R: 0.755)

14.108 (R: 0.999)

0.158 (R: 0.900)

1.794 (R: 0.999)

  • Large MNIST

Model

CPU

GPU

K = 8

K = 64

K = 8

K = 64

bruteforce-blas

21.829

22.135

Out of Memory

Out of Memory

kd-tree

542.688

573.379

n/a

n/a

bruteforce

373.823

432.963

10.317

12.639

bruteforce-sharemem

n/a

n/a

53.133

58.419

nn-descent

4.995 (R: 0.658)

75.487 (R: 0.999)

1.478 (R: 0.860)

15.698 (R: 0.999)

Conclusion

  • As long as you have enough memory, bruteforce-blas is the default algorithm to go with.

  • Specifically, when D is small and the data is on CPU, kd-tree is the best algorithm.