Benchmark the performance of KNN algorithms

In this doc, we benchmark the performance on multiple K-Nearest Neighbor algorithms implemented by dgl.knn_graph().

Given a dataset of N samples with D dimensions, the common use case of KNN algorithms in graph learning is to build a KNN graph by finding the K nearest neighbors for each of the N samples among the dataset.

Empirically, the three parameters, N, D, and K, all have impact on the computation cost. To benchmark the algorithms, we pick a few represensitive datasets to cover most common scenarios:

  • A synthetic dataset with mixed gaussian samples: N = 1000, D = 3.

  • A point cloud sample from ModelNet: N = 10000, D = 3.

  • Subsets of MNIST
    • A small subset: N = 1000, D = 784

    • A medium subset: N = 10000, D = 784

    • A large subset: N = 50000, D = 784

Some notes:

  • bruteforce-sharemem is an optimized implementation of bruteforce on GPU.

  • kd-tree is currently only implemented on CPU.

  • bruteforce-blas conducts matrix multiplication, thus is memory inefficient.

  • nn-descent is an approximate algorithm, and we also report the recall rate of its result.

Results

In this section, we show the runtime and recall rate (where applicable) for the algorithms under various scenarios.

The experiments are run on an Amazon EC2 P3.2xlarge instance. This instance has 8 vCPUs with 61GB RAM, and one Tesla V100 GPU with 16GB RAM. In terms of the environment, we obtain the numbers with DGL==0.7.0(64d0f3f), PyTorch==1.8.1, CUDA==11.1 on Ubuntu 18.04.5 LTS.

  • Mixed Gaussian:

Model

CPU

GPU

K = 8

K = 64

K = 8

K = 64

bruteforce-blas

0.010

0.011

0.002

0.003

kd-tree

0.004

0.006

n/a

n/a

bruteforce

0.004

0.006

0.126

0.009

bruteforce-sharemem

n/a

n/a

0.002

0.003

nn-descent

0.014 (R: 0.985)

0.148 (R: 1.000)

0.016 (R: 0.973)

0.077 (R: 1.000)

  • Point Cloud

Model

CPU

GPU

K = 8

K = 64

K = 8

K = 64

bruteforce-blas

0.359

0.432

0.010

0.010

kd-tree

0.007

0.026

n/a

n/a

bruteforce

0.074

0.167

0.008

0.039

bruteforce-sharemem

n/a

n/a

0.004

0.017

nn-descent

0.161 (R: 0.977)

1.345 (R: 0.999)

0.086 (R: 0.966)

0.445 (R: 0.999)

  • Small MNIST

Model

CPU

GPU

K = 8

K = 64

K = 8

K = 64

bruteforce-blas

0.014

0.015

0.002

0.002

kd-tree

0.179

0.182

n/a

n/a

bruteforce

0.173

0.228

0.123

0.170

bruteforce-sharemem

n/a

n/a

0.045

0.054

nn-descent

0.060 (R: 0.878)

1.077 (R: 0.999)

0.030 (R: 0.952)

0.457 (R: 0.999)

  • Medium MNIST

Model

CPU

GPU

K = 8

K = 64

K = 8

K = 64

bruteforce-blas

0.897

0.970

0.019

0.023

kd-tree

18.902

18.928

n/a

n/a

bruteforce

14.495

17.652

2.058

2.588

bruteforce-sharemem

n/a

n/a

2.257

2.524

nn-descent

0.804 (R: 0.755)

14.108 (R: 0.999)

0.158 (R: 0.900)

1.794 (R: 0.999)

  • Large MNIST

Model

CPU

GPU

K = 8

K = 64

K = 8

K = 64

bruteforce-blas

21.829

22.135

Out of Memory

Out of Memory

kd-tree

542.688

573.379

n/a

n/a

bruteforce

373.823

432.963

10.317

12.639

bruteforce-sharemem

n/a

n/a

53.133

58.419

nn-descent

4.995 (R: 0.658)

75.487 (R: 0.999)

1.478 (R: 0.860)

15.698 (R: 0.999)

Conclusion

  • As long as you have enough memory, bruteforce-blas is the default algorithm to go with.

  • Specifically, when D is small and the data is on CPU, kd-tree is the best algorithm.