Jump to content

Compressed cover tree

From Wikipedia, the free encyclopedia

The compressed cover tree is a type of data structure in computer science that is specifically designed to facilitate the speed-up of a k-nearest neighbors algorithm in finite metric spaces.[1] Compressed cover tree is a simplified version of explicit representation of cover tree that was motivated by past issues in proofs of time complexity results[2] of cover tree. The compressed cover tree was specifically designed to achieve claimed time complexities of cover tree[3] in a mathematically rigorous way.

Problem statement

[edit]

In the modern formulation, the k-nearest neighbor problem is to find all nearest neighbors in a given reference set R for all points from another given query set Q. Both sets belong to a common ambient space X with a distance metric d satisfying all metric axioms.

Definitions

[edit]

Compressed cover tree

[edit]

Let (R,d) be a finite metric space. A compressed cover tree has the vertex set R with a root and a level function satisfying the conditions below:

  • Root condition: the level of the root node r satisfies
  • Covering condition: For every node , we select a unique parent p and a level l(q) such that and this parent node pp has a single link to its child node q.
  • Separation condition: For , the cover set has

Expansion constants

[edit]

In a metric space, let be the closed ball with a center p and a radius . The notation denotes the number (if finite) of points in the closed ball.

The expansion constant[3] is the smallest such that for any point and .

the new minimized expansion constant[1] is a discrete analog of the doubling dimension Navigating nets[4] , where A is a locally finite set which covers R.

Note that for any finite metric space (R,d).

Aspect ratio

[edit]

For any finite set R with a metric d, the diameter is . The aspect ratio is , where is the shortest distance between points of R.

Complexity

[edit]

Insert

[edit]

Although cover trees provide faster searches than the naive approach, this advantage must be weighed with the additional cost of maintaining the data structure. In a naive approach adding a new point to the dataset is trivial because order does not need to be preserved, but in a compressed cover tree it can be bounded:

  • using expansion constant: .
  • using minimized expansion constant / doubling dimension .
[edit]

Let Q and R be finite subsets of a metric space (X,d). Once all points of R are inserted into a compressed cover tree it can be used for find-queries of the query point set Q. The following time complexities have been proven for finding the k-nearest neighbor of a query point in the reference set R:

  • using expansion constant: .
  • using minimized expansion constant / doubling dimension , where is a number of points inside a closed ball around q having a radius 5 times the distance of q to its k-nearest neighbor.

Space

[edit]

The compressed cover tree constructed on finite metric space R requires O(|R|) space, during the construction and during the execution of the Find algorithm.

Compared to other similar data structures

[edit]

Using doubling dimension as hidden factor

[edit]

Tables below show time complexity estimates which use minimized expansion constant or dimensionality constant [4] related to doubling dimension. Note that denotes the aspect ratio.

Results for building data structures
Name of datastructure, source

Claimed time complexity Claimed space complexity Proof of result
Navigating nets[4] Theorem 2.5[4]
Compressed cover tree[1] Theorem 3.6[1]

Results for exact k-nearest neighbors of one query point in reference set R assuming that all data structures are already built. Below we denote the distance between a query point q and the reference set R as and distance from a query point q to its k-nearest neighbor in set R as :

Name of datastructure, source

Claimed time complexity Claimed space complexity Proof of result
Navigating nets[4] Proof outline in Theorem 2.3[4]
Compressed cover tree[1] Corollary 3.7[1]

Using expansion constant as hidden factor

[edit]

Tables below show time complexity estimates which use or KR-type constant [4] as a hidden factor. Note that the dimensionality factor is equivalent to

Results for building data structures
Name of datastructure, source

Claimed time complexity Claimed space complexity Proof of result
Navigating nets[4] Not available
Cover tree[3] Counterexample 4.2[2] shows that the past proof is incorrect.
Compressed cover tree[1] Corollary 3.10[1]

Results for exact k-nearest neighbors of one query point assuming that all data structures are already built.

Name of datastructure, source

Claimed time complexity Claimed space complexity Proof of result
Navigating nets[4] Not available
Cover tree[3] for k = 1. Counterexample 5.2[2] shows that the past proof is incorrect.
Compressed cover tree[1] Theorem 4.9

See also

[edit]

References

[edit]
  1. ^ a b c d e f g h i Elkin, Yury; Kurlin, Vitaliy (2021). "A new near-linear time algorithm for k-nearest neighbor search using a compressed cover tree". arXiv:2111.15478 [cs.CG].
  2. ^ a b c Elkin, Yury; Kurlin, Vitaliy (2022). "Counterexamples expose gaps in the proof of time complexity for cover trees introduced in 2006". arXiv:2208.09447 [cs.CG].
  3. ^ a b c d Alina Beygelzimer, Sham Kakade, and John Langford. Cover Trees for Nearest Neighbor. In Proc. International Conference on Machine Learning (ICML), 2006.
  4. ^ a b c d e f g h i Kenneth Clarkson. Nearest-neighbor searching and metric space dimensions. In G. Shakhnarovich, T. Darrell, and P. Indyk, editors, Nearest-Neighbor Methods for Learning and Vision: Theory and Practice, pages 15–59. MIT Press, 2006.