Efficient Document Indexing Using Pivot Tree

Gaurav Singh Benjamin Piwowarski 1
1 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : We present a novel method for efficiently searching top-k neighbors for documents represented in high dimensional space of terms based on the cosine similarity. Mostly, documents are stored as bag-of-words tf-idf representation. One of the most used ways of computing similarity between a pair of documents is cosine similarity between the vector representations, but cosine similarity is not a metric distance measure as it doesn't follow triangle inequality, therefore most metric searching methods can not be applied directly. We propose an efficient method for indexing documents using a pivot tree that leads to efficient retrieval. We also study the relation between precision and efficiency for the proposed method and compare it with a state of the art in the area of document searching based on inner product.
Type de document :
Rapport
[Research Report] Sorbonne Universités, UPMC Univ Paris 06, CNRS, LIP6 UMR 7606. 2016
Liste complète des métadonnées

http://hal.upmc.fr/hal-01358681
Contributeur : Benjamin Piwowarski <>
Soumis le : jeudi 1 septembre 2016 - 11:25:37
Dernière modification le : jeudi 11 janvier 2018 - 06:27:14

Identifiants

  • HAL Id : hal-01358681, version 1

Collections

UPMC | LIP6 | LARA

Citation

Gaurav Singh, Benjamin Piwowarski. Efficient Document Indexing Using Pivot Tree. [Research Report] Sorbonne Universités, UPMC Univ Paris 06, CNRS, LIP6 UMR 7606. 2016. 〈hal-01358681〉

Partager

Métriques

Consultations de la notice

88