Пошук подібних векторів у підквадратичний час

Нехай - функція, яку ми називаємо функцією подібності . Прикладами функції подібності є відстань косинуса, норма, відстань Хеммінга, подібність Жакарда тощо. $d:\{0,1\}^k\times \{0,1\}^k \to \mathbb{R}$ $l_2$

Розглянемо двійкових векторів довжини : . $n$ $k$ $\vec{v} \in (\{0,1\}^k)^n$

Наша мета - згрупувати подібні вектори. Більш формально ми хочемо обчислити графік подібності, де вузлами є вектори, а ребра представляють подібні вектори ( ). $d(v,u) \leq \epsilon$

$n$ і дуже велика кількість, і порівняння двох векторів довжини є дорогим, ми не можемо виконати всі операції грубої сили . Ми хочемо обчислити графік подібності зі значно меншими операціями. $k$ $k$ $O(n^2)$

Чи можливо це? Якщо не можемо ми обчислити наближення до графа, який містить усі ребра в графіку подібності плюс, можливо, щонайбільше інших ребер? $O(1)$

ds.algorithms graph-algorithms clustering

— ОЗП
джерело

Якщо це буде а не

\leq ϵ

$\leq \epsilon$

\geq ϵ

$\geq \epsilon$ ?

— usul

@usul Thanks for your comment:) Here, we interested to group items which are highly similar. I have edited the question, I hope it is clear now.

— Ram

Sounds to me like you could use Similarity Preserving Hashing (arxiv.org/pdf/1311.7662v1.pdf) to reduce the problem dimension.

— RB

This question is not well-defined at all, please provide more details. E.g., if

d

$d$ is given by an oracle, then you obviously cannot do better than

(\binom{n}{2})

${n\choose 2}$ .

— domotorp

Do you work for twitter?blog.twitter.com/2014/all-pairs-similarity-via-dimsum Seriously, even detecting if there is an edge in this graph (I.e. that it's not an independent set of vertices) is going to be very hard to do faster than

O (n^{2})

$O(n^2)$ for an arbitrary similarity function.

— Ryan Williams

There may be a way to shoe horn the Johnson-Lindenstrauss theorem into this problem. Essentially, J-L states that you can project high dimensional data into lower dimensional spaces in such a way that the pairwise distances are nearly preserved. More practically, Achlioptas has a paper called Database-friendly random projections: Johnson-Lindenstrauss with binary coins that does this projection in a random way, which works pretty well in practice.

Now, certainly, your similarity function is not exactly the same as something that would fit into the J-L theorem. However, it looks like a distance function and perhaps some of the theory above may help.

— wyer33
джерело