Johnson–Lindenstrauss lemma
Mathematical result From Wikipedia, the free encyclopedia
In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In the classical proof of the lemma, the embedding is a random orthogonal projection.
The lemma has applications in compressed sensing, manifold learning, dimensionality reduction, graph embedding, and natural language processing. Much of the data stored and manipulated on computers, including text and images, can be represented as points in a high-dimensional space (see vector space model for the case of text). However, the essential algorithms for working with such data tend to become bogged down very quickly as dimension increases.[1] It is therefore desirable to reduce the dimensionality of the data in a way that preserves its relevant structure.
Statement
Summarize
Perspective
Given , a set of points in , and an integer ,[2] there is a linear map such that
for all .
The formula can be rearranged:
Alternatively, for any and any integer [Note 1] there exists a linear function such that the restriction is -bi-Lipschitz.[Note 2]
Also, the lemma is tight up to a constant factor, i.e. there exists a set of points of size N that needs dimension
in order to preserve the distances between all pairs of points within a factor of .[3][4]
The classical proof of the lemma takes to be a scalar multiple of an orthogonal projection onto a random subspace of dimension in . An orthogonal projection collapses some dimensions of the space it is applied to, which reduces the length of all vectors, as well as distance between vectors in the space. Under the conditions of the lemma, concentration of measure ensures there is a nonzero chance that a random orthogonal projection reduces pairwise distances between all points in by roughly a constant factor . Since the chance is nonzero, such projections must exist, so we can choose one and set .
To obtain the projection algorithmically, it suffices with high probability to repeatedly sample orthogonal projection matrices at random. If you keep rolling the dice, you will eventually obtain one in polynomial random time.
Proof
Summarize
Perspective
Based on.[5]
Construct a random matrix , obtained by sampling each entry from the standard normal distribution. Then define . Then, for any nonzero vector , let the projected vector be . Standard geometric argument show that is chi-square distributed, that is, . Thus, it satisfies a concentration inequalityBy the union bound, the probability that this relation is true for all of is greater than .
When , the probability is nonzero.
More generally, when , the probability is , allowing arbitrarily high probability of success per sample, and a fortiori polynomial random time.
Alternate statement
A related lemma is the distributional JL lemma. This lemma states that for any and positive integer , there exists a distribution over from which the matrix is drawn such that for and for any unit-length vector , the claim below holds.[6]
One can obtain the JL lemma from the distributional version by setting and for some pair u,v both in X. Then the JL lemma follows by a union bound over all such pairs.
Sparse JL transform
Summarize
Perspective
Database-friendly JL transform
(Achlioptas, 2003)[7] proposed "database-friendly" JL transform, using matrices with only entries from (-1, 0, +1).
Theorem (Achlioptas, 2003, Theorem 1.1)—Let the random projection matrix have entries drawn i.i.d., either from
or from
Given a vector , we define the random projection . Then for any vector , we have
Fix some unit vector . Define . We have .
Now, since the are IID, we want to apply a Chernoff concentration bound for around 1. This requires upper-bounding the cumulant generating function (CGF).
Moment bounds (Achlioptas, 2003, Section 6)—For any , the moment of is upper-bound by the standard gaussian :
Proof
Proof
is easy: just apply the fact that when is odd, since we can decompose it into a product of expectations, and one of those is the expectation of an odd power of Radamacher, which is zero.
Now, the trick is that we can rewrite as , where each is a standard gaussian. Then we need to compare: and
In the top sum, a term decomposes into a product of expectations, times . The product of expectations is zero, unless the indices are paired off. In that case, the term is the square of something, and so
while is also the square of , and so
In the bottom sum, we run a similar argument with each such term but in this case, since we have , we find that in each case,
And so, summing all of them up, .
The same argument works for the other case. Specifically, if is distributed like that, then , and the proof goes through exactly the same way.
Now that is stochastically dominated by the standard gaussian, and , it remains to perform a Chernoff bound for , which requires bounding the cumulant generating function on both ends.
The rest of the calculation
Proof
For any , we can compute the cumulant generating function
Similarly, for any ,
So by the standard Chernoff bound method, for any and any ,
The right side is maximized at , at which point we have
That’s one half of the bound done. For the other half, begin with some , and expand the exponential to the second order:
So by the standard Chernoff bound method, for any and any ,
Plug in , and simplify, we find the right side is and expand to third Taylor power,
Sparser JL transform on well-spread vectors
(Matoušek, 2008)[8] proposed a variant of the above JL transform that is even more sparsified, though it only works on "well-spread" vectors.
Theorem (Matoušek 2008, Theorem 4.1)—Define , where are absolute constants.
Let be a matrix sampled IID with
Then, for any unit vector such that , we have
where .
The above cases are generalized to the case for matrices with independent, mean-zero, unit variance, subgaussian entries in (Dirksen, 2016).[9]
Speeding up the JL transform
Given A, computing the matrix vector product takes time. There has been some work in deriving distributions for which the matrix vector product can be computed in less than time.
There are two major lines of work. The first, Fast Johnson Lindenstrauss Transform (FJLT),[10] was introduced by Ailon and Chazelle in 2006. This method allows the computation of the matrix vector product in just for any constant .
Another approach is to build a distribution supported over matrices that are sparse.[11] This method allows keeping only an fraction of the entries in the matrix, which means the computation can be done in just time. Furthermore, if the vector has only non-zero entries, the Sparse JL takes time , which may be much less than the time used by Fast JL.
Tensorized random projections
Summarize
Perspective
It is possible to combine two JL matrices by taking the so-called face-splitting product, which is defined as the tensor products of the rows (was proposed by V. Slyusar[12] in 1996[13][14][15][16][17] for radar and digital antenna array applications). More directly, let and be two matrices. Then the face-splitting product is[13][14][15][16][17]
This idea of tensorization was used by Kasiviswanathan et al. for differential privacy.[18]
JL matrices defined like this use fewer random bits, and can be applied quickly to vectors that have tensor structure, due to the following identity:[15]
- ,
where is the element-wise (Hadamard) product. Such computations have been used to efficiently compute polynomial kernels and many other linear-algebra algorithms[clarification needed].[19]
In 2020[20] it was shown that if the matrices are independent or Gaussian matrices, the combined matrix satisfies the distributional JL lemma if the number of rows is at least
- .
For large this is as good as the completely random Johnson-Lindenstrauss, but a matching lower bound in the same paper shows that this exponential dependency on is necessary. Alternative JL constructions are suggested to circumvent this.
See also
Notes
References
Further reading
Wikiwand - on
Seamless Wikipedia browsing. On steroids.