hopkins_statistic
Compute the Hopkins statistic to assess clustering tendency.
Main entry points are the hopkins and hopkins_test functions.
Installation
pip install hopkins-statistic
Usage
import numpy as np
from hopkins_statistic import hopkins
rng = np.random.default_rng(42)
# Simple clustered example: two Gaussian blobs
centers = np.array([[0, 0], [0, 1]])
labels = rng.integers(len(centers), size=100)
X = centers[labels] + rng.normal(scale=0.1, size=(100, 2))
statistic = hopkins(X, rng=rng)
print(f"{statistic:.3f}")
#> 0.771
Background
The Hopkins statistic is a test statistic for the null hypothesis of complete spatial randomness (CSR), i.e., that points are independently and uniformly distributed within a region of space (the sampling frame). Under this null, the expected value of the statistic is 0.5. Larger values indicate more clustering than expected under CSR, while smaller values indicate more regular spacing. Thus, the statistic is often used as a scalar measure of clustering tendency.
Definition
As noted by Wright (2022), the definition of the Hopkins statistic is a common source of confusion in both literature and software implementations. This library defaults to the formulation by Cross and Jain (1982), which generalizes the original definition by Hopkins and Skellam (1954) to data in any dimension:
Given a set $X$ of $n$ data points in a $d$-dimensional Euclidean space, choose $m$ such that $m \ll n$ and let
- $\lbrace x_i \rbrace_{i=1}^m$ be a simple random sample from $X$ (without replacement), and
- $\lbrace y_i \rbrace_{i=1}^m$ be points placed uniformly at random in the sampling frame.
For each $i \in \lbrace 1,\dots,m \rbrace$, let
- $u_i$ be the distance from $y_i$ to its nearest neighbor in $X$, and
- $w_i$ be the distance from $x_i$ to its nearest neighbor in $X \setminus \lbrace x_i \rbrace$.
Then the Hopkins statistic is defined as
$$ H = \frac{\sum_{i=1}^m u_i^d}{\sum_{i=1}^m u_i^d + \sum_{i=1}^m w_i^d}. $$
Under the CSR null hypothesis, $H \sim \mathrm{Beta}(m,m)$.
Other implementations may follow Lawson and Jurs (1990) by not raising distances to the power of $d$, or may return $1 - H$ instead of $H$.
Interpretation
While critical values can be obtained from the $\mathrm{Beta}(m,m)$ null distribution, the table below lists commonly used rules of thumb for interpreting $H$.
| $H$ | Pattern | Interpretation |
|---|---|---|
| $\ge 0.7$ | clustered | Suggests a departure from CSR toward clustering. |
| $\approx 0.5$ | random | Consistent with complete spatial randomness (CSR). |
| $\le 0.3$ | regular | Suggests a departure from CSR toward more even spacing. |
Guidelines
Euclidean distances on non-spatial data often benefit from scaling features in
Xto comparable ranges.The sample size
mshould typically be at least 10 to avoid small-sample problems and no more than about one tenth of $n$ to keep the null-distribution approximations accurate.The sampling frame defaults to the axis-aligned bounding box of
X. A known rectangular frame can be specified using theframeparameter. If the underlying sampling frame is not aligned with the coordinate axes,Xmay be transformed beforehand.To mitigate edge effects, e.g., when events may also occur outside the sampling frame, periodic boundary conditions can be applied with
toroidal=True. Alternatively, buffer zones can be used by specifying aframesmaller than the full extent ofX.The exponent
powerapplied to distances defaults to $d$, the number of columns inX. This yields the statistic as defined above. Other values alter the null distribution.
References
Cross, G. R., & Jain, A. K. (1982). Measurement of clustering tendency. In Theory and Application of Digital Control (pp. 315–320). Pergamon. https://doi.org/10.1016/S1474-6670(17)63365-2
Hopkins, B., & Skellam, J. G. (1954). A new method for determining the type of distribution of plant individuals. Annals of Botany, 18(2), 213–227. https://doi.org/10.1093/oxfordjournals.aob.a083391
Lawson, R. G., & Jurs, P. C. (1990). New index for clustering tendency and its application to chemical problems. Journal of chemical information and computer sciences, 30(1), 36–41. https://doi.org/10.1021/ci00065a010
Wright, K. (2022). Will the Real Hopkins Statistic Please Stand Up? The R Journal, 14(3), 282–292. https://doi.org/10.32614/rj-2022-055
API Documentation
Compute the Hopkins statistic.
The Hopkins statistic measures clustering tendency by comparing nearest-neighbor distances of sampled data points with those of points placed uniformly at random in the sampling frame.
Arguments:
- X: Array-like of shape
(n, d), withn >= 3observations ind >= 1dimensions. Must contain only finite real values. - m: Sample size, or its fraction of the
n_inpoints in theframe.- If int, this must satisfy
1 <= m <= n_in. - If float, this must satisfy
0 < m <= 1, and the sample size isceil(m * n_in).
- If int, this must satisfy
- frame: Area sampling frame. Must be one of:
- Literal
bboxto use the axis-aligned bounding box ofX, or - Pair
(lower, upper)defining the bounds of a rectangular sampling frame. Both must be broadcastable to shape(d,). While data points outside a given frame are ignored during sampling, they can still be nearest neighbors.
- Literal
- toroidal: If True, compute distances with periodic boundary conditions.
- power: Exponent applied to Euclidean distances. Defaults to
d. Must be positive and finite. - rng: Random number generator or seed passed to
numpy.random.default_rng. Specify for repeatable behavior.
Returns:
The Hopkins statistic, a value between 0 and 1 (NaN if undefined).
Perform a Hopkins test.
The Hopkins test tests the null hypothesis of complete spatial randomness (CSR) by comparing the observed Hopkins statistic to its Beta(m, m) null distribution.
Arguments:
- X: Array-like of shape
(n, d), withn >= 3observations ind >= 1dimensions. Must contain only finite real values. - m: Sample size, or its fraction of the
n_inpoints in theframe.- If int, this must satisfy
1 <= m <= n_in. - If float, this must satisfy
0 < m <= 1, and the sample size isceil(m * n_in).
- If int, this must satisfy
- frame: Area sampling frame. Must be one of:
- Literal
bboxto use the axis-aligned bounding box ofX, or - Pair
(lower, upper)defining the bounds of a rectangular sampling frame. Both must be broadcastable to shape(d,). While data points outside a given frame are ignored during sampling, they can still be nearest neighbors.
- Literal
- toroidal: If True, compute distances with periodic boundary conditions.
- alternative: Alternative hypothesis of departure from CSR toward more
clusteredorregulardata, or in either direction:two-sided. - rng: Random number generator or seed passed to
numpy.random.default_rng. Specify for repeatable behavior.
Returns:
The result of the Hopkins test (statistic and p-value).
Result of a Hopkins test.