Background¶
The Hopkins statistic is a test statistic for the null hypothesis of complete spatial randomness (CSR), i.e., that points are independently and uniformly distributed within a region of space (the sampling frame). Under this null, the expected value of the statistic is 0.5. Larger values indicate more clustering than expected under CSR, while smaller values indicate more regular spacing. Thus, the statistic is often used as a scalar measure of clustering tendency.
Definition¶
As noted by Wright (2022), the definition of the Hopkins statistic is a common source of confusion in both literature and software implementations. This library defaults to the formulation by Cross and Jain (1982), which generalizes the original definition by Hopkins and Skellam (1954) to data in any dimension.
Definition
Given a set \(X\) of \(n\) data points in a \(d\)-dimensional Euclidean space, choose \(m\) such that \(m \ll n\) and let
- \(\lbrace x_i \rbrace_{i=1}^m\) be a simple random sample from \(X\) (without replacement), and
- \(\lbrace y_i \rbrace_{i=1}^m\) be points placed uniformly at random in the sampling frame.
For each \(i \in \lbrace 1,\dots,m \rbrace\), let
- \(u_i\) be the distance from \(y_i\) to its nearest neighbor in \(X\), and
- \(w_i\) be the distance from \(x_i\) to its nearest neighbor in \(X \setminus \lbrace x_i \rbrace\).
Then the Hopkins statistic is defined as
Under the CSR null hypothesis, \(H \sim \mathrm{Beta}(m,m)\).
Note
Other implementations may follow Lawson and Jurs (1990) by not raising distances to the power of \(d\), or may return \(1 - H\) instead of \(H\).
Interpretation¶
While critical values can be obtained from the \(\mathrm{Beta}(m,m)\) null distribution, the table below lists commonly used rules of thumb for interpreting \(H\).
| \(H\) | Pattern | Interpretation |
|---|---|---|
| \(\ge 0.7\) | clustered | Suggests a departure from CSR toward clustering. |
| \(\approx 0.5\) | random | Consistent with complete spatial randomness (CSR). |
| \(\le 0.3\) | regular | Suggests a departure from CSR toward more even spacing. |
Guidelines¶
-
Euclidean distances on non-spatial data often benefit from scaling features in
Xto comparable ranges. -
The sample size
mshould typically be at least 10 to avoid small-sample problems and no more than about one tenth of \(n\) to keep the null-distribution approximations accurate. -
The sampling frame defaults to the axis-aligned bounding box of
X. A known rectangular frame can be specified using theframeparameter. If the underlying sampling frame is not aligned with the coordinate axes,framecan be set to'hull'for using the convex hull ofX. -
To mitigate edge effects, e.g., when events may also occur outside the sampling frame, periodic boundary conditions can be applied with
toroidal=True. Alternatively, buffer zones can be used by specifying aframesmaller than the full extent ofX. -
The exponent
powerapplied to distances defaults to \(d\), the number of columns inX. This yields the statistic as defined above. Other values alter the null distribution.
References¶
-
Cross, G. R., & Jain, A. K. (1982). Measurement of clustering tendency. In Theory and Application of Digital Control (pp. 315–320). Pergamon. https://doi.org/10.1016/S1474-6670(17)63365-2
-
Hopkins, B., & Skellam, J. G. (1954). A new method for determining the type of distribution of plant individuals. Annals of Botany, 18(2), 213–227. https://doi.org/10.1093/oxfordjournals.aob.a083391
-
Lawson, R. G., & Jurs, P. C. (1990). New index for clustering tendency and its application to chemical problems. Journal of chemical information and computer sciences, 30(1), 36–41. https://doi.org/10.1021/ci00065a010
-
Wright, K. (2022). Will the Real Hopkins Statistic Please Stand Up? The R Journal, 14(3), 282–292. https://doi.org/10.32614/rj-2022-055