hopkins_statistic

Compute the Hopkins statistic to assess clustering tendency.

Main entry points are the hopkins and hopkins_test functions.

Installation

pip install hopkins-statistic

Usage

import numpy as np
from hopkins_statistic import hopkins

rng = np.random.default_rng(42)

# Simple clustered example: two Gaussian blobs
centers = np.array([[0, 0], [0, 1]])
labels = rng.integers(len(centers), size=100)
X = centers[labels] + rng.normal(scale=0.1, size=(100, 2))

statistic = hopkins(X, rng=rng)
print(f"{statistic:.3f}")
#> 0.771

Background

The Hopkins statistic is a test statistic for the null hypothesis of complete spatial randomness (CSR), i.e., that points are independently and uniformly distributed within a region of space (the sampling frame). Under this null, the expected value of the statistic is 0.5. Larger values indicate more clustering than expected under CSR, while smaller values indicate more regular spacing. Thus, the statistic is often used as a scalar measure of clustering tendency.

Definition

As noted by Wright (2022), the definition of the Hopkins statistic is a common source of confusion in both literature and software implementations. This library defaults to the formulation by Cross and Jain (1982), which generalizes the original definition by Hopkins and Skellam (1954) to data in any dimension:

Given a set $X$ of $n$ data points in a $d$-dimensional Euclidean space, choose $m$ such that $m \ll n$ and let

  • $\lbrace x_i \rbrace_{i=1}^m$ be a simple random sample from $X$ (without replacement), and
  • $\lbrace y_i \rbrace_{i=1}^m$ be points placed uniformly at random in the sampling frame.

For each $i \in \lbrace 1,\dots,m \rbrace$, let

  • $u_i$ be the distance from $y_i$ to its nearest neighbor in $X$, and
  • $w_i$ be the distance from $x_i$ to its nearest neighbor in $X \setminus \lbrace x_i \rbrace$.

Then the Hopkins statistic is defined as

$$ H = \frac{\sum_{i=1}^m u_i^d}{\sum_{i=1}^m u_i^d + \sum_{i=1}^m w_i^d}. $$

Under the CSR null hypothesis, $H \sim \mathrm{Beta}(m,m)$.

Note

Other implementations may follow Lawson and Jurs (1990) by not raising distances to the power of $d$, or may return $1 - H$ instead of $H$.

Interpretation

While critical values can be obtained from the $\mathrm{Beta}(m,m)$ null distribution, the table below lists commonly used rules of thumb for interpreting $H$.

$H$ Pattern Interpretation
$\ge 0.7$ clustered Suggests a departure from CSR toward clustering.
$\approx 0.5$ random Consistent with complete spatial randomness (CSR).
$\le 0.3$ regular Suggests a departure from CSR toward more even spacing.

Guidelines

  • Euclidean distances on non-spatial data often benefit from scaling features in X to comparable ranges.

  • The sample size m should typically be at least 10 to avoid small-sample problems and no more than about one tenth of $n$ to keep the null-distribution approximations accurate.

  • The sampling frame defaults to the axis-aligned bounding box of X. A known rectangular frame can be specified using the frame parameter. If the underlying sampling frame is not aligned with the coordinate axes, X may be transformed beforehand.

  • To mitigate edge effects, e.g., when events may also occur outside the sampling frame, periodic boundary conditions can be applied with toroidal=True. Alternatively, buffer zones can be used by specifying a frame smaller than the full extent of X.

  • The exponent power applied to distances defaults to $d$, the number of columns in X. This yields the statistic as defined above. Other values alter the null distribution.

References

API Documentation

def hopkins( X: ArrayLike, *, m: int | float = 0.1, frame: Literal['bbox'] | tuple[ArrayLike, ArrayLike] | ArrayLike = 'bbox', toroidal: bool = False, power: int | float | None = None, rng: RNGLike | SeedLike | None = None) -> float:

Compute the Hopkins statistic.

The Hopkins statistic measures clustering tendency by comparing nearest-neighbor distances of sampled data points with those of points placed uniformly at random in the sampling frame.

Arguments:
  • X: Array-like of shape (n, d), with n >= 3 observations in d >= 1 dimensions. Must contain only finite real values.
  • m: Sample size, or its fraction of the n_in points in the frame.
    • If int, this must satisfy 1 <= m <= n_in.
    • If float, this must satisfy 0 < m <= 1, and the sample size is ceil(m * n_in).
  • frame: Area sampling frame. Must be one of:
    • Literal bbox to use the axis-aligned bounding box of X, or
    • Pair (lower, upper) defining the bounds of a rectangular sampling frame. Both must be broadcastable to shape (d,). While data points outside a given frame are ignored during sampling, they can still be nearest neighbors.
  • toroidal: If True, compute distances with periodic boundary conditions.
  • power: Exponent applied to Euclidean distances. Defaults to d. Must be positive and finite.
  • rng: Random number generator or seed passed to numpy.random.default_rng. Specify for repeatable behavior.
Returns:

The Hopkins statistic, a value between 0 and 1 (NaN if undefined).

def hopkins_test( X: ArrayLike, *, m: int | float = 0.1, frame: Literal['bbox'] | tuple[ArrayLike, ArrayLike] | ArrayLike = 'bbox', toroidal: bool = False, alternative: Literal['clustered', 'regular', 'two-sided'] = 'clustered', rng: RNGLike | SeedLike | None = None) -> HopkinsTestResult:

Perform a Hopkins test.

The Hopkins test tests the null hypothesis of complete spatial randomness (CSR) by comparing the observed Hopkins statistic to its Beta(m, m) null distribution.

Arguments:
  • X: Array-like of shape (n, d), with n >= 3 observations in d >= 1 dimensions. Must contain only finite real values.
  • m: Sample size, or its fraction of the n_in points in the frame.
    • If int, this must satisfy 1 <= m <= n_in.
    • If float, this must satisfy 0 < m <= 1, and the sample size is ceil(m * n_in).
  • frame: Area sampling frame. Must be one of:
    • Literal bbox to use the axis-aligned bounding box of X, or
    • Pair (lower, upper) defining the bounds of a rectangular sampling frame. Both must be broadcastable to shape (d,). While data points outside a given frame are ignored during sampling, they can still be nearest neighbors.
  • toroidal: If True, compute distances with periodic boundary conditions.
  • alternative: Alternative hypothesis of departure from CSR toward more clustered or regular data, or in either direction: two-sided.
  • rng: Random number generator or seed passed to numpy.random.default_rng. Specify for repeatable behavior.
Returns:

The result of the Hopkins test (statistic and p-value).

class HopkinsTestResult(typing.NamedTuple):

Result of a Hopkins test.

statistic: float

The Hopkins statistic.

pvalue: float

The p-value associated with the given alternative.