The sampling distribution represents an empirical distribution based on observed samples. It is useful for bootstrapping, representing posterior distributions from Markov Chain Monte Carlo (MCMC) algorithms, or working with any empirical data where the parametric form is unknown. Unlike parametric distributions, the sampling distribution makes no assumptions about the underlying data-generating process and instead uses the sample itself to estimate distributional properties. The distribution can handle both univariate and multivariate samples.
dist_sample(x)We recommend reading this documentation on pkgdown which renders math nicely. https://pkg.mitchelloharawild.com/distributional/reference/dist_sample.html
In the following, let \(X\) be a random variable with sample \(x_1, x_2, \ldots, x_n\) of size \(n\).
Support: The observed range of the sample
Mean (univariate):
$$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $$
Mean (multivariate): Computed independently for each variable.
Variance (univariate):
$$ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 $$
Covariance (multivariate): The sample covariance matrix.
Skewness (univariate):
$$ g_1 = \frac{\sqrt{n} \sum_{i=1}^{n} (x_i - \bar{x})^3}{\left(\sum_{i=1}^{n} (x_i - \bar{x})^2\right)^{3/2}} \left(1 - \frac{1}{n}\right)^{3/2} $$
Probability density function: Approximated numerically using kernel density estimation.
Cumulative distribution function (univariate):
$$ F(q) = \frac{1}{n} \sum_{i=1}^{n} I(x_i \leq q) $$
where \(I(\cdot)\) is the indicator function.
Cumulative distribution function (multivariate):
$$ F(\mathbf{q}) = \frac{1}{n} \sum_{i=1}^{n} I(\mathbf{x}_i \leq \mathbf{q}) $$
where the inequality is applied element-wise.
Quantile function (univariate): The sample quantile, computed using
the specified quantile type (see stats::quantile()).
Quantile function (multivariate): Marginal quantiles are computed independently for each variable.
Random generation: Bootstrap sampling with replacement from the empirical sample.
# Univariate numeric samples
dist <- dist_sample(x = list(rnorm(100), rnorm(100, 10)))
dist
#> <distribution[2]>
#> [1] sample[100] sample[100]
mean(dist)
#> [1] -0.00212688 10.05092587
variance(dist)
#> [1] 1.0439251 0.8202521
skewness(dist)
#> [1] 0.009655314 0.112929608
generate(dist, 10)
#> [[1]]
#> [1] -1.32180897 -0.06133107 -0.68370077 -0.40519219 2.58014889 -1.15764579
#> [7] 1.90577393 -0.59244911 0.40793869 -0.13567761
#>
#> [[2]]
#> [1] 10.439941 10.560629 11.524216 12.256750 9.748285 10.794906 11.274653
#> [8] 10.006444 9.822368 9.888097
#>
density(dist, 1)
#> [1] 0.2493053 0.0000000
# Multivariate numeric samples
dist <- dist_sample(x = list(cbind(rnorm(100), rnorm(100, 10))))
dimnames(dist) <- c("x", "y")
dist
#> <distribution[1]>
#> [1] sample[100]
mean(dist)
#> x y
#> [1,] -0.2056781 10.12689
variance(dist)
#> x y
#> [1,] 1.015929 -0.1077450
#> [2,] -0.107745 0.7995047
generate(dist, 10)
#> [[1]]
#> x y
#> [1,] -1.0705715 10.664291
#> [2,] 1.2953307 9.903387
#> [3,] -1.2838250 10.450325
#> [4,] -1.0705715 10.664291
#> [5,] 0.2545351 11.424024
#> [6,] 0.6734014 9.745097
#> [7,] 1.7697899 8.429418
#> [8,] -1.4626811 9.524618
#> [9,] 0.2383209 10.418310
#> [10,] -1.1300470 11.096636
#>
quantile(dist, 0.4) # Returns the marginal quantiles
#> x y
#> [1,] -0.3904258 9.798941
cdf(dist, matrix(c(0.3,9), nrow = 1))
#> [1] 0.395