Jump to content

Truncation (statistics)

From Wikipedia, the free encyclopedia
(Redirected from Truncated data)

In statistics, truncation results in values that are limited above or below, resulting in a truncated sample.[1] A random variable is said to be truncated from below if, for some threshold value , the exact value of is known for all cases , but unknown for all cases . Similarly, truncation from above means the exact value of is known in cases where , but unknown when .[2]

Truncation is similar to but distinct from the concept of statistical censoring. A truncated sample can be thought of as being equivalent to an underlying sample with all values outside the bounds entirely omitted, with not even a count of those omitted being kept. With statistical censoring, a note would be recorded documenting which bound (upper or lower) had been exceeded and the value of that bound. With truncated sampling, no note is recorded.

Applications

[edit]

Usually the values that insurance adjusters receive are either left-truncated, right-censored, or both. For example, if policyholders are subject to a policy limit u, then any loss amounts that are actually above u are reported to the insurance company as being exactly u because u is the amount the insurance company pays. The insurer knows that the actual loss is greater than u but they don't know what it is. On the other hand, left truncation occurs when policyholders are subject to a deductible. If policyholders are subject to a deductible d, any loss amount that is less than d will not even be reported to the insurance company. If there is a claim on a policy limit of u and a deductible of d, any loss amount that is greater than u will be reported to the insurance company as a loss of because that is the amount the insurance company has to pay. Therefore, insurance loss data is left-truncated because the insurance company doesn't know if there are values below the deductible d because policyholders won't make a claim. The insurance loss is also right-censored if the loss is greater than u because u is the most the insurance company will pay. Thus, it only knows that your claim is greater than u, not the exact claim amount.

Probability distributions

[edit]

Truncation can be applied to any probability distribution. This will usually lead to a new distribution, not one within the same family. Thus, if a random variable X has F(x) as its distribution function, the new random variable Y defined as having the distribution of X truncated to the semi-open interval (a, b] has the distribution function

for y in the interval (a, b], and 0 or 1 otherwise. If truncation were to the closed interval [a, b], the distribution function would be

for y in the interval [a, b], and 0 or 1 otherwise.

Data analysis

[edit]

The analysis of data where observations are treated as being from truncated versions of standard distributions can be undertaken using maximum likelihood, where the likelihood would be derived from the distribution or density of the truncated distribution. This involves taking account of the factor in the modified density function which will depend on the parameters of the original distribution.

In practice, if the fraction truncated is very small the effect of truncation might be ignored when analysing data. For example, it is common to use a normal distribution to model data whose values can only be positive but for which the typical range of values is well away from zero. In such cases, a truncated or censored version of the normal distribution may formally be preferable (although there would be alternatives); there would be very little change in results from the more complicated analysis. However, software is readily available for maximum-likelihood estimation of even moderately complicated models, such as regression models, for truncated data.[3]

In econometrics, truncated dependent variables are variables for which observations cannot be made for certain values in some range.[4] Regression models with such dependent variables require special care that properly recognizes the truncated nature of the variable. Estimation of such truncated regression model can be done in parametric,[5][6][7] or semi- and non-parametric frameworks.[8][9]

See also

[edit]

References

[edit]
  1. ^ Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms. OUP. ISBN 0-19-920613-9
  2. ^ Breen, Richard (1996). Regression Models : Censored, Sample Selected, or Truncated Data. Quantitative Applications in the Social Sciences. Vol. 111. Thousand Oaks: Sage. pp. 2–4. ISBN 0-8039-5710-6.
  3. ^ Wolynetz, M. S. (1979). "Maximum Likelihood Estimation in a Linear Model from Confined and Censored Normal Data". Journal of the Royal Statistical Society. Series C. 28 (2): 195–206. doi:10.2307/2346749. JSTOR 2346749.
  4. ^ "Truncated Dependent Variables". About.com. Retrieved 2008-03-22.
  5. ^ Amemiya, T. (1973). "Regression Analysis When the Dependent Variable is Truncated Normal". Econometrica. 41 (6): 997–1016. doi:10.2307/1914031. JSTOR 1914031.
  6. ^ Heckman, James (1976). "The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models". Annals of Economic and Social Measurement. 5 (4): 475–492.
  7. ^ Vancak, V.; Goldberg, Y.; Bar-Lev, S. K.; Boukai, B. (2015). "Continuous statistical models: With or without truncation parameters?". Mathematical Methods of Statistics. 24 (1): 55–73. doi:10.3103/S1066530715010044. hdl:1805/7048. S2CID 255455365.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  8. ^ Lewbel, A.; Linton, O. (2002). "Nonparametric Censored and Truncated Regression". Econometrica. 70 (2): 765–779. doi:10.1111/1468-0262.00304. JSTOR 2692291. S2CID 120113700.
  9. ^ Park, B. U.; Simar, L.; Zelenyuk, V. (2008). "Local Likelihood Estimation of Truncated Regression and its Partial Derivatives: Theory and Application" (PDF). Journal of Econometrics. 146 (1): 185–198. doi:10.1016/j.jeconom.2008.08.007. S2CID 55496460.