Talk:Statistics/Archive 2

This is an archive of past discussions about Statistics. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

Archive 4

Archive 5

Forums for Statistics Help?

Are there any open forums for people who help each other with statistics? Can we add here some if anyone knows any? Thanks. Towsonu2003 04:22, 10 February 2006 (UTC)

Wikipedia:Reference desk/Mathematics is very well staffed; the helpers there complain when a day goes by without a question. It's a great place to ask, and to have a conversation, but be sure to read and follow the rules at the top of the page. Asking homework questions is fine, but some people will scoff. --James S. 05:09, 10 February 2006 (UTC)

Reorg on 19 Feb 2006

Hello. I've been lurking for a while, reading the many ideas and suggestions, and now I've chosen to be bold and implement some of them. Here were my main goals:

Make the intro concise. (If past tendencies prevail, it will be made long again soon enough.) I understand that people have strong feelings about what deserves to lie above the contents. I hope that the new conceptual overview section will satisfy most. It also brings in a lot of requested topics not treated here before.
Remove the probability section. There were several "complaints" about it. Most of it was not relevant. I have preserved it here, in case anyone wants to put it into statistical literacy, etc.:

Statistics makes extensive use of the concept of probability. The probability of an event is often defined as a number between one and zero. In reality however there is virtually nothing that has a probability of 1 or 0. You could say that the sun will certainly rise in the morning, but what if an extremely unlikely event destroys the sun? What if there is a nuclear war and the sky is covered in ash and smoke?

We often round the probability of such things up or down because they are so likely or unlikely to occur, that it's easier to recognize them as a probability of one or zero.

However, this can often lead to misunderstandings and dangerous behaviour, because people are unable to distinguish between, e.g., a probability of 10⁻⁴ and a probability of 10⁻⁹, despite the very practical difference between them. If you expect to cross the road about 10⁵ or 10⁶ times in your life, then reducing your risk of being run over per road crossing to 10⁻⁹ will make it unlikely that you will be run over while crossing the road for your whole life, while a risk per road crossing of 10⁻⁴ will make it very likely that you will have an accident, despite the intuitive feeling that 0.01% is a very small risk.

Use of prior probabilities of 0 (or 1) causes problems in Bayesian statistics, since the posterior probability is then forced to be 0 (or 1) as well. In other words, the data are not taken into account at all! As Dennis Lindley puts it, if a coherent Bayesian attaches a prior probability of zero to the hypothesis that the Moon is made of green cheese, then even whole armies of astronauts coming back bearing green cheese cannot convince him. Lindley advocates never using prior probabilities of 0 or 1. He calls it Cromwell's rule, from a letter Oliver Cromwell wrote to the synod of the Church of Scotland on August 5th, 1650 in which he said "I beseech you, in the bowels of Christ, consider it possible that you are mistaken."

And here is what I ripped out of the intro:

Key concepts and terms of statistics assume probability theory; among the terms are: population, sample, sampling, sampling unit and probability. Warning: systems are known to science that violate probability theory empirically.

Once data has been collected, either through a formal sampling procedure or by recording responses to treatments in an experimental setting (cf experimental design), or by repeatedly observing a process over time (time series), graphical and numerical summaries may be obtained using descriptive statistics.

Patterns in the data are modeled to draw inferences about the larger population, using inferential statistics to account for randomness and uncertainty in the observations. These inferences may take the form of answers to essentially yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression).

The framework described above is sometimes referred to as applied statistics. In contrast, mathematical statistics (or simply statistical theory) is the subdiscipline of applied mathematics which uses probability theory and analysis to place statistical practice on a firm theoretical basis.

I know the links need to be cleaned up, and some of what I added in the conceptual section is probably wrong. I am not a statistician, so I won't be surprised when it is ripped apart. Some things, such as the "empirical violation of probability theory", I left in despite not understanding. Cheers, Joshuardavis 22:59, 19 February 2006 (UTC)

Thanks for taking the initiative. The article has needed a revamp for a long time.

Your first sentence said that statistics is a "branch of mathematics". Since I believe many statisticians feel that statistics is no longer simply a part of mathematics, I changed this. Nrcprm2026 changed it back, saying "give math some credit". But saying statistics is a branch of mathematics gives math too much credit. (Maybe a wayward child would be a better analogy?) It gives the computational and cognitive aspects of statistics short shrift.

I'll quote the first sentence from Moore and Cobb (2000), Statistics and Mathematics: Tension and Cooperation, American Mathematical Monthly, pp. 615-630 pdf. (Moore is a past president of the American Statistical Association.)

It has become a truism, at least among statisticians, that while statistics is a mathematical science, it is not a subfield of mathematics.

The rest of this paper's lead paragraph gives further support to my belief, as do the reasons listed here. -- Avenue 13:28, 20 February 2006 (UTC)

In case you wanted a response: This is all fine by me. I did not write the passage you refer to (except for changing "refers to" to "is"). Your view of statistics distinct from math agrees with everything I have ever heard. In fact, I think I'll reword the Historical context section to make this more forceful. Joshuardavis 19:35, 20 February 2006 (UTC)

Presentation of data/Rates

An anonymous poster put up a section "Presentation of data" with a single subsection on "Rates", mainly death rates per 100,000 people. It is rather detailed and focused on death rates, and the definition of specific death rate is identical to that of the crude death rate. It has not been improved upon. I move that it be deleted, and thus the Presentation of data section with it. Do the knowledgeable editors around here think that such sections should exist? If so, what should go in them? Joshuardavis 16:45, 13 March 2006 (UTC)

I agree that the current section on Presentation of Data has little or no value. Deletion is justified, in my opinion. I wouldn't object to some general coverage of data presentation techniques and problems, perhaps as a subsection of Statistical methods. For example, we could cover the pros and cons of tabular vs graphical display, an overview of common graph types and how to use them well (and/or show problems with their usage), and perhaps some more general discussion of how most data summaries hide more than they reveal. At least some of this would probably be too detailed for the main article on Statistics, but it could easily be shifted to more specific articles if necessary. -- Avenue 20:06, 13 March 2006 (UTC)

I'm deleting it from the article and putting it right here:

Rates are the percentages that are based on a particular population. Rates will be based on the same figure, usually per 100,000 populations.

The most widely used rates are the death or mortality rate. Rates are divided into crude death rates and specific death rates.

The crude death rates are the number of deaths that occur in a given year per 100,000 persons in the entire population. Crude death rates are used oftenly on an international level as a comparison between countries. These rates are also used by population specialists to determine growth rates.

Crude death rate = (Number of deaths X 100,000) ÷ Total Population

Specific death rates are the number of deaths that occur in a given year per 100,000 persons in the entire population.

Specific death rate = (Number of deaths for a specific population X 100,000) ÷ Total Population

Joshua Davis 23:37, 11 April 2006 (UTC)

wikistatistics

Are there any wikis for collecting raw statistical data?

Do you mean a wiki in which users post raw data from experiments from various scientific disciplines? I know that some specific scientific subfields maintain community repositories of data, but a wiki doesn't seem appropriate to me for such a database. Joshua Davis 23:15, 11 April 2006 (UTC)

I am looking for a community repository of data. A repository which follows a standard form and is open to peer review. A repository for ALL data, including demographics, surveys, studies, patterns….etc…. Basically, a repository of all numerical data. The reason I am looking for it in wikiformat, is because Wiki's utilize vast human resources, and it is possible to not only collaborate extensively on popular topics, but also possible to find information on obscure topics. Do you know of any systems which utilizes vast human resources in order to collect raw data on all topics? --146.244.137.197 23:54, 11 April 2006 (UTC)

Statistical graphics are poorly covered

Graphical methods are currently mentioned only at one point in the article, in this paragraph:

Descriptive statistics deals with the description problem: Can the data be summarized in a useful way, either numerically or graphically, to yield insight about the population in question? Basic examples of numerical descriptors include the mean and standard deviation. Graphical summarizations include various kinds of charts and graphs.

This does not seem sufficient for such a pervasive aspect of statistical practice. Expanding on this would also provide plenty of scope for illustations. I'll add something soon. -- Avenue 09:35, 3 May 2006 (UTC)

Disputing "Good Article" status

See Wikipedia:Good articles/Disputes#Statistics. --zenohockey 00:26, 14 May 2006 (UTC)

Let's talk about what needs to be done, specifically, to polish off this article. The criticisms are essentially:

Historical overview section needs expansion
Criticisms section needs expansion
There are too many lists of people/concepts
We need mention of Bayesian methods

Aren't lists appropriate, as the Statistics article serves as an introduction and a jumping-off point for anyone reading WP articles on statistics? Furthermore, what other topics should go in history/criticisms? How much should we say about Bayesian stuff here? Can someone make a concrete outline? Joshua Davis 16:57, 14 May 2006 (UTC)

Wikilinks are needed, but preferably not in the form of several lists of bullet points. They can be embedded in prose instead, as in Mathematics#Overview_of_fields_of_mathematics for example. This has the advantage that someone adding such a link has to figure out something useful to say about it, which might help combat the profusion of links to obscure software packages, for instance. The sheer number of external links is also a problem.

I'm sure there are other important topics besides Bayesian methods that aren't mentioned, but should be. I'll give this some thought. Some pruning would be useful too. -- Avenue 04:42, 15 May 2006 (UTC)

I think whoever nominated this article for "Good Article" has to review his/her opinion, this article is a real disgrace to Wikipedia (only one image, very few and unorganised links, long text but no useful information, and of course, the lack of Bayes ...), neglecting a topic as important that it needs a portal...

I think of making a Portal:Statistics, however I am not familar with making new portals, so, any help would be appreciated. --Lord Snoeckx 19:20, 17 May 2006 (UTC)

I mentioned this below, but if anyone is not working on revision and expanding the "criticisms" section, then I'd like to work on it. Don't want to step on toes though. --Newideas07 22:06, 3 November 2006 (UTC)

Criticism

We currently have a Criticism section with very little in it. What should go in such a section? Or should the section be deleted, with its contents left to misuse of statistics and statistical literacy? Joshua Davis 14:56, 15 May 2006 (UTC)

We should be integrating the relationship between half-truths into the concept of statistics, half-truths

For example:

Milo Schield, who has a PhD in Astrophysics, is a professor at Augsburg College who teaches statistical literacy, traditional statistics and critical thinking at the undergraduate and graduate level.[4]In his 2005 paper, Statistical Prevarication: Telling Half Truths Using Statistics[5] , Schield notes,
All too often statistics are characterized as lies. But statistics are more likely to be half truths than lies.... statistical prevarication [is] the art of straddling both sides of an issue or idea... If statistics educators are to avoid a charge of statistical negligence, they should focus more on identifying and eliminating sources of statistical prevarication in their teaching and textbooks. And statistical educators should do more to help students become statistically literate in detecting statistical prevarication.
Let us assume a statistic is true, it represents one part of the whole picture. The fastest growing sport, may not be the most popoular. [5]

— Preceding unsigned comment added by User:Caesarjbsquitti (talk • contribs)

Statistics is like a hammer--a hammer can be used to kill, but we don't put warning labels on all hammers. Statistics, like any other tool, can be used incorrectly for wrong. The best thing to point out is how it can and has been used misleadingly. I think this would be an interesting sub-section, but short and to the point. You generalize in a way that makes it sound like everyone who uses statistics is a liar. Statistics are not always "generalizable," and that may contribute to misunderstandings. Furthermore, some statistics may be unknowingly incorrect due to error (that is, Type I and II error). We should be careful by discussing cases that were proven misleading rather than suspected. Overall, we should be careful to not make such sweeping generalizations about those that use statistics. Chris53516 20:42, 23 August 2006 (UTC)

I would like to propose we change the name of this section to "The Misuse and Limitations of Statistics" or something similar as Joshua suggested. I also would like to make big revisions to it if no one is working on it or attached to it as it is. I'm a statistician (M.S.) and educator. If anyone objects or has a better idea or is already working away hard on this, speak soon or I'll do it. --Newideas07 22:04, 3 November 2006 (UTC)

Hypothesis tests versus confidence intervals

User:JJL recently changed the comment on confidence intervals (as an alternative to hypothesis tests) to read:

One possibility is to avoid hypothesis testing and report confidence intervals instead, but this merely avoids drawing the final conclusion of the test, and statisticians do wish to draw conclusions.

I disagree strongly with this statement. Confidence intervals present the uncertainty about the size of the difference, while hypothesis intervals typically condense this into a statement about whether the difference could reasonably be zero. But in many situations, simply having a non-zero difference is not the goal, and hypothesis tests miss the point. JJL's edit summary was "C.I. really is equiv. to hyp. test but w/o the final concl.; fine if no concl. needed, but try selling that to the FDA." Okay, so look at a drug example. Suppose we have a clinical trial on a new cancer drug. Assume it costs $100,000 more per patient than existing treatments, but this would be judged worthwhile if it decreases 5-year mortality by 10%. If the confidence interval for the mortality decrease is (2%,22%), we don't know enough to say whether the drug is worthwhile, but a hypothesis test would show there is a significant difference in mortality - not the right thing to conclude. Alternatively, if the confidence interval for the mortality decrease is (1%,9%), we do know enough to say that the drug is probably not worthwhile. However a hypothesis test would still show there is a significant difference in mortality, again leading us to the wrong conclusion.

Statisticians do not wish to draw wrong conclusions. This is why many of them have recommended confidence intervals over hypothesis tests. I can dredge out references if you want. -- Avenue 01:24, 23 May 2006 (UTC)

I wanted to make the point that in computing a C.I. one is doing essentially all the calculations done in performing a comparable-level hyp. test but then not condensing it to a yes-no answer. In this sense it is more a stylistic choice than a fundamentally different approach. I too have seen more and more people leave things as a C.I. rather than a reject/FTR, just as I've seen the rise of the emphasis on the p-value. Pointing this out is fine, but I think it's important to make it clear that we're talking about different ways of viewing the same calculation, not truly different approaches in the way that, say, a Bayesian approach is different. So, I am certainly fine with someone weakening my statement, but I'd like to see it remain clear that there is a close connection between these appproaches. JJL 03:15, 23 May 2006 (UTC)

I agree that mathematically p-values or confidence intervals are "only" stylistic choices, since (if one knows the point estimate and the null hypothesis) you can convert between them. But in practice, people make very different judgements based on these two presentations of the same information. I don't mind both aspects being discussed, but to me the practical advantages of CIs seem much more relevant to the Criticisms section than the mathematical equivalence of these approaches. -- Avenue 00:28, 24 May 2006 (UTC)

I don't disagree with your overall point, but I think the current wording is too strong--preventing common errors made by hyp. testers sounds like POV to me (would they agree that they make all these 'common' errors?). I also disagree with widely recommended remedy. It's an increasingly popular approach, to my mind. Let's see if I can help us iterate to a happy medium. Edit away at what I write, of course! JJL 00:52, 24 May 2006 (UTC)

Section on Basic statistical techniques for beginners

A section called "Basic statistical techniques for beginners" was recently added. Some of this might be merged into the "Statistical techniques" section, although I think much of this is getting too detailed for this article. I also wouldn't choose ANOVA, regression, and chi-squared tests as the three best techniques to cover here. I've removed the section from the article; its text follows below. -- Avenue 00:38, 6 June 2006 (UTC)

There are three basic statistical techniques: chi-square, which works well on counts; analysis of variance (anova), which works well on measurements from two or more groups; and regression, which builds the best simple equation for seeing how well one variable will convert into another. . . In the case of chi-square, the technique can be used to support the notion that there is an important, reliable difference in the observed counts in two or more groups, as in the number of members of group A who are found to have a given characteristic versus those in group B. In the case of anova, that technique can be used to support the notion that the means or averages in some groups of scores are importantly and reliably different from the means of other groups. In the case of (correlation)/regression, the technique helps show how much one variable can be changed to another by a simple formula. For instance, through the body mass index human body height might be found to translate into the weight of that body by a fairly simple and straightforward mathematical formula .

I concur with the removal of this section. JJL 14:52, 6 June 2006 (UTC)

Delisting of statistics from Wikipedia:Good articles

Hi all,

Unfortunately after discussion on Wikipedia:Good articles/Disputes this article has been delisted. The discusssion relating to this article's delisting follows. If you feel the issues discussed below are no longer applicable, please feel free to renominate the article.

Cedars 08:10, 11 June 2006 (UTC)

Largely consists of lists of people and disciplines, few of which are explained even briefly in the text. The explanations seem adequate, where they appear, but the "Criticism" section, e.g., needs to be much larger. --zenohockey 00:24, 14 May 2006 (UTC)

I agree, much that is important about statistics is missing from the article. For instance, Bayesian methods don't get a mention. Also contains only one image. --Avenue 04:51, 14 May 2006 (UTC)

I think this article sucks, only one image, few links, and no useful information, it's a disgrace to Wikipedia to neglect such an important topic which is even worth a portal!--Lord Snoeckx 19:12, 17 May 2006 (UTC)

Please keep things cool people, let us not go around with terms like that. By doing so you are insulting both the editors to the article and the person that passed it, without actually discussing the issue at hand - the article. If you read the GA guidelines, you'll see that images aren't required so that can't be used against. "Few links"? You mean it isn't a link farm or it doesn't link to other articles? We should discuss the context and quality of this article. Play on, H ig hway ^{Rainbow Sneakers} 14:59, 26 May 2006 (UTC)

I didn't understand the "few links" comment either, and I think "no useful information" is overstating things. The article has improved somewhat since its listing here, with the "Criticism" section being expanded. But the other criticisms remain. These problems mean that, in my opinion, the article fails to meet the good article criteria of being well written and having broad coverage of its subject. It also includes only one inline citation and one image; although these are not mandatory, more would certainly be desirable. -- Avenue 05:10, 6 June 2006 (UTC)

Not to mention that making free images for a Statistics article is much easier than most. --SeizureDog 08:35, 11 June 2006 (UTC)

Probability

As I looked this article, I think that Probability should be mentioned because it's one of the Mathematical information on about Statistic. Does anybody thinks that Probability needs to be mentioned in this article. I hesitated to mention Probability on this article. Perhaps, There is another article that is about Mathematical Probability. *~Daniel~* ☎ 02:32, 31 July 2006 (UTC)

Probability is mentioned as the basis for statistics in the historical overview (with moderate discussion) and in the conceptual overview (with brief mention as to its relevance). There are links to probability, probability theory, and mathematical statistics. What more do you think this article needs? Joshua Davis 18:25, 31 July 2006 (UTC)

No, I don't think that this article needs anything. But I just considered that Mathematical statistic should be mentioned. Thanks anyways. *~Daniel~* ☎ 02:40, 6 August 2006 (UTC)

Historical overview

The historical overview section has grown splendidly in recent months. But this has left it somewhat a mess. Today I tried to separate out a few threads and place each in its own subsection, with the mess concentrated in the "Origins in probability" subsection. In order to make the article sensible while it's being revised, I have temporarily deleted this passage:

Statistics eventually merged with the more mathematically oriented field of inverse probability, referring to the estimation of a parameter from experimental data in the experimental sciences (most notably astronomy).

I don't know enough to polish this section much further. Here's what I don't like:

The discussion of Quetelet is more than half as long as the article Adolphe Quetelet. Let's move most of this material there, and leave a brief mention of his significance here.
People keep wanting to write treatises on probability in this article. In particular, the discussion of aleatory vs. epistemic probability is long and not directly referenced elsewhere in this article. (And it's not historical, but conceptual.) Should it be moved to Probability?
I separated out the paragraph "Other contributors were..." because it's unclear to me whether it's related to the paragraph before it or the one after it.
The article is 33 KB long. If the Historical overview gets much longer, we might consider splitting it off into "History of statistics".

Joshua Davis 16:31, 5 August 2006 (UTC)

Responding to myself here...I have removed the treatment of aleatory vs. epistemic probability, since it appears verbatim at Probability, and I have moved the treatment of Quetelet to Talk: Adolphe Quetelet, leaving just a short summary here. Joshua Davis 15:39, 15 August 2006 (UTC)

Refs and Disraeli in Statistics article

I moved the references out of external links (they weren't external LINKS at all), and made them a bibliography for statistics, and added Dr. Joel Best's book to it. I also clarified the quote on "lies, damned lies, and statistics" to whom it has been reasonably shown to be from -- Disraeli and footnoted it as such. That was my intention. User:Chris53516 asked that I explain it here. I hope it's fine. Bests. --- (Bob) Wikiklrsc 21:35, 24 August 2006 (UTC)

Thanks for the updates. It looks much better. Chris53516 21:38, 24 August 2006 (UTC)

You're welcome, Chris. Thanks for your comments. Bests. --- (Bob) Wikiklrsc 12:50, 25 August 2006 (UTC)