Talk:Data set
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||
|
Untitled
[edit]what is dataset architecture? — Preceding unsigned comment added by 221.135.220.238 (talk) 2006-08-01T09:43:55 (UTC)
In the social science world, a dataset is a set of files which includes data, usually in encoded form, and documentation, such as a codebook. Rcrice 18:19, 30 November 2006 (UTC)Robin Rice
This article is a plagiarism of reference.com. Here is the url to the original article: [1] 216.166.206.16 01:20, 26 April 2007 (UTC)
- The url makes it clear that it just pulls up this article for display. Melcombe (talk) 15:57, 31 March 2008 (UTC)
Datasets are data sets!
[edit]I do not agree about the fact that the term 'data set' is unrelated or not consistent with set theory. In Statistics, and in all other fields i can imagine, each record of a dataset belongs to a distinct statistical unit (or observation), so two rows of a dataset are always distinct. A dataset is a set tuples, but rows can exchange their positions, so it is not an ordered list. It can be well done any sort of operation (unions, interesections, subsets extraction and complementary) considering a row of a dataset just like an (one- or n-dimensional) element of a set of data. This is also consistent with the definition of sample as a subset of a population, and in fact in multivariate analysis a single row of a data matrix is a k-variate observation, extracted from a joint k-variate density function (k being the number of columns). So in my opinion a dataset is just a particular form of a particular (data) set. I'm waiting for the community to contradict myself :-) !
Jabbba (talk) 12:39, 1 May 2008 (UTC)
- It is not obvious what you are arguing for or against. I don't think a connection with mathematical sets is any more worth mentioning in this instance than it would be for a "set of bowls". Melcombe (talk) 13:32, 1 May 2008 (UTC)
- Ok thank you for the answer. The reason why it is not obvious is that i've commented out the lines of the text i was referring to (as you see the history). Now i understand that it was a mistake, since people can't manage to understand my post. I'm going to restore the content, for now, even if i think it should be removed (according to your opinion, too). Jabbba (talk) 22:15, 3 May 2008 (UTC)
- Actually, I had seen the hidden text. What I meant was that it was not clear whether you were arguing for or against the inclusion of the portion of text. Your first sentence seemed to be arguing for the inclusion, while the last sentence seemed to say "exclude". I do think it should not be in the article. Melcombe (talk) 09:19, 6 May 2008 (UTC)
- Ok text removed :-) Jabbba (talk) 22:01, 9 May 2008 (UTC)
- Is the data set of SAT scores [739, 1336, 1336, 2173] the same as the data set [739, 1336, 2173]? What is its arithmetic mean, 1416 or 1396? In a set you may discard duplicates. Unlike the name suggests, the rows of a data set need not be distinct, and so a data set is in general not a set in the usual mathematical sense. The purpose of explaining that a data set is actually a multiset is that the reader will understand that duplicate members must be retained. The fact that identical members of a data set have a distinct origin does not imply the members themselves are distinct. Alice is distinct from Bob, but they may have identical SAT scores. --Lambiam 15:14, 21 May 2008 ()
- It is not a matter of "origin". This discussion might also continue for years, because it is only a matter of definition. A dataset is simply a structured list of observed value. I do think that the more simple and elegant way to view a database is simply as a set of (vectors of) observations, even when the row id is not shown. Two row in a dataset are distinct by definition, because are descriptions of two objects really observed (the population or the sample is undoubtedly a "set"). There is no data analysis software that treats two rows as *identical*. As far as i know, there is always some sort of "row names" or "case number" vector attached to the dataset. It's certainly true that, from a statistical viewpoint, deleting any of two identical rows have the same effect. But it's also true that behind a database there is always a research study conducted by people that almost surely know the "name" of each case, and probably there are other datasets describing the same population. All the times i build a dataset i invariably find some errors and only with the row name i can get back to the "origin" to correct the value. So in my opinion there is no need to invoke such a unfamiliar concept as the "multiset", simply there is an implicit row id. Thank you. Jabbba (talk) 00:33, 22 August 2009 (UTC). [P.S.: and just another thing i've not noted 3 days ago is that the mathematical object usually binded with a dataset is simply a matrix, and obviously a matrix may well have equals values in distinct cells. It's true that the element of a matrix are numbers . On the contrary *certainly* a dataset do NOT fit to the multiset definition when there are more than one variable. What does mean that "in the simplest case the data set is a multiset"? The order is important in a dataset. Datasets are more related with tuples and/or coordinate vectors (even with categorical variables is common to talk about a variable as dimension in with a unit can assume a range of definite value). But in my opinion in this page any reference to mathematical abstract objects is off-topic.Jabbba (talk) 21:28, 25 August 2009 (UTC)]
Also used more loosely
[edit]I've added a line that the term may also be used more loosely, to refer to a collection of closely related tables -- for example, in genomics, a "dataset" for a single mRNA expression assay experiment may contain a table for the assay results themselves (no of samples x no of probes), a table giving various phenotypic data for each sample, a table containing annotation information about each probe, a table containing other experimental information, etc.
Because together this makes up the data needed to describe the setup and results of a single experiment, this whole collection of tightly related tables is what tends to be called a "dataset". A particular analysis might then consider bringing together several datasets, from several different sources. Jheald (talk) 10:29, 17 June 2013 (UTC)
First sentence
[edit]Shouldn't the first sentence also included a qualifier that there is some purpose or reason for the collection of data? Its just not random data, or any data like data storage. - Shiftchange (talk) 02:15, 23 November 2016 (UTC)