Jump to content

Comma-separated values: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Line 42: Line 42:


==History==
==History==
Comma-separated values are old technology and pre-date personal computers by more than a decade: the [[IBM]] [[Fortran]] (level G) compiler under [[OS/360]] supported them in 1967.{{citation needed|date=February 2012}}
Visit www.higatv.com for more info. Comma-separated values are old technology and pre-date personal computers by more than a decade: the [[IBM]] [[Fortran]] (level G) compiler under [[OS/360]] supported them in 1967.{{citation needed|date=February 2012}}


Comma-separated value lists are easier to type (for example into [[punched card]]s) than fixed-column-aligned data, and were less prone to producing incorrect results if a value was punched one column off from its intended location.
Comma-separated value lists are easier to type (for example into [[punched card]]s) than fixed-column-aligned data, and were less prone to producing incorrect results if a value was punched one column off from its intended location.

Revision as of 18:10, 17 July 2012

Comma-separated values
Comma separated list
Filename extension
.csv or .txt
Internet media type
text/csv
Type of formatmulti-platform, serial data streams
Container fordatabase information organized as field separated lists
StandardRFC 4180

A comma-separated values (CSV) file stores tabular data (numbers and text) in plain-text form. Plain text means that the file is a sequence of characters, with no data that has to be interpreted instead, as binary numbers. A CSV file consists of any number of records, separated by line breaks of some kind; each record consists of fields, separated by some other character or string, most commonly a literal comma or tab. Usually, all records have an identical sequence of fields.

--207.62.190.33 (talk) 18:08, 17 July 2012 (UTC)--207.62.190.33 (talk) 18:08, 17 July 2012 (UTC)==Usage== CSV is a common, relatively simple file format that is widely supported by consumer, business, and scientific applications. Among its most common uses is moving tabular data between programs that natively operate on incompatible (often proprietary and/or undocumented) formats. This works because so many programs support some variation of CSV at least as an alternative import/export format.

For example, a user may need to transfer information from a database program that stores data in a proprietary format, to a spreadsheet that uses a completely different format. The database program most likely can export its data as "CSV"; the exported CSV file can then be imported by the spreadsheet program.

"CSV" is not a single, well-defined format (although see RFC 4180 for one definition that is commonly used). Rather, in practice the term "CSV" refers to any file that:

  1. is plain text using a character set such as ASCII, Unicode, EBCDIC, or Shift JIS,
  2. consists of records (typically one record per line),
  3. with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or tab; sometimes the delimiter may include optional spaces),
  4. where every record has the same sequence of fields.

Within these general constraints, many variations are in use. Therefore "CSV" files are not entirely portable. Nevertheless, the variations are fairly small, and many implementations allow users to glance at the file (which is feasible because it is plain text), and then specify the delimiter character(s), quoting rules, etc. If a particular CSV file's variations fall outside what a particular receiving program supports, it is often feasible to examine and edit the file by hand or via simple programming to fix the problem. Thus CSV files are, in practice, quite portable. CSV stands for Cargo Suitcase Vans

History

Visit www.higatv.com for more info. Comma-separated values are old technology and pre-date personal computers by more than a decade: the IBM Fortran (level G) compiler under OS/360 supported them in 1967.[citation needed]

Comma-separated value lists are easier to type (for example into punched cards) than fixed-column-aligned data, and were less prone to producing incorrect results if a value was punched one column off from its intended location.

The comma separated list (CSL) is a data format originally known as comma-separated values (CSV) in the oldest days of simple computers. In the industry of personal computers (then more commonly known as "Home Computers"), the most common use was small businesses generating solicitations using boilerplate form letters and mailing lists.[citation needed]

Some early software applications, such as word processors, allowed a stream of "variable data" to be merged between two files: a form letter, and a CSL of names, addresses, and other data fields. Many applications still do, perhaps because tasks requiring human input (such as constructing a list) are natural and easy using comma delimiters. CSL/CSVs were also used for simple databases.

Comma separated lists were also widely used in the earliest pre-IBM PC era personal computers for tape storage backup[dubiousdiscuss] and interchange of database information between machines of two different architectures. The plain-text character of CSV files keeps them valuable even today, especially in a global context, because they largely avoid incompatibilities such as byte-order, word size, and character sets; and because they are largely human-readable, making it far easier to deal with them in the absence of perfect documentation or communication.

General functionality

CSV formats are best used to represent sets or sequences of records in which each record has an identical list of fields. This corresponds to a single relation in a relational database, or to data (though not calculations) in a typical spreadsheet.

Records in a CSV file are, by definition, in some order. Whether the recipient maintains and/or uses that order can vary. Thus, CSV files can represent either unordered or ordered record sequences.

CSV formats are not limited to a particular character set. They work just as well with Unicode as with ASCII (although particular programs that support CSV may have their own limitations). CSV files normally will even survive naive translation from one character set to another (unlike nearly all proprietary data formats). CSV does not, however, provide any way to indicate what character set is in use, so that must be communicated separately, or figured out at the receiving end (if possible).

Databases that include multiple relations cannot be exported as a single CSV file as described here. At best, more notational conventions must be added, for example to identify and separate the different relations. Such notations are not difficult to design or implement, but there is no consensus on them and consequently very little portability.

Similarly, CSV cannot naturally represent hierarchical or object-oriented databases or other data. This is because every CSV record is expected to have the same structure. CSV is therefore rarely appropriate for documents (such as are created with HTML, XML, or other markup or word-processing technologies).

Statistical databases in various fields often have a generally relation-like structure, but with some groups of fields repeatable. For example, health databases such as the Demographic and Health Survey typically repeat some questions for each child of a given parent (perhaps up to a fixed maximum number of children). Statistical analysis systems often include utilities that can "rotate" such data: for example, a "parent" record that includes information about 5 children, can be split into 5 separated records, each containing (a) the information on one child, and (b) a copy of all the non-child-specific information. CSV can represent either the "vertical" or "horizontal" form of such data.

In a relational database, similar issues are readily handled by creating a separate relation for each such group, and connecting "child" records to the related "parent" records using a foreign key (such as an ID number or name for the parent). In markup languages such as XML, such groups are typically enclosed in a container (for example, <child>), which is then repeated as necessary. With CSV there is no widely-accepted single-file solution.

Lack of a standard

The name "CSV" indicates the use of the comma to separate data fields. Nevertheless, the term "CSV" is widely used to refer a large family of formats, which differ in many ways. For example, many so-called "CSV" files in fact use the tab character instead of comma (such files can be more precisely referred to as "TSV" for Tab separated values); some allow or require single or double quotation marks around some or all fields; and some reserve the very first record for a list of field names.

A particular problem is that in some countries, it is very common to write the decimal point as a comma instead of period. For example: 3,14159. This makes the comma a poor choice for field-separator in many locales. Other implementation differences include handling of more prosaic field separators (such as space or semicolon[1]) and newline characters inside text fields.[2]

Such lack of standardization can cause problems for data exchange based on so-called "CSV" files. One solution is to rely on a standard, such as that proposed by RFC 4180. The more common but technically less satisfactory solution is to rely on human intervention: because CSV files are plain text, humans can view and diagnose most common variants using a text editor.

Toward standardization

The huge variety among "CSV" formats has led to the assertion that there is no "CSV standard".[3][4] In common usage, almost any delimiter-separated text data may be referred to as a "CSV" file. Different CSV formats may not be compatible.

Nevertheless, RFC 4180 is an effort to formalize CSV. It defines the MIME type "text/csv", and CSV files that follow its rules should be very widely portable. Among its requirements:

  • DOS-style lines that end with (CRLF) characters
  • An optional header record (there is no sure way to detect whether it is present, so care is required when importing).
  • Each record "should" contain the same number of comma-separated fields.
  • Any field may be quoted (with double quotes).
  • Fields containing a line-break, double-quote, and/or commas should be quoted. (If they are not, the file will likely be impossible to process correctly, so this should is better taken as must).
  • A (double) quote character in a field must be represented by two double quote characters.

The format is simple and can be processed by most programs that claim to read CSV files. The exceptions are (a) programs may not support line-breaks within quoted fields, and (b) programs may confuse the optional header with data or interpret the first data line as an optional header.

Technical background

The format dates back to the early days of business computing and is widely used to pass data between computers with different internal word sizes, data formatting needs, and so forth. For this reason, CSV files are common on all computer platforms.

CSV is a delimited text file that uses a comma to separate values (many implementations of CSV import/export tools allow other separators to be used). Simple CSV implementations may prohibit field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit them, often by requiring " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or less commonly, newlines). Embedded double quote characters may then be represented by a pair of consecutive double quotes(Creativyst 2010), or by prefixing an escape character such as a backslash (for example in Sybase Central).

In computer science terms, a CSV file may be considered a "flat file".

Basic rules and examples

Many informal documents exist that describe "CSV" formats. IETF RFC 4180 (summarized above) defines the format for the "text/csv" MIME type registered with the IANA. (Shafranovich 2005) Another relevant specification is provided by Fielded Text. Creativyst (2010) provides an overview of the variations used in the most widely used applications and explains how CSV can best be used and supported.

Rules typical of these and other "CSV" specifications and implementations are as follow:

  • A CSV file does not require a specific character encoding, byte order, or line terminator format (some software does not support all line-end variations).
  • A record ends at a line terminator. However, line-terminators can be embedded as data within fields, so software must recognize quoted line-separators (see below) in order to correctly assemble an entire record from perhaps multiple lines.
  • All records should have the same number of fields, in the same order.
  • Data within fields is interpreted as a sequence of characters, not as a sequence of bits or bytes (see RFC 2046, section 4.1). For example, the numeric quantity 65535 may be represented as the 5 ASCII characters "65535" (or perhaps other forms such as "0xFFFF", "000065535.000E+00", etc.); but not as a sequence of 2 bytes intended to be treated as a single binary integer rather than as two characters. If this "plain text" convention is not followed, then the CSV file no longer contains sufficient information to interpret it correctly, the CSV file will not likely survive transmission across differing computer architectures, and will not conform to the text/csv MIME type.
  • Adjacent fields must be separated by a single comma. However, "CSV" formats vary greatly in this choice of separator character. In particular, in locales where the comma is used as a decimal separator, semicolon, TAB, or other characters are used instead.
1997,Ford,E350
  • Any field may be quoted (that is, enclosed within double-quote characters). Some fields must be quoted, as specified in following rules.
"1997","Ford","E350"
  • Fields with embedded commas must be quoted.
1997,Ford,E350,"Super, luxurious truck"
  • Fields with embedded double-quote characters must be quoted, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.
1997,Ford,E350,"Super, ""luxurious"" truck"
  • Fields with embedded line breaks must be quoted (however, many CSV implementations simply do not support this).
1997,Ford,E350,"Go get one now
they are going fast"
  • In some CSV implementations, leading and trailing spaces and tabs are trimmed. This practice is controversial, and does not accord with RFC 4180, which states "Spaces are considered part of a field and should not be ignored."
1997, Ford, E350
not same as
1997,Ford,E350
  • In CSV implementations that do trim leading or trailing spaces, fields with such spaces as meaningful data must be quoted.
1997,Ford,E350," Super luxurious truck "
  • The first record may be a "header", which contains column names in each of the fields (there is no reliable way to tell whether a file does this or not; however, it is uncommon to use characters other than letters, digits, and underscores in such column names).
Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar

Example

Year Make Model Description Price
1997 Ford E350 ac, abs, moon 3000.00
1999 Chevy Venture "Extended Edition" 4900.00
1999 Chevy Venture "Extended Edition, Very Large" 5000.00
1996 Jeep Grand Cherokee MUST SELL!
air, moon roof, loaded
4799.00

The above table of data may be represented in CSV format as follows:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00

Example of a USA/UK CSV file (where the decimal separator is a period/full stop and the value separator is a comma):

Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38

Example of an analogous German and Dutch CSV/DSV file (where the decimal separator is a comma and the value separator is a semicolon):

Year;Make;Model;Length
1997;Ford;E350;2,34
2000;Mercury;Cougar;2,38

The latter format is not RFC 4180 compliant. Compliance could be achieved by the use of a comma instead of a semicolon as a separator and either the international notation for the representation of the decimal mark or the practice of quoting all numbers that have a decimal mark.

Application support

The CSV file format is very simple and supported by almost all spreadsheets and database management systems. Many programming languages have libraries available that support CSV files. Many implementations support changing the field-separator character and some quoting conventions, although it is safest to use the simplest conventions, to maximize the recipients' chances of handling the data.

Microsoft Excel will open .csv files, but depending on the system's regional settings, it may expect a semicolon as a separator instead of a comma, since in some languages the comma is used as the decimal separator. Also, many regional versions of Excel will not be able to deal with Unicode in CSV. One simple solution when encountering such difficulties is to change the filename extension from .csv to .txt; then opening the file from an already running Excel with the "Open" command.[dubiousdiscuss]

When pasting text data into Excel, the tab character is used as a separator: If you copy "hello<tab>goodbye" into the clipboard and paste it into Excel, it goes into two cells. "hello,goodbye" pasted into Excel goes into one cell, including the comma.

OpenOffice.org Calc and LibreOffice Calc handle CSV files and pasted text with a Text Import dialog asking the user to manually specify the delimiters, encoding, format of columns, etc.

There are many utility programs on Unix-style systems that can deal with at least some CSV files. Many such utilities have a way to change the delimiter character, but lack support for any other variations (or for Unicode). Some of the useful programs are:

  • cut
  • paste
  • join
  • sort
  • uniq (-f to skip comparing the first N fields)

See also

References

  1. ^ See e.g. .csv import settings for LibreOffice 3.4.
  2. ^ For example, this bug documents unintentionally different handling of newlines inside text fields between OpenOffice and LibreOffice.
  3. ^ "CSV File Reading and Writing". Retrieved July 24, 2011. is no "CSV standard"
  4. ^ Y. Shafranovich. "Common Format and MIME Type for Comma-Separated Values (CSV) Files". Retrieved September 12, 2011.