Talk:Database normalization
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
|
|||
Why is Boyce-Codd Normal Form Not Discussed More?
[edit]Throughout this article, there is discussion of 3NF and then EKNF (which is between 3NF and BCNF) and then 4NF, with no mention of BCNF. Why was this left out? Contributors themselves have noted that EKNF is not discussed much in the literature, so I am not sure why it is mentioned but not BCNF.
--StatsJunkie (talk) 04:22, 23 October 2022 (UTC)
- It is listed in various places and a link to its own article is provided. I'm sure a new subsection Satisfying BCNF would be most welcome. Rp (talk) 18:55, 24 October 2022 (UTC)
- I would appreciate this section as well. Perhaps a link to https://wiki.riteme.site/wiki/Boyce%E2%80%93Codd_normal_form#Achievability_of_BCNF would suffice. Jtbwikiman (talk) 20:04, 8 July 2024 (UTC)
Karnaugh map
[edit]How would I know if I had reached the most normalised/optimised stage? Is there any tool like Karnaugh map that lists all permutations/combinations?Anwar (talk) 13:01, 22 May 2008 (UTC)
- I don't know, but for small schemas (say, 20 relations or less) this is easy to see. A related issue is that there may be implicit dependencies that have not been marked explicitly in the database schema - determining those is not so easy, see e.g. [1] Rp (talk) 08:49, 7 July 2009 (UTC)
- We can use the concept of normalization only for the time of removing the redundancy data in the databases. When the data gets the unique data in the records, may get reach the normalised stage. —Preceding unsigned comment added by 210.212.230.196 (talk) 09:52, 18 December 2010 (UTC)
- Normal forms are defined as conditions; you can simply check whether the required conditions hold to know whether a database schema is in a particular normal form. Rp (talk) 23:38, 26 December 2010 (UTC)
Trade-off
[edit]The article does not explain the trade-offs suffered with normalisation. For instance, a highly normalised database needs more tables each stripped to the bare minimum. So serving a singular query would require pulling data from several tables. This costs time and money. In a way, the business process is not optimised (though the database is!).Anwar (talk) 13:18, 22 May 2008 (UTC)
- Do you have an article or a computing paper where this is explained, so we can add it better to the article? --Enric Naval (talk) 11:43, 24 May 2008 (UTC)
- A normalization, and thus pulling data from several tables doesn't cost significantly more time. In fact, a database that is not normalized will take much more CPU time and Disk I/O to process, requiring more resources as query load increases. Then your business will either need to purchase more hardware, or re-architect it's database and code base, and trust me, that costs alot more. Common sense always dictates that you "Do it right the first time" 71.231.132.75 (talk) 14:38, 24 December 2009 (UTC)
- This is not correct - certainly deviations from normal form can cost a lot of space and query time, but denormalization can also help performance - for instance, if certain joins are very frequent, it may pay off to denormalize the database over those joins. Rp (talk) 21:56, 26 December 2009 (UTC)
- A general guideline is that transactional databases (intended for high-concurrency, high-volume interactive updates, inserts and deletes to the data - such as databases used for E-commerce transactions) SHOULD be maximally normalized, because if you don't, changes to certain data elements have to be made in two (or more) places. Denormalization is normal practice for data marts/warehouses, which are read-only when operated interactively (their contents are generated by scripts from a transactional database) to improve query performance. Sometimes, de-normalization can be extreme, as when precomputing aggregates at various levels of hierarchy - e.g., grand sales total, total by division, region, salesperson, product, etc. Prakash Nadkarni (talk) 01:20, 11 May 2020 (UTC)
Denormalization
[edit]The statement, "It has never been proven that this denormalization itself provides any increase in performance, or if the concurrent removal of data constraints is what increases the performance." needs more backup on that argument or needs to be revised or removed. Volomike (talk) 12:53, 17 January 2009 (UTC)
- It's wrong. Denormalization doesn't remove data constraints. It just means that instead of maintaining two tables, you maintain their join. This can be faster if you often need that join. Rp (talk) 21:13, 1 July 2009 (UTC)
- PS the article should bother to explain this. Rp (talk) 21:19, 1 July 2009 (UTC)
- Agreed. 99.60.1.164 (talk) 17:13, 22 August 2009 (UTC)
Request for normalization
[edit]Please help with http://strategy.wikimedia.org/wiki/Proposal:Assessment_content and http://strategy.wikimedia.org/wiki/Proposal_talk:Assessment_content#Normalization_of_assessment_items_.28questions.29_in_database 99.60.1.164 (talk) 01:37, 23 August 2009 (UTC)
- Please note that [2] contains a list per quesion with each element containing two timestamps, so this is definetly a 6NF-level problem. An easier, and perhaps more important sub-requirement is the review system in [3], in particular, this example outlines a sub-schema related to
- a selection of text or a url (to a permanent aritcle version or diff, etc.) could be an item for which multiple, randomly-selected reviewers chosen for their stated familiarity with a topic area would be selected. Those reviewers could be shown that text or url (perhaps as part of a list of such items) in a securely authenticated and captcha-ed channel. They would be asked to vote on the accuracy of the item, and have the opportunity to fully explain their votes in comments. If a statistically significant number of votes are in agreement, then the item could be approved as to veracity or rejected. When the votes are not in agreement, then additional voter(s) would perform a tie-breaking function. Each voter's track record in terms of agreement with other voters could be recorded secretly and used to (1) weight their vote to nullify defective voters, and/or (2) used to select whether the voter is allowed to become a tie-breaker.
- The fields required to support that need to be added to [4]. 99.35.130.5 (talk) 18:00, 10 September 2009 (UTC)
ugly tables
[edit]this article has the ugliest tables i've seen —Preceding unsigned comment added by 67.187.187.128 (talk) 00:42, 27 November 2009 (UTC)
- You noticed that too? 71.231.132.75 (talk) 14:40, 24 December 2009 (UTC)
Elementary Key Normal Form
[edit]This article should include EKNF (given this name in 1982 by Zaniolo, in his paper "A new normal form for the design of database schemata [5]), a normal form which is stronger (more faithful to the principal of separation) than 3NF and which has the "complete representation" property proposed in 1976 by Philip A Bernstein (Bernstein P.A., "Synthesizing third normal form relations from functional dependencies" [6]) as a criterion for schema synthesis algorithms.MichealT (talk) 11:57, 7 March 2010 (UTC)
- Please add it - and while you're at it, perhaps you can remove Date's 6NF which really doesn't fit in with the rest, but I hesitate to touch it. Rp (talk) 23:40, 26 December 2010 (UTC)
Normalization
[edit]In researching the area of normalization, one thing that I often see mentioned is the "insert" or "delete" anomaly. I don't understand how this is an anomaly, let me explain.
Assume a system intended to store information about students and registered courses. If a logical schema presented the following requirements:
-each customer shall enroll in one or more courses and -each course shall have one or more students
Then one can legitimately arrive at a single table
(S#, SNAME, C#, CNAME)
which would face the update anomaly and redundancy.
But, the "insertion" anomaly and "delete" anomaly are a result of the statement of the logical requirements that state that no course may have zero students.
By changing the requirements to be:
-each customer shall enroll in zero or more courses -each course shall have zero or more students
The supposed "update" and "insert" anomalies may be eliminated.
In the original paper by Codd about normalization, I see reference to "redundancy" and attendant update anomalies but where did insert and delete anomalies come from?
130.215.36.61 (talk) 11:01, 16 September 2010 (UTC) amrith
- You are mistaken. The anomaly has nothing to do with requirements (whether courses without students may occur), but with expressiveness: in this table, courses without students cannot be represented, unless we introduce NULL student IDs and names (and NULL is notoriously difficult to deal with consistently). So one may argue that here, the purpose of normalization is to avoid NULLs. Rp (talk) 11:14, 16 September 2010 (UTC)
--Dqmiller (talk) 17:36, 14 December 2012 (UTC)--Dqmiller (talk) 17:36, 14 December 2012 (UTC)
- Delete anomalies can occur for your simplistic table a couple ways. For example, if you delete the only student enrolled in a course, then you also delete the course. Another kind of delete anomaly occurs when a course has multiple students and a "delete course" operation does not catch them all. Neither of those can happen if the structure is better normalized, such that each course and each student are only represented once.
Non-repeating groups
[edit]If I understand "repeating groups" correctly, having no repeating groups sort of just means "don't have separate tables for stuff that could be grouped in a table with the same number of fields as the separate tables". Is that "more or less" correct? I kind of find the example confusing. It's currently presented as:
Customer | Transactions | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Jones |
| ||||||||||||
Wilkins |
| ||||||||||||
Stevens |
|
I'm guessing that this is supposed to represent three different tables in the database that "each do the same thing". Would it be better if the tables were presented like this:
Customer Jones Table | Customer Wilkins Table | Customer Stevens Table | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
Do you think this more clearly illustrate "repeating groups" than the present example? (And with less chance of confusion?) Jason Quinn (talk) 10:44, 17 June 2011 (UTC)
- I'm not sure I really understand what you intended to say, but the example with repeating groups is meant to be an example of "what not to do", i.e. an example of a relation that is not in first normal form. Think of a repeating group as an array within one row; in the example, there is a column that contains a variable number of transactions (intended to illustrate a clear-cut case, I presume). I am not sure if you are using the word 'table' in the sense of an array rather than representing a relation.--Boson (talk) 12:42, 17 June 2011 (UTC)
- I don't know if I understood the example correctly now. Are the tables given in the example a single table displayed in three separate sections or three separate tables? I was under the impression they were three separate tables. This now illustrates exactly why I think the current nested presentation is ambiguous and confusing. Jason Quinn (talk) 02:38, 18 June 2011 (UTC)
- As I understand the example, it shows (and must show) a single table with two columns (Customer and Transactions). Each Transactions "field" contains an "array" of individual transactions (possibly with a varying number of logical occurrences). I am using terms rather loosely. It may help to remember that we are really talking logical design, not physical implementation. --Boson (talk) 10:07, 18 June 2011 (UTC)
- If this is the case, the example is really terrible and ought to be completely replaced. It seems like most books and websites, use an example of a field that contain multiple phone numbers as a table that is not in 1NF. Conceptually that is so much better than there's no comparison. Jason Quinn (talk) 09:50, 19 June 2011 (UTC)
- I think the current example shows the issue pretty well. 1NF is about avoiding nested records, i.e. not allowing a single field to contain a set of values. In his paper Codd himself describes normalizing to 1NF as "eliminate domains which have relations as elements" - i.e eliminating nested tables. So an example with a nested table is a good example of what 1NF is supposed to eliminate.--80.62.117.218 (talk) 17:15, 8 June 2021 (UTC)
- I don't really see why phone numbers should be better than transactions. I don't know who created the example, but they may have wanted to indicate that a repeating group containing more than one attribute is more typical for a non-1NF database. For that reason an example with only one attribute is, perhaps, not very good and might make the uninitiated think of an (atypical) "repeating attribute". Can you think of a better way of illustrating a repeating group containing more than one (visible) attribute per occurrence.--Boson (talk) 14:41, 19 June 2011 (UTC)
- A similar example of normalization is contained in Date's An Introduction to Database Systems, but is is perhaps clearer because the sub-column headings are shown only once, in the heading row (immediately under Transactions), though, I suppose, it could be argued that that is not as good, because a relation includes the headings and we are effectively talking about relation-valued domains. I don't know if this is the case in later editions of the book; my copy is from 1990. By 2003, of course, Date was saying explicitly that attributes could be relation-valued. I am not expert enough to judge if there really is a contradiction between his earlier and later books, but we need to be careful about how we define 1NF and what examples we use. By the way, the current example seems to go back to a rewrite of the article in January 2009 by Nabav (talk · contribs).
- --Boson (talk) 16:17, 19 June 2011 (UTC)
- Hi, just to clarify: my intention in the Customers & Transactions example was to do what Boson described, i.e. depict a data structure in which each Customer is associated with "package" of transactions. Any number of transactions can be contained within a given package.
- I would like to avoid the implication that the data structure as a whole is a table. Also I'd like to avoid the implication that the "package" of transactions is a table. Before Codd came on the scene, structures like these were implemented in, for example, hierarchical databases (rather than in relational tables, which didn't exist). One of the most fundamental characteristics of relational tables and 1NF is that they allow us to get by without complex structures like the one in the example.
- To avoid people thinking of the example in terms of tables, perhaps we could make the example look less tabular. This could be done by arranging the transactions for each customer HORIZONTALLY, rather than one on top of the other. --Nabav (talk) 13:34, 12 July 2011 (UTC)
--50.132.39.209 (talk) 05:02, 15 December 2012 (UTC)
- Strictly speaking, the comment "[this is] an example of a relation that is not in first normal form" is inconsistent because, by definition, a relation is always in 1NF. Not so a table, however, and I believe the intent was to point out that the table was not in 1NF and, therefore, not a relation.
- Moving on, I dispute the assertion that "repeating groups" is a 1NF violation in this example. In fact, the usual attention to "repeating groups" is not very helpful. I say that despite the fact that "eliminating repeating groups" is often cited as the first step of normalization. That may be relevant to legacy data structures that accommodate records with a variable number of fields, but it is not really applicable to a table structure that has the same number of columns in every row to begin with. A table, as we know it, is intrinsically free of repeating groups, making that consideration unequivocally moot.
- So, while I do agree the table is not in 1NF, "repeating groups" is hardly the reason. The table is not 1NF (and it, therefore, cannot represent a relation) only because it permits duplicate rows. Express a key on the Customer column and then the table satisfies 1NF. The fact that the Transaction column appears to have a table-valued datatype (or possibly even a relation-valued datatype) is interesting--perhaps even a provocative design--but is not a disqualification for 1NF.
- Much of the issue here seems to stem from that the article does not actually define what a repeating group is, or if such definition exists somewhere, it is not associated with the term "repeating group". In the Satisfying 1NF section, repeating groups are treated as being enumerated attributes for storing multiple values of the same information. In the example discussed here, repeating groups are depicted as non-atomic attributes, attributes that contain several, different, pieces of information as one value; and they are also depicting multi-valued attributes, attributes that for a given prime key, return multiple values. The Transaction attribute contains three pieces of information, Transaction ID, Date, and Amount, but for some Customers, it also contains multiple values of those three pieces of information.
- The solution to enumerated attributes, as discussed in Satisfying 1NF, is to move that information into a new table where each value can be it's own row that references this table. The solution to multi-valued attributes is the same, as they are essentially the same problem. The solution to non-atomic attributes is to split the attribute into it's atomic constituent values, each with their own attribute. Ultimately, the article refers to non-atomic and multi-valued together ambiguously, as well as enumerated attributes separately.
- Other sources have a similar lack of clarity, with some describing partial dependency, surely that at least, is incorrect. 68.54.2.135 (talk) 00:11, 26 March 2020 (UTC)
- It is not correct that a relation by definition is 1NF. In Codd's paper he shows how to perform 1NF normalization on a relation, i.e. the data was a valid relation already before it was normalized to 1NF. "Repeating group" is an old term from hierarchical database systems which just means nested records. Codd explains that repeating groups is equivalent to what he more formally calls "non-simple domains", i.e. attributes which does not contain single values but sets or relations. In other words, nested tables.--80.62.117.218 (talk) 17:15, 8 June 2021 (UTC)
The Design Exercise
[edit]- From where I stand, there is an important aspect of Data Normalization that is completely missing in this article. That aspect is the use of these procedures in the design process of any database - or even for the evaluation of how well classes have been partitioned in object-oriented coding.
- I have been programming professionally since 1971 and have found these NF definitions useful since I have started using them - decades now.
- The point of DN is not that it is an implementation goal, but that it is an powerful design tool. Any data base, regardless of how it is to be implemented, should be fully normalized (at a minimum, 3NF) during the design process because that discipline will reduce the chances that the purpose behind the data fields is being missed. DN forces the designer to ask the right kind of questions and collect the required data so that the system, once implemented, will perform productively.
- In the case where there is an existing system which is simply being automated, when interviewing the users of that system, they will described the data fields in the common tougue - very loose terms that barely touch on the purpose of a field. Only on careful investigation can keys and dependencies be determined, and once determined they will commonly result in discoveries that result in more user interviews. As that information is collected, it needs to be recoded and the "common tongue" is not good enough. The simplest comprehensive form for describing the database at this stage in the design process is by describing it in terms of a normalized database.
- Once the database has been documented in this form, the next stage of the desgn process is how it will be implemented. At this point the designer needs to consider precisely how and when backups and restores will be done, whether the harware will support various query and update transactions, etc. At this stage, the database implementation will be described and, except for rare or trivial cases, substantial denormalizations will be applied.
Scott Bowden (talk) 16:08, 9 February 2012 (UTC)
- I would say that probably belongs mainly under Data model, Data modeling, Entity-relationship model, etc., but there doesn't appear to be much in the way of linking to those articles. Perhaps a short section here with a hatnote pointing to those articles? Adding links wouldn't hurt, anyway. --Boson (talk) 20:42, 9 February 2012 (UTC)
Article too technical?
[edit]This article was tagged, a few months ago, as being too technical, but apparently without any suggestions for improvement or explanation of what is not understandable. This article describes a specialist activity. I think the objectives of normalization need to be explained in terms the non-specialist can understand, but I think this is adequately done. I'm not sure that it is possible to describe the concepts that are central to database normalization in less technical terms. I believe they are introduced in a way that a non-specialist can grasp at a level appropriate to their level of knowledge. Is it possble to express the concepts in less technical terms without introducing serious errors or misunderstandings? If the inadequacies cannot be decribed in more detail, I propose to remove the tag. --Boson (talk) 16:25, 21 July 2012 (UTC)
- I think the first paragraph can be improved further. I don't think of 'dependency' and information being 'propagated through the database' are the clearest way to express the intention. I'd rather say that normalization strives to represent each atomic piece of information as a single value, such that when information changes, that change can always be expressed as a change to the corresponding values in the database and nothing else.
- First normal form is not allowing multiple values in a single table cell. I don't think it is associated with a particular normalization operation; a candidate would be the unnest operation of nested relational algebra, but that only covers cases where a column represents sets of values, and first normal form is usually understood in a broader sense.
- The other normal forms are very different, and very similar to each other: they are various degrees of vertical decomposition of tables into smaller tables, such that joined together these tables produce the original table. Here, the goal is complementary to that of first normal form, namely avoiding duplication of the same values across different rows in the same table. Any explanation of these normal forms should start with the project and join operations of relational algebra. By skipping this connection the present article jumps from a very general and imprecise introduction to excessive detail. Rp (talk) 17:10, 21 July 2012 (UTC)
- I agree about the first paragraph, though I'm not sure about "strives to represent each atomic piece of information as a single value . . .". That could be thought of as a definition of 6NF (which few people strive for) - at least if we equate atomic with irreducible. On reflection, I agree that more is needed to bridge the gap between the general/imprecise introduction and the more detailed technical stuff.--Boson (talk) 19:26, 21 July 2012 (UTC)
- Addressing the "too technical" tag, I propose to add this sentence somewhere in the "lead" section:
- For instance, a unique ID is stored everywhere in the system and it's name is just held in just one table. The name can be updated more easily in one row of one table.
- adding an example too...
- A typical update would be the R.I.M. company changing its name to Blackberry.[1] That update would be done in one place and immediately the correct "Blackberry" name would be displayed throughout the system.
- CITE REF USED ABOVE: http://bgr.com/2013/01/30/rblackberry-z10-q10-announce-311920/
- Mediation4u (chat) nb: editing is fun 13:22, 3 June 2014 (UTC)
- I have now added this to the article, after 2 weeks allowed for any feedback. Mediation4u (chat) nb: editing is fun 12:20, 19 June 2014 (UTC)
- Addressing the "too technical" tag, I propose to add this sentence somewhere in the "lead" section:
References
- ^ Reed, B. "R.I.M. changes its name to Blackberry". BGR.com. Retrieved 2014-06-03.
Example for Full functional dependency seems to be inconsistent
[edit]"Even by the removal of {Employee ID} functional dependency still holds between {Employee Address} and {Skill}."
How can there be a functional dependency between {Skill} and {Employee Address}?, consider two people living in the same house. — Preceding unsigned comment added by Prathik Rajendran M (talk • contribs) 13:21, 29 December 2012 (UTC)
I agree that the quoted claim in the above comment seems incorrect, although I think think that the above argument makes a case for why {Employee Address} -> {Skill} is not true, while I think it is more pertinent to demonstrate that {Skill} -> {Employee Address} is not true (which is what I think the example in the article argues in its explanation). This argument, though, is even easier to disprove than the one that Prathik debunked, as it is quite possible for someone living at address A to possess the skill of typing and someone else at address B to also possess the skill of typing.
But regardless of which claim must be disproved ({Skill} -> {Employee Address} or {Employee Address} -> {Skill}), they both seem to be incorrect. Perhaps the example should argue that {Employee Address} does not have a full functional dependency on {Employee ID, Skill} because {Employee Address} has a functional dependency on the proper subset {Employee ID}, although even that is debatable as some people have more than one residence, but it was something that seemed to be enforced in the mention of the update anomaly. Brennere (talk) 19:08, 30 January 2013 (UTC)brennere
References
[edit]The Mike Chapple article on About.com mentioned at the bottom is Ghostwritten, it's copied verbatim from the Microsoft Access Manuals it seems.. If you look in the 2NF reference, there is a a reference to "Sea Cliff, NY 11579" and "Miami, FL 33157" which does not have a corresponding figure on about.com, to locate the figure I did a search on google and found a number of documents, including a PDF of a technicians manual at AIMS.. Which leads me to to believe Mike is either outright plagiarizing, or its ghostwritten by a foreign outsourcer.. There probably isn't a copyright infringement as it seems Microsoft owns 5% of IAC which is the parent company of About.com . It should go noted.. 172.0.207.115 (talk) 02:43, 13 February 2014 (UTC)
3NF/HNF a little confusing
[edit]The article reads "By definition, a table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF" but then goes on to say that a database in 3NF is usually also in higher normal forms. I feel like this is sort of confusing. So 3NF isn't usually the HNF for a 3NF table? Some guy (talk) 19:38, 10 June 2014 (UTC)
- I think this just means that the "highest normal form" of a table is defined as the highest NF that the table meets.
- So if you say that the higest normal form of a table is 2NF, this means that it is not 3NF and that you can also say it is 1NF (because 2NF meets all lower requirements.)
- On the other hand most 3NF tables meet the requirements of, say, 4NF, so 3NF is not the HNF for most 3NF tables.
- --Boson (talk) 21:15, 10 June 2014 (UTC)
- But if a table meets the requirements of 4NF why would one say that it is 3NF? Is it not standard practice to state the table's HNF as its NF? Andrew S. (talk) 14:52, 24 July 2015 (UTC)
- It depends on context. If you say all tables must be in 3NF, you would not exclude tables that are in 4NF, so if someone asked if a table was in 3NF, you would say yes, though that is not its HNF. Similarly, an even number remains an even number, even if it is divisible by 10. --Boson (talk) 17:34, 24 July 2015 (UTC).
- But if a table meets the requirements of 4NF why would one say that it is 3NF? Is it not standard practice to state the table's HNF as its NF? Andrew S. (talk) 14:52, 24 July 2015 (UTC)
- The problem I see here is 1NF. 1NF should be excepted here. It is fundamentally different from the other normal forms, and the article should explain this better:
- 1NF is a property of the individual values that appear in a database, the value domains; it doesn't say anything about the structure of a database or the relationships between values within a database. There is no standard way to mathematically formalize 1NF: if you want to do it, you have to do it in terms of constraints on values in domains, or in terms of how values in the database may be used by queries performed on the database.
- 2NF and higher, on the other hand, are completely independent of the particular values in the database, and are all about dependencies, i.e. a particular kind of cardinality constraint on the relations formed by the values in tables.
- Hence, being in 1NF and being in 2NF are independent properties. Even if you provide a mathematical definition of 1NF, being in 1NF will be independent from being in 2NF. The quote from the article is wrong if 1NF is included.
- 2NF and higher are defined mathematically, and these definitions are such that for each i > j > 1, every database in iNF is also in jNF. Hence, for all NFs above 1, the quote is correct. Rp (talk) 16:11, 12 June 2014 (UTC)
- But there is no NF below 1, so doesn't the statement "a table always meets the requirements of its HNF and of all normal forms lower than its HNF" apply? --Boson (talk) 21:07, 12 June 2014 (UTC)
- ?? The distinction required is between 1NF and >1NF. Rp (talk) 12:54, 13 June 2014 (UTC)
- I'm still not sure I get the distinction being made. I can see that you may want to express the rule for 1NF and >1NF differently in simple "mathematical" terms, but for 1NF the condition "there exists at least 1 lower NF" is false; so in normal language, surely, if the highest normal form of a database is 1NF, then the condition
- "meets the requirements of 1NF and all (= any existing) normal forms lower than 1NF.
- is always true. --Boson (talk) 15:29, 13 June 2014 (UTC)
- One problem is that a table can be in 2NF without being in 1NF, and the text doesn't say what its 'HNF' is in that case. Another is that once you fix this, the statement that a table is always in its highest normal form (by definition of 'HNF') becomes trivial and I think trivial statements only serve to confuse the reader. Yet another is that the statement is confusingly similar to two statements often made about normalization that do make sense, namely,
- that each table in (n+1)NF is always in nNF for all n > 1 (by definition of those normal forms); this does not include 1NF;
- that as a general rule of database design, the lower the normal form, the more important it is for databases to be in that form; this includes 1NF. Rp (talk) 10:28, 20 June 2014 (UTC)
- One problem is that a table can be in 2NF without being in 1NF, and the text doesn't say what its 'HNF' is in that case. Another is that once you fix this, the statement that a table is always in its highest normal form (by definition of 'HNF') becomes trivial and I think trivial statements only serve to confuse the reader. Yet another is that the statement is confusingly similar to two statements often made about normalization that do make sense, namely,
- I'm still not sure I get the distinction being made. I can see that you may want to express the rule for 1NF and >1NF differently in simple "mathematical" terms, but for 1NF the condition "there exists at least 1 lower NF" is false; so in normal language, surely, if the highest normal form of a database is 1NF, then the condition
- ?? The distinction required is between 1NF and >1NF. Rp (talk) 12:54, 13 June 2014 (UTC)
- But there is no NF below 1, so doesn't the statement "a table always meets the requirements of its HNF and of all normal forms lower than its HNF" apply? --Boson (talk) 21:07, 12 June 2014 (UTC)
- If a table "can be in 2NF without being in 1NF, we neeed to change the text of Second normal form, which currently states "a table is in 2NF if and only if it is in 1NF and . . ." (which seems to come straight from Date). Could you suggest an example? --Boson (talk) 00:52, 21 June 2014 (UTC)
- I don't think everybody who defines 2NF requires this. If 2NF requires 1NF by definition, then things become much simpler; the present text can then be clarified by saying that all normal forms are defined such that (n+1)NF always implies nNF. Rp (talk) 22:01, 21 June 2014 (UTC), edited 25 June
- If a table "can be in 2NF without being in 1NF, we neeed to change the text of Second normal form, which currently states "a table is in 2NF if and only if it is in 1NF and . . ." (which seems to come straight from Date). Could you suggest an example? --Boson (talk) 00:52, 21 June 2014 (UTC)
Added section to list ALL the normal forms for which there are Wikipedia articles (with links)
[edit]I realize that some links are duplicates since they exist in the body of some preceding paragraphs but I thought it would be better to duplicate them in a complete list instead of calling the section "Other Normal Forms" and only listing forms not discussed in the rest of the article. Andrew S. (talk) 15:15, 24 July 2015 (UTC)
Expanding on normal forms & clarifying terminology
[edit]By way of disclaimer, I am not a database expert myself — far from it. I'm simply a novice web developer seeking to advance his understanding of databases and relational theory. I've made the following changes; in all cases, please correct me if I'm mistaken.
- My understanding is that normal forms are, by definition, the formal end-goal of database normalization. As such, I have edited the lead section to redefine normalization in terms of normal forms, and extracted the list of internal links to various normal forms out of the "See also" section into its own "Normal forms" section.
- I've also removed bits that appeared redundant to me, as well as the bit about "achieving the optimal structure composed of atomic elements" (which, in my understanding, relates only to 1NF).
More importantly, I'm requesting the following changes from an expert:
- a broad overview of what normal forms are exactly; e.g., the property that each normal form requires all prior normal forms to be met (currently, it only features a brief history of when they were proposed and a list of links to each one); and
- a modest reorganization to ensure that no terms are used before they are clearly defined or linked to corresponding articles (currently, "Objectives" mentions normal forms, which aren't defined anywhere, and "Normal forms" mentions update/insertion/deletion anomalies, which are defined in "Objectives").
Pennbrook (talk) 03:42, 10 January 2018 (UTC)
A step by step normalization example
[edit]I've added a step by step normalization example illustrating each normal form. I've ceased to think of an example for EKNF though... I'm by all means no expert on databases, so I welcome any corrections... Honzikec (talk) 21:29, 23 January 2019 (UTC)
- Thanks Honzikec I think that's super helpful. I just went over it and fixed two problems, hopefully I didn't do anything wrong. I think one remaining problem is that right now it's not clear what's the difference between ETNF and 5NF as the example seems to be essentially saying the same things. --a3nm (talk) 20:25, 21 April 2019 (UTC)
Problem with 3NF in the step by step example
[edit]Not sure of how to fix it, but the step-by-step example seems to have another problem: after the 3NF step, the schema is not actually in 3NF because the dependency removed in the BCNF case also violates 3NF. Pinging Honzikec, in case you have an idea of how to fix the example? --a3nm (talk) 22:20, 27 May 2019 (UTC)
- A3nm, we could probably replace the BCNF example with some reservations inspired by the Wiki article on BCNF (https://wiki.riteme.site/wiki/Boyce%E2%80%93Codd_normal_form) ... I currently don't have time to address this in detail though :( Honzikec (talk) 14:11, 29 May 2019 (UTC)
- Honzikec, thanks for your input. No worries if you can't do it, but just to be sure, you agree that there's currently a problem, right? Also, if you see a way to integrate the examples from the BCNF article in a way that fits with the running example of this article, this would be interesting. (As for actually doing it, maybe I could take care of it.) --a3nm (talk) 15:09, 29 May 2019 (UTC)
Yes, this is extremely confusing and had me scratching my head for an hour until I visited this talk section. Something would seem to be in 3NF but not BCNF if information is missing from the table? That what the example on https://wiki.riteme.site/wiki/Boyce–Codd_normal_form seems to imply.
As stated above, the example involving the anomaly Author => Author country is wrong since this anomaly is required to be solved in order to have a 3NF table. This is true since we are having a transitive dependency, with a non key attribute (Author), thus 3NF does not hold. It is the same case as with Genre id => Genre. Once we solve the Author => Author country anomaly, the table becomes 3NF but also BCNF. In general, as soon as we are 3NF and without at least two overlapping key, the table will be BCNF too. This because a non-key attribute should depend on whole key without transitive dependencies (2NF and 3NF holding). Key attributes have no dependencies, since there is only one superkey candidate: if this was true, we would have a redundat key (say PK is {A,B,C}. if AB=>C, then a minimal super key is {A,B}) I would suggest the following modifications:
- We modify Book table PK from Title alone to {Title, Author} (this is someway more realistic, since there are many different books with same title, think about something like "Linear Algebra")
- We add one more column "Publisher Author ID" which is defined as a unique identifier of an author, within a publisher (ok this is less real world, but in my opinion easy to understand). In this way we have a new dependency {Publisher ID, Publisher Author ID} => Author, and a second super key {Publisher id, Publisher Author ID, Title}
- Removing the {Author} => {Author Title} in 3NF paragraph and choosing {Title,Author} PK, we now have a 3NF table but not a BCNF compliant one. Infact BCNF requires that if we have A => B, A should be a super key, but now we have {Publisher ID, Publisher Author ID} => Author with {Publisher ID, Publisher Author ID} which is not a key.
- To become a BCNF table, we add a many to many table Publishers_Authors with columns PublisherId (FK to Publisher), Author (FK to Authors table), Publisher Author ID. Then we can remove "Publisher Author ID" from Book table and have a BCNF
What do you think? Pluttero (talk) 14:31, 16 May 2020 (UTC)
- @Pluttero: I have realized the same issue while reading the article. So your contribution is very welcome. Please go ahead. — Maggyero (talk) 11:30, 21 June 2020 (UTC)
1NF Example isn't right
[edit]Having First and Last name in the same column isn't atomic. If you want to sort by last name, you can't do it without parsing. — Preceding unsigned comment added by Grr (talk • contribs) 00:41, 9 July 2020 (UTC)
- Whether a particular field is atomic or not depends on how it will be used. Having first name and last name seperated may be neccessary if sorting by last name is something you want to do, but this may not be neccessary in the given scenario. I think the example is acceptable as is. AntPraxis (talk) 10:48, 17 February 2021 (UTC)
- "Atomic" is really a simplification of 1NF. The definition is to eliminate non-simple domains, which means collection types. A string type is considered a simple domain, while a column which allowed an array or set of strings as value would would be non-simple. Deciding on one or two columns for a name depends on business requirements, but is unrelated to the issue of normalization, since in either case the attribute domains are the same: a string. --80.62.117.218 (talk) 11:00, 9 June 2021 (UTC)
The table shown as an example of 3NF does NOT conform to 3NF!
[edit]This table has a transitive dependency {Title} → {Author} → {Author Nationality}, which violates 3NF.
Unfortunately, fixing this is not trivial since later in the section about BCNF this same dependency is described to violate BCNF and is decomposed to illustrate the process of normalizing a database from 3NF up to BCNF. But, as I said, the table is not in 3NF in the first place!
Therefore, the sections: #Satisfying_3NF and #Satisfying_BCNF are incorrect and misleading.
Also see this reddit thread that discusses this Wikipedia page. — Preceding unsigned comment added by 5.173.128.7 (talk) 10:03, 12 August 2020 (UTC)
EDIT: Working with the assumption that no content is better than wrong content, I removed this wrong content for now, but in principle better examples should be provided here instead. 5.173.128.7 (talk) 10:10, 12 August 2020 (UTC)
- The incorrect example you removed appears to have been re-added. Also the BCNF example is problematic because author name is unlikely to be unique. Author-Nationality (which should just be called Author) should have a surrogate key, because there could be two authors with the same name from two different countries. In fact if we consider duplicate author names, author name does not even functionally determine author nationality. AntPraxis (talk) 11:09, 17 February 2021 (UTC)
1NF example: why give publisher a own table?
[edit]The problem is about the subject column isn't atomic. Giving the subject a own table is to normalize this 1NF example, but why is for the publisher column also given a own table? There is even no word about the publisher in the problem description at all. 80.208.142.162 (talk) 16:12, 25 April 2021 (UTC)
Reference to UNF/Unnormalized form
[edit]In the normalization grid, the first step is UNF (Unnormalized form) which is described as requiring primary keys. I think this is potentially confusing. "Unnormalized form" have no exact definition AFAIK, but can be used to refer to "free-form" data like an Excel sheet which allow duplicate rows and inconsistent columns. But Codd defines normalization as starting with relations, which means it is already assumed there is no duplicates, no intrinsic order, a well defined number of attributes and domains and so forth. The terms 0NF (zeroth normal form) and NFNF (non-first normal form) is often used to describe data which is in relational form but does not conform to first normal form (i.e may have nested relations). Therefore I suggest we create a separate page (perhaps called 0NF or NFNF) which describes the constraints of a relational form of data as separate from unnormalized data. --80.62.117.218 (talk) 10:47, 9 June 2021 (UTC)
Use "his"
[edit]"Until the new faculty member, Dr. Newsome, is assigned to teach at least one course, their details cannot be recorded."
"Their"? Who are they? At first I was confused, and then I thought that it probably is the new "inclusive" language. Who cares about the sex of an imaginary example doctor other than some crazy zealots? Use a singular pronoun. I suggest using "his", but this is only because that is the tradition. Again, I do not mind the sex of an imaginary doctor. If you think female representation is needed in fictional examples, use "her". Sin Jeong-hun (talk) 12:11, 11 June 2022 (UTC)
- The singular "they" is a commonly accepted usage in modern English. In this case, the sex of the doctor is completely ambiguous and the language can reflect that. Jtbwikiman (talk) 19:58, 8 July 2024 (UTC)
Using "primary key" in normal form definitions and examples
[edit]After looking at the definition of "primary key" I find it surprising that the term would be involved in the definitions and descriptions of the normal forms at all. The "primary key" concept seems to be an implementation detail of RDBMSs that doesn't even necessarily have to exist on a given implementation.
The part that states "As a prerequisite to conform to the relational model, a table must have a primary key, which uniquely identifies a row." doesn't seem true.
The Relational Model article states "Usually one candidate key is chosen to be called the primary key and used in preference over the other candidate keys, which are then called alternate keys." which implies that it is not strictly necessary to choose a primary key.
It seems like the formal and informal definitions in "Every non-prime attribute has a full functional dependency on a candidate key (attributes depend on the complete primary key)" disagree. If any candidate key can be chosen as the primary key and the informal definition is assumed to be correct, then that implies that the formal definition should say "on each candidate key" rather than "on a candidate key" because we don't know which candidate key will be chosen as the primary key.
I don't know how the canonical texts on normalization define these, but the definitions in this article seem problematic. If "primary key" is not used in the formal definitions, perhaps the use of the term should be relegated to one section that clearly explains that the informal definitions aren't strictly correct and should only be used as an convenience to help guide intuition. Since the large example section accounts for most of the explanation of the normal forms and heavily make use of the term "primary key" it makes it seem as though the term is part of the real definition. 172.10.118.68 (talk) 17:25, 2 April 2023 (UTC)
- You are right. The use of the phrase primary key is wrong and inconsistent. Candidate keys are what matter. Oradium (talk) 22:09, 8 April 2023 (UTC)
Overly Technical and Complicated
[edit]I can't recall a time I've ever left Wikipedia to go and learn the basics of a topic elsewhere, because the Wikipedia article was too complicated. But I just did that, for this topic. I'm a software engineer and even for me, this article is very difficult to follow. Due the fact that I don't yet understand the topic, I can't really fix it. But I (and presumably others) would be grateful if someone could dumb this one down some. At least the introduction and early parts of it. This article seems to assume a high degree of familiarity with a lot of material, and knowledge of a lot of specific jargon. That people coming here may not have. As this is covering a pretty high level and general topic. JaHolo (talk) 21:35, 1 August 2023 (UTC)
- This is where I wound up and 15 seconds in, I felt I had a much better understanding of what Database Noramlization is, than I did in several headache inducing minutes trying to figure this one out. https://www.lifewire.com/database-normalization-basics-1019735 JaHolo (talk) 21:39, 1 August 2023 (UTC)
- Ok, with a tiny bit of knowledge one thing I can say is that the 'Objectives' section seems to assume an understanding of what the Normal Forms are. When that isn't even covered until the next section. And there, it's covered in a pretty technical way. A paragraph of introduction on the normal forms - at the end of the introduction or before 'Objectives' would be extremely helpful. JaHolo (talk) 22:50, 1 August 2023 (UTC)
- That article isn't a very reliable technical source. It appears to be a blog. "Dumb it down" isn't one of the goals of Wikipedia. Andre🚐 21:50, 23 November 2023 (UTC)
- Agreed. I can recall hundreds of times I had to go learn the basics of a topic elsewhere, but I suspect I spend more time on math-related pages than JaHolo.
- This page is not a beginner's tutorial, like it or not. It does not need to be immediately digestible to all software engineers. Jtbwikiman (talk) 20:00, 8 July 2024 (UTC)