Talk:Hash table/Archive 1
This is an archive of past discussions about Hash table. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 |
Keys
There was a statement saying that keys are usually strings, but don't have to be. I don't think this is generally true, and as it doesn't add much, I took it out. --BlckKnght (it is probably true of perl and awk programs --drj)
- I just changed it to
Keys may be pretty much anything, a number, a string, an object; with some hash tables even a reference to the value being stored. The only requirement is the existence of a hash function for the keys used.
Pictures
Im actually a coder, I've used hash tables before through STL, I vaguely recognize them as being "fast" in some magic way. But, I dont UNDERSTAND them. This little text helped a little bit, now if I just draw this thingy on a paper, then maybe I'll get it. Anyway, what I wanted to say was that, some kind of example of a simple, simple hash and the way it works would be great. If I get the time, maybe I'll do it myself and add to this page, or subpage /Example or so. sandos
- I'll second that. This page is screaming for some nice pictures.
- Yes, I agree. After taking a look in my old algorithm books and thinking about it and drawing on paper for some hours I made a first pic and put it in the article. I am not completely satisfied with it and another editor (Deco) pointed out that the pic should perhaps also show a hash collision. So I will try and make a new pic after a bit more thinking about it. Any comments on the pic are of course welcome! --David Göthberg 12:53, 1 November 2005 (UTC)
- Excellent, thank you very much for that. This I think is 90% of the work done. Showing a collision, and maybe show how different collision policies would handle a few collisions (or just the one) would get be perfection. Anyway a couple of minor points:
- I would remove the following in the caption: "Note that the key space is humongous but the hash table is only 1000 slots, which easily fits in computer memory and thus allows quick lookups." The point here is not that "if we didn't have hash tables we would need a gazillion Gigabytes to store a phone book", so you are risking of confusing the reader IMHO. Also the definition of what "easily fits in computer memory" is quite arbitrary especially considering Moore's law.
- Along the same lines, I find it a bit misleading to mention the "key space" in the picture. I would take out "A", "AA", and the last ones and leave only meaningful examples, for two reasons: a) These key don't seem to map to any index (I know you did it like this because otherwise it would have been a mess, but that's exactly my point, keep it simple) b) Realistic examples of key spaces are infinite (e.g. arbitrarily long strings or files)
- My choice for the headings in the picture, to be consistent with the definitions in the article and to avoid confusions, would be: "Keys", "Indices in the array" or "Array", and "Key-value pairs (aka records)". PizzaMargherita 20:43, 1 November 2005 (UTC)
- Excellent, thank you very much for that. This I think is 90% of the work done. Showing a collision, and maybe show how different collision policies would handle a few collisions (or just the one) would get be perfection. Anyway a couple of minor points:
- I personally like the inclusion of the entire keyspace, since it emphasizes the domain of the hash function, and in math diagrams the domain of a function is often made explicit. I agree that the caption should be changed, but we can fix that up ourselves. Deco 04:19, 2 November 2005 (UTC)
Ok, here are some picture examples. These two are my suggestions for "simple" pictures to have at the top of the article: --David Göthberg 09:49, 2 November 2005 (UTC)
Note: These pictures have been changed several times since the discussion started out and thus now have different content then when the discussion on this page started. --David Göthberg 13:47, 29 January 2006 (UTC)
These pictures are for the section of the article about hash collisions. (I don't think the picture at the top of the article should show a hash collision since the top picture should be a bit simpler.)
My preference goes to pic 32 over pic 22. Also another thing: on the right-hand side of the diagrams (i.e. after the hash mapping) I think it would be better not to include arrows if we are not dealing with pointers. I.e. I would use arrows only for chaining. Something like this (I hope it's clear):
Intro:
| 1| Lisa Smith: +44 1234567 | : : : |873| John Smith: +44 1442121 | : : : |974| Sam Doe: +44 8933234 |
Chaining:
| 1| *-+--> | Lisa Smith: +44 1234567 | : : : |873| *-+--> | John Smith: +44 1442121 | : : : | : : : v : : : | Sandra Dee: +44 8768768 | : : : |974| *-+--> | Sam Doe: +44 8933234 | ^ these are the cells of the array, containing a pointer to a linked list.
Open addressing:
| 1| Lisa Smith: +44 1234567 | : : : |873| John Smith: +44 1442121 | |874| Sandra Dee: +44 8768768 | : : : |974| Sam Doe: +44 8933234 |
PizzaMargherita 12:46, 2 November 2005 (UTC)
- Ok, I see what you mean, and I have considered that alternative myself. But I prefer arrows since in many cases the records are stored in separate allocated space even when using open addressing since records can be of pretty much any size. Anyway, I updated the pics a bit and changed to the ones I think we can agree/compromise on in the article. Hope you like it. --David Göthberg 18:57, 2 November 2005 (UTC)
- "I prefer arrows since in many cases the records are stored in separate allocated space even when using open addressing since records can be of pretty much any size" - Uhm... I think you are confusing two levels of indirection. Say both Name and Telephone number are strings. In open addressing you would still store the two pointers right in the array, whereas in the chaining case you would store a pointer to these two pointers. So that's why I think that having arrows in both cases (and in the introduction) is confusing. PizzaMargherita 19:11, 2 November 2005 (UTC)
- I just noticed the latest updates, this is excellent stuff, thanks for all the work on this. I particularly like the "open addressing" one, with no arrows and the empty slots. I think that the introductory figure should also look like that. Also I changed my mind about the "Array" heading (sorry, I would edit the picture myself and show you what I mean if I could) - I think it should be "Indices", because the array is really the green part, whereas the pink part is immaterial. PizzaMargherita 22:08, 2 November 2005 (UTC)
- Well, as far as I know using chaining is the more well known method, so for me it feels more natural that the intro picture uses chaining. That is, to leave it as it is. (Besides I think chaining looks better, but guess you don't.) And yeah, the naming of the pink array of indexes/hashes is tricky. It is literally an array/table of hashes/indexes. And in the chaining case it sort of is a tangible array (well of slots for pointers) but in the open addressing case it is intangibel/immaterial which makes it even trickier to draw and name properly. By the way, I checked my dictionary, both "indices" and "indexes" are valid forms, and indexes won the google check. So perhaps "indexes" or even revert to the slightly confusing "hash table" again? I guess we should think about this for a while before we decide. --David Göthberg 03:52, 3 November 2005 (UTC)
- Indices/indexes. I normally use indices because that's the original Latin plural, but I don't really care. This actually raises the point that we ought to use it consistently throughout the article, at the moment it's a mixture.PizzaMargherita 09:26, 3 November 2005 (UTC)
- "Well, as far as I know using chaining is the more well known method, so for me it feels more natural that the intro picture uses chaining." I now see where you are coming from. However, I think that since we are illustrating (literally! :-) a concept, given the choice we should stick to the simplest case, not the most common.
- "And in the chaining case it sort of is a tangible array". The indices are always intangible in the array (they are tangible if you have a variable that temporarily stores an indes, but that's another story). So I think "Indexes" for the pink column is always appropriate. Which brings us to the point I mentioned earlier, i.e. that we should explicitly draw pointers (which are tangible) in the chaining picture, much in the same way they are drawn in linked lists. I will upload what I have in mind tonight. PizzaMargherita 10:36, 3 November 2005 (UTC)
- Well, now I think you are being C-centric and are going too much into detail. If you code this in Java both the open addressing and the chaining case would actually be slots with pointers to objects (the records) but from a programmer point of view it would look and feel more like slots that store the object, not pointers, in both cases. (A linked list in Java kind of "stores" the next object "inside" the previous object.) I view it more as conceptual pictures: The key is hashed and thus gives and index, the index in some way indicates where the record is, no matter if the record is stored in the index array itself or if the index array only has a pointer to the record. And in many languages (like Java and Python) the difference between storing and pointing is more or less hidden. But sure, give it a shot and make C-centric complex pictures with pointer slots and arrows originating from within the slots and the index numbers (hash values) "floating" outside the slot since they are not slots (not tangible) them selfs. So lets see how it looks. But I think it will cause too complex pictures. But I beleive in trying and looking at the result before saying no. :) --David Göthberg 11:53, 3 November 2005 (UTC)
- (Enjoying the debate) Look at the pictures in the article about linked lists. They use a box (together with an arrow) to represent a pointer (or a reference, or a memory address, however you may want to call it). That is not being C-centric. That is Computer Science.
- "If you code this in Java both the open addressing and the chaining case would actually be slots with pointers to objects (the records)". In case of open addressing it would be pointers to record objects. In case of chaining it would be pointers to node objects, not record objects. PizzaMargherita 12:26, 3 November 2005 (UTC)
- Please disregard my ASCII "Chaining" picture above, it's slightly inconsistent/wrong. This is what I think we should do, barring topologically equivalent rearrangements
Chaining:
|873| John Smith: +44 1442121 | *-+------> | Sandra Dee : +44 345345 | | ^ \___________________________/ \___________________________/ | ^ ^ | First node of linked list Second node of linked list, pointing to null Index
PizzaMargherita 12:38, 3 November 2005 (UTC)
- Ok, I updated pic 22 to how I would do a more complete description of chaining: With intangible indexes (light grey boxes since without boxes it looked weird), each record value having its own field and proper pointers where even the indexes are pointers to the first record. (I placed the pointer slot in the begin of the records so that the pointer arrow would not have to cross a long distance from the end of one record to the begin of the next.) --David Göthberg 13:22, 3 November 2005 (UTC)
- Sorry David, I must have misled you. My previous suggestion was wrong. In fact, thank you for pointing that out. With chaining, arrays contain a node object, not necessarily a pointer/reference to a node object, which would complicate the explaination unnecessarily. So, I stand by my latest suggestion above, which also makes both pictures for open addressing and chaining look more similar as an added effect. Oh, and I like how the new "pair" looks! PizzaMargherita 15:24, 3 November 2005 (UTC)
- I uploaded a more correct/detailed version of the open addressing pic with separate fields for the record values and immaterial/intangible indexes. Unfortunately it made the colours look less neat but I think exactness is more important then nice colours. :))
- I disagree about the chaining. I did understand that your last ASCII picture shows the first list element stored within the index array. But that is not how I learnt to make chaining hash tables when I studied Computer Science in university, that is not what my algorithm books says and that is not what the article text says. For a hash table optimised for lookups and not lookups+inserts, then storing within the array is optimal, sure. But there are several reasons why one instead only store pointers in the index array: One might use a ready made list class for the list of records. With only pointers in the index array one can afford to have a longer array (more sparsely populated) without using much more memory thus causing much less hash collisions, since pointers usually take up much less space then entire record fields. Also when inserting items one can do that either in the end of the list or the begin of the list, but inserting in the end means one has to traverse the list to find the end and new items in many systems are beleived to be searched for more often thus one prefers to insert in the beginning. If the first record is stored within the array inserting in the beginning becomes more complicated since then one needs to first copy the old first element to a separate record and then insert the new first record in the index array. So using pointers in the index array shows it as all the descriptions I have access to shows it. Also I actually think the picture is clearer this way. --David Göthberg 15:07, 4 November 2005 (UTC)
- I see your point. So that's fine for chaining, but can we please have the simpler case (no indirection, like in open addressing) in the introductory picture? Cheers. PizzaMargherita 18:14, 4 November 2005 (UTC)
- You know PizzaM, not all arrows in the world are pointers. You missread the intro picture. It is a generic description of an algoritm, not a detailed description of a specific implementation. What I intend that picture to mean is: The keys tell you the index values (that is, the keys get hashed), then those index values tell you where the records are (positions in an array or actual pointers to records stored separately). That is, keys give indexes give records. Or with the notation I learnt in math lessons in school: keys => indexes => records. Of course, that you missread the picture might mean that the picture is not clear enough. But on the other hand Deco seems to read the picture exactly as I intended so I hope most readers will understand it as I intended. --David Göthberg 19:10, 4 November 2005 (UTC)
- I like having the additional arrows in the diagram because they visually separate the disparate concepts of the table slot and the data stored there, if any. I think the discussion of internal versus external storage is somewhat beside the point, but I will say that in hash tables external storage has the notable advantage that it can save a great deal of space in a sparse hash table (say less than 50% full), even if the data being stored is small and static-sized. On the other hand some resizing policies would prevent this scenario from arising - however, it's still necessary for variable-sized data. Deco 18:33, 4 November 2005 (UTC)
- Ok, fine by me. Thanks again for the effort. I changed the captions, see how you like them. PizzaMargherita 20:53, 4 November 2005 (UTC)
- I'm very happy with the images as they are now. Good job, guys. Deco 21:04, 4 November 2005 (UTC)
- Thanks Deco! And yes, PizzaM, your new captions are short and sweet and very clear, I like them. Now I am just curios for any comments from others how they understand the pics. Trying to explain complicated things in an easy way is always hard. So thanks guys, I hope that's it for now. --David Göthberg 22:08, 4 November 2005 (UTC)
Prime number array size
Can somebody please help me understand why this is?
- Hash table array sizes are sometimes chosen to be prime numbers. This is done to avoid any tendency for the large integer hash to have common divisors with the hash table size, which would otherwise induce collisions after the modulus operation.
I mean, assuming the output of the hash function is random uniform, how is its output modulus 11 any more uniform than its output modulus 12?
- I'll try to. If your assumption was correct then yes.
- Historically, hashing is often done using equations of the form:
- f(a1,a2,a3...) = a1 * n1 + a2 * n2 + a3 * n3 +... where n1, n2, n3... are carefully chosen primes.
- However if any of the n's divide into the length of the table, or they are a common factor with it, then since datasets not uncommonly differ by a single variable, then collisions occur much more often than you would naively expect. Also, if you use n's where varying the a's in a common way for the dataset causes a hash to appear that varies only by one or only a few places, then large amounts of clustering and very poor performance occurs for open addressed hash tables.WolfKeeper
Ok, here's an example. Let's say that you have a table size of 6. And you're hashing a single variable, called v.
To show how it all messes up let's use the following function:
f(v) = v*3
Now, because 3 divides evenly into 6, this is a bad idea...
Let's put numbers 1-5 into the hash table. They hash 3,0,3,0,3 respectively (after taking the remainder for the table length.) So we have 3 collisions. Which is much more than we should have had.
But if we had used:
f(v) = v*5
Then 5 doesn't divide into 6 at all; and doesn't share any factors with 6.
The hashes would have been: 5,4,3,2,1; absolutely no collisions. (This example is a bit forced because I used consecutive numbers, but rest assured even using random numbers would the second hash would have worked much better, since the first hash always lands on 0 or 3 everytime.) WolfKeeper
- I can now see the light, thanks for taking the time. I think that the article doesn't make it clear that there are (for a typical hash function) two nested modulus operations involved. Mind you I should have done my homework and read the hash functions section. Also I finally understand that a hash function that outputs random uniform indices is not necessarily an "ideal" one. An "ideal" hash function, as the article points out, changes half the bits of the output when one bit of the input changes, and does so independently from the rest.PizzaMargherita 08:49, 25 October 2005 (UTC)
Theoretical conflict probability
I would like to question the following
- statistically and in practice, even with excellent hash functions, and even if the array were to be allocated with space for a million entries, there is a 95% chance of at least one collision occurring before it contains 2500 items
I find it dubious, since it is not stating the cardinality of the key space. For instance, if there are only one million possible keys and we are using a perfect hash, the above is clearly false. I suggest we remove the quantitative figures.
- Please explain further, I do not see how that could work. Note that there is a difference between an excellent hash function and a perfect hash function.WolfKeeper
- True, bad argument. Let me try again. I guess my point is: either we quantify "excellent", or all the other figures are dropped, because I strongly doubt that they hold for every hash function and every cardinality of the key space. Or is it 95% is a theoretical lower bound of some sort that can be calculated from the numbers 1 million and 2500 alone? Even if that is the case, a reference/explaination would be called for.
- The thing about 'excellent' hash functions is that they behave psuedo-randomly, and hence statistics may be sensibly applied. The 95% is a theoretical result based on, yes 2500 and 1 million and the birthday paradox, which is already linked to. There's an equation in the linked article that gives 0.9561, when n=2500 N=1000,000, near enough 95%.WolfKeeper
- I see, thanks for the explaination and sorry for missing the link. I propose rephrasing though:
- The hash function should be designed so as to minimise the number of collisions for any index returned. However, even when an ideal hash function produces indices with a uniform probability distribution over the range of an array with, for instance, 1 million elements, the probability of at least one collision occurring before a mere 2500 records are inserted can be calculated to be around 95% (see birthday paradox). Note that real hash functions cannot perform better.
- What do you think?
- I see, thanks for the explaination and sorry for missing the link. I propose rephrasing though:
- I wouldn't use phrases like 'ideal'. And the final 'real hash function' sentence doesn't make sense to me, it raises questions like 'what do you mean, the example is not real?'.
- Of course it's not real (hence "ideal"), because in our mental experiment (or if you prefer, statistical analysis) we are assuming that the output of the hash function is a uniformly distributed random variable.
- Many real-world hashes can achieve this to measurable accuracy in all but deliberate sabotage cases (e.g. md5). And I wouldn't use 'ideal' because it raises questions about perfect hashes, which really are ideal, and that's not what we're about at this point in the article.WolfKeeper
- Also, it's a statistical result, so sometimes you will get better results.
- No, if by performance you mean "probability of at least a conflict" as opposed to "number of conflicts for a given sequence". Shall we say that explicitly? "It's possible to calculate (see bd paradox) a lower bound for the probability of having at least a conflict, given the number of records to be inserted and the size of the array. <example>"
- No, because of concepts like perfect hashes, and optimisation of hashes for particular input sequences or populations. It's not a lower bound. The lower bound is zero.WolfKeeper
- But doesn't this contradict the "even with excellent hash functions" part? It's not a lower bound. So what is it then? And what is the relevance of that sentence at this point of the article? Right now, to me, it sounds like this:
- If a uniform random number generator were to generate 2500 indices for a 1million array, p=95%. However, there is absolutely no relationship between this object and a generic hash function.
- As you say, we are not talking about perfect hashes here, and I'm not convinced that, under this benchmark, optimised hashes perform better than a uniform random number generator. Though I could be persuaded that that's actually the case. Could you please provide a simple example?
- Another thing. I think that in this sentence the reference to the birthday attack (as opposed to the birthday paradox) is not relevant.
- You raise an interesting point about optimisation of hashes given statistical properties of the keys. Is this mentioned anywhere in the article? I think it should...
- But doesn't this contradict the "even with excellent hash functions" part? It's not a lower bound. So what is it then? And what is the relevance of that sentence at this point of the article? Right now, to me, it sounds like this:
- Hey, it's me again (I got myself an account). If I don't hear any objections to my comments immediately above, I'm gonna edit as explained. I'll make it clear that by "ideal hash" I don't mean "perfect hash".--PizzaMargherita 19:18, 6 October 2005 (UTC)
Rehashing and O(1)
> Rehashing is ... by ... recomputing ... the O(1) property can be maintained.
I'm not familiar with the relevant math, but I would speculate that perhaps there are some restrictions as to how frequently the rehashing may be done, to preserve the O(1) ?
Reply: It is similar to re-indexing a database. It's an intermittent procedure that is not a part of the search algorithm. The rehashing is probably not O(1), but you can comfortably divorce the procedure from the actual search.
- Rehashing (i.e. growing the table) is O(n). If you exponentially grow the table size as it fills you get O(nln(n))
Reply: Actually it is only O(n). If you double the table at each step, then for a given n the rehashing will have cost at most 1 + 2 + ... + n/4 + n/2 + n which is strictly less then 2*n, thus the O(n) complexity. The same holds for any exponential growth, like increasing table size by 25% which would be more sensible for very large tables as you want to avoid large jumps in memory use for a very small variation of the input size. The real proof can be done with geometric series.
pfunk42: You can maintain amortized O(1) in the presence of rehashes as long as you increase size geometricly (e.g. double each time it fills up). The idea behind amortized O(1) is that you can pretend with each item you add, you add some time (O(1) each time) to a savings account for execution time. By the time you need to rehash, you have O(n) in your savings account, which you "spend" doing the rehash. So in terms of overall execution time, each "add" can be considered O(1).
Actually, if you delay completion of the rehashing by maintaining two hash tables over multiple calls you can move one object from the old hash table to the new with each access, and then on completion delete the old table. Provided you atleast double the size of the hash table, this is guaranteed to complete before filling the new table up. You do of course have to check both tables in the interim period, but it's still O(1) for each access, and not just amortised O(1).WolfKeeper
- Is this true? I don't think that the scenario that you describe is quite O(1). The reason is that there is no guarantee that you fill up the table before you need to double the size again. Let's say that we do something silly, like create a 3-element hash table, which we promptly fill up. Your algorithm would then allocate a new 6-element hash table. Let's say that we fill that up, without ever accessing the original 3 items. Then, you would create a 12-element hash table. Now, to find an element, you have to search through the 3-elem, 6-elem, and 12-elem table. As you can see, this will continue to degrade. It might not degrade terribly, but it is certainly not O(1). Josh
Revision
I'm working on some revisions of this page. The working copy is at User:T Long/Hash Table. T Long 05:08, 2004 Aug 29 (UTC)
- Revisions done. T Long 06:05, 2004 Sep 4 (UTC)
Simplification
I'm thinking the introduction is a bit terse when it says that hash tables use a hash function to perform index range reduction. This could probably be made friendlier with an example, but I'm not sure just how, and how much overlap there would be with hash function. Deco 20:21, 23 Sep 2004 (UTC)
- Yes, it doesn't read well to me. Ok, I admit it; I don't know what range reduction is; even though I've implemented hash tables several times. Presumably they mean that the size of the key is bigger than the size of the hash. If so that's probably not even worth saying; or if it is, it should spell it out.
- I don't like this piece in general. It's written to assume too much knowledge of specialist terms on behalf of the reader. It seems to be about MSc level, when undergraduate or lower is more appropriate. It just doesn't have to be this complicated.
- -Wolfkeeper
- You have my agreement there. The likely article viewers are college freshman. It needs to be dumbed down somewhat. I'll look at this pretty soon if no one else gets to it first. Deco 01:25, 24 Sep 2004 (UTC)
- Looks much better now I think. Good work. - WolfKeeper.
Fixed hash value
I deleted the section on using a fixed hash value because this seems like a bad idea to me. The only conceivable reason one might want to do this is to simulate worst-case behaviour.
- No, the main reason you use it is to make sure that the collision resolution code works correctly; and measure the worst case performance.
It certainly hides errors one might wish to find during debugging, especially in the hash table implementation.
- The hash table code itself is generally very short. It is predominately the collision resolution that is more likely to have bugs in. You take it right back out again if the test passes. You can leave it in if the performance is adequate however :-)
- Okay, this is fine, I really just wanted to hear your justification. I'm sorry for re-removing it, but I felt like you were ignoring me. I apologize again and have added it back with some explanation. Deco 16:35, 3 Oct 2004 (UTC)
Hash tables are deterministic, and the argument used with random number generators does not apply. Can anyone explain this paragraph? Deco 00:33, 3 Oct 2004 (UTC)
- Please answer this question. I don't appreciate having my edits undone without justification, especially when I had a good reason. Deco 15:26, 3 Oct 2004 (UTC)
- The wikipedia NPOV says that it is incorrect to remove other peoples stuff; not that it is incorrect to add back in stuff that other people have removed. Please act appropriately given the nature of the document you are contributing to. Your point of view is not necessarily better than other peoples, only different.
- I'm glad we worked out this issue, but I think you're misinterpreting the NPOV policy. I did not remove the content because I wanted to espouse my own non-constant hash point of view; I deleted it because I thought it was incorrect (partly because I misunderstood the intention). Deleting content is perfectly okay and is part of being bold, although it understandably upsets the original editor, and that's why I asked for your explanation here. Sometimes editors who add things really don't mind them being later removed, if they can agree about why they were removed. Deco 16:35, 3 Oct 2004 (UTC)
DOS attacks
The article claims:
- Additionally, in some applications, a black hat with knowledge of the hash function may be able to supply information to a hash which creates worst-case behavior by causing excessive collisions- resulting in very poor performance (i.e. a denial of service attack). In critical applications, a data structure with better worst-case guarantees may be preferable.
I don't think this is very plausible. Can anyone point to an example in which knowledge of an application's hash function was used to initiate a DOS? Neilc 12:54, 7 Apr 2005 (UTC)
There's a paper on it here: http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003.pdf
Basically, using knowledge of the hash function can make the lookup go from o(1) to o(n^2), so large slowdowns are possible. Wolfkeeper 16:12, 7 Apr 2005 (UTC)
- I don't know if it's ever been seen in the wild but the possibility has been demonstrated to exist. -- Antaeus Feldspar 00:00, 9 Apr 2005 (UTC)
- I can't see how a hash lookup can be made O(n^2) unless the bucket's storage structure is already really inefficient. Even a simple hashtable implementation uses a linked list for internal storage so at worst the slowdown is O(n). Daniel
Hash collision article?
i think the collision resolution stuff should come back into the main hash table page. a hash collision can entail or imply a variety of things in other contexts, and i don't think the fact that they all have to do with "hash collisions" necessitate putting the details for each context in that article. A "hash collision" article should be short and sweet. link to data structures and algorithms that suffer from or must deal with hash collisions. let the linked to articles describe the issues for that context. thoughts? (-- pfunk42)
Let me also add that this article is very horribly written right now. I was going to fix some abhorrently imprecise and even colloquial language, but kept discovering more problems. Maybe I'll play with a major update... (-- pfunk42)
- I agree that there was absolutely no need to remove the hash collision material from this article. It has no use, as far as I know, outside the context of hash tables, and without them hash tables are rather trivial. I despise the introduction of this article (despite having contributed to it); it's oversimplified to the point of meaninglessness, and your comments fittingly describe it.
- I am offended though by your contention that this entire article is horribly written or imprecise. I think the Overview and Problems with hash tables sections are good, and I think the Choosing a good hash function section is alright but needs some specific examples with less focus on hash table size (which, I've read, only really matters if your hash function is terrible). Of course I have my biases as a contributor to the article. Deco 07:26, 7 May 2005 (UTC)
- In its current state, i would equate the Choosing a good hash function section with a steaming pile of feces. Many statements are flat wrong; others are misleading. I just did some stuff to the hash table section of hash function, and i might try to complement that in this article. (-- pfunk42)
- Maybe the "choosing a good hash function" section should be removed entirely in favour of discussing this material in hash function. Nothing about hash functions is specific to hash tables, except maybe for the fact that the range is chosen based on table size. Deco 02:02, 29 May 2005 (UTC)
- I sucked the material back from hash collision, leaving the cryptography info there along with a brief summary of the hash table stuff. I think this is as it should be; I just hope our anonymous contributor doesn't get too excited. Deco 07:45, 7 May 2005 (UTC)
- I've made some changes in various places, but I would appreciate you discussing any major overhaul before you do it. It could use some more formal treatment, but I don't see any major inaccuracies — keep in mind the expected audience, which is probably people just learning about data structures. Deco 08:33, 7 May 2005 (UTC)
O(1) should be O(k) k being the key-length
- By definition, O(k) = O(1). Period. You're confusing run time speed with complexity. They are NOT the same thing.WolfKeeper
- Please read more carefully, WolfKeeper. k is not a constant, but the key-length, so this is about complexity. iSee
- But a well-written hash function is linear on the key length as well so O(aN+bK)=O(1). I repeat, you are confusing run time with complexity.WolfKeeper
- Quote: ... function is linear on the key length... so it's linear, not constant. I've just checked with articles about big-O, and nowhere is mentioned that you linear = constant if it happens to be the input. If you can provide a backup for the claim O(n) = O(1) (the variable in question does not really matter) then I'll admit being wrong. If you are wrong and O(k) > O(1) is true, then you should undo your edit. iSee
This might seem trivial and everyone gets it wrong, but it is of crucial importance when comparing hash tables with tries or judy arrays. For long keys a cleverly implemented balanced binary tree will yield the same complexity as a hash table as, you can only store m^k entries in a key of length k the hash table can not really beat O(log n).
- No? What about a large disk-based databases, with a random accessor? Hashing is optimal there. judy array's are optimised for in RAM accesses... Sometimes complexity does make a difference. WolfKeeper
- Now you are confusing complexity and run-time. For a single select, hash tables probably would have better runtime, while it would probably be worse for sequential access in a cursor. The complexity - however - is the same (unless the hash tables are degraded). iSee
- No, I'm saying that complexity can have an effect on run-time which eventually dominates. The performance relates to the size of the database; tree structures are slower on very large databases, since hash table lookups run at constant speed, whereas tree structures require multiple lookups and comparisons. In addition, although you are claiming O(k) time for a trie, in reality the lookup time is O(k ln(k)) in most cases, although the worst case is actually O(k)=O(1) when a trie is full (but in that case an array lookup has a lower constant of proportionality). With hash tables the time complexity is simple O(1) in all cases.WolfKeeper
- How do you come to the conclusion tries would have the complexity O(k ln(k))? Please elaborate a bit more on that. Thank you, iSee
I do not yet fancy being an administrator or start an edit-war with Neilc, but the point should get clear from what I have previously edited into the article and just explained here.
Regards, iSee
- I've already replied to your comment on my user talk page, but I'll repost my reply here, as this seems a more appropriate place to discuss it.
Well, hash tables are (average-case) constant-time with respect to the size of the hash table, which is what people usually care about. Assuming that the hash runs in linear time (which isn't necessarily the case!), then sure, they would be O(k) where k is the key length. That's worth mentioning, but I don't think it's that important -- we're usually concerned with describing how the performance of the hash table changes as the size of the table grows, not as the length of the key grows. Neilc 13:28, 26 July 2005 (UTC)
- Thanks for your reply, I just figured this would be the better place for the discussion. I have read some literature about places where it would have been better if people would have considered tries instead of hash tables but didn't and got bad performance as double hashing and frequent deletes just don't mix very well. It would be best to give the reader the information and let him/her decide if it matters for the problem at hand. iSee
- Balanced binary trees require about O(klog n) time for lookup, but tries are O(log n). This is why tries are preferred to binary trees for string keys. However, just because log n <= k though doesn't necessarily mean tries are always faster than hash tables in practice. When k is the word size, the hardware is fine-tuned for doing operations on data of that length quickly, where tries will still require k distinct machine operations (or somewhat less, if it's a k-trie for k>2). Tries can also occupy a lot more space. Not a clear cut decision, and not sure which if any of this is relevent to the article. Deco 04:29, 27 July 2005 (UTC)
- The reason why it might well matter is the fact that when choosing a data structure for a problem, you would in many cases go for the one with the best complexity and best constants. The article makes it look like everything else would have complexity worse then hash tables, which is not generally true. I suppose you can bring down the complexity for balanced binary trees to O(max(k,log n)) = O(k) by keeping a pointer to the relative key-position so you do not keep comparing. I did not claim tries are necessarily faster, I claim they have the same best-case complexity. Also it doesn't really matter if bad implementations exist, it matters that there is at least one good implementation (consider Judy arrays). iSee
- This is true. I'll add some stuff re this discussion. Deco 02:44, 28 July 2005 (UTC)
- I edited tries complexity to also be O(k), as it is. While I would have changes the intro, I am reasonably satisfied with the compromise. Thanks for your trouble sorting this out. iSee
I took this out of the article:WolfKeeper
Another weak area of hash tables are large keys, such as strings. Suppose the key we seek is k bits long. Either the hashing algorithm or the final key comparison must examine each bit of the key, so O(k) time is required for lookup.
- since k is constant with varying n this isn't true, it's O(1). Note that the order notation properly always contains functions of 'n'. WolfKeeper
- I beg to differ. Please read the passage in O-notation about multiple variables. iSee
In fact, since k-bit objects only take on 2k distinct values, k ≥ log n. This is one reason that data structures such as tries with O(k) lookup time can be competitive with hash tables.
- tries are O(ln n) worst case, since it devolves into a BST. Tries simply have better constants of proportionality. WolfKeeper
- This is simply wrong. There are no comparisions on the complete key. No matter if it is O(1) or O(k) they have the same complexity (except for the worst case). Please check your claims before removing other people's edits. iSee
- The worst case is when it is full, and then, k=ln n; so the worst case for a trie is O(k) == O(1). The more normal case is O(ln n), because for sparse entries it degenerates to a BST.WolfKeeper
- You're simply wrong about this.
- I'm open to the suggestion. I don't believe so though at the moment.WolfKeeper
- The fact that the usual times are given for fixed k is irrelevent, because I explicitly specify that k is not fixed in this short discussion.
- In what way not fixed exactly? As in it what way would you vary it. I conjectured that it mattered, but it didn't look like it, so now I conjecture that it doesn't.WolfKeeper
- That is to say, the asymptotic lookup time may increase with k; usually we assume k has a single constant value (it is "fixed") in which case it gets subsumed into the big-O notation. In this case I don't mean that different keys in one dictionary have different k, although that can also occur. Deco 03:40, 29 July 2005 (UTC)
- I was wrong that tries have O(log n) lookup time (in fact they have O(k) lookup time, since they perform lookup by examining each digit of the input in some base) but BSTs are definitely slower asymptotically in general.
- Definitely slower in terms of wall clock. But complexity has little to do with that. The Trie performance seems to be similar to a balanced binary tree-O(ln n); you're building an n-ary tree up- how can that not be logarithmic in terms of comparisons? It cannot be as slow as O(k) in most normal cases; the search will end far earlier than that. I'm certain in most normal cases it's O(ln n). Note that ln n is less than k except when the trie is full. Still, in terms of operations-you have to store/compare/retrieve all k-bits of the key, so you can argue that it's O(k) in that sense, but O(ln n) just in terms of comparisons.WolfKeeper
- I wasn't talking about wall time. Key comparisons need not require Ω(k) time, for the reasons you give, and in fact two random keys require only constant time on average to compare, but this ignores the fact that the tree structure gathers keys with similar prefixes together. Later comparisons will require much closer to Ω(k) time. Deco 03:28, 29 July 2005 (UTC)
- Not quite as bad as Ω(k log n) for strings, since not all comparisons look at all characters, but worse than O(k). But this is all OR, so let's just leave it out anyway. Deco 23:28, 28 July 2005 (UTC)
Summarize above discussion
Ok since there are too many different threads in the discussion above, I thought it would be good to bring the dicussion back on track and hopefully find consensus.
1. Wolfskeeper thinks O(k) == O(1). iSee disagrees.
- Certainly if k is constant this is so. If k varies, then in a lot of cases we still have this true in an amortised sense. Note that hash functions are already amortised to O(1) anyway, due to collisions.WolfKeeper
2. Hash tables and tries have O(k) (which may or may not be the same as O(1))
3. For full tables O(k) = O(log(n)), for very sparse tables O(log n) < O(k)
4. No lookup algorithm can avoid to at least once do a full key compare, hence Omega(k)? (weel the least cost)
- Actually, hash functions can sometimes avoid this by having a large enough and good enough hash and precalculating it once before multiple insert/deletes. But normally you're right. Hmm. Come to think of it, strings can do the same thing, so that's not true of Tries either. Precalculating a hash for any string gives fast comparisons.WolfKeeper
5. Typical balanced BSTs have O(k*log(n)) which may be improved by avoiding full recompares, but it is OR how well that works and probably would rather belong in the BST article (if anywhere).
- Or in the self-balancing binary search tree article. I'm sure a reference could be found for this; any reference on the advantages of tries should mention it. Deco 17:45, 29 July 2005 (UTC)
6. Constant key lengths are an improtant special case, so even if iSee is correct at 1., we might want to state O(1) and O(k)
7. Wolfskeeper thinks Tries degenerate into BSTs for low densities. iSee disagrees although there might be optimizations going in that direction. (Then "degenerate" would however be inappropriate).
- On second thoughts it depends on how it is implemented. There's two normal ways to implement Tries, you can use an n-ary tree, or a binary tree. If you use a binary tree you have ln(n) insert (and lookup if you use one of the constant-time algorithms for comparing strings), if you use an n-ary tree, it's normally O(k), which for fixed k is O(1); however it is possible to postpone creating the deeper leaves which makes it O(ln n) again.WolfKeeper
Hope this will work out to something sensible in the end. I also want to apologize if I sound rude at any point in the discussion, which was never my intent. Regards, iSee --ISee 07:02, 29 July 2005 (UTC)
- No problemWolfKeeper
I think the bottom line is that with an optimal implementation creating a key is inevitably an O(k) operation (including hashing the key for fast comparisons). Hash tables are then O(1) for all operations. Tries are actually much slower, in general with variable k, I think O(k ln n) sounds right as Dcoe says. In most situations k can be constant, but there's no particular reason you can't use random novels for keys (except appalling inefficiency :-) ).WolfKeeper
- Actually, most operations for tries are definitely Θ(k), because locating an element in a trie involves literally examining each digit of the key in some base, and nothing more. It's BSTs that the time is less clear for. BSTs are also Ω(k), but if you can assume the element is in the tree you could likely pull off quite a speedup for your "random novels" example. Similarly, if you assume an element is in a hash table, and the bucket you hash to has only one element, you know it's the right one, and the hash function need not use the entire key. So there are some ways of surpassing the O(k) limit, but they involve additional assumptions. Deco 17:45, 29 July 2005 (UTC)
- Actually, taking the novels example, novels diverge very quickly, very few novels indeed have the same first words. When doing a lookup, once you've got down to a single data entry, you can do the hash check for equality, and if that matches you have the right entry. So Tries are simply ln n in most cases where the keys are well differentiated, k is irrelevant if you never have to compare it all. WolfKeeper
- But imagine if the novels were all the same for the first 100 pages and then a few characters different at the end. Then the lookup is going to O(k). So I find that the time complexity is between O(k) (worst case) and O(ln n) (typical) for Tries. So O(k ln n) is wrong after all.WolfKeeper
- Again, ordinary tries always require Θ(k) lookup time. This is simply how they work. Your argument applies to BSTs (and is in fact what I just said). Deco 18:06, 30 July 2005 (UTC)
- Not quite. A BST with 400 records in it has 400 nodes. A Trie with 400 records in it can have many more nodes than that; even with hashing the strings for quick comparison. So ultimately it's less efficient than a balanced tree.WolfKeeper
- You seem to be confused. You don't compare strings at all in a trie. You do that in a BST. In fact, avoiding these comparisons is pretty much the only advantage of tries over BSTs, which is why they can be faster. Deco 02:34, 31 July 2005 (UTC)
- It depends on how you implement a Trie. When adding an item to a Trie you can reach a point where you have just created a new branch of the Trie. The problem you have then is whether you should create all further branches; or leave it until another node is placed at that branch. If you do it right away, then you waste memory and slow the wallclock time. If you don't; then when accessing an item later, if there is only one entry in a branch of the tree, a partial string compare is required to ensure you have the right item.WolfKeeper
- Incidentally, Donald Knuth mentions that Tries are most efficient in the first few levels; and it's a normal trick to do something else after that point.WolfKeeper
I'm okay with the removal of the paragraph. Now that I think about it, the question of what key size is best for hash tables seems tricky. If the key is too small, you spend too much relative time hashing and too much space on bucket link structures. If the key is too large, you spend too much time doing equality comparison. All I can say for sure is that all three data structures have advantages and disadvantages as implementations of associative arrays, with no one clearly dominating. Deco 02:08, 5 August 2005 (UTC)
Hash tables may be persistent datastructures
The main article states "Hash tables are also not persistent data structures." This is false. There have been many database implementations that use on-disk (persistent) datastrcutres. Both chaining and open addressing may be used, though the latter is more common.
See, for example
- J. D. Ullman, Principles of Database Systems. 1982
- R Sedgewick, Algorithms in C. 1992
- I am not referring to "persistent" in the sense of disk-based, but in the sense in which a purely functional data structure is persistent. If you clicked the link "persistent data structure", you would've seen that it was being used in this sense - this term is overloaded. Your new text, using this link, is misleading. I will clarify this point. Deco 19:28, 31 October 2005 (UTC)
Hash Table vs. Hash Map
Would it be more correct to rename the article "Hash map"? Maybe hash table is more appropriate for the array that the hash map uses, i.e. some people confusingly refer to the whole hash map as "hash table".--PizzaMargherita 10:24, 24 October 2005 (UTC)
- I strongly disagree. I have been programming as a hobby since 1982 and worked as a programmer since 1993 and also teached computer science. I always known these structures as "Hash tables" not as "Hash maps". I did the Google check: "Hash table" = 1,7 million hits, "Hash map" = Only 54 thousand hits. I also checked my computer science algorithm books, they call it "hash table". However, since "Hash map" seems to mean the same thing it is correct as it is now that "Hash map" redirects to this article. So in Wikipedia tradition I will add something to the effect "also sometimes called Hash map" to the first paragraph. --David Göthberg 16:11, 31 October 2005 (UTC)
- I agree with David. "Hash table" is technically less accurate than "hash map", but far more commonly used, and it is our policy to name articles after the most familiar name rather than the most technically accurate. Deco 19:38, 31 October 2005 (UTC)
Locality of reference
Is it me, or the graph and caption seem to contradict the main text? I'm confused. Also, would it be possible to enlarge it, so that it looks of the same width as the picture above. The long caption will also benefit from that. Thanks. PizzaMargherita 07:32, 8 November 2005 (UTC)
- I'm sorry if this is confusing. The main point of the graph is to demonstrate how sensitive open addressing is to the load factor as opposed to chaining. Open addressing only has less cache misses on average if it uses a larger table size, and even then only for a certain range of load factors - in this graph, the table size is the same for both. However, a larger table size may be justified for small elements since it would use about the same amount of total space as chaining. Deco 22:21, 8 November 2005 (UTC)
- Ehm, Deco, I think your graph is wrong in two ways: First, I think the graph is about lookups, not insertions. Second it should actually look worse for chaining. I guess you are thinking of a kind of chaining that stores the first record in the array slot. The more common chaining case (and what is described in the article) is to only store a pointer to the first record in the array slot. Then for sparse hash tables it should be about one "random" memory access to get the record for open addressing and about two "random" memory accesses to get the record for chaining. (First get the pointer in the array slot, then get the record it points to.) So the chaining line in that graph should start at 2 misses and then slowly increase, probably with about that angle you have there. (That is, at density close to 0 the average cache misses are 2 for chaining.) But the open addressing curve should look like it does now.
- And if you remake the graph, could you do it in a much larger size so we then can use the Wikipedia built in picture scaling (thumbnailing) to resize it to whatever size is proper? The Wikipedia resampling function is really good. Look at how it treated the other pictures in the article, it handles lines and text very well. It also means the graph can be enlarged the day people have even higher screen sizes then today, making your graph usable even long into the future.
- --David Göthberg 23:57, 8 November 2005 (UTC)
- Oops, you're right about the chaining cache misses. The graph is about insertions, though - lookups are somewhat cheaper because you don't have to scan a whole chain/cluster to find the desired element, and I didn't want to get into that complication. I use a small image size because Wikipedia's resizer tends to produce much larger files than is strictly necessary (yours are huge, over 30k, whereas mine is 5k, which matters to modem users), but you're right, I should at the least upload a separate larger image for the future. Deco 01:39, 9 November 2005 (UTC)
- Yes, the resampled images tend to be very large (in bytes). Since I myself was a modem user until recently I spent some time thinking about that. But personally I came to the conclusion that having large pictures (in pixels) means we can resize the pictures as we wish in articles to get decent layout. I find it hard to make pictures that have the proper size for the article offline, it is usually easier to fix that inside the article. And it allows other editors to change the layout later on. Anyway, of course you do as you wish with your pictures.
- By the way, why do you use two user names? It is confusing.
- Regarding the lookup versus insert: Inserts cause several additional cache misses when the memory manager allocates the records, so I find the insert case much more complicated and I think that it causes more cache misses.
- --David Göthberg 02:48, 9 November 2005 (UTC)
- Hmm, I think the best way to go with the images is to upload one at "full" resolution and one at web size and optimized for size, then link their image pages together, although that is somewhat more trouble. I think I'm going to say rather than insertion time, my simulation reflects the time of a failed lookup (which examines every node in the chain/cluster but does nothing else). I'll fix the label. I use two user names for the purpose of privacy - I've become a bit paranoid about people Googling my original username, which is more unique ("deco" is a word). How effective it is I don't really know. Thanks a lot for your feedback. Deco 03:29, 9 November 2005 (UTC)
- Oh, now I really like the picture, nice work Deco! I took the freedom of changing the caption to "lookup". By the way, I just realised it is slightly evil that you start the y-axis (cache misses) at 1 instead of at 0. Kind of makes chaining look worse then it is.
- I should perhaps mention that the original reason I used the picture width 362px and not 360px was that the top picture did get some resample damages at 360px. I just noticed that both you and I just cut and pasted that size for the other pictures. I guess people in the future will wonder why our pics have those extra 2 pixels width. :))
- --David Göthberg 05:52, 9 November 2005 (UTC)
- Thanks! Blame Mathematica for the axis scaling. :-) I wish we could specify resampling options, like "reduce to k colors" or "crop out this region" or "convert to grayscale". I also put all my .png uploads through pngcrush with all options on, it often reduces by 20-40%. Deco 06:53, 9 November 2005 (UTC)
- Yeah, a "reduce to k colours" setting would be great. Guess someone has to join the Wikimedia coders and fix it... I use IrfanView for my png optimisation. That is, I usually reduce to the lowest number of colours that looks good, usually 16, 64 or 256 colours and then just save as png. I have compared that to pngcrush and several other such optimisers and it gives the exact same optimisation (for the same number of colours). Those tools often even state that they can not optimise the pictures further. And oddly enough IrfanView saves much faster then those optimisers optimise.
- --David Göthberg 08:13, 9 November 2005 (UTC)
- Make sure you're using the -brute option with pngcrush. I haven't used IrfanView, but if it does some optimization on saving that's great - maybe the Wikipedia scaling could do something similar. Deco 08:31, 9 November 2005 (UTC)
Hi there, I haven't read the recent changes very throughly, but I have a few comments.
- "An additional disadvantage is that traversing a linked list has poor cache performance." I think it's chaining itself that has poor cache performance, even with zero conflicts. It's not a matter of traversing lists, it's that records (even if corresponding to different slots/hashes) are not stored one next to the other.
- "Dynamic arrays can also be used to decrease space overhead and improve cache performance when records are small." A very limited improvement for low densities, see previous point.
- About the graph, I agree that the y-axis should start from zero, not 1. In fact, even better, since it does not show experimental results, but only theoretical speculations, I would only show a qualitative graph to be conservative (i.e. no numbers on y-axis). Otherwise one may be inclined to think e.g. that with chaining the average number of cache misses is always below 3. Moreover, if the open-addressing table is small and fits entirely in cache (which I repeat is not necessarily a CPU cache), it always performs better than chaining.
- "Open addressing is easier to serialize" - are we sure?
In general I have the feeling that we are being carried away and we are doing some "in-flight" research, which is discouraged in WP. It would be nice to support claims wrt performance with either hard data or reference to publications. PizzaMargherita 13:44, 12 November 2005 (UTC)
Space issues Chained versus open addressed
Do we have a reference to the claim that:
"If they are filled with enough elements, about three to four times the table size, and they are also storing large records, they can occupy less space than open addressing, discussed in the next section. The disadvantage to this is that three to four cache misses will be incurred per access on average."
Consider a record that is 1000 bytes. The hash array consists of just pointers and can be made perfectly big enough to not incur three or four collisions per access (where does this come from please???). By way of contrast an open addressed-in place table has 20% overhead or more; (20% of >1000 = >200 bytes average). There's no way on earth that the overhead of a chained hash table can be that high.
So far as I am aware open addressed hash tables are more often in-place storage. I suppose you could have an open addressed hash table that references the records rather than contains the records, but then you're at best no better off than with chained lists in terms of memory and in terms of performance you'd get the fabled cache misses just the same, and the minimum size of the hash array is larger than chained.
So where the heck did three of four times come from???? I don't understand it and I don't believe it, so I will certainly require a citation for that.WolfKeeper 18:38, 8 November 2005 (UTC)
I'm sorry if that was unclear. This is related to the space waste associated with unused table slots that you cite below. If a chaining hash table is filled to a load factor of 100%, about 35% of the slots will remain unused - try it for yourself if you don't believe me. If the elements are large and only references are stored in the table, this doesn't matter so much. If the elements are small, it amounts to a considerable amount of wasted space comparable to the margin required for open addressing. In order to ensure that most (95%) of the slots are used, we have to fill the table to a load factor of 270%, at which point nearly 3 cache misses per access are required on average (okay, more like 2 to 3 times).Oh, and as for a citation - I'm afraid I don't have one but it is fairly straightforward to show either by experiment or simple probability. Feel free to exclude this if you're not satisfied.Deco 22:10, 8 November 2005 (UTC)- The above is sort of true but imprecise - please see my argument below. Deco 01:52, 9 November 2005 (UTC)
Yes WolfKeeper, I agree that paragraph is wrong. We have some open addressing advocates among us that seems to have a bad understanding of chaining but they don't try to be rude, they just lack some understanding, so take it easy. They (for instance Deco) seem to be thinking of something similar to "coalesced hashing" but a simpler variant, where the first record in the chaining lists are stored in the array slots, instead of just having pointers to records in the array slots. They don't seem to understand that the article text (and now also the chaining picture I made) describes the more common case when the hash table array only stores pointers to records.
So let us do the maths and compare two examples. Assume we have 1000 records (key-value pairs).
1: Open addressing with record storage within the array (as in the open addressing picture in the article): We should only fill the array to tops 80% otherwise we get bad performance. So then we need an array of 1250 slots. (1000 out of 1250 = 80%.) We have no pointer overhead etc, so that gives an overhead of exactly 25% (those 250 unused slots).
2: Chaining with the array slots only storing pointers to linked lists of records (as in the chaining picture in the article): Chaining performs well already when the array has about as many slots as we have records. But for the sake of comparison let us assume we have 1000 records and 1250 array slots (which from the point of view of chaining means even better performance). I assume pointers cost 4 bytes of space (32 bit addressing). So we have 4 bytes of pointer overhead for each array slot and 4 bytes of pointer overhead for each record = 4*1250 + 4*1000 = 9000 bytes overhead in total.
This means at 36 bytes data per record the two examples above cost exactly the same space. Chaining: 36 byte per record * 1000 records + 9000 bytes overhead = Open addressing: 36 byte per array slot * 1250 slots = 45000 bytes in total.
If the records are bigger chaining is more memory efficient, if the records are smaller open addressing wins. 36 byte records and 1000 records are not unusual values. (Actually, number of records doesn't matter in this calculation.) So to me the two alternatives seem about equal in memory consumption.
3: Let us make the chaining case even worse, let us assume that the memory handler costs an additional 10 bytes overhead for each record we allocate. (4-10 bytes is normal.) This means 4*1250 + (10+4)*1000 = 19000 bytes overhead in total.
This means at 76 bytes data per record open addressing and our worst case chaining example cost exactly the same space. Chaining: 76 byte per record * 1000 records + 19000 bytes overhead = Open addressing: 76 byte per array slot * 1250 slots = 95000 bytes in total.
If the records are bigger chaining is more memory efficient, if the records are smaller open addressing wins. I still say the two alternatives seem about equal in memory consumption.
Now, let us look at level 1 CPU cache misses. Basically I am against calculating this since this is the kind of thing one should implement and time to really know which is the fastest implementation. But anyway, here goes:
Lets start with lookups only, not inserts: Let us assume that we use a fairly sparse hash table that is filled to say less then 80%. (As we showed above that is about equally memory efficient for both open addressing and for chaining.) This means most records do not collide. One access into the array costs one "random" memory access which means one cache miss. In the open addressing case we are then done, we have our record since we read it out of the array slot. In the chaining case instead we now only have a pointer to the record so we need to do a second "random" memory access to get the record it self, that is a second cache miss. So yes, chaining causes more cache misses. But not several times more, just about one more on average.
So my conclusions are:
The following statement is wrong: "If they are filled with enough elements, about three to four times the table size, and they are also storing large records, they can occupy less space than open addressing,"
The following statement is only true if you actually fill the table to 2-3 times the array size: "The disadvantage to this is that three to four cache misses will be incurred per access on average." But we don't need to do that, having a bigger array filled only to 50% is more normal in chaining and can still be more or less as memory efficient as open addressing and causes very few collisions. Thus chaining costs only about one more cache miss per lookup then open addressing.
But what is true is that for really small records open addressing seems more memory efficient. Provided we know in advance how many records we will have so we can allocate the right array size, or provided that we reallocate the array each time it gets to full (in the open addressing case). But as we all know, when an open addressing hash table gets too full it gets into much more trouble then a chaining one.
I will not go into arguing about how much inserts costs since that depends on way too many factors. Trying to figure that out by hand calculation is too error prone. That is the kind of thing one has to test and compare for actual implementations.
So, now to my conclusion what that paragraph really should say:
- If the hash table stores large records chaining uses less memory then open addressing. (Large here means more then 40-80 bytes per record or so.)
- If the hash table is sparse (that is, it has a big array with many free array slots and few collisions) chaining uses less memory then open addressing even for very small records. (If the hash table array only stores pointers to records.)
- However, chaining usually causes slightly more memory accesses and thus more cache misses in the CPU memory cache which probably makes chaining slower then open addressing.
--David Göthberg 23:18, 8 November 2005 (UTC)
- Actually I'm not an advocate of open addressing - I don't even like it and generally recommend against its use, primarily for reasons of complexity. I actually didn't realise that my argument relied on storing first elements of chains in the table, but I see now that it does. Thanks for pointing that out. I agree with your revisions for the most part, but I do have an argument that seems to show that open addressing is more space efficient than chaining tables unless the chaining table has load factor 2.5 or more:
- Suppose we have a hash table of size n with k-word values. Suppose the table contains m elements. Then chaining with external storage uses (k+1)m+n words, k for each element, one for each table slot, and one for each element's next pointer. Cache misses per access is 1 + m/n on average. To put m elements in an open addressing table and have it perform well, it must have size at least 1.4m, so depending on the storage method:
- If the elements are stored internally, we need 1.4km words of storage, k words for each of the 1.4m table entries.
- If the elements are stored externally, we need km+1.4m words, k for each element and 1.4m for the table.
- Now, 1.4km > (k+1)m+n (chaining wins over internal OA) whenever k > 2 and m/n > 5/(2k-5). Here is a table of the smallest load factors for which chaining wins for various k:
k 1 2 3 4 5 6 7 8 9 Minimum load factor for which chaining wins ∞ ∞ 5.00 1.67 1.00 0.71 0.56 0.45 0.38
- Similarly, km+1.4m > (k+1)m+n (chaining wins over external OA) whenever m/n > 5/2. So, open addressing with external storage wins whenever the chaining table load factor is < 2.5, independent of k, and chaining wins for all greater load factors (although it's somewhat slower at this load).
- Finally km+1.4m > 1.4km (internal OA wins over external OA) when k ≤ 3, independent of m.
- My conclusion: For k ≤ 3, internal OA beats both external OA and chaining for all reasonable parameter values. For k > 3, external OA beats chaining for a chaining load factor < 2.5, and chaining beats external OA for a load factor > 2.5 (note that this is the chaining table's load factor here - the OA table is fixed at 0.7). Since external OA with load 0.7 is faster than chaining with load 2.5, this seems like the way to go. I think this is correct, but please double check me. Deco 02:07, 9 November 2005 (UTC)
- Ehm, first of all I wish to apologise to you Deco, it was not my intention to offend any one, just to try and make things more clear. (Yes, I read your first edit before you reedited it.) And now you bring up a new but very interesting case: Open addressing with external storage. My first gut feeling is that that case seems very efficient in most cases yes. But I have to think some time about this before I can comment more on it. --David Göthberg 02:37, 9 November 2005 (UTC)
- That's okay, I overreacted. This argument was a bit trickier than I thought, but the above seems to confirm my initial suspicions that OA with external wins except when the chaining table has high load factor (if I didn't screw up again). I hope making my assumptions explicit and my derivation precise helps. I also didn't mention the interesting case of storing two elements in each linked list node, which can help cut space and cache misses both. Please ask if there's any point I didn't explain well enough. Deco 03:05, 9 November 2005 (UTC)
- Ehm, first of all I wish to apologise to you Deco, it was not my intention to offend any one, just to try and make things more clear. (Yes, I read your first edit before you reedited it.) And now you bring up a new but very interesting case: Open addressing with external storage. My first gut feeling is that that case seems very efficient in most cases yes. But I have to think some time about this before I can comment more on it. --David Göthberg 02:37, 9 November 2005 (UTC)
- Ok, basically your maths is right but your conclusions might be a bit confused but about right. Just two minor corrections:
- "Now, 1.4km > (k+1)m+n (chaining wins over internal OA) whenever k > 2.5 and" (not k > 2)
- "Finally km+1.4m > 1.4km (internal OA wins over external OA) when k < 3.5" (not k ≤ 3)
- He, I guess this means your original statement is more or less correct: "If they are filled with enough elements, about three to four times the table size, and they are also storing large records, they can occupy less space than open addressing, discussed in the next section. The disadvantage to this is that three to four cache misses will be incurred per access on average." But if we look at your table above (which is correct) we see that chaining wins much earlier then that. Already at k = 5 and load factor 1 chaining and internal OA is equal. Say 4 bytes per pointer, then k=5 means 5*4 = 20 bytes/record. That is already fairly small records and just load factor 1. With slightly bigger records that table shows we can go to even smaller load factors. Which means for already fairly small records we can afford a big array (a small load factor) which means most of the time there will not be hash collisions and thus no traversing of linked lists.
- Another interesting case which you did not look at is when the hash table is very sparse. Like when we have a fixed size hash table that is big enough for some worst case (so we never have to resize it) but we most of the time only have a very low load factor. Then interesting things happen. For hash tables with only pointers in the array (chaining and external OA) having a very sparse table is not so costly. Then chaining and external OA uses much less memory then internal OA for many reasonable values.
- --David Göthberg 05:33, 9 November 2005 (UTC)
- For the inequalities you mention I was assuming that k is an integer. Come to think of it I guess it need not be. Good catch.
- My original statement makes a bit more sense with reference to external OA, which wins for all k as long as the chaining load factor isn't ≥ 2.5. Internal OA loses to everybody for even medium-sized records, as you note. I made a little graph (just for talk page right now) showing space per table slot as a function of k (chaining load factor 1, OA load factor 0.7):
- You can see that external OA wins for k>3.5, but the difference isn't terribly dramatic. The graph for load factor 4 is similar, but with the parallel lines reversed and closer together.
- You're right about the case of a sparse table - in this case the space occupied by the table itself dominates external space, so external storage wins big for k>1. Deco 06:39, 9 November 2005 (UTC)
- Oh, nice graph! One thing we should keep in mind is that our estimates are simplified since in reality for chaining and external OA the memory handler will cause an extra 4-12 bytes of overhead per allocated record.
- --David Göthberg 07:56, 9 November 2005 (UTC)
- Thanks! Allocation overhead can be made negligible through pooling. If we did include it, the effect would be just to shift the two parallel lines up. One case I haven't considered is dynamic-sized elements, like strings - this seems to mandate external storage. Deco 08:09, 9 November 2005 (UTC)
Hey, I tried to incorporate some of these ideas into the article, but nothing too fancy, just some vague generalities. I think it's better now, and I hope Wolf isn't offended. I was also thinking of adding the following paragraph somewhere:
- Some chaining implementations use an optimization where the first record of each chain is stored in the table. Although this does increase performance, it is generally not recommended: chaining tables with reasonable load factors contain a large proportion of empty slots, and the larger slot size causes them to waste large amounts of space.
Thought I'd post it here for comments first. Also, where should it go? I believe it's accurate. I'd also like to mention somewhere that chaining is more space-efficient when combined with pooling. Thanks. Deco 00:06, 11 November 2005 (UTC)
- I have seen what you and the others have added/changed lately, nice stuff. And yes, I think you should add the text you suggest. I have no suggestion where, I leave that to you guys to decide. I am already busy thinking about pictures I am going to do for other articles. Pictures for Public-key cryptography is on the top of my todo list now. (Apart from putting the final touch to my slides for my upcoming talk at the Berlin congress in December about p2p algorithms!) --David Göthberg 04:45, 12 November 2005 (UTC)
Doesn't open addressing have some problems with deletion operation? Just marking a cell as empty slows down read operations. I have not yet read any satisfactory solutions yet. --Scrambler 06:16, 1 January 2006 (UTC)
- Mainly its awkward complexity. I don't think I've seen an analysis of the performance of deletion. Deco 19:50, 1 January 2006 (UTC)
Ummm...
"** They utilize more of the table slots. Even with elements distributed uniformly at random, a chaining table must contain several times more elements than slots before most of the slots are used, and it does not perform well at this load. At more typical table densities, such as 1, about 37% of the table slots remain unused."
I've read this several times. It makes no sense at all. The more times I read it, the less sense it makes.
Utilisation of slots is not a useful metric in this sense- who in their right mind cares? Speed or space or complexity or cost are useful metrics.
Uhh.. guys... slow down...WolfKeeper 19:28, 8 November 2005 (UTC)
For small elements, unused table slots have a direct impact on space usage, since the space occupied by these slots is wasted. This penalty is roughly the same as that open addressing requires to maintain a small load factor. This was the point I was attempting to make. I'm sorry if I was unclear.Deco 22:16, 8 November 2005 (UTC)- The above is sort of true but not precise enough. See my argument above. Deco 03:22, 9 November 2005 (UTC)
NPOV- it's not just a good idea it's the law!!
Hey, has anyone besides me actually read NPOV?
- I am striving to be accurate and have made numerous contributions regarding both the disadvantages and advantages of both collision resolution mechanisms. You can hardly accuse me of being pro-open addressing with the work I put into that diagram, which is quite disparaging to open addressing. If anyone has a bias here, it isn't me. Deco 22:14, 8 November 2005 (UTC)
Buckets
I've never been good with the terms, the words used, for certain things, since I'm not not natively speaking English. In the section Probabilistic Hashing the word "bucket" is used, but never referenced elsewhere and without clarification to what it is. I've seen the word several times before, but I'm uncertain what it strictly refers to (the keypair? the entire "item" in the table?). Could someone who knows the exact terms clarify this in the article please? /Gustaf
I should add that I found it annoying since the second sentence in the article on Distributed hash table is the following: "Each node is analogous to a bucket in a Hash table". This description makes no sense without a definition of a bucket in this article. /Gustaf
- Thanks for your comments, Gustaf. This was an error on our part, partly an error of inconsistency and partly a failing to define a commonly used alternate term. I've attempted to correct both of the articles you mentioned. A "bucket" is used simply to refer to a slot in the hash table's array. Deco 05:48, 8 December 2005 (UTC)
- Oops, someone beat me to it. Good job to them. Deco 05:52, 8 December 2005 (UTC)
- "In other words, keys in the same bucket hash to the same location in the hash table's array, and a collision is the result of two keys being inserted into the same bucket." Is this always true? Are they always inserted into the same bucket? Or am I getting it right if I say; In open addressing, a bucket always contain one or zero objects, while in chaining, a bucket may contain several objects with identical hashed indices.
Does this infact mean that every item in the array is a bucket, and buckets contain items on the following form: {[key,value], [key,value], ...}, or in the case of open addressing only contain one [key,value]-pair but then pointers to items that share the index. Because a quick thought and you realize that if Alice and Bob share the hashed index, and Ceasar has the next index, and you in open addressing insert: Alice, Ceasar, Bob, and then remove Alice, there's no way you know that Bob is there, since his "real" index is now emptied... This is not entirely explained ;)
"Each element of the array can contain one key–value pair, or record."
- I prefer to think of a bucket as a conceptual thing. Its not how hash tables are actually implemented. So one can think of a bucket as the set of keys that all have the same hash value. They can be implemented in multidute of different ways. In fact, a language that supports variable length index entries, such as Euphoria, can have one table entry per bucket and thus many key-values in an array element. DerekP 05:29, 13 December 2005 (UTC)
- I'm sorry if my edits confused the point. In the case of open addressing, the items with the same hash value are still in the same bucket, even though they are stored in different places in the array. In other words, in an open-addressed hash table, items from different buckets can actually conflict; this is why clustering occurs in open-addressed hash tables but not chained hash tables. I'll try somehow to clarify this. Deco 08:17, 13 December 2005 (UTC)
I think this is more confusing than it needs to be. In my experience, a bucket is exactly one slot in the array. Although I can understand that it might be useful to think about the set of keys with common hash value as a sort of "abstract bucket", defining a bucket that way would I think be a departure from the norm -- unless someone can cite literature in which it is used that way. To take an example of what I think is the common usage, here's a bit from a Microsoft developer reference page:
- When an element is added to the Hashtable, the element is placed into a bucket based on the hash code of the key. Subsequent lookups of the key use the hash code of the key to search in only one particular bucket, thus substantially reducing the number of key comparisons required to find an element.
That is, "bucket" is used as a physical container in which you place entries, not an abstract set of entries sharing some property.
- My understanding of the Microsoft document is that the term 'bucket' refers to a container of zero or more elements - "the element is placed into a bucket", and that a bucket is identified (or accessed) by the key's hash - "based on the hash code of the key". This is further hinted at by the phrase - "search in only one particular bucket", because why would one have to search in a bucket for the required element if the bucket could only hold a single element? I believe the Microsoft document is describing a conceptual, or abstract, construct and it is not suggesting that a bucket and an array-slot are synonymous.
- The next paragraph in that Microsoft document goes on to say ...
- The load factor of a Hashtable determines the maximum ratio of elements to buckets. Smaller load factors cause faster average lookup times at the cost of increased memory consumption. The default load factor of 1.0 generally provides the best balance between speed and size. A different load factor can also be specified when the Hashtable is created.
- Which states quite clearly that it is possible for a bucket to hold multiple elements. Can an array-slot hold multiple elements? Not in most languages.
- However, I do agree that many people use bucket as if they were talking about array slots, but I think they both deserve they own separate definition. DerekP 03:15, 16 December 2005 (UTC)
I would remove the two sentences that currently describe the term "bucket", and just define bucket wherever a slot/position/entry in the array is first mentioned, by inserting "sometimes called a bucket" at the appropriate place. Nethgirb 23:31, 15 December 2005 (UTC)
- Note that in a chaining hash array, the concepts are equivalent, so any context assuming such an implementation would necessarily equate them. Nevertheless, I don't have any supporting evidence myself. Deco 01:55, 16 December 2005 (UTC)
Response to DerekP: Yes, a bucket can hold multiple elements, and so can an array slot, both via chaining as Deco has just said. The programming language isn't an issue. For example NIST's definition of hash table says The table may be an array of buckets, to handle some numbers of collisions easily; that is, one slot in the array is a bucket.
So the only question is whether bucket = slot if you're not using chaining. In that case a slot only has one element and so a slot is not the same as the concept "the place(s) you put the elements with the same hash value". But I would not use "bucket" as the name for that concept. A bucket brings to mind a data structure holding zero or more elements, not a scattered set of elements sharing some property. Consider NIST's definition of bucket: an area of storage where items with a common property are stored. The point here is that it is an area of storage -- a single location or data structure. Thus, I think for the non-chained case, there are two reasonable options: (1) equate "bucket" with "slot"; (2) simply don't use the term bucket in the non-chaining context. Nethgirb 08:37, 16 December 2005 (UTC)
- I don't really want to argue about this, so let's just say that we have a difference of opinion about the topic and go about our separate ways. Thanks for sharing anyway. DerekP 21:02, 16 December 2005 (UTC)
- Let's just equate bucket with array slot, and avoid mentioning the term again in the article. (i.e., keep current version) This seems mostly consistent with other uses of the term I've seen. Deco 21:30, 16 December 2005 (UTC)
- Sounds good... however the current version doesn't equate bucket with array slot ("Note that a bucket is not precisely the same thing as an array slot"). Nethgirb 01:01, 17 December 2005 (UTC)