Talk:Hash table/Archive 2
This is an archive of past discussions about Hash table. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 |
Pseudo-Code
I have a few small issues with the current Hash table article. First, the pseudo-code is a little convoluted—in that the findSlot(..) function is not 100% obvious. It hides away the linear probe code, which makes reading the set(..) and lookup(..) functions a little confusing on first read. I was just going over it with an intro computer science student and even had to read it twice myself. Josh 08:11, 10 April 2006 (UTC)
- OK, I tweaked find_slot() to make it more obvious (to me) that it is doing a linear probe.
- Or did I just make it more confusing?
- Please feel free to improve it further.
Something i don't understand from the article
If when adding to a table the hashing function leads to a collision for 2 particular keys, a and b, then using probing b will be stored somewhere after a. When then looking up the value associated with b, won't the table actually return the value associated with a. How does it know which of the two values to return, and how to get to the index associated with b?
I hope that makes sense. Iae 11:22, 8 June 2006 (UTC)
- Providing we're not using perfect hashing, we must search through all the places the desired element could be, comparing our search key with each one, until we hit the end of the list. This is why hash tables require not only a hash function but a means of comparing for equality. This only works efficiently because in a hash table with enough room, these lists are overwhelmingly very short. Deco 11:27, 8 June 2006 (UTC)
- Ah I see, thanks very much. Iae 11:49, 8 June 2006 (UTC)
how do you delete from a hash table that uses probing?
- How do you know that you have hit the end of the list. Or to cut to the gist of the matter; How do you delete from a hash table that uses probing.
You know you hit the end of the list in a (probed) hash table when you hit a empty "not occupied" slot. In theory, one could have a "occupied" bit for each row of the hash table that is initially 0. (In practice, typically each slot that is not occupied begins with a NULL byte. If it *is* occupied, that byte is the first byte of the key).
I know of 2 very different ways to delete from a hash table: the "deleted bit" method", and the "move stuff around" method. (My understanding is that the "move stuff around" method is impossible to implement with "quadratic probing" or "double hashing" hash tables (and probably a few other types). Those hash tables are forced to use the "deleted bit" method.)
Let me try to explain those 2 methods
"deleted bit" method: works with any kind of hash table. Every row of the hash table (in addition to the key, value pairs) has a "deleted" bit that starts out 0. To delete a record from the hash table, use the key to find it (using find_slot), then set the "deleted" bit to 1. (Some hash tables cram the "deleted" bit and the "occupied" bit into the "key" part of the record, reserving 1 special key to indicate "unoccupied", another special key to indicate "deleted", and any other value to indicate a real occupied "key" ).
"move stuff around" method: only works with linear probing.
The function remove(key) in the article is supposed to describe the "move stuff around" method. How could we make it easier to understand?
Often when the application deletes a record, the following slot is already not occupied. In that case you can wipe out the record by marking that record as not occupied -- overwriting the key with a NULL byte -- and you are done.
Unfortunately, there are a bunch of special cases the application needs to be able to handle, even though they rarely happen. As the article says, "For all records in a cluster, there must be no vacant slots between their natural hash position and their current position (else lookups will terminate before finding the record)." The application needs to scan through *all* the records between the record you want to delete, and the next "not occupied" slot, and make sure the above requirement is met. In some cases, you must move all those records up 1 slot. In other cases, you must move some (but not all) of those records.
In yet other cases, you must not move any of them, just mark the deleted slot "not occupied" just like the simple case. (For example, if you want to delete the record in slot 870, and you see that the "key" in slot 871 actually hashes to "871", and slot 872 is "not occupied" -- you must mark slot 870 "not occupied", and leave the record in slot 871 alone).
Once you understand how it works, please update the article to make it easier to understand for the next person.
--68.0.120.35 07:42, 5 March 2007 (UTC)
joaat_hash function error
Hi, I've implemented the joaat_hash function which is described in pseudocode on this page in my C Program, and encountered an error. Better said I produced one, since I misunderstood len as the len of the hashtable, and not as len of the key.
here is my (correct) function implemented in C, please consider updating the article:
int joaat_hash(char *key, size_t len) //len is the size of the hashtable { unsigned int hash = 0; unsigned int i; for (i = 0; i < strlen(key); i++) /* [...] as in the article */ return (hash % len); }
--88.76.141.17 03:43, 2 January 2007 (UTC)
You are right -- the "len" in the pseudocode in the article is the length of the key. (That pseudocode gives the same results as the original C implementation "One-at-a-Time Hash" ub4 one_at_a_time(char *key, ub4 len) http://www.burtleburtle.net/bob/hash/doobs.html , right?)
I think the above implementation gives the same results as the original C implementation for all possible ASCII strings. Unfortunately, the above implementation gives different results for other sorts of data structures cast into byte arrays, when those data structures include a zero byte.
Since the original version gives *better* results for those data structures, and the *same* results for everything else, I think I'll stick with the original version, except use "key_len" to clarify exactly of what it is the length. (Or have I missed something obvious?) --68.0.120.35 07:42, 5 March 2007 (UTC)
Bug in 'remove' Pseudocode
The remove function will loop indefinitely if the hash table is full, since it only exits when it finds an unoccupied slot. I think adding the following after the j := (j+1) line should fix it:
if j = i exit loop
32.97.110.142 20:36, 11 April 2007 (UTC)
Mistake in table resizing section ?
The article states:
To see why this is true, suppose a hash table using chaining begins at the minimum size of 1 and is doubled each time it fills above 100%. If in the end it contains n elements, then the total add operations performed for all the resizings is:
1 + 2 + 4 + ... + n = 2n - 1.
This does not seem right. The expected number of steps it takes to double the table size from (1) to (n>1) at each overfill should be: truncate(LOG2(n-1))+1 (whare LOG2 means logarithm base 2 of n). Also, if my math is correct 1 + 2 + 4 + ... + n = n(n+1)/2, but this does not seem the right computation to make: we're not supposed to add the values corresponding to each step but to count them.
If this reasoning is correct the total cost should read O(LN(n)) rather than O(n), which should mean this method scales very well with n.
- I can't understand your reasoning at all. First of all 1 + 2 + 4 + ... + n is in fact 2n - 1, and not n(n+1)/2 (this is 1 + 2 + 3 + ... + n). Second, this is summing the add operations performed during the resizings. During each resizing, all elements currently in the table have to be added again to the new table, and each resizing occurs at a power of 2 number of elements, so the numbers being added are correct. I really don't understand what you're saying. Dcoetzee 10:29, 9 June 2007 (UTC)
Ambiguity in amortized analysis
In the discussion of the costs of resizing the hash, there seems to be an ambiguity in the amortized running time analysis:
... If in the end it contains n elements, then the total add operations performed for all the resizings is:
1 + 2 + 4 + ... + n = 2n - 1. Because the costs of the resizings form a geometric series, the total cost is O(n)....
The n in "2n - 1" is actually the value of the nth element, not the number of elements. The sentence that follows makes it seem that the sum of a geometric series is linear, which is clearly not the case.
Guray9000 00:47, 16 April 2007 (UTC)
- Actually, it is the number of elements. I'm not sure what makes you think it's the value of the nth element - it's not. And the sum of a geometric sequence is linear in n, if n is the last term of the sequence. Dcoetzee 11:34, 9 June 2007 (UTC)
How this can be true (question about statement in section 'Time complexity and common uses of hash tables')?
In that section it is written: "Like arrays, hash tables provide constant-time O(1) lookup on average, regardless of the number of items in the table." (*)
Searching in sorted list of keys by using binary search can be done in O(log N). Average running time of binary search is also log N (this is according to http://planetmath.org/encyclopedia/BinarySearch.html). I don't know any better algorithm for non special sorted data. So I think statement (*) isn't true in asymptotic sense.
However if we know that N is bounded by some number we can say that O(log N) = O(1). But this would be wrong... But in some philosophical (or maybe practical) sense we could be right (we have large but fixed maximum amount of RAM, HDD space, maximum amount of hash table entries etc). But this would be not right in asymptotic sense where N not bounded but free.
What I don't understand?
- Really, in a fairly real sense, hash tables don't search, they go straight to the right place in the table.WolfKeeper 22:56, 28 June 2007 (UTC)
- At a maximum table utilisation factor (say 80%), the Hash table will do on average a constant number of comparisons (k= ~2 at 80%) which is still an O(1) operation, as k is independent of n, the size of the table. In other words, at 80% full, there's going to be only about 2 colliding entries. Hope this helps.WolfKeeper 22:56, 28 June 2007 (UTC)
- Thanks, WolfKeeper and sorry for my ignorance. Your answer made me do more searching and I understood that, I was confusing hash_map with map (in C++ terminology). map permits lookup, insertion and removal in logarithmic time on average. And hash_map permits these operations in constant time on average. 78.56.1.150 12:37, 29 June 2007 (UTC)
Implementations
This section is quickly devolving into yet another trivia list. MegaHasher 19:45, 11 July 2007 (UTC)
Citations
This article could be improved with more direct citations. MegaHasher 06:11, 14 August 2007 (UTC)
More concrete suggestions for hash function?
I wanted to check in here before making huge changes to the article, but one thing I'd find very helpful is a discussion of concrete choices for the hash function itself. Here's an outline of what I'd say:
It's very common to implement hash tables with poor hashing functions. Knuth is largely to blame, advocating the very weak "multiplicative hash" function, and even going so far as to claim that its clustering property is a good thing! (section 6.4, TAoCP vol III, 2e, p. 517). Variants of the multiplicative hash remain popular, as do other linear techniques such as CRC.
Surveying the field, two excellent newer alternatives stand out. For most simple hashing tasks, the Fowler Noll Vo Hash is an excellent performer. It is among the simplest of known hash functions, is quite fast, and has a good distribution. For use in a hash table, the FNV-1a variant is likely the best choice, as it has better (more dispersed) clustering behavior than FNV-1.
For some applications, particularly when keys are long, the newer Jenkins lookup3.c hash function may be a better performer. It achieves better speed by consuming 12 bytes of the input per iteration, as opposed to one byte for FNV. Disadvantages include greater code complexity, a sensitivity to machine endianness (causing potential difficulties when reusing hash values across disparate computers), and the need to pad byte-aligned input to 32 bit words. Do keep in mind, though, that benchmarks showing impressive speed for large blocks of data may not translate to real-world gains. Common usage patterns involve relatively short keys, so the amount of time spent in the hashing function inner-loop may be less relevant than, say, the gains from a compiler being able to automatically inline a simpler hash function.--Raph Levien 02:32, 5 July 2006 (UTC)
Some interesting links I came across while searching:
- Hash functions by Paul Hsieh
- The Art of Hashing
- Hashtables, Part 2 by Maciej Stachowiak
- Hash Functions at Pluto Scarab - with nice graphs
You might also want to check out HSH 11/13 which seems to distribute quite well (as per published graph) and also performs nicely with its handful of code lines.
Followup:
I went ahead and completely rewrote the section. I may have come close to bending the letter of the law on doing original research (measuring avalanche behavior of Jenkins One-at-a-time hash using Bret Mulvey's tools) and NPOV (I advocate particular hash functions and trash criticize other popular choices), but everything I said is verifiable, and I think the result is the most helpful and reliable advice on choosing hash functions anywhere on the Internet or in dead tree form.
I measured the HSH 11/13 hash using Mulvey's AvalancheTest, and found it slightly inferior to the Jenkins One-at-a-time hash. Herbert Glarner's tests for uniform distribution are not as sensitive as Mulvey's chi-squared tests. Further, HSH 11/13 is quite slow, because of its inner loop.
Obviously, I also changed my mind about FNV, swayed in large part by Mulvey's analysis. It's not a bad choice, being one of the simplest of the "pretty good" choices, but even there the XOR-folding adds a bit more complexity. In any case, I feel like I've provided enough information for people to make their own informed choice.--Raph Levien 21:31, 12 August 2006 (UTC)
- I agree that this analysis of various hash functions are worth putting into Wikipedia.
- But wouldn't the hash function article be a better place?
- Or are there special considerations for hash functions used in a hash table that don't apply to hash functions used for other purposes?
- --68.0.120.35 07:42, 5 March 2007 (UTC)
- Yes, this section evaluates hash functions solely for their use in a hash table. Functions like the Jenkins one-at-a-time are very well suited for such uses, and extremely bad for other hash applications like message integrity checking, which is the domain of cryptographic hashes. Fair enough? --Raph Levien 04:12, 9 May 2007 (UTC)
- Referring to the statement: "Further, HSH 11/13 is quite slow, because of its inner loop." - The HSH documentation states, that a "key like 'Yvonne' [...] requires 92 machine instructions to generate a hash value". - Now I am just wondering, does there exist a speed comparison of any sort in order to choose a fast one? - Regards, --Gulliveig 16:02, 14 August 2007 (UTC)
Benchmarks
Simplicity and speed are readily measured objectively
There is a caveat here. Experimental measurements of speeds are necessarily done on a "representative sample" of inputs. It may be the case that such or such algorithm performs with varying speed depending on the kind of inputs, and that some sample, representative of one kind of inputs, may not be representative of another kind. I don't think this would happen with usual hash functions on e.g. strings but this may happen in more esoteric cases. David.Monniaux 14:38, 13 September 2006 (UTC)
- Without checking further into the matter of this article specifically, the speed of an algorithm is usually (or at least often) expressed as the worst case speed, using the Big O notation. It is an objective measurement which gives the asymptotic upper bound of the execution time as a function of the input length. Of course you are correct in that one algorithm may perform well with a certain kind of input and badly with other kinds of input, but if one algorithm always works in O(n) time, it is, after certain point, always faster than an algorithm that works in O(n2) time. It's just pure mathematics to define that point, as is calculating the asymptotic upper bound too.
- I would be more concerned about the simplicity claim. "Simplicity" is not something you can measure mathematically. You can always count the instructions etc but that's both machine and implementation dependent. —ZeroOne (talk / @) 16:44, 14 August 2007 (UTC)
Load factor
I have seen two separate referenced articles that had measurements that show under separate chaining, an increase of load factor by 1 has the impact of increasing CPU instruction count around 4 to 5%. This is much better than what I would have guessed. The first reference is "Dynamic Hash Tables" by Larson (1988), and the second reference is "Main-memory linear hashing" by Pettersson (1993). Have I completely mis-read these articles? MegaHasher 21:33, 20 August 2007 (UTC)
Picture
Can somebody please add this picture to the article? I believe it will help a lot of people to understand the notion of the hash table. --yanis 12:25, 21 August 2007 (UTC)
- Yes, it is a good picture, but no we can not add it to the article since at the image page you state that you copied it from a non-free source. But as you state on the image page we can draw a license free image that is inspired by it. Then I suggest some changes:
- Make it look like the images we already have (colour match, shapes and so on).
- The output of the hash function would be more clear if it is written as [04] instead of [4].
- The hash function would be more clear if it is written as "Key modulo 100" instead of "Key % 100". Of course, a better hash function usually is to do modulo for instance 101 but that would make the image unclear.
- Perhaps use the phone number as key?
- Each record must also contain the key, since hash tables need to check if they found the right record or not. So I suggest the record should hold the key (the phone number) and some data (the persons name), just like the other images we now use.
- Since I made most of the images in the article now I might take a shot at it when I am in the mood. (But I had lots of input from others when doing them and some one else remade them as SVGs.)
- --David Göthberg 13:00, 21 August 2007 (UTC)
O(1) Removal
There is a line which states:
- The O(1) remove method above is only possible in linearly probed hash tables with single-slot stepping. In the case where many records are to be deleted in one operation, marking the slots for deletion and later rebuilding may be more efficient.
I'm not really sure that this is accurate. I might agree wth calling it expected O(1), or average O(1). But it is not, as far as I can tell, worst-case O(1). In fact, remove is worst-case O(n), as in all hash operations.Josh 08:11, 10 April 2006 (UTC)
- In my experience the O() notation usually refers to expected runtime, unless otherwise stated, but feel free to disambiguate it.WolfKeeper 14:07, 10 April 2006 (UTC)
I know that there has been some discussion of amortized runtime here, but I'm not really sure that you can use amortized analysis on this function. Please correct me if I am wrong, otherwise, I will change shortly. Further, I am not sure why the article says only possible in linearly probed hash tables... If the remove function is O(1) in linear probe, then why not with a quadratic probe? Josh 08:11, 10 April 2006 (UTC)
- You need the delete to remove spaces. the quadratic probe moves a different amount through the hash table, depending on the initial number from the hash function, rather than the slot that the initial collision occured at. So two hash entries may have collided initially at say, slot 10; and one of them got put into slot 20. If slot 10 is deleted, with quadratic probing there's no way to compact down the slot 20 into slot 10, because there's no way to find it- the hash of the object at 10 doesn't tell you how to find any of the successors. With linear probing you just need to look in the next entry and then decide whether to move it or not.WolfKeeper 14:07, 10 April 2006 (UTC)
- I think that this discussion reveals a real need to disambiguate expected vs. worst-case runtime. I think that we'll both agree that remove(..) in a hash table requires worst-case O(n) steps, even with a linear probe. If you don't then we need to first discuss that..
- If you don't see why the worst-case isn't particularly relevant, then I don't want to discuss this further.WolfKeeper 16:31, 10 April 2006 (UTC)
- I found a very good explanation in CLRS03, Chapter 11. Basically, the text proves that, on average, probes are needed in an unsuccessful search, where is the load factor. Using this formula, we can see that even at , the number of probes required is 10. I would still argue that worst case is relevant, but this clearly does not apply until the hash table is very full. The magic number of 80% is explained in the article, but perhaps this would be better understood with a graph illustrating how the performance changes with load? --Josh 01:24, 11 April 2006 (UTC)
- Now, lets go back to what you're saying. Yes, you are correct, that in a linear probe, you only need to look at the next slot, although I would nitpick and say that you need to look at the next n slots if the hash table is full. However, in a quadratic probe, instead of saying , you say . We assume that hash(key) was cached and does not need to be recomputed. Therefore, as long as we know i, we can clearly find the next slot in constant time. The same applies for double hashing. You just say . Since we already know hash(key), then hash(hash(key)) takes O(k), as per previous discussion.
- I'm sorry if my explanation was inadequate. I would recommend you read Knuth or a good book on it.WolfKeeper 16:31, 10 April 2006 (UTC)
- I've looked through CLRS, and I don't really see a reference for the fact that an O(1) remove can only be achieved using a linear probe. If we can do a search in any open address hash table in constant time, then shouldn't we be able to delete elements and eliminate holes in constant time as well? --Josh 01:24, 11 April 2006 (UTC)
- In summary, I think that there needs to be a lot of reworking between the term O(..) and expected O(..) and worst-case O(..). Further, I think that we might argue that remove(..) is expected O(1), but not worst-case O(1). I think that it is always worst-case O(n) with linear and quadratic probing. With double hashing, we would say that it is worst-case O(k n), assuming that k is not fixed size. Josh
- With all due respect, I completely disagree. The whole point of hash tables is to design them so that you are able to apply statistical techniques like averaging. Using worst case values for randomised parameters gives significantly inaccurate results.WolfKeeper 16:31, 10 April 2006 (UTC)
- I'm not suggesting that we re-do all of the analyses in terms of worst-case performance. I understand that doing so would give an incredibly inaccurate picture of hash table performance. I simply think that we should qualify the asymptotic notation with "expected", to acknowledge that bad cases do exist. --Josh 01:24, 11 April 2006 (UTC)
- I agree with Josh here. Unqualified big O notation specifically implies worst-case behavior in many publications. Better to either add the word "expected" where appropriate or make a big sweeping blanket statement at the beginning that it's implied. Deco 11:30, 8 June 2006 (UTC)
"When we say T(N) [is] O(f(N)), we are guaranteeing that the function T(N) grows at a rate no faster than f(N)." (Data Structures and Algorithm Analysis in C++, Third Edition, Mark Allen Weiss, page 44)
It is entirely possible to implement a hash table that makes a collision every time a new value is added. This would be a useless implementation, but a valid one. A hash table can't then be O(1) ever in any way if the above possibility exists.
"Hash tables can be used to implement the insert and contains operations in constant average time...The worst case for hashing generally results from an implementation error." (page 207)
I would suggest that the article should be changed to mention that hash tables are not O(1) by the definition of Big-O notation, but that all good implementations come closer to acting like they are than binary search trees. 199.111.229.133 00:23, 16 October 2007 (UTC)
Confusing a hash table and a hash map
The intro to this article is severely misguided and confusing. A hash table is not the same thing as a "hash map" (which is a Javaism meaning a map/dictionary implemented by a hash table instead of a tree). Hash tables have nothing to do with keys and values -- a hash table only knows that elements have hashes, and that elements are at the index in its array corresponding to the value of the hash function for an element. For example, a hash table can also be used to implement an unordered set. Additionally, the concept of "buckets" is not intrinsic to a hash table. It is a common strategy for resoloving collisions but there are others (double hashing, for instance).
What's the correct way to tag a page that has inaccurate information on it?
There are some non-wikipedia ghits that have the correct definition, such as: http://www.sparknotes.com/cs/searching/hashtables/section1.html
However, it's a fairly common misconception that hashes map keys to values, since so often they are used to implement maps or dictionaries. I am fairly sure that Knuth makes this distinction, but I don't have it on hand.
--67.180.15.227 (talk) 16:49, 5 December 2007 (UTC)
- I always thought the word "table" in "hash table" stood for for "lookup table", go figure. In colloquial usage, the word "hash table" nearly always refers to a table with values, e.g. a hash map. Furthermore, given an explanation of how hash maps work, it doesn't take much imagination to realize that one can also construct a hash table that only stores keys without values. So I think your assertion "severely misguided and confusing", is severely exaggerated and it doesn't really need a tag.
- The URL you provided doesn't explicitly differentiate between "hash tables" and "hash maps" either; it seems that they only store the key in order to keep the diagrams simple.
- But naturally the definition can be changed if you can find significant sources making this distinction. Sorry, but I can't be bothered searching for them at the moment. -- intgr [talk] 18:13, 6 December 2007 (UTC)
- A hash table can be used to implement either a map concept, or a set concept. MegaHasher (talk) 21:23, 7 December 2007 (UTC)
- Hash tables can be used to implement both sets and maps, and the word "hash table" does not imply any particular interface. "Hash map" may connote that it implements a map rather than a set, which is a qualitative difference, but this is a minor point that does not need expounding upon in the introduction. Dcoetzee 21:35, 7 December 2007 (UTC)
Table vs Map
- In computer science, a hash table, or a hash map, is a data structure that associates keys with values.
Already the first sentence is misleading, if not even wrong. A hash map associates keys with values (i.e. maps) and is implemented using a hashtable, though a hashtable itself does not necessarily do any mapping or association of keys and values. Just consider a hashtable where all you do is to store integers, there are no values here. —Preceding unsigned comment added by 91.36.112.7 (talk) 14:51, 23 January 2008 (UTC)
- This comment was already posted once and was moved to #Confusing a hash table and a hash map because newer entries are supposed to go to the *end*. -- intgr [talk] 23:08, 23 January 2008 (UTC)
Additional references?
I can't access it at the moment, but this paper is probably also relevant to the robin hood hashing: Munro, J. I., and Celis, P. 1986. Techniques for collision resolution in hash tables with open addressing. In Proceedings of 1986 ACM Fall Joint Computer Conference (Dallas, Texas, United States). IEEE Computer Society Press, Los Alamitos, CA, 601-610. —Preceding unsigned comment added by 75.71.67.71 (talk) 17:02, 15 April 2008 (UTC)
Open Hashing: inaccurate description
The description of "open hashing" seems quite inaccurate.
That "data indexed by the hash is 'stored externally' to the hash table" may be *required* for this data structure, but does not define it. To store variable length data open addressing may be used in conjunction with storing references to the actual data in the hash table. Open addressing is used synonymously to closed hashing, which results in a contradiction.
The example which follows the first sentence seems at least incomprehensible to me (By the way: Examples should be separated from a general description.) Key-value pairs are added to the group/bucket according to the hash value of the key. (The hash of the key defines the group.)
"The key that identifies a bucket is now put into the hash table (in this case it is as simple as the bucket number)" seems wrong to me. Actually, a reference to the group is 'put into' the hash table. And this is not equal to the key (Hash keys are generally not stored in hash tables, the values are stored).
Separate chaining should be characterized as a form of open hashing.
Sorry for the extensive criticism. This part of the article really confused me...
--Whzlt (talk) 03:24, 18 May 2008 (UTC)
- I've removed it. It was just simply terrible, and seemed to be a duplication of the chaining section anyway, and was unreferenced.- (User) WolfKeeper (Talk) 04:11, 18 May 2008 (UTC)
load factor
Sorry to be thick, but can we define it please? Thanks. 124.101.249.63 (talk) 14:01, 17 July 2008 (UTC)
- The Load Factor page defines it as:
- Load factor (computer science), the ratio of the number of records to the number of addresses or indexes within a data structure
- --Yitscar (talk) 17:38, 20 July 2008 (UTC)
Where did all the code go?
I went reading through the article, as I do every few years since it represents my first ever contribution to Wikipedia, and noticed that the short pseudo-code for the deletion algorithm for a linearly probed hash tree had been removed. Standing in its place is a reference to the ICI implementation where the algorithm can be gleaned, but not so simply because it's handling a more complex case (set subtraction).
I just wondered why the clearer pseudo-code section was removed?
I confess to a feeling of ownership for the deletion algorithm, since that part of the entry came about after Tim Long and I discovered wikipedia. Some years earlier we'd been using a lot of hash tables, and it had bugged me that I could find no reference to an efficient deletion method. I worked out how to do it (I thought), only to discover that my method didn't always work. Tim debugged it (by adding a missing condition to the boolean expression.) We thought Wikipedia was the perfect place to put the small invention.
- The Wikipedia isn't a place to put inventions, unless they are referenced by reliable sources.- (User) WolfKeeper (Talk) 17:07, 28 January 2008 (UTC)
In fact, the pseudo-code for all the basic operations has been removed. I don't understand the logic behind that, hence this note. Lukekendall (talk) 15:08, 28 January 2008 (UTC)
- There's lots of different deletion algorithms that can be used, depending on what kind of hash table it is. It doesn't seem appropriate to include pseudocode for just one kind; instead the wikipedia's job is to talk about general principles, and refer to reliable sources such as Knuth that contain the pseudocode.- (User) WolfKeeper (Talk) 17:07, 28 January 2008 (UTC)
- That's not a practical approach for accepting software knowledge, because there are far more problems - and algorithms to solve them - than there are reputable published solutions. If there were, programming would be a mechanical process of referring to these gems and gluing them together! It isn't. (Unless you're working in the few-but-increasing extremely well-trodden problem areas.) Nor do you even need a "reliable source such as Knuth for code": if you publish the code, any programmer can try it out and verify that it works. The code itself is the most effective proof you can have: it's easier to understand than a formal proof of correctness, or acceptance-through-trust-my-reputation. It shouldn't matter whether it's an invention or not, merely that it's verifiably true.
- As for the issue of of it only applying to one kind of hash table, let's consider that statement a little more deeply:
- For any sort of chained hash table (which is not a pure hash table, and whose behaviour swings more to that of the underlying chaining data structure as the "load" increases), the deletion is effectively solved by using the deletion method for that chaining structure.
- For linearly probed hash tables, it is a solution to a problem which as far as I know is not published elsewhere.
- For quadratically probed hash tables there is no similar solution, as far as I know.
- So objecting to it on those grounds seems the same as objecting to the inclusion of the solutions to the quadratic and cubic polynomials because the formulae are special cases that don't solve all polynomial equations.
- Who removed the sections? I looked through the history and discussions but couldn't find it, and there'd be a lot of binary chopping to do to find it. Do you happen to know, off-hand?
122.106.85.3 (talk) 00:21, 29 January 2008 (UTC)
- Writing your own pseudocode and adding it to the wikipedia contravenes WP:Original research.- (User) WolfKeeper (Talk) 05:35, 29 January 2008 (UTC)
- I disagree - I think pseudocode is justified where it describes knowledge or methods that can be attributed to reliable sources. It's just a method of presentation. As long as it's not made too specific, there isn't an issue, and this is a widespread practice. Dcoetzee 22:42, 4 March 2008 (UTC)
- Dcoetzee is right: "No original research" means no original ideas. Original expression is different and required by copyright law. --Damian Yerrick (talk | stalk) 18:04, 29 December 2008 (UTC)
- Writing your own pseudocode and adding it to the wikipedia contravenes WP:Original research.- (User) WolfKeeper (Talk) 05:35, 29 January 2008 (UTC)
Info about multiplicative hashing is wrong
The page points to some page that claims multiplicative hashing has poor clustering. But that page doesn't get Knuth's multiplicative hashing right -- it falls into the trap that many people do of thinking that it works as (k*a) mod m for some a. In fact, multiplicative hashing does fine if you implement it correctly, by taking only the high bits of k*a. —Preceding unsigned comment added by AndrewCMyers (talk • contribs) 16:45, 13 January 2009 (UTC)
Sum over sequence
For the x-th time, 1+2+4+8+...+n = 2n-1, not 2^n-1; you're thinking of 1+2+3+4+...+n. Just put in values for n and you will see this, or do you need a formal proof? Please people, actually read the text you edit. Adrianwn (talk) 17:22, 26 March 2009 (UTC)
- You should add a comment explaining why it is that, I misread the termination condition myself.- (User) Wolfkeeper (Talk) 17:33, 26 March 2009 (UTC)
- Yes, that is probably a good idea; I will think of something. Adrianwn (talk) 05:34, 27 March 2009 (UTC)
- I rewrote it and tried to clearly distinguish between the ops for the resizing and the ops for adding elements to the table. Please check the formulas for mistakes. Adrianwn (talk) 08:17, 31 March 2009 (UTC)
Why are prime-sized tables bad?
The article claimed that
- Some older hashes are even worse, requiring table sizes to be a prime number rather than a power of two, again computing the bucket index as the hash value modulo the table size. In general, such a requirement is a sign of a fundamentally weak function; using a prime table size is a poor substitute for using a stronger function.
This statement sounds rather biased. What matters is the actual hash function, that maps the key to a bucket index. A "raw" hash function that gives good results when taken modulo a prime is OK as long as the table size s is a prime. One can do dynamic resizing with tables of prime size, with liltle extra cost. The above statement can be turned around to say "using tables whose size is a prime number increases the chance that the hashes will be unformly distributed, since some popular hash functions are known to perform badly when the table size is a power of two and the hash index is obtained by bit masking." So, how should we rephrase this sentence? --Jorge Stolfi (talk) 00:40, 12 April 2009 (UTC)
Pseudo-randomness is not necessary
This sentence has been removed:
- In most cases the hash function is deliberately chosen to have pseudo-random properties, so that small changes of a key give a large and apparently random (although of course reproducible) effect on the hash returned. Because of this random effect, in some cases,
Hash functions need not be pseudo-random, they need only spread the keys as evenly as possible. Moreover collisions are not due to pseudo-randomness of the function. --Jorge Stolfi (talk) 05:59, 3 April 2009 (UTC)
- Congratulations! You seem to have successfully removed every part of the description of how hash tables in general work from the article!!! Well done indeed!- (User) Wolfkeeper (Talk) 23:03, 3 April 2009 (UTC)
- Perhaps you can remove the collision resolution algorithms as well?- (User) Wolfkeeper (Talk) 23:03, 3 April 2009 (UTC)
- Is this comment still relevant, or did you just look at the article between edits? I have tried to keep all the pertinent information, deleting only some details that can be found in the articles specific to each method. Is there anything in particular that you think should remain here? --Jorge Stolfi (talk) 00:46, 12 April 2009 (UTC)
- Yes, I can only continue to congratulate you on removing how they work, and adding more on what they are used for.- (User) Wolfkeeper (Talk) 14:23, 12 April 2009 (UTC)
- Um, forgive me for being dense... Are you being ironic, or do you mean it? If the former, please be more specific. There are gazillions of hash table algorithms, methinks that is is best to give a general survey and leave the details to specific articles. Do you think that the general description of how they work is not adequate? All the best, --Jorge Stolfi (talk) 16:53, 12 April 2009 (UTC)
- Yes, I can only continue to congratulate you on removing how they work, and adding more on what they are used for.- (User) Wolfkeeper (Talk) 14:23, 12 April 2009 (UTC)
- Yes, I was being ironic. People don't come here simply to read in-depth comparisons about something and something else. They come here primarily to find out what something is. What it can be used for, and how it compares and the history is also important, but are not usually the primary reason.- (User) Wolfkeeper (Talk) 16:58, 17 April 2009 (UTC)
- Well, first, I dont't think that is quite true. But it does not matter. What matters is that the lead section and the accompanying figure already give enough information about what a hash table is and how it works --- enough to satisfy the curiosity of most people who don't know that already. Anything beyond that will have to go into gory technical details, and will be of interest only to people who implement hash tables --- which is a very small set indeed, much smaller than those who need to choose between a hash table or a balanced tree. I also cannot believe that the average reader wants to know the details of chained versus open addessing before knowing what hash tables are good for. All the best, --Jorge Stolfi (talk) 22:40, 17 April 2009 (UTC)
- With very few very clearcut exceptions (like birthdates), the lead isn't supposed to contain anything not in the body of the article.- (User) Wolfkeeper (Talk) 23:36, 17 April 2009 (UTC)
- Sorry again, I don't understand your complaint. When I last looked, the lead section had a sufficiently clear and complete explanation of how a hash table works, and that explanation was throughly expanded in the body of the article. So what, exactly, was wrong with the latter? All the best, --Jorge Stolfi (talk) 02:09, 18 April 2009 (UTC)
Is the "Basic algorithm" section needed?
The "Basic algorithm" section does not seem to add to what has aleady been said in the lead section. Considering that the "mod N" trick does not belong here, the section seems redundant. --Jorge Stolfi (talk) 03:01, 18 April 2009 (UTC)
- It's really bad style to only mention something in the lead; and in this case it's the fundamental algorithm the whole thing rests upon.- (User) Wolfkeeper (Talk) 16:14, 18 April 2009 (UTC)
"Good" hash functions
From the lead paragraph of "Choosing a good hash function":
A good hash function is essential for good hash table performance... However, since poor hashing usually degrades hash table performance by a constant factor, and hashing is often a small part of the overall computation, such problems commonly go undetected.
If the latter is true, how can the former be true? If the problems caused by poor hashing commonly go undetected, and the hashing is a small part of the overall performance, it seems that a good hash function can't be called "essential for good performance" but is rather an aesthetic preference at most. One might distinguish between 'adequate' and 'bad' functions, where 'bad' might run slowly and produce uneven distributions, but if your problems are going undetected, I'd say you've at least struck 'adequate', and improving to 'good' would be an exercise in lily-gilding.
But perhaps I'm missing something. --Fader (talk) 14:51, 24 July 2009 (UTC)
- Well, it seems that the text can be improved. The intended meaning is: the difference between a good and a not-so-good hash function may be a factor of 10 or more in the average operation cost; so, if you want the hash table to have good performance, you'd better pay attention to that item. However, performance aside, the table will work correctly with any hash function; and, in many applications, making the lookup 10 times slower will probably make little difference in the overall processing time, because other costs will probably dominate the bill. For example, some shells use a hash table to map command names to directories in the command search path. A bad hash function there may add a few milliseconds to the command startup time, which is usually dwarfed by the cost of loading the program into memory. Such a bug is likely to go undetected for a long time; and even if i is detected, fixing it will be very low in the shell maintainer's priority list. On the other hand, some interpreted languages like Gawk use hashing to map variable names to values; in that context, even a 50% increase in the lookup cost will have a noticeable impact on the performance of every Gawk program.
Hope this helps. All the best, --Jorge Stolfi (talk) 23:42, 25 July 2009 (UTC)
cryptographic hash functions
From the article: "In fact, even a cryptographic hash does not provide protection against an adversary who wishes to degrade hash table performance by choosing keys all hashing to the same bucket."
I thought that one of the criteria for a cryptographic hash function was that it be infeasible to find collisions. Therefore, it would provide defense against such an adversary. Or am I misunderstanding something? Ralphmerridew 21:55, 30 January 2007 (UTC)
- Hash tables are usually very limited in size, and the length of the hash function is clipped modulo the hash table size. That is, while a hash might produce a result of 160 bits, it is obvious that a hash table of 2160 entries would be infeasible, not to mention useless. In-memory hash tables rarely exceed millions of entries, and it is relatively trivial to brute force through this amount of hashes for finding collisions. For a sense of magnitude, a low-end processor today can compute around half a million SHA-1 hashes of 16-character strings per second. -- intgr 23:07, 30 January 2007 (UTC)
- Doesn't that require that the attacker also knows the hash table size? (Alternately, if the attacker adds enough entries to be able to work out the size, it's also likely to force a rehash.) Ralphmerridew 00:51, 31 January 2007 (UTC)
- Well yes, if secrecy is a choice, it's a good idea to choose a large prime as the hash table size. This is, however, unrelated to whether one is using a cryptographic hash function or a classic one — it is impossible to cause collisions if you cannot predict the modulus or one of its factors.
- But lots of hash table implementations resize automatically to predefined constant sizes, and as far as I know, many even use the worst choice of power-of-two sizes. Power-of known number sizes mean that the attacker can cause clustering even if they mispredict the size, and that the attack is effective even through reclustering. This is because when , and even if , it will just cause clustering on different keys simultaneously, which can still be fatal with large hash tables. -- intgr 07:19, 31 January 2007 (UTC)
- I have no idea what I was thinking earlier, the equation should be . -- intgr 17:41, 31 January 2007 (UTC)
- But lots of hash table implementations resize automatically to predefined constant sizes, and as far as I know, many even use the worst choice of power-of-two sizes. Power-of known number sizes mean that the attacker can cause clustering even if they mispredict the size, and that the attack is effective even through reclustering. This is because when , and even if , it will just cause clustering on different keys simultaneously, which can still be fatal with large hash tables. -- intgr 07:19, 31 January 2007 (UTC)
- Re point 1, with a bad hash function, an attacker can choose keys that hash to the same value, which will cause collisions regardless of modulus. Ralphmerridew 16:41, 31 January 2007 (UTC)
- Oh, were you thinking of generating hashes small enough to always be less than the modulus? Good point, never thought that. (though I'm not the author of the quoted claim) -- intgr 17:41, 31 January 2007 (UTC)
- No, I mean that, with a bad hash function, an attacker could, say, generate arbitrarily many strings that hash to, say, 0x16de fa32 4261 1ab3. Whatever modulus is used, all those strings will fall into the same bucket. With a cryptographically secure hash function, given a known modulus, an attacker might be able to produce a large number of strings such that (hash % modulus) are all the same, but he'd be unable to produces a significant number that all have the same hash value. Ralphmerridew 21:43, 31 January 2007 (UTC)
- I wouldn't count on the speed (well, slowness) of a hash function for protection against clustering attacks. While it makes the attack slightly more expensive, it also complicates hash table inserts and lookups by the same proportional amount. If you can cope with the extra processing overhead, you're most likely better off with data structures that do not exhibit such critical worst-case performance, such as various balanced trees.
- Perhaps this is an overly cryptographic point of view, but it is not that expensive to generate truncated hash collisions for cryptographic hash algorithms by brute force. And as mentioned above, a general purpose low-end processor (Athlon 64 3000+ here) is capable of generating half a million hashes per second. A heavily parallelized FPGA-, or even ASIC-based chip could improve that by several magnitudes. (Such programmable FPGA computers are readily available on the market, e.g. COPACOBANA, which can do 1013 DES trials per second) -- intgr 22:37, 31 January 2007 (UTC)
- I'm not depending on "Given string 'str', calculate hash(str)" being slow; I'm depending on "Given 'value', find a string 'str' such that hash(str) == value". The latter is part of the definition of cryptographically secure. And even 10^13 trials/second means brute force will take about three weeks per full collision with a 64 bit hash, and is ineffective against even the deprecated MD5 (128 bits) or SHA-1 (160 bits). By comparison, IIU multiplicative hash C, it's possible to find a full collision in O(#bits) time. Ralphmerridew 23:12, 31 January 2007 (UTC)
- Did you forget that to utilize all these 64 bits, you need to store the table somewhere? There are no practical applications of a 264-entry hash table, and the space requirements are proportional to the brute force collision-finding time (both scale O(2n bits)). Just storing this many hashes (or 8-byte values) alone will take up of space. If you create a smaller hash table, you have to truncate the hash (throw away some of its information), which inherently speeds up brute force cracking. -- intgr 23:29, 31 January 2007 (UTC)
- But a collision against a truncated hash is only useful against a small number of moduli, and then only if the attacker knows or can work out the modulus. Ralphmerridew 23:57, 31 January 2007 (UTC)
- This seems to effectively boil down to what I formulated earlier, "were you thinking of generating hashes small enough to always be less than the modulus?". I do agree that in case the attacker does not know the modulus, and the modulus is a non-tiny prime, then this effectively disables clustering attacks. I disagree that when the modulus is known, the hash table needs to be "small", but let's leave it at that. -- intgr 00:54, 1 February 2007 (UTC)
- Well, I've changed the article now and put a {{fact}} on it, since we still need to cite a source for it. Or do you happen to have one? -- intgr 12:41, 5 February 2007 (UTC)
- I agree that, for a certain special case, Mallory (the attacker) can guarantee that all the keys he generates hash to the same bucket, even when a cryptographic hash is in use.
- That special case happens when an attacker doesn't know the exact table size, but does know that it is a power of 2, and at most some maximum size -- Mallory knows that slot_number = hash(key) % 2^n, and he knows the particular cryptographic hash() used, and although he doesn't know n exactly, he knows some k such that n <= k.
- By doing some work reminiscent of hashcash to zero out the k least-significant bits, Mallory can generate keys that all hit the same bucket. Mallory generates O(2^k) trial keys before he finds one that he knows will hit the same bucket. (With most non-cryptographic hashes, Mallory only needs to do O(1) work to construct a key that will hit the same bucket).
- But so what? Is there any application where this relevant?
- What sorts of applications need to accept keys from a potentially malicious attacker?
- Say we did the "secure" thing by picking a L bit cryptographic hash, and "randomly" picking a new large prime number p every time we resize the table, and using slot_number = ( hash(key) % p ) % tablesize. (Since tablesize << p << 2^L , does it matter whether the tablesize is a power of 2 or not?).
- Even with this "secure" hash -- even if Mallory has no clue what hash we are using or how we are reducing it to the tablesize -- Mallory could still force collisions by sending O(tablesize) submissions.
- (By the birthday paradox, he is *likely* to cause a collision even with O(tablesize^(1/2)) submissions).
- Is there some application where this "secure" hash would work, but the above power-of-2 "special case" wouldn't work?
- --68.0.120.35 07:42, 5 March 2007 (UTC)
- The article text now says that "However, using cryptographic hash functions can protect against collision attacks when the hash table modulus and its factors can not be kept secret from the attacker, or alternatively, by applying a secret salt." (there's a comment there saying "see discussion on talk page; just needs a reference"). This is not true by the argument already given here: If (as the article says) we assume that the attacker knows the hash table modulus, then he may search for collisions by brute force, as the effective amount of bits is log(number of buckets the hash table can use), which is generally very small by cryptographic standards. Salting works with cryptographic hashes, but for non-cryptographic ones there are no guarantees.
- In addition, the case where the attacker can't predict the modulus with a reasonable probability, i.e. when it's not feasible to just generate collisions for the most probable modulus, then the second most probable, etc. until the modulus is discovered, does not seem very important. Which real-life hash table implementations have moduli chosen with this in mind?
- There's also the possibility that information about a modulus can be obtained interactively by throwing values at the hash table and seeing if the response time slows down, even if slightly (see timing attack). A similar attack might work to figure out an effectively equivalent salt (producing the same internal state after processing the salt and before processing the key) when non-cryptographic hashes are used: The probability of certain classes of keys colliding might depend on e.g. whether a certain bit of internal state is 1 or 0, so one might throw keys from those classes at the hash table, measure slowdown and compute a new probability for that bit being 1 or 0, and so on independently for each bit, effectively performing binary search for the salt. -- Coffee2theorems (talk) 20:29, 15 June 2008 (UTC)
- Yeah, using a good cryptographic hash with a decent salt should work OK, independently of the modulus though.- (User) WolfKeeper (Talk) 23:28, 15 June 2008 (UTC)
What lots of people seem to forget when discussing the need for hash functions that reduce predictable collisions is, that they are only needed for hostile environments, where an attacker can control the input data *and* where it is important that proper speed is maintained (e.g. tcp/ip hashtables within a kernel). It is much much less important in simple programs where a hashtable is just used to store some data that might even be created dynamically by the program, or doesn't occur in the input set the program was designed for. It might be possible to degrade a hashtable that just uses some "id * 1000000007UL & ((1<<n)-1)" for finding the appropriate slot in your face recognition software by feeding it with a carefully crafted artificial bitmap pattern. but why care? trash-in, trash-out.
Verbiage?
A recent edit deleted most of the following paragraph, claiming it was "verbiage":
- "Cryptographic hash functions are believed to provide good hash functions for any table size s, either by modulo reduction or by bit masking. They may also be appropriate if there is a risk of malicious users trying to sabotage a network service by submitting requests designed to generate a large number of collisions in the server's hash tables.[citation needed] However, these presumed qualities are hardly worth their much larger computational cost and algorithmic complexity, and the risk of sabotage can be avoided by cheaper methods (such as applying a secret salt to the data, or using a universal hash function)."
This paragraph is explaining that *cyptographic* hash functions (a different concept altogether, see lead section) are not necessarily good choices for hash tables, because their only advantage (probabilistically guaranteed good performance even on data submitted by hostile clients) can be obtained at smaller cost by using ordinary (non-crypto) hash functions with secret salt (as discussed above). The recent edit removed this information. But perhaps the wording is not clear and needs to be improved. All the best, --Jorge Stolfi (talk) 21:28, 6 February 2010 (UTC)
Example of a good hash function?
The following text was recently added to the article:
begin text
An example of a simple hash function with good behavior is:
unsigned long f(char key[], int arrayLength) { unsigned long h = 0, v = 401; int i; if (arrayLength >= 2) { h ^= (0xffUL & key[0]); h ^= (0xffUL & key[1]) << 8; h = ((h + 9) * (h + 2) * v) >> 1; } for (i = 2; i < arrayLength; i += 2) { h ^= (0xffUL & key[i]); h ^= (0xffUL & key[i + 1]) << 8; h ^= ((h + 9) * (h + 2) * v) >> 1; } if ((arrayLength & 1)) { h ^= (0xffUL & key[arrayLength - 1]); h ^= ((h + 9) * (h + 2) * v) >> 1; } return h % N; }
This function has the recommendable property that if arrayLength is 2, and N is 216, then the function is almost a perfect hash, filling 65531 of 65536 slots.
end text
This example needs more context before it is put in the article. Is there a reference for it? Also the comment below the code is confusing. All the best, --Jorge Stolfi (talk) 12:47, 17 February 2010 (UTC)
- "Also the comment below the code is confusing.": No, the comment below the text is not confusing. —Preceding unsigned comment added by Paulsheer (talk • contribs) 11:05, 18 February 2010 (UTC)
- Well, what is N and what is arrayLength? Anyway we need a reference that explains this code and analyzes it performance, etc.. Wikipedia does not publish original research. --Jorge Stolfi (talk) 07:26, 19 February 2010 (UTC)
N is the size of the hash table and array length is the length of the char array if i'm not mistaken. It still needs more info on hash time and comparisons to other hash functions. I think an example is needed to give the reader a better idea of what a hash function is.
Citations (April 2010)
I agree, although I think that the following quote "Poor hashing usually degrades hash table performance by a constant factor,[citation needed] but hashing is often only a small part of the overall computation" was unnecessarily marked as needing a citation, as the truth of the assertion is self-evident. Consider a hash function f such that f(x) = 0 for all objects x. Obviously, this is the worst hash function possible, but it illustrates my point. Clearly, if the keys are comparable, then insertion and lookup will always have time complexity O(log n) rather than O(1). If the keys are not comparable, then insertion will have time complexity O(1) and lookup will have time complexity O(n) (since you can't binary search a list of items without an ordering). —Preceding unsigned comment added by 24.69.148.22 (talk) 01:39, 11 April 2010 (UTC)
- No, it is not self-evident, as you have shown in your example. The change from O(1) to O(n) complexity is a performance degradation by a linear factor, not by a constant one (as claimed in the article). The statement, that performance degradation due to a poor (but not pathologically bad) hash function is constant, does need a citation. – Adrianwn (talk) 05:28, 11 April 2010 (UTC)
Oh, I see what you're saying. I assumed that "constant factor" meant a constant factor of n where proportional to the number of items mapped by the hash function. Perhaps the sentence should be edited for clarity? I think that whoever wrote that didn't mean to say that the performance degradation was in O(k*1) = O(1). I think the person meant O(kn) = O(n) for some "constant factor" k > 1. If that is what was meant, I am sure that we could both agree that it's more likely to be true than the claim that degradation is in O(1), which is should be rejected prime facie. —Preceding unsigned comment added by 24.69.148.22 (talk) 06:35, 11 April 2010 (UTC)
- Actually, I think that the original author meant a performance degradation by a constant factor k, as long as the hash function is suboptimal (and leads to more collisions than a better one) but not purposefully bad (like mapping all inputs to the same hash value). If that is the case, this statement needs a more detailed explanation. If no citation can be found, it should be deleted, although it appears somewhat plausible. – Adrianwn (talk) 06:40, 12 April 2010 (UTC)
Auto archive
Any objections to me setting up auto-archive on this page? It's getting rather lengthy, and I'd rather set it up to work automatically than keep doing it by hand. me_and 16:46, 23 September 2010 (UTC)
External links
My clean-up of the external links section was met with opposition, so let's discuss the individual items (I'll exclude the ones that have been removed again in the meantime):
- [1] – contains nothing that's not already in the article (see WP:ELNO #1)
- [2] – ditto
- [3] – ditto
- [4] – ditto
- [5], [6] – these might actually contain some valuable information that is not already mentioned in this article, but I don't want to go through 160 minutes of video to find out.
- [7] – I don't see the benefit of linking an implementation; if an example makes things clearer, then it should be in the article (in pseudocode).
- [8] – ditto
- [9] – ditto
- [10] – promotional link
I think that all these links should be removed, for the reasons given above (maybe except for #5). – Adrian Willenbücher (talk) 16:39, 23 September 2010 (UTC)
- I see benefit in linking to actual implementations. Actual code will often have more detail in it than an encyclopedia article should have. Actual code is also more concrete than psuedocode. Addition explanations (even if they go over the same ground) could help a reader. Some links may be be weak (perhaps sparknotes), but the NIST link appears to link to significant detail. I'll agree that the list is getting long; it should not list every implementation but rather ones with significant content. Glrx (talk) 21:44, 23 September 2010 (UTC)
- What would you propose are the ones that should be kept?
- Regarding additional explanations: if they go into more detail than appropriate for an encyclopedic article, then I'm fine with a link; if it is not too detailed, then I would like to add the respective content to this article; however, I don't see the benefit of linking websites that don't explain more than already present here (and often do it worse). – Adrian Willenbücher (talk) 21:57, 23 September 2010 (UTC)
- Earlier, I peeked at all but your number 5. If I saw any content, then I kept the link. I'd have to study the links to form a more detailed opinion. (If it is any consolation, I have deleted NIST links in the past - they are often just weak dictionary defs. This NIST link has more substance and should probably stay.) The sparknotes link was disturbing because it was heavy on ads; if you were to look it over and decide that it didn't add any reasonable content, I would not object to your removing it. IIRC, it was a recent addition and that editor may want an explanation. Glrx (talk) 23:15, 23 September 2010 (UTC)
- Of course they all had content. The question is, whether the content they provide justifies an inclusion. According to WP:EL, it is generally more desirable to add content to the article (if possible and reasonable) instead of linking it.
- You haven't given any reason why #1, #3 and #4 should be kept, except that they might contain some information which is not already present in the article. This is not enough to warrant an inclusion.
- As for the links to actual implementations: it might be better to link to b:Data Structures/Hash Tables. – Adrian Willenbücher (talk) 06:40, 24 September 2010 (UTC)
For beginners
This article is terrible - someone needs to make this article clearer for beginners or even intermediates. It barely explains anything at first. How about a simple example instead of jumping into advanced theory??? 129.128.241.131 (talk) 17:53, 7 February 2008 (UTC)
I feel the same way. Someone (like me) who doesn't know what a hash table is for will not find out quickly. I had to read most of the article before I could figure it out. I would suggest something like this be added, either in the introduction and/or in the very beginning of the contents. (I'm no expert; someone who knows what they're talking about should write it.)
Say an array of available space for storing data has indices like 1, 2, 3, ..., 1000: ARRAY[1], ARRAY[2], etc. However, the keys to the data may be something entirely different, like "John Smith", "Lisa Smith", and "Sam Doe". If they would just be put into the array directly, some sort of search method would be necessary to find the desired entry. A hash table is a way of solving this problem, allowing extremely rapid lookup. In the example above, a pseudo-random function is called to assign each of John Smith, Lisa Smith, and Sam Doe to some number between 1 and 1000. If "John Smith" is assigned 873 by the hash function, its data is stored in ARRAY[873]. It can be retrieved immediately on demand by just recomputing the hash function on "John Smith" again locate the correct index. If the hashing function is well-designed, different keys being sent to the same number is unlikely until the array begins to fill up. At some point, of course, keys will begin to "collide", to be assigned the same array index. The hashing algorithm needs a way to find the right index anyhow.
--MikeR7 (talk) 21:07, 4 March 2008 (UTC)
- I agree. A gentle introduction describing a specific example could be very useful. I'll write something up. Dcoetzee 22:41, 4 March 2008 (UTC)
Wow! MikeR7's introduction is great for me. And it would be so for many bigginers. Thank you! -- Kang —Preceding unsigned comment added by 59.17.69.101 (talk) 04:50, 28 August 2009 (UTC) It would be good to add to this an example of how the hash function might be constructed. —Preceding unsigned comment added by Lunedi9 (talk • contribs) 20:28, 6 November 2010 (UTC)
Open addressing time
Currently, it turns out from this article that open addressing is absolute nonsense (except, perhaps, because of caching). Actually it's not better so much because of insertion, deletion etc. speed, but because iteration speed (over whole table) is very good even with very non-uniform hashes. Thus, it's good for example to do duplicate check for values, which you do not insert too often, where the common operation is iterating over all values (for example if you want to iterate over table of size 100 000 each time you have added 20 elements). Consequently, rehashing can also be faster. qtvali --83.178.58.29 (talk) 22:03, 3 July 2011 (UTC)
The "mod N" trick is inside the hash function, not outside it
There is a misunderstanding in the recnt edits. The "modulo the table size" step is technically a part of the hash function, not of the hash table algorithm. If the bucket array has size s, the hash functiion must return a number in 0..s-1. See Talk:Hash function. Briefly, the mod-s trick is not always a good idea, and cannot be discussed separately from the "raw" hash function. More importantly, all discussion about the hash function (desired properties, meaning of "perfect", etc.) assume that there is no external mod-s step. All the best, --Jorge Stolfi (talk) 02:21, 18 April 2009 (UTC)
At work I regularly explain to new developers the implementation of hash tables in our software product. I describe the computation prior to "mod N" as creating a fixed size digest of (some of) the entropy in the (typically larger, sometimes variable length) input value. Often we use a number of digest functions all returning the same type (typically an unsigned 32 bit integer). My developers readily grasp that the range of such digests is too great and must be transformed into an index into the N entry hash array, hence the "mod N" step. Making the size of the array prime simply ensures that all entropy in the digest participates in selection of the hash array position.
The independence of the digest step and the "mod N" step is underscored by the fact that N may change over time as values get added to the hash table, while the chosen digest function does not change. From experience I know that whenever we fail to distinguish clearly these two notions confusion ensues. 67.189.170.148 (talk) 21:20, 8 August 2012 (UTC) "John Yates" <john@yates-sheets.org>