Talk:Hash table/Archive 3

This is an archive of past discussions about Hash table. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

Look up is not O(1) at all

The mathematical definition of O(1) is when n tends to infinity. All the demonstrations of O(1) I've seen assume at some point that the size of your data is bounded, or that you don't use your table at more than 80%, etc. But if your data gets really big (say greater than 10^10^10), either your collision rate will increase to 100% (then you tend to O(log n) ), or you'll have to increase your table size. The table size is about proportional to the data size. Then you'll have to compute a greater hashkey to uniquely identify the buckets. As it happens, the size of the hashkey (and the time it takes to compute it) grows with log n, where n is the number of buckets. So the operations on a hashtable are really O(log n), even if it stays reasonable for large data.--Yitscar (talk) 08:11, 24 May 2008 (UTC)

Hash tables can be designed for any range of n. For that range of n, lookup is O(1). End of.- (User) WolfKeeper (Talk) 08:35, 24 May 2008 (UTC)

I'm talking mathematically here. O() is a concept valid at infinity, not in a range. See Big_O_notation#Formal_definition

I understand that in that range computation time is independant of the size of data, whereas it wouldn't be for, say, quicksort. But I'm saying the Big O notation is not rigorous here.--Yitscar (talk) 20:31, 24 May 2008 (UTC)

I don't agree. In any case, to change the article, you would need a reference.- (User) WolfKeeper (Talk) 22:46, 24 May 2008 (UTC)

His point is valid - that in order for the average number of elements in each bucket to be constant, the table size must be a constant multiple of the number of elements, and consequently the hash function's output must have O(log n) bits. Nevertheless the O(1) lookup claims are so pervasive in standard reference works that we'd need to find a very authoritative source to contradict them for the purposes of this article. In practice what's more important of course is that it's much faster to generate O(log n) hash bits than it is to incur O(log n) cache misses, as in a binary search tree. Dcoetzee 01:29, 25 May 2008 (UTC)

(Deleted my own comment -- yes, you'll always need log(n) bits to index into the hash table.) not-just-yeti (talk) 14:52, 17 July 2008 (UTC)

Interesting comment, but unfortunately it would be OR without references. 124.101.249.63 (talk) 14:00, 17 July 2008 (UTC)

Indeed, and despite looking over the whole internet I haven't found anyone who's 'got it right', so I'm not modifying the page for now--Yitscar (talk) 17:38, 20 July 2008 (UTC)

Don't take the notion of "time"-complexity too literally. For most data structures, "time" refers to the number of key comparisons performed (see, for example, binary search trees), not to the number of bits which are compared. Likewise, time-complexity of hash tables refers to the number of hash-computations and memory lookups (each of which is considered to take take a constant number of operations), so it is in O(1). Adrianwn (talk) 08:40, 16 December 2008 (UTC)

Wrong, all this talking about log(n) is fully wrong. Normal, linear search is O(n/2) with worst case O(n). Hash tables - as they are mostly implemented - have O(n²/m) where m is the size of the hash table. That's why crowded hash tables become as slow as normal search. But nowhere O(log n), forget about that! 178.197.232.59 (talk) 11:16, 22 October 2012 (UTC)

The complexity table at the beginning does not define its variables

You should define k to be the size of the hash table (the size of the underlying array ) and n to be the number of elements in the hash table, and unless you make the hypothesis that k < n, the space complexity should be O(k+n) rather than simply O(n).

By the way, this page has a more detailed complexity analysis for hash table that is much clearer than the mess on the wiki page

Worse than that, the complexity table is inconsistent with the Performance Analysis section of the article. In the complexity table, the search and delete operations are O(1 + n/k). In the Performance Analysis section, the article states: "For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup." If these two parts of the article are in fact discussing different things, that should be made apparent.

Furthermore, the Performance Analysis section is completely missing the derivation and justification of the time complexity expressions in the complexity table. This doesn't need to be an exhaustive treatment, but should serve to remove ambiguity in the variables shown in the table. 12.9.138.10 (talk) 23:17, 16 October 2012 (UTC)

I already wrote above about the problem. I think thay hash tables are often teached wrongly, and most of the so called "hash table" examples have only O(n). However, a real hash function takes the input data to get the key! It makes some operations on the input "key" (i.e. giving a integer representation of a string), afterwards a modulo over the hash table length, and then you get the real hash table key ("index"). Then the value is inserted at the index number's slot, together with the original key. If there is already another value there, the original keys are compared, and then the value is added to that slot if the original keys aren't equal, or the value will be overwritten. That's how you get O(n²/m), where m is the table size. All this works well only if you use an intelligent operation to transform original keys into table keys/indexes, and it certainly should't take a long time to calculate the index out of the original key (otherwise you would be faster to implement a linear search). While the tables themselves are just normal lists with an incremental index, nothing more, so the trick behind everything is the hash function. Or am I wrong in thinking that many people don't understand that correctly? But then why are they talking about log(n) or making crap hash table examples?? (Don't understand me wrong, this wiki article has correct graphic examples, but the explanations are somehow not clear...a clear explanation would be like what I just wrote here!) 178.197.232.59 (talk) 11:37, 22 October 2012 (UTC)

Separate chaining section -- Sequential Lookup is slower than Balanced Tree Search??

"For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list, and possibly even faster than a balanced search tree."

This example and comment in the article seems to suggest that balanced search tree is even slower than a sequential list (linear search). This is absurd! :) — Preceding unsigned comment added by 151.197.168.146 (talk) 16:15, 9 July 2011 (UTC)

I don't see the problem. The claim is plausible as a lookup in the "load factor 10" scenario involves computing a hash then linear searching 10 items. Apart from the fact that such an operation is probably faster than a balanced tree search of 10,000 items (13 comparisons), searching 10 contiguous items can be much faster than accessing memory in different pages if page swapping is involved. Johnuniq (talk) 01:25, 10 July 2011 (UTC)

"This example and comment in the article seems to suggest that balanced search tree is even slower than a sequential list (linear search)." -- No it doesn't. A hash table at load factor 10 might be 1000 times faster than a sequential search, and 1% faster than a balanced search tree. The balanced search tree in that case is also approximately 1000x faster than a sequential search. And besides, it said "possibly". Theclapp (talk) 20:06, 9 August 2013 (UTC)

"Clustering concern?"

Can anyone explain what the following is trying to get at?? The reference seems to be off line.

For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash[3] is claimed to have particularly poor clustering behavior.[7]

Any competent hash function will map two or more keys to consecutive slots, given the right keys.
It appears to imply the slowdown ("skyrocket"!?) is caused physical storage concerns rather than collisions. Given typical memory swapping schemes, adjacent physical addresses are more likely to available than not.
In an open addressing scheme, the location of the data record is independent of the location of the slot.

Hew Johns (talk) 17:44, 19 August 2013 (UTC) Okay, I see some of the problem "open addressing" == "closed hashing", i.e. in table storage... Hew Johns (talk) 20:08, 19 August 2013 (UTC) OK, never mind. After reading the whole thing I realize it is a very poor expression of the idea that a probe or re-hash function must also be uniformly unbiased, rather than clustering about the original hash. Hew Johns (talk) 21:26, 19 August 2013 (UTC)

Omitted definition in article: what is "s"?

In the section "Choosing a good hash function", a (presumably integer) "s" is repeatedly referred-to. However this is as far as I can see totally undefined in the wikipedia article. Although (some) computer scientists might infer s from context, other readers may not. This is bad form. Suggest to change.

152.3.68.83 (talk) 18:39, 22 November 2013 (UTC)

Associative arrays

Perl is not interpreted. — Preceding unsigned comment added by 68.183.92.186 (talk) 00:17, 14 May 2014 (UTC)

array hash table

Why link to an non-existent article? — Preceding unsigned comment added by 75.79.3.84 (talk) 20:09, 15 May 2014 (UTC)

Some of the good reasons for linking to an article that does not yet exist are listed at Wikipedia: red link. --DavidCary (talk) 04:25, 24 May 2015 (UTC)

Random table based hash function

I am suspicious of the random table based hash function. This seems to be a variation on Pearson's hash function, except there are multiple tables, therefore much worse cache misses. Pearson's hash function is not all that fast to start with. MegaHasher 03:22, 18 August 2007 (UTC)

For single-byte indices and 32 bit hash numbers, the tables are 256*4 = 1024 bytes each. This algorithm is obviously best when the cache size is larger, but also very suitable when a few keys account for most of the lookups, or when the cost of a collision is far greater than the cost of a memory cache miss (such as when hashing into disk records).

By having multiple tables, this algorithm ensures lookups to the same index in the table don't cancel each other out, thereby generating literally perfectly randomly distributed but repeatable hashes. The Wikipedia article on Pearson's hash function indicates it employ a table with the numbers 0 to 255 randomly arranged, and it ultimately selects one of the 256 different values in the table at random. The multi-table algorithm you removed doesn't simply select one of the 256 values in one of the tables, it xors values together to create a far stronger hash. Look at the algorithm properly and you might appreciate it more.

FWIW, the algorithm is proven technology, used to hash > 3 million highly-repetitive financial security keys by the world's leading financial data provider's proprietary security database, handling hundreds of thousands of database inserts per second.

I'll reinstate the algorithm, with the attribution removed as you've (offensively) referred to it as "vanity" material, and I hope you, Mr "MegaHasher", have much better reasons before you remove it again.

Wikipedia generally does not accept original research, but you are welcome to publish your algorithm in a different location, and submit to Wikipedia as a cited work of concise length. The article of Hash function is probably a better location. MegaHasher 04:19, 29 August 2007 (UTC)

I don't consider this to be original "research". It is simple stuff, and important precisely because of that. Too many programmers either don't use hash tables or use poor algorithms because they don't have a feel for or ability to assess mathematical hashes. This may suit the few specialists who write the algorithms and relish the elitist mystique of it all, but it's a big loss for computing as a whole. More generally, I might accept that this belongs on the Hash Function page if you also moved the joaat_hash off this one. You can't have it both ways. I agree with the earlier discussion comments that your posting and defense of Bob Jenkin's algorithm looks very inappropriate on this page, and all the more so for your refusal to accept anything that removes any of the focus from it. —Preceding unsigned comment added by 205.228.104.142 (talk) 23:59, August 29, 2007 (UTC)

Wikipedia's policy is very simple. Please look at the welcoming material in your user talk page. You need to publish your full article in a location off Wikipedia, then write an one paragraph summary of it on Wikipedia, and give a citation link to your full article. MegaHasher 18:20, 30 August 2007 (UTC)

As far as I can tell, the hash function that was deleted 02:46, 29 August 2007 is an implementation of Zobrist hashing, a topic which has more than enough references to prove its Wikipedia:Notability. The joaat_hash function has been moved from this article to Jenkins hash function#one-at-a-time. Further discussion of those functions should probably go on those respective specific talk pages, or on the more general hash function talk page. --DavidCary (talk) 04:32, 13 August 2016 (UTC)

Wrong "highly optimized" Perl claim

"Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash type (%) are highly optimized as they are used internally to implement namespaces."

This statement is misleading. Perl used a simple and especially insecure but fast form of a Jenkins OOAT hash until 5.6, when it was attacked by collisions against the simple linear chaining. The proposed fix was to use universal hashing, to keep performance and add security against hash collisions. Implemented was randomized seeding on rehashing instead. This trick was found to be broken with 5.17.6, and so all optimizations were thrown overboard and a rather slow and secure hash function was selected, SipHash, instead of choosing an optimized and proper hash function (like City or Murmur) and a fast and secure collision strategy. Python choose the same road and changed to SipHash. Perl went back before the 5.18 release to a bit faster JenkinsOOAT hash function, but this still didn't solve the problem of the wrong collision resolution. Optimized strategies would have been open addressing, universal hashing or perfect hashing. There are furthermore no other possible optimizations implemented, like using SIMD or HW assisted crc32 intrinsics on Intel processors, as CRC32 turns out to be the fastest and best hash function for the average case (least number of collisions). See http://blogs.perl.org/users/rurban/2014/04/statistics-for-perl-hash-tables.html

I propose to use the wording "Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash type (%) are used internally to implement namespaces and therefore need to pay more attention to security, i.e. collision attacks."

Highly optimized and concurrent hashes can be studied with Clojure, the Hopscotch table or Robin Hood. ReiniUrban (talk) 17:01, 12 April 2014 (UTC)

Right. Using your own post as evidence - FAIL. — Preceding unsigned comment added by 68.183.92.186 (talk) 00:27, 14 May 2014 (UTC)

Show me one counter-example of a hash implementation worse then perl5. You'll find none in open source. Claiming "highly optimized" for highly deoptimized is grossly misleading. I implemented many improvements in my fork, but it's still worse than ruby or any other typical naive unoptimized implementation. ReiniUrban (talk) 16:30, 28 August 2016 (UTC)

Examples

This article needs simple examples up front so that novices can get the general idea. 2602:306:35FA:D540:2487:9F59:8A65:3095 (talk) 04:44, 27 January 2017 (UTC)

Not really O(n)

Like arrays, hash tables provide constant-time O(1) lookup on average, regardless of the number of items in the table. However, the very rare worst-case lookup time can be as bad as O(n).

This is essentially wrong. In a non buggy implementation which is when the table is a reasonable size, and the table is never too full and the hash function is working correctly n-collisions cannot occur.WolfKeeper 14:36, 16 October 2007 (UTC)

WolfKeeper: You have no idea what you are talking about. In standard implementations in standard SW like all dynamic languages the linear chaining strategy is open to well-known hash collision attacks, which can be defeated even with randomized seeds. So hash tables need to protect themselves against those O(n) worst-cases, which is e.g. a DOS attack on the web against php, perl, ruby with well-crafted keys. And those O(n) possibilities lead the implementations as those attacks appear in practical web security.ReiniUrban (talk) 19:12, 2 April 2014 (UTC)

That's wrong for another reason. Dynamic perfect hashing methods like cuckoo hashing provide constant lookup time. Only updates may take O(n) time in the worst case. 23:24, 2 November 2007 (UTC)

The probability of n collisions by sheer chance is p^n where p is a number usually well under 0.8. The chances of say, 200 collisions is 0.8^200 = 4e-20, and on modern computers that would still run quickly. At a million hashes every second that statistically wouldn't happen once in a million years; and that assumes the table is full, and that's a small table by hashing standards. There's more chance of the hardware failing, by about a factor of 10^6.WolfKeeper 14:36, 16 October 2007 (UTC)

But Big-Oh notation is based on the worst case and the worst case for hash tables would be poor or incorrect implementation/small n. The only reason hash tables never 'feel' like that is because only good implementations would be kept and reused. You can't ignore possibilities just because they aren't likely to occur when you're talking with Big-Oh.199.111.229.133 18:47, 16 October 2007 (UTC)

But O(n) on a 50% full table with 1000 entries and a good hash function, it's not just "very rare" as in the text, it just never happens and never would happen before the end of the universe. If it has never been seen, and never would be seen, then it isn't rare, it just never happens.WolfKeeper 19:12, 16 October 2007 (UTC)

It really is O(n) in the worst case - regardless of the table size or hash function, you can artificially construct a set of keys that all map to the same hash bucket. It never happens in practice - but a worst-case formal analysis must consider this case. Dcoetzee 21:33, 7 December 2007 (UTC)

similarity --- quicksort is also O(n*n) in a similar manner (for few special cases) . 84.16.123.194 (talk) 15:44, 28 January 2008 (UTC)

I feel like we are talking past each other, because we are talking about several different things but using the same name for all of them:

Many hash table implementations use some nice, fixed, nonrandomized, published hash function and always do power-of-two resizes so the table always has a good amount of empty space -- I'll call these "known hash tables" -- is there a better name?
Crosby and Wallach (if I'm reading them right) recommend a randomized algorithm that, each time a hash table is created, picks a fresh new random hash function for that table from some known, published family of hash functions. Then the algorithm uses that same hash function for as long as the hash table exists, which could be many years. My understanding is that the "secret salt" is intended to have effectively the same characteristics. I'll call these "fixed secret hash tables" -- is there a better name?
Several algorithms -- cuckoo hashing, hopscotch hashing, dynamic perfect hashing, etc. -- periodically switch to a *different* fresh new random hash function, from some known, published family of hash functions, often enough to prevent too many collisions. I'll call these "temporary secret algorithms" -- is there a better name?

I agree with Wolfkeeper that, with natural data, known hash tables are so unlikely to ever exhibit O(n) performance that "it just never happens and never would happen before the end of the universe." However, the hash table#Drawbacks section has a few references that imply that known hash tables are vulnerable to a DoS attack involving carefully crafted unnatural data that tricks the computer into the worst-case O(n) performance. Some of those references seem to recommend fixed secret hash tables, which still have worst-case O(n) performance, but make it practically impossible for an attacker to exploit that possibility unless the attacker knows the secret. However, other references imply that an attacker can learn the secret after an hour of probing, and then use that secret to send carefully crafted unnatural data that tricks the computer into the worst-case O(n) performance. Those references recommend temporary secret algorithms, which generally guarantee O(1) lookup times even in the worst case under active attack, at the cost of making the worst-case insertion time slower.

I feel this article needs to clearly distinguish between algorithms that have guaranteed O(1) lookup times even when under active attack, vs. algorithms that suffer from O(n) lookup times when under active attack. Is that difference big enough that these two categories of algorithms need to have two separate Wikipedia articles? --DavidCary (talk) 16:39, 29 July 2016 (UTC)

I see that elsewhere on this talk page someone says "Dynamic perfect hashing methods like cuckoo hashing provide constant lookup time."

Is "dynamic perfect hashing methods" a better name for the entire category of algorithms that have guaranteed O(1) lookup times even when under active attack -- a category I previously called "temporary secret algorithms" -- a category that includes cuckoo hashing, hopscotch hashing, etc.? --DavidCary (talk) 15:52, 10 May 2017 (UTC)

O(1) in worst-case or amortized sense?

Hi, a recent edit claims that hash tables (HT) allow aritrary insertions and deletions in O(1) worst-case sense. I am thinking of usual implementations where the table is doubled/halved whenever the load factor gets above/beyond fixed thresholds. For these implementations the insertion cost is O(n) worst-case but O(1) only in amortized sense.
Perhaps the editor is thinking of implementations that keep 2 (or 3) arrays and spread out the copying load over many ops? Those would indeed have O(1) worst-case insertion cost in theory, but the overhead seems quite large, and I wonder whether such solutions are really practical. Are they? If not, methinks that the claim of O(1) worst-case in that section would be misleading, since the edited sentence is about practical advantages of HTs. All the best, --Jorge Stolfi (talk) 03:30, 27 May 2009 (UTC)

No, that was not claimed at all. The issue was that "amortized" should not be used together with "average" and "O(1)", because amortized analysis is based on worst-case, not average case. Insertion in a hash table, even if the table size never changed (we are not talking about that issue), is definitely not O(1) worst-case, amortized or not. Looking at the article, someone might conclude that it would be okay to say that insertion in a hash table was "amortized O(1)", which is definitely false. --76.173.203.58 (talk) 10:16, 27 May 2009 (UTC)

Oops,yes, I forgot for a moment that the O(1) time for a "normal" (non-resizing) insertion is only average/probabilistic, not worst-case.
However, the word "amortized" (not "worst-case amortized") is still quite appropriate here, in theory and practice. Let's define an (n,m,a,b)-table as a hash table with dynamic resizing by a factor of 2, contaning n items, with m array slots, such that 0 < a ≤ n/m ≤ b < 1. Let's say that a constant, in this context, is a number that does not depend on m or n (but may depend on a and b) Let's also define a random hash function as a function chosen from some large set, in such a way that it may map each item to any existing slot with uniform and independent probabilities. Then:

(1) If (n+1)/m ≤ b and the hash function is random, the expected cost of an insertion in an (n,m,a,b)-table is bounded by a constant A.

(2) If (n+1)/m > b and the hash function is random, the expected cost of an insertion in an (n,m,a,b)-table is at most B + C n, where B,C are constants.

From (1) and (2) alone, one would conclude only that

(3) For any sequence of k distinct items, if the hash function is random and independent of the sequence, then the expected total cost for inserting those items into a (0,1,a,b)-table will be at most B k + C k²/2.

But this is is obviously a very pessimistic estimate. By considering how the table state evolves during those operations, we can get

(4) For any sequence of k distinct items, if the hash function is random and independent of the sequence, then the expected total cost of inserting those items in a (0,1,a,b)-table will be at most D k for some constant D.

or, more geerally

(5) For any sequence of k operations (insertions,deletions, lookups), if the hash function is random and independent of the sequence, and a << b/2, then the expected total cost of performing those operations in an (n,m,a,b)-table will be at most D (k + n) for some constant D.

Thus, if one allows arbitrary insertions and deletions, with dynamic resizing, one cannot claim that the expected cost for *one* *specific* insertion is O(1), even averaging over all hash functions. If the load factor is close to the threshold, the expected cost will be proprtional to the table size.
The cost is O(1) only if averaged over a long sequence of operations. To be precise, the expected cost (over all hash functions) is a term that depends on the initial table state, plus an expected cost per operation that is O(1) only when averaged over all operations in the sequence. To prove (4) and (5) one must use the same reasoning used in amortized worst-case analysis. Indeed, "amortized sense" does not imply "worst-case", it means simply that the analysis is applied to operation sequences rather than single operations.
And this was the intended meaning of the original claim that properly dimensioned hash tables provide "constant cost per operation, in the average (indeed, amortized) sense". Methinks that this statement is as correct and informative as one can hope to be in a lead section. All the best, --Jorge Stolfi (talk) 22:14, 27 May 2009 (UTC)

tl;dr. You can't polish a turd, amortised is completely the wrong word.- (User) Wolfkeeper (Talk) 22:30, 27 May 2009 (UTC)

Incidentally, for a properly working hash table with a thousand entries in and 75% full, O(n) is not worse case; it's never-going-to-happen case. The chances of hitting that case is so low my calculator cannot calculate it; it would not happen once in the life of the universe. The worst conceivable case is still O(1).- (User) Wolfkeeper (Talk) 22:36, 27 May 2009 (UTC)

Your aggressive and dismissive stance doesn't help your case. Worst-case analysis is based on the worst case, however improbable it may be. It's not meant as a personal insult to hash tables, or even a practical consideration. There's no such thing as the "worst conceivable case" - you can talk about the 99th percentile case if you want, but it's not going to differ asymptotically from the average case. We also have to be very clear here about what operations are being counted - even in the average case, the O(1) is counting hash computations and table lookups each as constant-time operations. Dcoetzee 23:13, 27 May 2009 (UTC)

Hash tables are statistical algorithms that have an expected run time of O(1) as n->infinity. The worst case never happens in practice unless you have a buggy implementation. It doesn't matter what the 'worst case' is; and amortized analysis assumes worst case.- (User) Wolfkeeper (Talk) 00:28, 28 May 2009 (UTC)

Regarding the argument at hand: I believe the amortized cost of insertion differs depending on exactly what you're counting and whether or not the table is dynamically expanding. We really need to find an appropriate reference that goes into detail about this. Dcoetzee 23:41, 27 May 2009 (UTC)

People want to know how hash tables behave in the real world. The amortized cost isn't it.- (User) Wolfkeeper (Talk) 00:28, 28 May 2009 (UTC)

Sigh. Dynamically expanding tables are also O(1).- (User) Wolfkeeper (Talk) 00:30, 28 May 2009 (UTC)

(unindent) Please stop the condescension, it's rude and annoying. Currently, this article contains no analysis whatsoever. The article should describe how hash tables perform under various models of analysis, including both the worst-case model where the hash function is unspecified and the more realistic model where it is a random variable over a sensible (say, uniform) distribution. Because the actual hash function is a parameter specified by programmer, these are useful in establishing bounds on performance and emphasizing the impact of hash function choice on performance. And as you know, the linear worst case performance emerges not from geometric table expansion, but from collisions. I'll add something, and I really hope I won't be facing resistance about it. Dcoetzee 02:19, 28 May 2009 (UTC)

I don't plan on resisting at all; unless it's the bunch of unreferenced OR it sounds like, in which case I won't resist, I'll just revert it wholesale.- (User) Wolfkeeper (Talk) 03:25, 28 May 2009 (UTC)

Okay. I went ahead and added some stuff with a source and you can edit it if you want. I understand your point of view about wanting to convey the practical concerns of hash tables that programmers need to worry about, but I'm also interested in speaking to other audiences such as researchers who wish to model hash tables formally. I hope you understand why I was frustrated with the way you treat people - you may disagree with other users on matters of presentation, or they may misunderstand characteristics of hash tables, but I still think it's worthwhile to take the time to listen to and address their concerns. Dcoetzee 04:15, 28 May 2009 (UTC)

I can't understand Wolfkeeper's claim that "O(n) is never-going-to-happen case". With dynamic resizing (which is what is under discussion, and how most "general purpose" hash tables are implmented), resizing will happen as soon as one inserts more than the threshold number of items; and the cost of resizing is O(n), because all items have to be copied to the new table, and (in typical designs) rehashed too. If n is 750,000, m is 1,000,000, and the threshold is 75%, the expected cost of the next insertion is 750,000, not 1. If you start with 749,000 items, and do 2000 insertions, for the fist few ones the expected cost is 1, but you can bet that one of them will cost 750,000 units. Still the expected cost of the whole sequence does not exceed 740,000+2000 = 751,000 units --- a lot more than 2000×1 = 2000, but a lot less than 2000×750,000 = 1,500,000,000. And that, in very practical terms, is what the word "amortized" means. --Jorge Stolfi (talk) 06:55, 28 May 2009 (UTC)

Because in the general case, you use incremental resizing. That does not have 750,000 resizings in a single step, it has for example, 1, and the next 1,000,000 steps will have one also.- (User) Wolfkeeper (Talk) 22:43, 26 July 2009 (UTC)

That is not how dynamic resizing works. Resizing the table, by any amount, has a cost proportional to n, because the whole table must be copied (and rehashed) to a new location. That is how memory allocation works in any platform (including C's realloc, if that is what you have in mind). So if one resizes the table by 1 after each insertion, the cost will be n per isertion, or n^2 for inserting n items. In dynamic resizing, the table size is doubled whenever the load factor reaches a certain threshold. So the 750,000th insertion costs 750,000, but the next 750,000 insertions will cost 1 each. That is why one needs amortized analysis. All the best, --Jorge Stolfi (talk) 12:03, 27 July 2009 (UTC)

No, that's not what I mean, read: Hash_table#Incremental_resizing, when you trigger a resizing, you allocate the array, but don't transfer the hashed data across immediately, instead you do it one at a time with each access. When the old table is empty then you deallocate it. This avoids the 750,000 operation insertion and gives only small statistical variations for any access.- (User) Wolfkeeper (Talk) 12:25, 27 July 2009 (UTC)

Should this article say the "stop everything and rehash the entire table all at once" scheme (where occasionally an insert requires O(n) time) is a naive scheme never recommended for use in practice -- a scheme only used by teachers as an over-simplified example of some important parts of hash table algorithms, before getting into the complicated details of practical hash table algorithms, analogous to the way electronic codebook and overlapping subproblems start with simple examples that are never recommended for use in practice?

Is the term dynamically expanding tables mentioned by Wolfkeeper a good name for the entire category of hash table algorithms that expand the memory used at runtime as necessary, including many that are O(1) for each and every insert (not merely amortized)? --DavidCary (talk) 17:20, 10 May 2017 (UTC)

Nest Robin Hood hashing and 2-choice hashing under Open Addressing?

Right now Robin Hood hashing and 2-choice hashing are listed as a direct children of "Collision Resolution". They seems to fit the definition of Open Addressing resolution strategies. Since all other open addressing strategies are listed under the Open Addressing subsection, shouldn't these two techniques be moved there too?

Andreas Lundblad (talk) 07:13, 10 July 2019 (UTC)

"Hashing" section needs improvement

The section needs to justify the use of a hash function: Why is a hash function even needed? Why not, for example, just store with each key the address of (pointer to) the value?

"The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets."

The diagram above shows buckets separate from the keys, but this sentence seems to say each (full) k-v pair is in a bucket.

"Given a key, the algorithm computes an index that suggests where the entry can be found: [...]"

"Suggests"? Surely that's far too gentle a word. How about "gives" or "tells"?

"A non-uniform distribution increases the number of collisions and the cost of resolving them."

Given the definition of collisions in the introduction, it's not clear why they are a problem that requires resolution. It's also not clear how uniformity has anything to do with them (apart from the mysterious hash function); it would seem that either two keys point the same value or they do not.

JKeck (talk) 17:14, 31 July 2013 (UTC)

JKeck: I'm not sure if you really have no idea how hash tables work, as it seems to be in reading your first sentence, or if the "Hashing" section really needs a better explanation. Of course linear search (unsorted) or binary search (if sorted) is slower than a lookup through a hashed index and then linear search through the collisions.

2nd problem: keys vs bucket vs entries. Each bucket stores one entry or a linked list of entries. Each entry stores k-v pairs. Both, the sentence and the graphics do make sense. I have no idea how to improve the wording to make it clearer to confused people. More graphics probably. 3rd: "Suggests" is a proper word, since the key is not always found at the calculated index. Sometimes it has to search for more locations, either in the same array (open addressing) or in a separate bucket list of colliding entries. Also your forth complain makes not much sense. It's very "clear why there is a problem that requires resolution", dependent on the quality of the initially calculated index, which is dependent on the hash function, the fill factor and the uniformity of the distribution. ReiniUrban (talk) 16:47, 28 August 2016 (UTC)

I'm with JKeck. The section is unscholarly, and "suggests" is an unscholarly word. Hashing is a time/storage space trade-off: it takes a very long time to search a long list, and it takes a very large amount of storage to address directly with say, a 64-bit key (~10^19 slots). Hashing allows data storage and retrieval in a small and nearly constant amount of time, and storage only fractionally larger than that required for the records + keys themselves. The primary requirements for a hash function are efficiency and uniformity. Since the section on Choosing a hash function doesn't mention efficiency, one may presume that a hash function that takes a geological amount of time to compute won't be a problem, or that in any case, one doesn't need to worry about such things. Perfect hashing is a combinatorial curiosity only, and given the paucity of real information in that section, it ought to just go. I could go on and on, but I'll stop here to more carefully read over the entire article, which sort of just rambles on. Sbalfour (talk) 20:30, 24 September 2019 (UTC)

Robin hood hashing - clarity

Robin hood hashing is entirely analogous to the three ways the chains may be ordered in chained hashing: as unordered lists, as serially ordered lists (alphabetically or by some other key-dependent criteria), or as self-ordering lists by probe frequency (i.e. key-independent criteria). So, we reshuffle the bucket sequences for collided keys just as we reshuffle the chains. Yeh? I think we lost sight of the forest for the trees on this one. Sbalfour (talk) 22:38, 24 September 2019 (UTC)

Hash table and binary tree in one data structure

Are there any references on a data structure that implements a hash table and binary tree at the same time? MegaHasher 20:04, 3 August 2007 (UTC)

Um, huh? It's not clear to me why you'd want to do that, or how that even makes sense. Can you expand a bit on why you're asking? Dcoetzee 02:45, 4 August 2007 (UTC)

Seems that a treap could be combined with a hash table. I wasn't able to come up with any citations though. MegaHasher 06:11, 14 August 2007 (UTC)

I still have no idea what you mean. Combined in what way, for what purpose? Dcoetzee 08:13, 14 August 2007 (UTC)

to provide O(1) read access, and ordered traversal at the same time MegaHasher 08:19, 14 August 2007 (UTC)

You could certainly do that by merely keeping a hash table and a binary tree at the same time with the same contents - but on modern systems, I don't think the lookup cost of hash tables compared to an effective cache-aware tree data structure is significantly better (it may even be worse). Considering the space overhead and time overhead for other operations, I wouldn't consider this an effective solution. Dcoetzee 21:50, 20 August 2007 (UTC)

I think the ordered searching and hashing ideas can be combined in the hash table algorithm itself, as seen in this paper, but it does not seem to be exactly what you want. Also, instead of linked lists for buckets, you can use any other data structure that implements the necessary operations, even trees, but that may still be not what you want. Honeypot95 (talk) 12:29, 20 April 2020 (UTC)

circular reference

Recently someone used the "Inside the latency of hash table operations" by Andy Ke as a reference for this article. I feel this is perilously close to becoming a WP:REFLOOP, because many sentences and illustrations in that article are a word-for-word identical to (an earlier version of) this "hash table" Wikipedia article. (A few of of those sentences are sentences I wrote for Wikipedia). What should we do about this? --DavidCary (talk) 23:24, 6 May 2020 (UTC)

Open hashing

I'm considering merging the open hashing article with this article, as has already been done for closed hashing. For that I think we should rename the section on Chaining to Open hashing and make more clear the already mentioned fact that linked lists are only one way, though perhaps the most popular one, of organizing hash buckets and introduce a mention of buckets allocated in a contiguous memory/disk space that is presently discussed in open hashing. Jyotirmoyb 05:16, 20 February 2007 (UTC)

I agree -- let's keep all the different types together in this one article until the article gets too big. At that time, it will (hopefully) be easier to look at the article and decide the "natural" breaking points to split it up into multiple articles. Big Buckets First. --68.0.120.35 07:42, 5 March 2007 (UTC)

I agree, too. It makes more sense to contrast "closed hashing" with "open hashing". I'd like to see the two sections so labeled.Anjin\\^talk 02:23, 25 April 2007 (UTC)

I agree, too. I can see no good reason why not to incorporate such a strongly related topic into this article. I find the question of length of the article largely irrelevant as long as the topics in it are relevant. I'd rather read one long coherent article (possibly skipping sections) than having to jump back and forth in Wikipedia. My 5 cents... TFJ 09:54, 14 June 2007 (UTC+1)

Makes sense - can't have it both ways - with closed hashing merged (thought it's hard to find any content called "closed hashing" in the article) and open hashing in its own article. On the other hand, there is no problem with an article on "open" and "closed" and other types of hashing if there is enough content to justify. Does wikipedia ave guidelines on article length??--121.45.246.110 14:10, 9 April 2007 (UTC)

I agree with all this, because open hashing is simply irrelevant in any context outside of hash tables. Dcoetzee 01:13, 15 June 2007 (UTC)

The Wikipedia guideline on article length is the Wikipedia: article size guideline. --DavidCary (talk) 20:02, 18 November 2021 (UTC)

Concrete example needed

I spent several hours documenting a simple example of a hash table ([1]), only to have it deleted one week later with the edit summary "rm unsourced code, we don't need that here". I agree that it's unsourced, but I disagree with the assessment that we don't need a code example here. I'll never forget my first visit to this article many years ago -- before I had experience with hash tables -- and found the article seriously lacking because it didn't show a single functional example of an actual hash table. I realize that my example may not be acceptable because -- even though I personally tested it and determined that it functions exactly as explained -- it might be considered OR and therefore verboten. Having said that, I don't really care if my example is restored or some other example is used instead, but I would argue that simple functional example is essential to help explain this topic. Lambtron talk 22:50, 7 February 2022 (UTC)

You shouldn't be using your favorite programming languages for writing samples; it should be written in language agnostic way at its dedicated article. If you found a C example to be helpful, maybe you can host a blog or take private notes. But such things aren't utilitarian in our Wikipedia articles. WikiLinuz🍁(talk) 23:52, 7 February 2022 (UTC)

Thanks for the friendly and supportive reply. I should have known that simple C examples are not useful in Wikipedia, and that only language agnostic examples are permitted. Per your advice, I will abandon my unworthy attempts to improve your article and instead start a blog. Hopefully, someone more enlightened will -- at some time in the future -- add an acceptable, actual example to help explain this topic. Lambtron talk 14:32, 8 February 2022 (UTC)

You're welcomed to try again at Perfect hash function article, but not in C. There are language agnostic ways to write pseudocode that involve bitwise operations. If your example was notable enough, it wouldn't be hard to find a scholarly journal article where the code might be discussed (and we could source it from there). WikiLinuz🍁(talk) 15:03, 8 February 2022 (UTC)

It's not the approach I would have taken, but I can see the appeal of simply deleting my contribution rather than working collaboratively to improve it. For example, I would have used my knowledge of "Wikipedia-mandated" language agnostic coding to translate the code -- a job an expert could probably do in a few minutes -- and left intact the explanation. But that's just me: averse to throwing out the baby with the bathwater. BTW, it's not clear to me why you think this basic example is not appropriate for this article, or why a "scholarly journal article" is required for the simple calculations involved. Frankly, I find it a bit frustrating that you seem opposed to having an easily understood example here when there's such an obvious and compelling need for one. Lambtron talk 16:51, 8 February 2022 (UTC)

Hash table (and other data structure-related articles) are candidates for WP:GA. Your source code should be sourced, and "trust me bro" isn't constructive; it's also WP:UNDUE in this article, that's why I said that you could try again at the dedicated article. A source code certainly doesn't fall under WP:Routine calculation. Wikipedia is not a repository, if you really want C code samples, maybe WikiBooks is appropriate for you. WikiLinuz🍁(talk) 17:06, 8 February 2022 (UTC)

You have successfully found 10000 reasons for excluding examples from this article (apologies to Edison). Apparently examples in specific languages are prohibited; these must be translated to pseudocode before they can be considered for inclusion. The translated code would require a citation from a scholarly journal, because a straightforward explanation alone cannot be trusted. Even after translating and citing, a sole example is inherently WP:UNDU because it implements a particular type of hash function, and therefore it cannot be used here, to explain hash tables. Finally, after clearly stating that my goal is to have a basic illustrative example in this article, you flippantly assume that what I really want is C samples -- and suggest that I visit WikiBooks to satisfy that need. I suggest a different approach: proactively work with me to enhance this article with a simple example that helps to make hash tables easier to understand. Lambtron talk 19:47, 8 February 2022 (UTC)

Like I precisely stated, if you can't cite a scholarly source, and wanted me to comply with your "trust me bro" policies, we have nothing to talk about. Any user, including an IP, can challenge an unsourced OR. If you find a source that implements your example in XYZ programming language, and you translate that into pseudocode, it isn't OR in this case. You cannot publish your own thought, no matter how much you're in love with the example. I'm all ears if you could find a source.
straightforward explanation alone cannot be trusted - WP:VERIFYOR WikiLinuz🍁(talk) 20:01, 8 February 2022 (UTC)

I'm glad to hear that you've been listening, with open mind, to my concerns about the need for a simple practical example to meet readers' needs. Lambtron talk 04:36, 9 February 2022 (UTC)

Collision Resolution Chart is Original Research

Sometime in the past the following chart/image was added to the article:

. There are several problems with this graphic but the biggest is that by its own description it appears to be original research. Per the WP:NOR policy, no original research is allowed in Wikipedia articles. If the poster (User:Dcoetzee, apparently) would like to update it with a reliable, published source for its specific data, then I will withdraw my objections. Otherwise I will remove it in the near future. — Preceding unsigned comment added by RBarryYoung (talk • contribs) 14:51, 9 February 2022 (UTC)

@RBarryYoung: It was added by Amakuha in this revision diff. WikiLinuz🍁(talk) 21:16, 9 February 2022 (UTC)

Wikipedia doesn't require references for diagrams. What's your specific concern with the accuracy of the diagram? GliderMaven (talk) 21:33, 9 February 2022 (UTC)

My specific concern is that it is a violation of the WP:NOR policy which I have read and which has no exceptions for diagrams in general let alone for charts of quantitative data that makes specific claims which this graphic definitely does. If you know of some rule that says that quantitative data charts are allowed to be based on original research then please point me to it. Otherwise it is in violation of Wikipedia policy. RBarryYoung (talk) 12:24, 11 February 2022 (UTC)

@RBarryYoung and WikiLinuz: This diagram is not meant to be precise. It just explains the general tradeoff between separate chaining and linear probing. I think, it's pretty self-evident and useful.

Though, "separate chaining with head records in the bucket array" looks to be an optimal way, which combines the best of the two approaches. --Amakuha (talk) 07:07, 10 February 2022 (UTC)

Nowhere in its presentation in the article is it qualified as being imprecise. And in fact is makes very precise claims (irrespective of the accuracy of those claims) and comparisons about the performance of two different implementation methods. And regardless of that, it doesn't matter whether I agree or disagree with its claims, it is original research and does not belong here. RBarryYoung (talk) 12:24, 11 February 2022 (UTC)

Advantages of "array hashing"

Array hashing (separately chained hashing using dynamic arrays to stroe the buckets) is advantageous only in some limited conditions. If the load factor is low (< 1) and the entries are well-distributed, most buckets have at most one entry, so the cache provides little benefit. Also the dynamic arrays usualy require a "bucket length" field; if there is only one entry, then this extra field takes away the memory saved by omitting the chain links. On the other hand, if the load factor is high (>>1), and the bucket arrays are resized by doubling/halving, then the vacant slots will waste more memory than the chain links would. All the best, --Jorge Stolfi (talk) 23:47, 24 April 2009 (UTC)

Dynamic arrays only require a "bucket length" field when they store 32-bit or 64-bit integer keys. It is not required for variable-length null-terminated strings (since strings are length-encoded).

The advantage of the array hash table is that it can scale much more efficiently than a chained hash table, with respect to both time and space. If a very low load factor is used (say 16 million slots and only 1 million keys), and considering that 32-bit integer keys are used, then the array hash table will consume no more space than the chained hash table, while still rivaling its speed. Once more keys are inserted, the array hash table will start to save space over the chained hash table, while retaining high access speed. The chained hash table, on the other hand, will start to consume more memory and slow down --- eventually, performance will get so bad that you will need to resize the chained hash table, further consuming more space and computing resources. This is not the case with the array hash table, which has been shown to scale well as the load factor increases, making it a more practical choice in a dynamic environment.

When variable-length string keys are used (as was the original design of the array hash table), even under low load, the array hash table will consume less space than the chained hash table, because it is cheaper to store a string in a dynamic array, then it is in a node of a linked list (as a result of memory allocation overheads and the requirement of a next-link pointer).

Also, the dynamic arrays are not resized in by doubling or halving. Dynamic arrays are grown in an exact-fit manner --- so they are only grown by as many bytes as needed, which is both fast and memory efficient.

cheers. —Preceding unsigned comment added by Hetori (talk • contribs) 23:26, 11 June 2010 (UTC)

Dear readers,

Do any of you have any references to things which "are grown in an exact-fit manner", which Hetori calls "dynamic arrays"?

Those sound like they would be useful for a lot of memory-constrained things.

The static array article seems to imply that any kind of array that can grow is a "dynamic array".

Alas, the dynamic array article currently seems to imply that "dynamic arrays" always resize by doubling or halving or some other multiplicative amount; never "grown in an exact-fit manner". --DavidCary (talk) 05:55, 18 February 2022 (UTC)

@DavidCary: Interesting question. AFAIK, you can achieve that using realloc() POSIX compliant C API. The implementation would involve you calling realloc() every time you add or remove an element into an array. However, the performance of that operation would be dreadful, since there are multiple memory reallocations involved. I couldn't think of any other implementation that doesn't involve realloc(). WikiLinuz🍁(talk) 06:23, 18 February 2022 (UTC)

@DavidCary: So, if I understand your question correctly, a simple implementation to insert an element into an "exact-fit array" would look something like this:

#include <stdlib.h>
typedef struct {
    void **array_ptr; // underlying array
    size_t array_current_size; // size of allocation of `array_ptr` pointer
} array_structure;

int insert_new_element(array_structure *array, void *new_element) {
    array->array_ptr = realloc(array->array_ptr, ++array->array_current_size * sizeof(void*));
    if (array->array_ptr != NULL) {
        array->array_ptr[array->array_current_size - 1] = new_element;
        return 0;
    }
    return -1;
}

WikiLinuz🍁(talk) 06:33, 18 February 2022 (UTC)

Thank you. This is the thing I was looking for. What is the name of this thing? I added a brief mention of this kind of exact-fit array -- including the "realloc" that User:WikiLinuz mentioned -- at Dynamic_array#Variants. (I also added a reference to an article that, like User:WikiLinuz, *shows* some code that implements the same kind of thing -- but, alas, I don't see it *name* this kind of thing). But I feel like this is common enough (even if common only on teacher's blackboards as a step towards geometric-growth dynamic arrays, even if rare in actual production code) that this should be added to the {{List_data_structure_comparison}} template table -- as soon as we find a name for it. --DavidCary (talk) 07:00, 23 February 2022 (UTC)

@DavidCary: Glad you found it useful. But this is mostly called "exact-fit vector/array" - there isn't any "formal convention", at least I wasn't aware of one. This type of exact-fit container is very specialized and only used in applications like embedded systems where memory is the superior priority compared to performance, as this type of realloc() would result in worst runtime performance, especially if add/remove are frequent operations on the container. WikiLinuz🍁(talk) 07:26, 23 February 2022 (UTC)