← Home

learnDataStructures

CSC/CPE 202

Hash Tables (Set & Map)

A hash table is a data structure that lets you store and retrieve items extremely quickly, usually in constant time on average. The basic idea is simple: each item has a key, and a hash function turns that key into a number. That number is then reduced with modulo (hash(key) mod m) to decide which bucket (or slot) in an underlying array the item belongs to. For example, the string "cat" can be hashed into a number and placed in a specific bucket of the table.

Because the number of possible keys is much larger than the number of available buckets, collisions are unavoidable — different keys may end up in the same bucket. To handle this, hash tables use strategies such as separate chaining, where each bucket points to a small list of entries that share that slot, or open addressing, where items are stored directly in the array and collisions are resolved by probing forward until an empty slot is found.

Hash tables can operate in two modes: a hash set, which stores unique keys only (e.g., {"cat", "dog"}), or a hash map, which stores key–value pairs (e.g., {"cat": 3, "dog": 5}). To keep performance fast, hash tables also monitor their load factor (the ratio of items to buckets). If it grows too large, the table is resized and all items are rehashed into a bigger array.

The strength of hash tables is their speed: insertion, lookup, and deletion are typically O(1) on average. They are widely used in applications that need fast membership checks and quick key-based lookups, such as dictionaries, compiler symbol tables, or web caches. However, they are not ideal when data must remain in sorted order or when range queries are required — in those cases, balanced trees or ordered maps are a better choice.

Hash table set vs map

  • Hash Set: stores unique keys only (e.g., { "cat", "dog" }). Operations use just the key.
  • Hash Map: stores key → value pairs (e.g., { "cat": 3, "dog": 5 }). On insert, an existing key updates its value.
  • In this page’s simulator, switch between Set and Map. In set mode the value field is ignored.

Key ideas

  • Load factor α = size / buckets. Keep α low (≈0.75 for open addressing; 1–5+ okay for chaining) to maintain O(1) average time.
  • Rehashing: growing the table (more buckets) and reinserting all items when α exceeds a threshold.
  • Complexity (average): insert/find/erase are typically O(1); worst-case is O(n) if everything collides or a pathological hash is used.

When to choose a hash table

  • Fast membership tests, counting, indexing by unique keys, symbol tables, caches, deduplication.
  • Avoid when you need sorted iteration or range queries — use balanced trees / ordered maps.

Real-world Examples

  • Dictionaries (word → definition)
  • Compiler symbol tables
  • Web cache (URL → response)

Hash Table — tiny reference snippets



        

Operations (pseudocode)


        
Pseudocode shows both separate chaining and open addressing (linear probing). Deletion in open addressing uses tombstones.

Interactive hash-table simulator

size: 0 · buckets: 0 · α: 0.00 · ops: 0
Chaining: α allowed > 1 Probing: α target ≈ 0.75

Implementations & Complexity

Time complexity (average, good hash)

Operation Separate chaining Open addressing (linear) Notes
insert/put(k,v) O(1) O(1) Rehash when α high; linear probing clusters as α→1
find/get(k) O(1) O(1) Expected probe length ~ 1/(1-α) for linear probing
erase(k) O(1) O(1) Open addressing uses tombstones to maintain probe sequences
rehash O(n) O(n) Resizes are occasional; amortized costs stay O(1)

Implementations by Language

Python

Built-ins: set (hash set), dict (hash map)
S = {"cat", "dog"}           # hash set (unique keys)
M = {"cat": 3, "dog": 5}     # hash map (key → value)
"cat" in S                   # O(1) average
M.get("cat", 0)              # 3
M["dog"] = 6                 # update

Java

HashSet<E> and HashMap<K,V>
import java.util.*;

HashSet<String> S = new HashSet<>();
S.add("cat"); S.add("dog");
boolean has = S.contains("cat");

HashMap<String,Integer> M = new HashMap<>();
M.put("cat", 3);
M.put("dog", 5);
int v = M.getOrDefault("cat", 0);

C++

std::unordered_set and std::unordered_map
#include <unordered_set>
#include <unordered_map>
using namespace std;

unordered_set<string> S;
S.insert("cat"); S.insert("dog");
bool has = S.count("cat");

unordered_map<string,int> M;
M["cat"] = 3; M["dog"] = 5;
int v = M.at("cat");
  • Custom hashing: For compound keys, provide a good hash and equality. Aim for uniform distribution.
  • Iteration order: Typically unspecified for hash tables (Python dict preserves insertion order by design; still O(1) ops).

Conceptual Questions

  1. Why does load factor affect performance differently in chaining vs open addressing?
  2. What's the difference between a hash set and a hash map?
  3. How would you choose table size when rehashing? Why are primes or powers of two common?
  4. What makes a hash function “good” for a given key distribution?
  5. When would you prefer a hash set over a balanced tree set and vice versa?