目录

Searching


Data Structure

Searching

Read Map

Static Search Table

Sequential Search

Binary Search

Indexing Search

Binary Search Trees (Dynamic Search Table)

Binary search tree

AVL tree

B-Tree

Hash Table

What is hashing

Hash Function

Collision Resolution

Closed Hashing

Open Hashing

Analysis of Hashing

Search Table

Definition

A set of the same type of data elements

Key

The value of data item in the data element. It is used to identify the data elements

Primary Key and Second Key

Searching

Based on the search value of key find out the element whose key value is same as search value

It returns the position of the element located in

Operations on searching table

  1. Search a given element in the search table

  2. Get attributes of a given element

  3. Insert an element into the search table

  4. Delete an element from the search table

Static Search Table

Only do search on the search table

Dynamic Search Table

Need do search and insertion and deletion on the search table

Basic concepts

Given: Distinct keys $k_1 , k_2 , …, k_n$ and collection T of n records of the form $((k_1 , I_1 ), (k_2 , I_2 ), …, (k_n , I_n ))\newline$

where $I_j$ is the information associated with key $k_j\ for\ 1 \leq j \leq n\newline$

Search Problem

For key value K, locate the record $(k_j , I_j)$ in T such that $k_j = K\newline$

Searching is a systematic method for locating the $record(s)$ with key value $k_j = K\newline$

Successful vs Unsuccessful

A successful search is one in which a record with key $k_j = K$ is found

An unsuccessful search is one in which no record with $k_j = K$ is found (and presumably no such record exists)

We often ask how many times one key is compared with another during a search.

This gives us a good measure of the total amount of work that the algorithm will do

  1. Sequential and list methods (lists, tables, arrays).

  2. Tree indexing methods

  3. Direct access by key value (hashing)

Static Search Table

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
constexpr auto LIST_SIZE 20;

typedef struct {
	KeyType key;
	OtherType other_data;
} ElemType;

typedef struct {
	ElemType *elem;
	int length;
} SSTable;
General Idea

Begin at one end of the list and scan down it until the desired key is found or the end is reached

A successful search returns the position of the record

An unsuccessful search returns 0

1
2
3
4
5
6
int S_search(SSTable ST[], KeyType K, int n) { 
    int i{};
    ST[0].key = K;/*a sentinel,storage the Key Value*/
	for (i = n; ST[i].key != K; i --);/*compare backwards*/
	return i;/*return the position, or 0*/
}
Analysis

To analyze the behavior of an algorithm that makes comparisons of keys

we shall use the count of these key comparisons as our measure of the work done

Average Search Length(ASL)

$$ ASL=P_1C_1+P_2C_2+…+P_nC_n=\sum_{i=1}^nP_iC_i\newline $$ $Where\ P_i\ is\ probability(frequency)\ of\ search\ i^{th}\ record\newline$ $and\ C_i\ is\ the\ count\ of\ key\ comparisons\ when\ search\ it\newline$

$for\ sequential\ search$ $$ \begin{align} &success\rightarrow\newline &best\ case:\ 1\ comparison\newline &worst\ case:n\ comparisons\newline &average\ case:ASL_{sq}=\sum^n_{i=1}P_iC_i=\sum^n_{i=1}\frac {1}{n}(n-i+1)=\frac {n+1}{2}\newline\newline &unsuccessful\ search:\rightarrow n+1\ comparisons\newline \end{align} $$

Ordered List

An ordered list is a list in which each entry contains a key, such that the keys are in order

That is, if entry $i$ comes before entry $j$ in the list, then the key of entry $i$ is less than or equal to the key of entry $j$ .

Idea

In searching an ordered list, first compare the target to the key in the center of the list.

If it is smaller, restrict the search to the left half. otherwise restrict the search to the right half

and repeat. In this way, at each step we reduce the length of the list to be searched by half

Initialization: Set Low=1 and High= Length of List

Repeat the following $$ \begin{align} &If\ Low > High, Return\ 0\ to\ indicate\ Item\ not\ found\newline &mid=\frac{low+high}{2}\newline &if\ (K<ST[mid])\rightarrow high=mid - 1\newline &else\ if\ (K>ST[mid])\rightarrow low = mid + 1\newline &else\ return\ mid\ as\ location\ of\ the\ target\newline \end{align} $$

code

URL $\rightarrow$ https://zhuanlan.zhihu.com/p/565438258

$Three\ editions:\ 《Data\ Structure》\ from\ Tsinghua\ University\ -\ Deng\ Junhui\newline$

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
/******************************************************************************************
 * Data Structures in C++
 * ISBN: 7-302-33064-6 & 7-302-33065-3 & 7-302-29652-2 & 7-302-26883-3
 * Junhui DENG, deng@tsinghua.edu.cn
 * Computer Science & Technology, Tsinghua University
 * Copyright (c) 2003-2021. All rights reserved.
 ******************************************************************************************/

#pragma once

//[lo, hi) and 0 <= lo <= hi <= _size
template <typename T> static Rank binSearch ( T* S, T const& e, Rank lo, Rank hi ) {
   while ( lo < hi ) { 
      Rank mi = ( lo + hi ) >> 1; 
      if      ( e < S[mi] ) hi = mi; 
      else if ( S[mi] < e ) lo = mi + 1; 
      else                  return mi;
   } 
   return -1; 
} 
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
/******************************************************************************************
 * Data Structures in C++
 * ISBN: 7-302-33064-6 & 7-302-33065-3 & 7-302-29652-2 & 7-302-26883-3
 * Junhui DENG, deng@tsinghua.edu.cn
 * Computer Science & Technology, Tsinghua University
 * Copyright (c) 2003-2021. All rights reserved.
 ******************************************************************************************/

#pragma once

//[lo, hi) and 0 <= lo < hi <= _size
template <typename T> static Rank binSearch ( T* S, T const& e, Rank lo, Rank hi ) {
   while ( 1 < hi - lo ) {
      Rank mi = ( lo + hi ) >> 1;
      ( e < S[mi] ) ? hi = mi : lo = mi; 
   } 
   return e < S[lo] ? lo - 1 : lo; 
} 
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
/******************************************************************************************
 * Data Structures in C++
 * ISBN: 7-302-33064-6 & 7-302-33065-3 & 7-302-29652-2 & 7-302-26883-3
 * Junhui DENG, deng@tsinghua.edu.cn
 * Computer Science & Technology, Tsinghua University
 * Copyright (c) 2003-2021. All rights reserved.
 ******************************************************************************************/

#pragma once

//[lo, hi) and 0 <= lo <= hi <= _size
template <typename T> static Rank binSearch ( T* S, T const& e, Rank lo, Rank hi ) {
   while ( lo < hi ) {
      Rank mi = ( lo + hi ) >> 1; 
      ( e < S[mi] ) ? hi = mi : lo = mi + 1; 
   }
   return lo - 1; 
} 
Analysis

$$ \begin{align} &T(n)=T(\frac{n}{2})+O(1)\newline &T(n)=T(\frac{n}{2})+O(1)=T(\frac{n}{4})+O(2)=T(\frac{n}{8})+O(3)=…=O(\log n)\newline \end{align} $$

Decision Tree
definition

The decision tree (also called comparison tree or search tree) of an algorithm is obtained by tracing the action of binary search algorithm, representing each comparison of keys by a node of the tree (which we draw as a circle)

Branches (lines) drawn down from the circle represent the possible outcomes of the comparison. When the algorithm terminates, we put either F (for failure) or the location where the target is found at the end of the appropriate branch, and draw as a square

Example

$$ array:[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]\newline $$

Analysis by decision tree

$$ \begin{align} &The\ best\ case:1\ comparison\newline &The\ worst\ case:the\ depth\ of\ a\ decision\ tree\ h\newline &The\ depth\ of\ a\ decision\ tree\rightarrow h=\lfloor \log_2n \rfloor+1\newline &O(\log_2 n)\newline &Average\ case(ASL): O(\log_2 n)\newline &ASL_{bn}=\sum^n_{i=1}P_iC_i=\frac {1}{n}[1 * 2^0 + 2 * 2^1 + 3 * 2^2 + … + h * 2^{h-1}]\newline &=\frac {n+1}{n}\log_2(n+1)-1 \approx \log_2(n+1)-1=O(log_2 n)\newline \end{align} $$

Data Structure

Divided a search list R[n] into b sublist.

Each sublist may not be ordered by key, but the maximal key in the front sublist must be lower than the minimal key in the next successor list

An indexing table ID[b]. ID[i] stores the maximal key in i-th list. ID[b] is an ordered list

Idea
  1. Search in ID[b] to determine the possible sublist (Sequential search or binary search can be used)
  2. Search in R[n] (Only sequential search can be used)
Analysis

$ASL=ASL_{Index\ table}+ASL_{Sublist}\newline$

For search list of length n, it is divided into b sublist and the length of a sublist is $s = \frac {n}{b}\newline$

If using binary search in indexing table, the Average Search Length will be $log_2 (1 + \frac {n+1}{s})-1 + \frac {s + 1}{2}\newline$

If using sequential search in indexing table, the Average Search Length will be $\frac {b+1}{2} + \frac {s+1}{2} = \frac {s^2 + 2s + n}{2s}\newline$

For list of length 10000,

Sequential search: $ASL=\frac {n+1}{2}=5000\newline$

Indexing search (s=100):

$ASL= log_2 (b+1)-1+ \frac {S+1}{2} <57\newline$

$or ASL= \frac {b+1}{2} + \frac {S+1}{2} = 101\newline$

Binary search: $ASL=log_2 (n+1)-1 < 14\newline$

Summary and Comparison

Sequential Search Binary Search Indexing Search
ASL Largest Smallest Middle
List Structure Ordered Unordered Ordered Ordered indexing table
Storage Structure Array or Linked List Array Array or Linked List

Binary Search Trees (Dynamic search table)

Binary Search Tree

Definition

Binary Search Tree (BST) is a binary tree in which

  1. Every element has a unique key.

  2. The keys in a nonempty left subtree (right subtree) are smaller (larger) than the key in the root of subtree.

  3. The left and right subtrees are also Binary Search Trees.

Note that this tree is ordered!

  1. Each element to the left of the root is less than the root

  2. Each element to the right of the root is greater than the root

Example
Search path

$Search\ path\ is\ a\ path\ from\ root\ to\ a\ node.\ The\ number\ of\ comparisons\ is\ the\ level\ of\ the\ node.\newline$

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
BiTree SearchBST (BiTree T , keytype k) { 
    // T is the root of the binary search tree
	if (!T || (T->key == k)) 
        return T; 
	else if ( k < T->key) 
		return SearchBST ( T->lchild, k); //Searching in left subtree
	else 
        return SearchBST( T->rchild, k);
		//Searching in right subtree
}

The searching efficiency is affected by the shape of Binary Search Tree

The worst case(one node per level, binary search degenerates to sequential search):$ASL=\frac {n+1}{2}\newline$

The best case (n nodes should have $\lfloor \log_2n\rfloor +1$ levels) :$ASL=\log_2 (n+1)-1\newline$

Insertion into BST

When search is not successful, the new data is inserted into BST

The new node must be a leaf of BST, and it is inserted into the position of searching fault

/img/CreateBST.png
case1(image1)
Deletion from BST
  1. The deleted node is a leaf node – delete node, reset link from parent to Null

  2. The deleted node with 1 child – delete node, reset link from parent to point to child

  3. The deleted node with 2 children

    Let left subtree of P as left subtree of F;

    Or Replace node with inorder predecessor (successor) S

    Delete S (which has 0 or 1 child)

/img/DeletionFromBST1.png
case1 (image1)
/img/DelectionFromBST2.png
case2 (image2)
/img/DelectFromBST3.png
case3 (image3)
/img/DelectFromBST4.png
case4 (image4)
/img/DelectFromBST5.png
case5 (image5)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
Status DeleteBST (BiTree &T , keytype k) { 
    // If there is an element whose key-word EQ key, delete the element, then return true
    //otherwise return false
	if (!T) return false; //The key does not exist
	else {
		if(EQ(key,T->data.key)) { 
            return Delete(T);
        } 
	//Find out the element whose keywork is key
	else if(LT(key, T->data.key)) 
        return DeleteBST(T->lchild,key); 
	else 
        return DeleteBST( T->rchild, key);
} //DeleteBST
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Status Delete (BiTree &p) { 
    // Delete node p from BST, and reconnect its right/left child node 
	if (!p->rchild) {
		q = p; 
        p = p->lchild; 
        free(q);
    }
	else if(!p->lchild) { 
		q = p; 
        p = p->rchild; 
        free(q); 
    }
	else{ 
		q = p;
		s = p->lchild; 
		while (s->rchild) { 
            q = s; 
            s = s->rchild; 
        }
		//s is the predecessor of the node p 
		p->data = s->data;
		if(q != p) 
            q->rchild = s->lchild; //connect the right subtree of q
		else 
            q->lchild = s->lchild; //connect the left subtree of q
		free(s);
    }
	return true;
} //Delete
/img/06.BST.B3.algorithm.remove/06.BST.B3.algorithm.remove_02.png
case1(image1)
/img/06.BST.B3.algorithm.remove/06.BST.B3.algorithm.remove_03.png
case2(image2)
/img/06.BST.B3.algorithm.remove/06.BST.B3.algorithm.remove_04.png
case3(image3)
/img/06.BST.B3.algorithm.remove/06.BST.B3.algorithm.remove_05.png
case4(image4)

AVL Tree

Tree Balancing

Binary Search Trees(BST) are designed for fast searching!

But the order of insertion into a BST determines the shape of the tree and hence the efficiency with which tree can be searched.

If it grows so that the tree is as “fat” as possible (fewest levels) then it can be searched most efficiently. If the tree grows “lopsided”, with many more items in one subtree than another, then the search efficiency will degrade.

For AVL(different in BST)

An AVL tree is a binary tree that is height balanced

$\rightarrow The\ difference\ in\ height\ between\ the\ left\ and\ right\ subtrees\ at\ any\ point\ in\ the\ tree\ is\ restricted\newline$

Balance Factor
Definition

The balance factor of node x to be the height of x’s left subtree minus the height of its right subtree.

$balFac(v)=height(lc(v))-height(rc(v))\newline$

An AVL tree is a BST in which the balance factor of each node is 0, -1, or 1

Example

/img/Sample Trees.png
case1(image1)
insert
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
template <typename T> BinNodePosi<T> AVL<T>::insert( const T & e ) {
    BinNodePosi<T> & x = search( e ); if ( x ) return x; //若目标尚不存在
    BinNodePosi<T> xx = x = new BinNode<T>( e, _hot ); _size++; //则创建新节点
    // 此时,若x的父亲_hot增高,则祖父有可能失衡
    for ( BinNodePosi<T> g = _hot; g; g = g->parent ) //从_hot起,逐层检查各代祖先g
        if ( ! AvlBalanced( *g ) ) { //一旦发现g失衡,则通过调整恢复平衡
            FromParentTo(*g) = rotateAt( tallerChild( tallerChild( g ) ) );
            break; //局部子树复衡后,高度必然复原;其祖先亦必如此,故调整结束
        } else //否则(g仍平衡)
            updateHeight( g ); //只需更新其高度(注意:即便g未失衡,高度亦可能增加)
    return xx; //返回新节点位置
}
remove
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
template <typename T> bool AVL<T>::remove( const T & e ) {
    BinNodePosi<T> & x = search( e ); if ( !x ) return false; //若目标的确存在
    removeAt( x, _hot ); _size--; //则在按BST规则删除之后,_hot及祖先均有可能失衡
    // 以下,从_hot出发逐层向上,依次检查各代祖先g
    for ( BinNodePosi<T> g = _hot; g; g = g->parent ) {
        if ( ! AvlBalanced( *g ) ) //一旦发现g失衡,则通过调整恢复平衡
            g = FromParentTo( *g ) = rotateAt( tallerChild( tallerChild( g ) ) );
        updateHeight( g ); //更新高度(注意:即便g未曾失衡或已恢复平衡,高度均可能降低)
    } //可能需做过(logn)次调整;无论是否做过调整,全树高度均可能下降
    return true; //删除成功
}
refactoring->”3+4”
/img/3+4.png
case1(image1)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
template <typename T> BinNodePosi<T> 
BST<T>::connect34(BinNodePosi<T> a, BinNodePosi<T> b, BinNodePosi<T> c,BinNodePosi<T> T0, BinNodePosi<T> T1,BinNodePosi<T> T2, BinNodePosi<T> T3) {
    a->lc = T0; if (T0) T0->parent = a;
    a->rc = T1; if (T1) T1->parent = a;
    c->lc = T2; if (T2) T2->parent = c;
    c->rc = T3; if (T3) T3->parent = c;
    b->lc = a; a->parent = b; b->rc = c; c->parent = b;
    updateHeight(a); updateHeight(c); updateHeight(b); 
    return b;
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
template <typename T>
static BinNodePosi<T> removeAt ( BinNodePosi<T> & x, BinNodePosi<T> & hot ) {
   BinNodePosi<T> w = x; //实际被摘除的节点,初值同x
   BinNodePosi<T> succ = NULL; //实际被删除节点的接替者
   if ( !HasLChild ( *x ) ) //若*x的左子树为空,则可
      succ = x = x->rc; //直接将*x替换为其右子树
   else if ( !HasRChild ( *x ) ) //若右子树为空,则可
      succ = x = x->lc; //对称地处理——注意:此时succ != NULL
   else { //若左右子树均存在,则选择x的直接后继作为实际被摘除节点,为此需要
      w = w->succ(); //(在右子树中)找到*x的直接后继*w
      swap ( x->data, w->data ); //交换*x和*w的数据元素
      BinNodePosi<T> u = w->parent;
      ( ( u == x ) ? u->rc : u->lc ) = succ = w->rc; //隔离节点*w
   }
   hot = w->parent; //记录实际被删除节点的父亲
   if ( succ ) succ->parent = hot; //并将被删除节点的接替者与hot相联
   release ( w->data ); release ( w ); return succ; //释放被摘除节点,返回接替者
} //release()负责释放复杂结构,与算法无直接关系,具体实现详见代码包
summary
/img/AVL_summary.png
case1(image1)

B-Tree

Goals of Indexing

  1. Store large files

  2. Support multiple search keys

  3. Support efficient insert, delete, and range queries

Difficulties when storing tree index on disk

  1. Tree must be balanced
  2. Each path from root to leaf should cover few disk pages

The B-Tree is now the standard file organization for applications requiring insertion, deletion, and key range searches

Definition

B-Tree of order m has these properties

  1. The root is either a leaf or has at least two children

  2. Each node, except for the root and the leaves, has between $\lceil \frac {m}{2} \rceil and\ m$ children

  3. The number of Keys in each node(except for the root) is between $\lceil \frac {m}{2} \rceil-1\ and\ m-1\newline$

  4. All leaves are at the same level in the tree, so the tree is always height balanced

Example

$B-Tree\ m=5\newline$

/img/B-Tree For m equal 5.png
case1(image1)
Application

B-Tree node is usually selected to match the size of a disk block

Search in a B-Tree is similar with search in BST

  1. Search keys in current node. If search key is found, then return record. If current node is a leaf node and key is not found, then report unsuccessful

  2. Otherwise, follow the proper branch and repeat the process

B-Tree Insertion

To insert value X into a B-tree, there are 3 steps

  1. using the SEARCH procedure for M-way trees (described above) find the leaf node to which X should be added

  2. add X to this node in the appropriate place among the values already there. Being a leaf node there are no subtrees to worry about

  3. if there are M-1 or fewer values in the node after adding X,then we are finished. If there are M values after adding X, we say the node has overflowed.

    To repair this, we split the node into three parts:

    1. Left: the first $\frac {M-1}{2}$ values

    2. Middle: the middle value $(position\ 1+(\frac {M-1}{2}))\newline$

    3. Right: the last $\frac {M-1}{2}$ values

B-Tree Deletion
  1. How many values might there be in this combined node?

  2. The parent node contributes 1 value

  3. The node that underflowed contributes exactly $\frac {M-1}{2}-1$ values

  4. The neighboring node contributes somewhere between $\frac {M-1}{2}\ and\ M-1$ values

case 1

Suppose that the neighboring node contains more than (M-1)/2 values. In this case, the total number of values in the combined node is strictly greater than 1 + ((M-1)/2 - 1) + ((M-1)/2)

case 2

Suppose, on the other hand, that the neighboring node contains exactly (M-1)/2 values. Then the total number of values in the combined node is 1 + ((M-1)/2 - 1) + ((M-1)/2) = (M-1)。

Hash Table

What is hashing

$Linear\ Search\ \rightarrow O(n)\newline$

$Binary\ Search\rightarrow O(\log_2n)\newline$

  1. Both depend on comparisons of item sought and elements in container

  2. Hash tables place data so that the location of an item is determined directly as a function of the item itself

  3. With a good hash function, searching a hash table takes O(1) time. that is, it is a constant and does not depend on the number of items stored.

$Address(a_i)=Hash(a_i.key)\newline$

Here

$a_i\ is\ an\ element\newline$

$Address(a_i)\ stores\ the\ address\ of\ a_i\newline$

$a_i.key\ is\ the\ key\ of\ a_i\ element\newline$

General Idea

One problem with hash tables is collisions:

$\rightarrow $when more than one item map to the same location in the hash table

$\downarrow \newline$

$\rightarrow $Some strategy is needed to resolve such collisions

The hash function and data set determine the number of collisions

The strategy used to handle collisions affect the performance of searching for an arbitrary item

Hash Function

Hashing: The process of mapping a key value to a position in a table

​ A hash function maps key values to positions in the hash table. It is denoted by h.

​ A hash table is an array that holds the records. It is denoted by HT

HT has m slots, indexed from 0 to m-1

For any value K in the key range and hash function h, h(K)=i, 0<=i <m, such that HT[i].key = K

Choosing a Hash Function
  1. A hash function should be EASY and QUICK to compute

  2. A hash function MUST return a value within the hash table range

  3. To be practical, a hash function SHOULD evenly distribute the records stored among the hash table slots

  4. Ideally, the hash function should distribute records with equal probability to all hash table slots. In practice, success depends on distribution of actual records stored

  5. If we know nothing about the incoming key distribution, evenly distribute the key range over the hash table slots while avoiding obvious opportunities for clustering

Example
  1. $Hash(key)=a * key+b\ [a, b\ are\ constants]\newline$

  2. Digital Analysis : All the keys are known in advance. Select m digits from n

Criterion: Delete the digits having the most skewed distributions

  1. $Hash(key)=key$ % $p\ p \leq m\newline$

Modular arithmetic: We may convert the key to an integer, divide by the size of the index range, and take the remainder as the result

Assuming the address range is 0 to m - 1, let p is a prime which is less than/equal m We can map the keys into addresses with the Hash function

  1. Mid-square method: Square the key value, take the middle r bits from the result for a hash table of $2^r$ slots

  2. $Hash(key)=\lfloor n * (A * key $ % $1)\rfloor \rightarrow A * key $%$ 1 = A * key - \lfloor A * key\rfloor\newline$

  3. Folding

Partition the key into several parts

All parts except for the last one have the same length

Combine the parts in a convenient way to obtain the hash address

Two possibilities:

Shift folding

Folding at the boundaries

Collisions

Given: hash function h with keys $k_1$ and $k_2$ .$\beta$ is a slot in the hash table

$If\ h(k_1 ) = \beta = h(k_2), then\ k_1\ and\ k_2\ have\ a\ collision\ at\ \beta\ under\ h\newline$

Collisions are inevitable in most applications

To use hashing we must

  1. find good hash functions

  2. determine how to resolve collisions

Store or search for the record with key K

  1. Compute the table location h(K)

  2. Starting with slot h(K), locate the record containing key K using (if necessary) a collision resolution policy

Collision Resolution

Collisions are inevitable in most applications, so some strategies are needed to resolve such collisions

  1. Need to be able to place an element when its mapped location is full

  2. Need to be able to retrieve the element when it’s not placed directly according to the hash function

The goal of collision resolution is to find a free slot in the HT. The new slot is found by a collision resolution policy

Search must follow the same policy to find records not in their home slots

Closed hashing

If a collision occurs, alternative slots are tried until an empty slot is found

Probe sequence : The series of slots visited during insert/search by following a collision resolution policy

Open Addressing

  1. Linear Probing

  2. Quadratic Probing

  3. Random Probing

Rehashing

Linear Probing

Linear probing starts with the hash address and searches sequentially for the target key or an empty position.

The array should be considered circular

So that when the last location is reached, the search proceeds to the first location of the array

$H_i=(Hash(key)+d_i)\ mod\ m\ (1 \leq i <m)\newline$

where: Hash (key) is hash function

m is the length of hash table

Probe sequence $d_i$ is 1,2,……,m-1 and $d_i = i\newline$

Problem

Keys tend to cluster together

Records tend to cluster in the table under linear probing since the probabilities for which slot to use next are not the same for all slots

Adjacent cluster tend to coalesce

If the first hash addresses of two keys are different, but following the same probe sequence, they tend to the same slot, this is called secondary clustering

To avoid secondary clustering, need probe sequence to be a function of the original key value, not just the home position

Increase the search time

Quadratic probing

Quadratic probing uses a quadratic function of $i$ as the increment

$H_i=(Hash(key)+d_i)\ mod\ m\newline$

$Hash(key)\ is\ hash\ function\newline$

$m\ is\ the\ length\ of\ hash\ table\newline$

$Proble\ sequenece\ d_i:1^2, -1^2, 2^2, -2^2, …q^2, -q^2, q \leq \frac {m-1}{2}\newline$

Random problem

The ideal probe function would select the next slot on the probe sequence at random

$H_i=(Hash(key)+d_i)\ mod\ m\newline$

Probe sequence is a (random) permutation of the numbers from 1 to M-1: $r_1, r_2, … r_M\newline$

$M \leq\ the\ length\ of\ table\ and\ M\ is\ a\ prime\ number\newline$

Rehashing

$H_i=ReHash_i(key)\newline$

$ReHash_i(key)$ is a different hash function from Hash(key)

Try $h_1 (K), h_2 (K), …, h_m(K)$ if collision occurs

Be sure that all different hash functions $h_i (K)$ are relatively prime to M

Open Hashing
Separate Chaining

Chaining: use a hash table that is an array of linked lists to store the items

The hash table itself can be smaller than the number of records; If the records are large, a chained hash table can save space

Collision resolution with chaining is simple. When a collision occurs, we simply insert the new item into the appropriate linked list. Clustering is no problem

Searching such a hash table is straightforward

apply the hash function to the item sought and then use one of the search algorithms for linked lists

Analysis of Hashing
Searching

To determine if a specified value is in this hash table,apply the hash function to compute the location for this value

  1. if location is empty, value not in the table
  2. if location contains the specified value, the search is successful
  3. if location contains a different value, must rule out collision, begin a “circular” search at this location and continue until either item is found or empty or starting location reached (item not in table)

Comparison is needed in hash searching because of collision. We can use ASL to analyze hashing

For hashing, the following factors may affects the efficiency

  1. Hash function: The behavior of the hash function affects the frequency of collisions

  2. Collision resolution policy

  3. Loading factor ($\alpha$): The load factor of the table is $\alpha = \frac {n}{m}$ , where n positions are occupied out of a total of m positions in the table

/img/Hash_ASL.png
case1(image1)
Summary

Comparison between Binary Search Trees and Hash Tables

  1. difficult to find the minimum (or maximum) element in a hash table

  2. Can not find a range of elements in a hash table

  3. $O(\log_2n)$ is not necessarily much more than $O(1)$,since there are no multiplications or divisions by BSTs

  4. Sorted input can make BSTs perform poorly