[Kamailio-Users] What happens with filled htables?

Wed Aug 26 15:58:43 CEST 2009

catalina oancea wrote:

> That seems like an interesting idea, Alex.
> 
> I still have problems understanding the size parameter. For example
> what exactly does size=4 mean?

So, imagine a two-dimensional array.  An array of size 4 would mean a 
2x2 array.  Or, if you prefer, a one-dimensional array of 4 elements. 
But "tables" are easier to visualise as a grid.

Every time a value is inserted,

    location = hash_function(key, table_size);
    hash_table[location] = value;

happens.  The hash_function() is structured in such a way that it will 
produce a deterministic result;  the same key will always result in the 
same hash.  It also is structured to never return a hash value that is 
greater than the size of the table, as that would result in an array 
overflow.

The size is the size of the set of all possible values hash_function() 
can produce.  How the hash function works exactly depends on the data 
and there are many strategies.  Good hash functions achieve as much of 
an "optimal" distribution as possible;  with size=4, the aim is to have 
about 25% of the stored values go in hash_table[0], 25% in 
hash_table[1], 25% in [2], and 25% in [3].

How exactly to achieve this best is a question I leave to the 
mathematicians;  there are many hash algorithms one can use.  Of course, 
whether the result is optimal or not also depends on the relationship of 
the hash algorithm to the variation of the data.  A very basic and 
common one is used for introductory teaching is to sum up the ASCII 
values of the key string and divide them by the size of the table and 
use the remainder as the hash value, e.g.

    char	   key[] = "catalina";
    int     sum = 0;

    for(int i = 0; i < sizeof(key); i ++)
            sum += (int) i;

    return (sum % HASH_TABLE_SIZE);

Of course, this is a rather naive hash function and more sophisticated 
ones are available.

Anyway, the point is that if you have a table with a size of 4, the most 
optimal, best-case scenario is that you will have the first 4 insertions 
map to a unique value among those 4.

Anything inserted AFTER that will *necessarily* result in a collision - 
AKA there is another element in that bucket, and it is necessary to 
chain the elements and/or use some sort of linear approach.  This 
increases the computational complexity of the search and removes the 
speed advantage which is the point of using a hash table as a data 
structure in the first place:

    struct value_data   *hash_lookup(table_t **tbl, char *key) {
            int idx = hash_lookup(tbl->size, key);
            struct bucket_node *scan = NULL;

            if(tbl->buckets[idx] == NULL)
                    return NULL;

            for(scan = tbl->buckets; scan; scan = scan->next) {
                    if(!strcmp(scan->key, key))
                            return scan->value;
            }

            return NULL;
    }

With a larger table, you get the advantage of having more mathematically 
possible unique buckets for the data to go into, thus reducing or 
eliminating the need for a linear search.  If you try to stuff 200 
values into 4 buckets, you're going to get a best-case performance of 
about O(N/4), or O(50).  With a table size > 200, you'll get a best-case 
performance of O(1), which is the point of a hash table.

All associative arrays and "maps" in various high-level languages, e.g. Perl

    my	%hash = (
	'key' => 'value'
    );

and PHP:

    my  $arr = array(
         'key' => 'value'
    );

and Java:

    HashMap<String, String> m = new HashMap<String, String>();
    m.put("Catalina", "Oancea");

... work on this principle, but dynamically resize themselves in the 
background in accordance with the runtime population of the structure.

The difference is that in the Kamailio 'htable' module, you must specify 
a fixed size at initialisation.

-- Alex

-- 
Alex Balashov - Principal
Evariste Systems
Web     : http://www.evaristesys.com/
Tel     : (+1) (678) 954-0670
Direct  : (+1) (678) 954-0671