catalina oancea wrote:
That seems like an interesting idea, Alex.
I still have problems understanding the size parameter. For example
what exactly does size=4 mean?
So, imagine a two-dimensional array. An array of size 4 would mean a
2x2 array. Or, if you prefer, a one-dimensional array of 4 elements.
But "tables" are easier to visualise as a grid.
Every time a value is inserted,
location = hash_function(key, table_size);
hash_table[location] = value;
happens. The hash_function() is structured in such a way that it will
produce a deterministic result; the same key will always result in the
same hash. It also is structured to never return a hash value that is
greater than the size of the table, as that would result in an array
overflow.
The size is the size of the set of all possible values hash_function()
can produce. How the hash function works exactly depends on the data
and there are many strategies. Good hash functions achieve as much of
an "optimal" distribution as possible; with size=4, the aim is to have
about 25% of the stored values go in hash_table[0], 25% in
hash_table[1], 25% in [2], and 25% in [3].
How exactly to achieve this best is a question I leave to the
mathematicians; there are many hash algorithms one can use. Of course,
whether the result is optimal or not also depends on the relationship of
the hash algorithm to the variation of the data. A very basic and
common one is used for introductory teaching is to sum up the ASCII
values of the key string and divide them by the size of the table and
use the remainder as the hash value, e.g.
char key[] = "catalina";
int sum = 0;
for(int i = 0; i < sizeof(key); i ++)
sum += (int) i;
return (sum % HASH_TABLE_SIZE);
Of course, this is a rather naive hash function and more sophisticated
ones are available.
Anyway, the point is that if you have a table with a size of 4, the most
optimal, best-case scenario is that you will have the first 4 insertions
map to a unique value among those 4.
Anything inserted AFTER that will *necessarily* result in a collision -
AKA there is another element in that bucket, and it is necessary to
chain the elements and/or use some sort of linear approach. This
increases the computational complexity of the search and removes the
speed advantage which is the point of using a hash table as a data
structure in the first place:
struct value_data *hash_lookup(table_t **tbl, char *key) {
int idx = hash_lookup(tbl->size, key);
struct bucket_node *scan = NULL;
if(tbl->buckets[idx] == NULL)
return NULL;
for(scan = tbl->buckets; scan; scan = scan->next) {
if(!strcmp(scan->key, key))
return scan->value;
}
return NULL;
}
With a larger table, you get the advantage of having more mathematically
possible unique buckets for the data to go into, thus reducing or
eliminating the need for a linear search. If you try to stuff 200
values into 4 buckets, you're going to get a best-case performance of
about O(N/4), or O(50). With a table size > 200, you'll get a best-case
performance of O(1), which is the point of a hash table.
All associative arrays and "maps" in various high-level languages, e.g. Perl
my %hash = (
'key' => 'value'
);
and PHP:
my $arr = array(
'key' => 'value'
);
and Java:
HashMap<String, String> m = new HashMap<String, String>();
m.put("Catalina", "Oancea");
... work on this principle, but dynamically resize themselves in the
background in accordance with the runtime population of the structure.
The difference is that in the Kamailio 'htable' module, you must specify
a fixed size at initialisation.
-- Alex
--
Alex Balashov - Principal
Evariste Systems
Web :
http://www.evaristesys.com/
Tel : (+1) (678) 954-0670
Direct : (+1) (678) 954-0671