[Serdev] Re: postgres module

Mon Jan 26 21:37:14 UTC 2004

Jan Janak wrote:

> Greg, comments inline.
> 
> On 26-01 12:08, Greg Fausak wrote:
> 
>>Jan,
>>
>>Thank you for writing me about these subjects.
>>I am very opinionated.  I have wrestled with most of the
>>subjects you discuss over the years (decades actually) and
>>I have found things that work for me.  My responses are in
>>no way directed at you or SER...I am just expressing my opinions!
>>I appreciate being offered the forums, and I'll respond candidly.
> 
> 
>   Sure, that's what we have this mailing list for.
> 
> 
>>>- Connection pool -- I would like to implement the same connection pool
>>> which is now implemented in mysql module. It allows sharing of
>>> connections with the same URL among modules within the same process.
>>> That means the number of connections will not grow with the number of
>>> modules using db anymore.
>>
>>A connection pool is fine with me.
>>
>>There are basic problems with the
>>approach that you are using with database operations.
>>I went over this when I created the first postgres module.
>>The main problem is that a file descriptor is *not* a database
>>connection.  The practice of opening the database and then
>>forking is just completely wrong for postgres!  The correct
>>approach for postgres is to open the database *in the thread or process*
>>that it is used in.
> 
> 
>   I remember the problem you are mentioning, but I have removed all such
>   construct so this shouldn't be a problem anymore. Currently all
>   database connection are open and closed in the process in which they
>   will be used and are never inherited. Last relicts were usrloc and
>   auth_db modules which inherited connections but never used them in the
>   descendant.
> 
>   I've fixed even those two modules over the weekend. Actually mysql
>   module does not allow inherited connections and it will scream (and
>   ser won't start) if it detects open inherited database connections.
> 
> 
>>If the connection pool operates outside
>>the ser modules, and is communicated with over a pipe/datagram/ip
>>then that would be fine.  If all queries are atomic in nature (that
>>is, they do not span multiple queries, like 'select' followed by
>>'update') then a completely shared pool would work.  Otherwise,
>>the pool would need to be 'reserved' so that a transaction
>>can be started, run, end committed/aborted.   This would require
>>reuse of the same database connection throughout the entire
>>transaction.
> 
> 
>   The term connection pool has a slightly different meaning here. Let me
>   describe what it is good for a little bit.
> 
>   Older versions of ser opened a huge number of database connections 
>   (you might remember some emails on the list about maximum number of 
>   allowed connections in mysql).
> 
>   For example, suppose you configure ser to start 16 processes and you
>   load usrloc, auth_db, acc and domain modules, for example.
> 
>   Usrloc will open a database connection in each children, that is 16
>   database connections. After that auth_db module gets initialized and
>   opens also a database connection in each children, so we have 16 + 16
>   = 32 database connections.
> 
>   The same for acc and domain modules so we will end up with 64 open
>   database connections. All the connections have usually the same
>   username, password and database. Each new module that needs database 
>   will add 16 new connections. Each connection will start a new thread
>   (in case of mysql) on the server.
> 
>   Module functions within one process will never be executed in
>   parallel, they will be always executed in the order in which they
>   are written in the configuration file. A function must return before
>   the next one is called. If a function performs any database
>   operations, they will be finished before the function returns (this is
>   true in all ser modules and in fact it must be true).
> 
>   Given the constraints described above, a single database connection
>   can be reused by multiple modules within the same process (as long as
>   they are configured with the same database URL). Modules will never
>   conflict with each other because they are executed sequentially and
>   each database operation is finished before a function from a different
>   module is executed.
>  
>   The connection pool relies upon these facts. When a module opens a
>   database connection, the connection will be remembered by the database
>   module (mysql in this case).
> 
>   When a different module opens a connection _within the same process_,
>   the database module will iterate through the pool to see if a
>   connection with the same URL has already been opened. If so then it
>   will return reference to the connection opened by previous module,
>   otherwise it will open a new one. So on and so forth... Each ser
>   process has a distinct connection pool.
> 
>   With the connection pool, the example (mentioned above) will look like
>   this:
> 
>   Again, you start ser with 16 processes. Usrloc module will open a
>   database connection in each process, since it is the first module,
>   there are no open connections in the pool yet and 16 new connections
>   will be open. 
>   
>   After that auth_db tries to open a connection in each child. But
>   because it was configured with the same database url as usrloc, a
>   previously opened connection is found in the pool and returned to
>   auth_db. So in fact auth_db doesn't open any new database connections.
>   The same happens in acc and domain provided that they were configured
>   with the same database URL.
> 
>   So in this case we have only 16 opened database connections (compared
>   to 64 previously). That's it. The purpose of the connection pool is to
>   reduce the number of opened connections.

I hacked dbase.c in the postgres modules to:

1) Skip the connect_db() when doing db_init()
2) Call connect_db() in any operation that tries to use the
    database, but the database is currently not open.

Using this technique has cut the number of database connections
to two.  Maybe a combination of both techniques is in order?

>  
> 
>>We (Andy Fullford mostly) has actually coded a modules called
>>RI (relational interface) a long time ago.  It does pooling,
>>communicates with remote processes via IP/datagram/pipe, and
>>insulates the client program from the underlying database type.
>>That is a different story.
> 
> 
>   Yes, we have similar (but probably simpler) api in ser.
> 
> 
>>>- Memory management functions -- I've noticed that you have been using
>>> your own memory management functions that allow to find mem leaks
>>> easily. I'd like to remove them. I understand that they are good for
>>> debugging, but they also introduce performance bottleneck which is not
>>> necessary. Of course I take the responsibility for any memory leaks
>>> which I might introduce and will fix them immediately.
>>
>>I have strong opinions about memory management.  I feel with current
>>processor speed and memory size memory management should lean towards
>>robustness at the expense of efficiency.  Certainly if there is a
>>performance problem it needs to be addressed.  Have you determined
>>there is a performance problem?  I have pref'fed this stuff, it
>>doesn't have a measurable performance hit, nor does it really take
>>too much memory (the machine I just built for our backup-SER
>>box has a 3Gig processor and 4GB of memory!)!
> 
> 
>   No, I haven't done any performance measurements of postgres module and 
>   maybe you are right that the performance impact is minimal compared to 
>   the rest of the server.
> 
>   We've taken a different approach in ser -- performance and efficiency
>   at the expenses of programmer's convenience.
> 
>   SER is everything but memory efficient. We use a handcrafted memory
>   management which is very fast, but it uses much more memory than
>   needed. SER in default configuration allocates only 1 MB of private
>   memory and only this memory can be used. If you reach the limit then ser
>   will bail out and you will have to recompile it. (Shared memory
>   segment is much bigger, of course).
> 
>   Postgres module does use the standard malloc (the one from libc) which
>   is slower than ours. In addition to that our memory allocator can be
>   switched over into debugging mode which allows you to find out memory
>   leaks easily. You simply start the server, let it running for a while
>   and after that you stop the server. It will print all unfreed memory
>   blocks along with file and line at which they were allocated. We are
>   able to search for memory leaks in whole ser.
> 
>   Because of that I think that having a separate garbage collector in
>   postgres module is not necessary and I would like to switch it over to
>   the memory allocator we are using everywhere else. I don't know if
>   there will be any performance boost (probably not), but at least the
>   postgres module will be smaller, easier to read and understand and we
>   will have all the memory management at one place. 
>   
>   I would prefer smaller code base and memory allocation aligned to the 
>   rest of the server at the expenses of careful memory handling in the module.
> 
> 
>>Our memory routines were donated to he cause.  If there aren't needed
>>I won't mind.  From a programmer's point of view I find it very
>>appealing to free a single pointer (like the memory associated with
>>a dbopen) and I know in my heart that all memory associated with
>>that pointer is freed.  So, all memory associated with the database
>>connection is freed with a single free.  Or all memory associated
>>with a single query is freed with a single free.  That's clean.
>>I don't think micro-management of strings inside one memory
>>allocation is necessary or called for.
> 
> 
>   First of all I have nothing against your memory allocation routines
>   and we are thankful for anyone who is willing to contribute and give it
>   away for free like you did.
> 
>   I am just presenting my point of view and tring to clarify why I think
>   it would be better without them. Of course you have the final word as
>   the module maintainer -- I wouldn't do anything you disagree with. 
>   
> 
>>I'm an old dog, and the tricks I know work for me.  I'm not going to
>>learn any new tricks.  If you can manage the memory through brute
>>force then by all means, go for it.  However, if it were me, I would
>>use the memory management we have developed everywhere else in the SER
>>code.
> 
> 
>   It's a tradeoff between efficiency and programmer's comfort. We have
>   chosen the former. For the latter C is probably not the best language
>   (that's one of reasons why so many people use java :-).

I program in java as well.  It really isn't a fair comparison.  Java is
extremely inefficient and remarkably elegant.  We are talking about
a memory management routine in compiled C.  Anyway, feel free to replace
the memory management.

> 
> 
>>[...]
>>
>>By the way, we have developed 'views' for postgres that completely
>>isolates each 'domain' from each other as far as SER is concerned.
>>Each view has insert/delete/update ability.  Each 'domain' has it's own
>>login, and the views only allow access to that domains records.  The
>>scema can be published to the domain holder, and access to the database
>>can be granted without concern about that domain seeing and manipulating
>>other domain's records.  The postgres views enabled the changes to the
>>database without any changes to the SER code.
> 
> 
>   That sounds cool, are you willing to contribute this ? (in form of a
>   description, scripts, whatever).

Sure.  We use a scheme ddl generator, I can donate the output of the ddl
generation which includes the .html descriptions, creation scripts (both
tables and views).

---greg

> 
>     Jan.
> 
> _______________________________________________
> Serdev mailing list
> serdev at lists.iptel.org
> http://lists.iptel.org/mailman/listinfo/serdev
> 
>