[Serdev] Re: postgres module

Mon Jan 26 20:10:55 UTC 2004

Greg, comments inline.

On 26-01 12:08, Greg Fausak wrote:
> Jan,
> 
> Thank you for writing me about these subjects.
> I am very opinionated.  I have wrestled with most of the
> subjects you discuss over the years (decades actually) and
> I have found things that work for me.  My responses are in
> no way directed at you or SER...I am just expressing my opinions!
> I appreciate being offered the forums, and I'll respond candidly.

  Sure, that's what we have this mailing list for.

> >- Connection pool -- I would like to implement the same connection pool
> >  which is now implemented in mysql module. It allows sharing of
> >  connections with the same URL among modules within the same process.
> >  That means the number of connections will not grow with the number of
> >  modules using db anymore.
> 
> A connection pool is fine with me.
> 
> There are basic problems with the
> approach that you are using with database operations.
> I went over this when I created the first postgres module.
> The main problem is that a file descriptor is *not* a database
> connection.  The practice of opening the database and then
> forking is just completely wrong for postgres!  The correct
> approach for postgres is to open the database *in the thread or process*
> that it is used in.

  I remember the problem you are mentioning, but I have removed all such
  construct so this shouldn't be a problem anymore. Currently all
  database connection are open and closed in the process in which they
  will be used and are never inherited. Last relicts were usrloc and
  auth_db modules which inherited connections but never used them in the
  descendant.

  I've fixed even those two modules over the weekend. Actually mysql
  module does not allow inherited connections and it will scream (and
  ser won't start) if it detects open inherited database connections.

> If the connection pool operates outside
> the ser modules, and is communicated with over a pipe/datagram/ip
> then that would be fine.  If all queries are atomic in nature (that
> is, they do not span multiple queries, like 'select' followed by
> 'update') then a completely shared pool would work.  Otherwise,
> the pool would need to be 'reserved' so that a transaction
> can be started, run, end committed/aborted.   This would require
> reuse of the same database connection throughout the entire
> transaction.

  The term connection pool has a slightly different meaning here. Let me
  describe what it is good for a little bit.

  Older versions of ser opened a huge number of database connections 
  (you might remember some emails on the list about maximum number of 
  allowed connections in mysql).

  For example, suppose you configure ser to start 16 processes and you
  load usrloc, auth_db, acc and domain modules, for example.

  Usrloc will open a database connection in each children, that is 16
  database connections. After that auth_db module gets initialized and
  opens also a database connection in each children, so we have 16 + 16
  = 32 database connections.

  The same for acc and domain modules so we will end up with 64 open
  database connections. All the connections have usually the same
  username, password and database. Each new module that needs database 
  will add 16 new connections. Each connection will start a new thread
  (in case of mysql) on the server.

  Module functions within one process will never be executed in
  parallel, they will be always executed in the order in which they
  are written in the configuration file. A function must return before
  the next one is called. If a function performs any database
  operations, they will be finished before the function returns (this is
  true in all ser modules and in fact it must be true).

  Given the constraints described above, a single database connection
  can be reused by multiple modules within the same process (as long as
  they are configured with the same database URL). Modules will never
  conflict with each other because they are executed sequentially and
  each database operation is finished before a function from a different
  module is executed.

  The connection pool relies upon these facts. When a module opens a
  database connection, the connection will be remembered by the database
  module (mysql in this case).

  When a different module opens a connection _within the same process_,
  the database module will iterate through the pool to see if a
  connection with the same URL has already been opened. If so then it
  will return reference to the connection opened by previous module,
  otherwise it will open a new one. So on and so forth... Each ser
  process has a distinct connection pool.

  With the connection pool, the example (mentioned above) will look like
  this:

  Again, you start ser with 16 processes. Usrloc module will open a
  database connection in each process, since it is the first module,
  there are no open connections in the pool yet and 16 new connections
  will be open. 

  After that auth_db tries to open a connection in each child. But
  because it was configured with the same database url as usrloc, a
  previously opened connection is found in the pool and returned to
  auth_db. So in fact auth_db doesn't open any new database connections.
  The same happens in acc and domain provided that they were configured
  with the same database URL.

  So in this case we have only 16 opened database connections (compared
  to 64 previously). That's it. The purpose of the connection pool is to
  reduce the number of opened connections.

> We (Andy Fullford mostly) has actually coded a modules called
> RI (relational interface) a long time ago.  It does pooling,
> communicates with remote processes via IP/datagram/pipe, and
> insulates the client program from the underlying database type.
> That is a different story.

  Yes, we have similar (but probably simpler) api in ser.

> >- Memory management functions -- I've noticed that you have been using
> >  your own memory management functions that allow to find mem leaks
> >  easily. I'd like to remove them. I understand that they are good for
> >  debugging, but they also introduce performance bottleneck which is not
> >  necessary. Of course I take the responsibility for any memory leaks
> >  which I might introduce and will fix them immediately.
> 
> I have strong opinions about memory management.  I feel with current
> processor speed and memory size memory management should lean towards
> robustness at the expense of efficiency.  Certainly if there is a
> performance problem it needs to be addressed.  Have you determined
> there is a performance problem?  I have pref'fed this stuff, it
> doesn't have a measurable performance hit, nor does it really take
> too much memory (the machine I just built for our backup-SER
> box has a 3Gig processor and 4GB of memory!)!

  No, I haven't done any performance measurements of postgres module and 
  maybe you are right that the performance impact is minimal compared to 
  the rest of the server.

  We've taken a different approach in ser -- performance and efficiency
  at the expenses of programmer's convenience.

  SER is everything but memory efficient. We use a handcrafted memory
  management which is very fast, but it uses much more memory than
  needed. SER in default configuration allocates only 1 MB of private
  memory and only this memory can be used. If you reach the limit then ser
  will bail out and you will have to recompile it. (Shared memory
  segment is much bigger, of course).

  Postgres module does use the standard malloc (the one from libc) which
  is slower than ours. In addition to that our memory allocator can be
  switched over into debugging mode which allows you to find out memory
  leaks easily. You simply start the server, let it running for a while
  and after that you stop the server. It will print all unfreed memory
  blocks along with file and line at which they were allocated. We are
  able to search for memory leaks in whole ser.

  Because of that I think that having a separate garbage collector in
  postgres module is not necessary and I would like to switch it over to
  the memory allocator we are using everywhere else. I don't know if
  there will be any performance boost (probably not), but at least the
  postgres module will be smaller, easier to read and understand and we
  will have all the memory management at one place. 

  I would prefer smaller code base and memory allocation aligned to the 
  rest of the server at the expenses of careful memory handling in the module.

> Our memory routines were donated to he cause.  If there aren't needed
> I won't mind.  From a programmer's point of view I find it very
> appealing to free a single pointer (like the memory associated with
> a dbopen) and I know in my heart that all memory associated with
> that pointer is freed.  So, all memory associated with the database
> connection is freed with a single free.  Or all memory associated
> with a single query is freed with a single free.  That's clean.
> I don't think micro-management of strings inside one memory
> allocation is necessary or called for.

  First of all I have nothing against your memory allocation routines
  and we are thankful for anyone who is willing to contribute and give it
  away for free like you did.

  I am just presenting my point of view and tring to clarify why I think
  it would be better without them. Of course you have the final word as
  the module maintainer -- I wouldn't do anything you disagree with. 

> I'm an old dog, and the tricks I know work for me.  I'm not going to
> learn any new tricks.  If you can manage the memory through brute
> force then by all means, go for it.  However, if it were me, I would
> use the memory management we have developed everywhere else in the SER
> code.

  It's a tradeoff between efficiency and programmer's comfort. We have
  chosen the former. For the latter C is probably not the best language
  (that's one of reasons why so many people use java :-).

> [...]
> 
> By the way, we have developed 'views' for postgres that completely
> isolates each 'domain' from each other as far as SER is concerned.
> Each view has insert/delete/update ability.  Each 'domain' has it's own
> login, and the views only allow access to that domains records.  The
> scema can be published to the domain holder, and access to the database
> can be granted without concern about that domain seeing and manipulating
> other domain's records.  The postgres views enabled the changes to the
> database without any changes to the SER code.

  That sounds cool, are you willing to contribute this ? (in form of a
  description, scripts, whatever).

    Jan.