Discussion:
How Unix manages processes in userland
Erik Fair
2013-12-05 20:34:46 UTC
Permalink
So, daemons - watched over or not?

There's been a philosophical split (sort of) between the BSD/v7 school of thought and the System III/V school of thought, on how the Unix system should startup and manage userland processes where (background) daemons are concerned, with the structure and function of process 1 (whether you call that /{etc,sbin}/init or something else) at the center of it.

Process 1 has always been responsible for reaping processes whose parents did not do so (e.g. processes not started by shells), but what it does with those events beyond simple status collection has varied.

In BSD-land, derived from Version 7 Unix (or 7th Edition, if you prefer), daemons are expected to fork into independent processes and their parents exit, leaving them with the default parent of process 1. Monitoring independent daemons has been somewhat ad-hoc and messy; we've left process 1 more or less alone in this regard; it merely collects status (with one exception).

In USG/System III/V-land, they opted to add a more general process monitoring facility into process 1, in what's known as "inittab" (or by other names in other variants). I'd argue: good idea, pretty terrible implementation - when System III shipped, it was clear to me what needed to be done: the "daemonization" routines of all daemons run by inittab needed to be removed (no more fork/exit) so that /etc/init could properly manage those processes, and restart them when they died (and terminate them when the system is being shut down). I did that in the early 1980's at Dual Systems, a small mc68k Unix-box manufacturer that was my employer, and it worked well. Had to redo all the work for System V, alas (it always annoyed me that USG/AT&T added new facilities like inittab and then didn't perform the necessary code rototill for the system to properly use them), and I've never liked inittab's "run levels".

It is important to note right here that there's one area in which both schools of thought agreed: user login sessions initiated by getty & login on ttys needed to be explicitly managed by process 1. One can argue that USG/System III people merely extended that model to daemons, too.

One can also argue that BSD simply didn't change what had been inherited from v7 Unix, and then went and did its own thing when TCP/IP (network) sessions over telnet, rlogin, et alia, showed up. You don't have to hang getty off a pty (unlike a tty) to accept a network user session. A good thing, too, but that's where we get inetd(8) from.

Why isn't that inetd stuff in process 1, too? Reasonable fear of code bloat & bugs, I suppose, and a philosophy that process 1 needs to be as simple as possible so that it can be reasonably expected to work properly (after all, if process 1 dies unexpectedly, all kinds of bad bad things happen).

One more important aside that we should consider: "user sessions" now come in more flavors than person pounding on a tty (pty) and a shell (or three): there's FTP logins, IMAP/POP logins, and so on. There have been some attempts at reflecting those in utmp(5) but I don't think anyone has been consistent about it. I think we ought to tie that stuff into the basic authentication libraries, i.e. when a user authenticates for something, if it's going to last more than a second or three (i.e. a user is asking for a "session"), it ought to get an entry in utmp(5) and wtmp(5) so that you can see with who(1) or w(1) the users of the system and what sort of session they're in.

We're NetBSD - it should be easy to see what the Network users are doing in our systems (never mind http or NFS, for now …).

End of "user session" digression - back to daemon management.

NetBSD's rc.d(8) system is great - proper dependency management, and it's easy to manually start, stop, or restart a given daemon or service, but we totally fall down on daemon monitoring - they're expected to "just work" (perfect code!) and if they're important enough, someone will notice and manually restart when they die. Or not.

I've had some problems with that - named(8) likes to die on some of my systems because it's a big, complicated beast, and the Internet now encompasses enough of the world that the totality of all code paths through named are being relatively regularly exercised and bugs discovered quite rapidly in deployment, but not fixed anywhere near fast enough. So, I wrote a little shell script for cron(8) the other day to keep those daemon processes that are polite enough to leave a PID file in /var/run alive, and after testing in my own environment, I posted it to tech-userlevel for those who might also be having the same problems. It's a simple, somewhat hacky patch to a design deficiency in NetBSD.

The right place to deal with all of this is in process 1. It is deemed responsible for startup & shutdown of system, which mode (single user, multi-user) to run in, the secure levels (ugh) and ultimate reaping of all processes, so it "knows" a priori whether a daemon should be running or not and can know whether it is provided the relationship between a daemon (service) and its PID is known. The trick is in expressing in some kind of configuration system what we want in a simple but hopefully sufficiently rich syntax.

However, I don't like either of the two schemes I've seen to date for dealing with the issue. I've already expressed my distaste for inittab(5) as I've seen it (has Linux done something more sensible with it in the last many years?), and I had a look at Apple's OS X "launchd" and I don't like it either - it really wants to be talked to through a control program interface (launchctl, with yet another control language to learn) rather than allowing one to simply edit configuration files.

Worse, neither system has proper dependency management as we have in rc.d(5), and I really, really don't want to lose that.

So, clear statement of the problem: daemons should be started and managed by process 1 because it is in a position to monitor for their death and restart them as necessary, and log those deaths (kern.logsigexit is OK but not really the right thing, and I was the one who ported it from FreeBSD), but we need a configuration system for process 1 that not merely names all the daemons (services) to be started/stopped, but also expresses dependency for both startup and shutdown.

your comments and thoughts are solicited,

Erik <***@netbsd.org>
David Brownlee
2013-12-05 21:04:31 UTC
Permalink
Post by Erik Fair
So, clear statement of the problem: daemons should be started and managed by process 1 because it is in a position to monitor for their death and restart them as necessary, and log those deaths (kern.logsigexit is OK but not really the right thing, and I was the one who ported it from FreeBSD), but we need a configuration system for process 1 that not merely names all the daemons (services) to be started/stopped, but also expresses dependency for both startup and shutdown.
A possible variant - process 1 needed to be involved, but does it need
to do everything?

We could have another process responsible for starting daemons and
'registering' their pids with init, and init could notify it when one
of them dies (or if the pid is not present by the time the
registration is processed). Whether this could be less complex than
just putting the functionality in init is another question...
James K. Lowden
2013-12-07 21:40:55 UTC
Permalink
On Thu, 5 Dec 2013 21:04:31 +0000
Post by David Brownlee
Post by Erik Fair
So, clear statement of the problem: daemons should be started and
managed by process 1 because it is in a position to monitor for
their death and restart them as necessary, and log those deaths
(kern.logsigexit is OK but not really the right thing, and I was
the one who ported it from FreeBSD), but we need a configuration
system for process 1 that not merely names all the daemons
(services) to be started/stopped, but also expresses dependency for
both startup and shutdown.
A possible variant - process 1 needed to be involved, but does it need
to do everything?
We could have another process responsible for starting daemons and
'registering' their pids with init, and init could notify it when one
of them dies (or if the pid is not present by the time the
registration is processed).
ISTM you're on the right track here. We could take advantage of
the fact that process 1 reaps the exit status of expiring daemon. But
it doesn't know how the process's name, much less how or whether to
restart it, nor should it. And that's OK.

A watchdog daemon could monitor /proc with kevent. It could take
note of the appearance of "interesting" processes in /proc, those
that's it's charged with monitoring and perhaps restarting. When the
process disappears, it does what it's configured to do.

A very simple BSD approach would be to keep a list a interesting daemons
in, say, /etc/watchdog. For anything in that list, the watchdog would
use rc.d to restart the process. To add a little intelligence, rc.d
could be extended with an "auto-restart" action for the watchdog to
prefer if available.

That does leave unanswered the question of who's watching the
watchers. All the more reason to keep the watchdog itself simple and
separated from the restart logic.

I see no advantage to the launchd model, to changing the way daemons
daemonize, nor reason to add anything to the daemon's list of things it
must do for the sake of management.

--jkl
Matt Thomas
2013-12-05 21:53:09 UTC
Permalink
Post by Erik Fair
your comments and thoughts are solicited,
And then there’s OSX launchd :)
Robert Elz
2013-12-06 00:01:28 UTC
Permalink
There is no reason that management of daemons, or for that mater,
logins on ttys, needs to be done by process 1. tty login management
isn't done that way because it has to be, but because it always has been
(and it gives init some work to do, something less morbid than just
being a graveyard for orphans.)

That process 1 inherits orphans is irrelevant - there's no information
content in the death of an orphan except that it happened - there's no
meaningful way to extract identity from them, etc.

If we have switched from a fork & exit paradigm (the "old" way of
starting daemon processes - which makes monitoring using wait()
essentially impossible - to one where daemons are run (as daemons)
by some controlling process, which then (perhaps) cleans up, and restarts
them when they die, then any process at all can be the parent process,
it certainly doesn't need to be process 1.

What's more, there can be lots of different ones - if you like launchd,
you could run it, if you like linux's inittab processing, you can run it.
If you like something different, you can run it - what's more, you can
run all of them in parallel if you like, each managing a particular subset
of the daemons that are best suited to the particular system's management
style.

There's no need to impose anything in particular on anyone, just create
a suitable monitoring program, and add it to pkgsrc - people who like it
can run it. All that might be needed is to make sure that any daemon
processes that might wan to be run have some kind of "don't fork" option.
Most do, to ease debugging, but I think, not quite all of them.

kre
Mouse
2013-12-06 04:57:54 UTC
Permalink
Post by Robert Elz
There is no reason that management of daemons, or for that mater,
logins on ttys, needs to be done by process 1. [...more on daemon
process management...]
What kre said.
Post by Robert Elz
All that might be needed is to make sure that any daemon processes
that might wan to be run have some kind of "don't fork" option. Most
do, to ease debugging, but I think, not quite all of them.
Some of the ones that do have that option also have it do something
else, suitable for debugging but undesirable here, such as force
logging to stdout. I really think all the daemons need at least a
look-in for this....

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Matthew Orgass
2013-12-06 09:55:14 UTC
Permalink
Post by Robert Elz
There's no need to impose anything in particular on anyone, just create
I started out disagreeing, but I think I somewhat changed my mind before
hitting send :/. What you say makes sense to me in that modularity
encourages more well defined interactions between simpler parts (dealing
with packages, etc.), which makes it easier to figure out what is going on
at each step of the way. I do think it is super important to have
something that works well in the base system.

I've had two basic problems with most init systems: the most common is
just having no idea WTF is going on or how to find out, and the much less
common one is wanting to replace as little as possible if the system just
doesn't do what I want. Being able to replace the whole system doesn't
help if there is no other system that does all of what I want and at least
means I need to learn how some other system works.

NetBSD seems to do very well on both of these basic issues and I think
it is a major reason to use NetBSD. I've had the misfortune of being
stuck mostly running various versions of Linux over the past year or two
and NetBSD's rc.d is one of the things I miss most (sh, mtree, vis, and
build.sh are other things that come to mind easily). My current Linux
system uses systemd and netctl and when something fails it is either
silent or tells me to check two different log files which have a bunch of
lines but usually nothing more helpful than "... status: FAILED".

daemontools is one that uses the "don't fork" model with per-service
monitors. I think it has some good ideas and is much simpler than the
others I've seen. I don't think it deals with dependencies, though, which
is a major issue. I think it would be great if something like that was
merged nicely with rc.d and used by default.

daemontools has a fghack program to try to work around services that
fork by opening some pipes before starting the service and exiting when
they all get closed.

djb also wrote a single service inetd-like program (ucspi-tcp).

OTOH, I'm personally not that fond of having an extra monitoring process
for every service and I also think it makes sense for init to do generic
"is it running" monitoring. I don't think it makes that much difference
either way, though, and I don't think init should try to do anything more
complex then checking if something is running and restarting it if not.

Another thing that I think should be part of a service monitoring system
is dbus type "message bus" functionality and ability to deal with on
demand services. Unfortunately, dbus itself has a quite unfriendly UI.

IMO, the basic Unix/BSD security model, while useful for servers, does
not cover how users actually interact with a personal computer. Fixing
that involves defining and limiting what any individual application and
application instances can do (in a way relevent to the user, such as this
app can only modify files in this particular directory or must get
approval for files to be written through a notification mechanism or
something like that). Which probably involves moving more things into a
service realm with processing capability but not the direct ability to do
things with side effects. With fewer and simpler programs doing things
with side effects (and the OS enforcing this), it would be eaiser to
visualize and control what those side effects are.

-Matt
James K. Lowden
2013-12-07 21:40:52 UTC
Permalink
On Fri, 6 Dec 2013 04:55:14 -0500 (EST)
Post by Matthew Orgass
the basic Unix/BSD security model, while useful for servers, does
not cover how users actually interact with a personal computer.
For that to be true, you'd have to explain how a personal computer is
not like a server, and why that matters to the security model. ISTM
the Unix security model was invented precisely to control user
interaction with the system.
Post by Matthew Orgass
Fixing that involves defining and limiting what any individual
application and application instances can do (in a way relevent to
the user, such as this app can only modify files in this particular
directory
So you want to associate permissions with programs instead of users.
Which is what setuid(2) gives you without creating a new vector of
things that can have permissions granted to them. That it's not used
very much suggests to me the "application may do" model is of limited
use.

--jkl
Matthew Orgass
2013-12-09 13:26:13 UTC
Permalink
Post by James K. Lowden
Post by Matthew Orgass
the basic Unix/BSD security model, while useful for servers, does
not cover how users actually interact with a personal computer.
For that to be true, you'd have to explain how a personal computer is
not like a server, and why that matters to the security model. ISTM
the Unix security model was invented precisely to control user
interaction with the system.
Servers usually are trying to accomplish a more limited task, so it is
often possible to run fewer and simpler applications and to put some
effort into determining what should be run and as what uid/gid and then
setting it up so that happens. I think the basic issues affect servers as
well, just much less severely. Personal computer users run and install
software without trusting the source, but the current security model is
based on trusting applications. Often it is quite complex software and we
want to be able to try it and see how it works before putting any
significant effort into setup.

In terms of a security properties, I think what I am most looking for is
to control:

1) what gets written to disk, particularly so that stuff I care about
doesn't get overwritten but also to be able to tell what is using disk
space and be able to delete stuff that isn't needed when the disk fills up
(and ideally prevent it from unnecessarily filling up in the first place)

2) private data should not be removed from my computer without my consent

3) preserve a trusted communication path with various levels of control
software

4) get good battery life on mobile devices and still be able to play audio
and video without skipping

With #3, #1 could be implemented as a reliable rollback mechanism. #2 is
harder, but very important IMO. Solving #2 is likely to make other
options available for #1. Also, in general an application like mplayer
usually has no need to write to any file at all but contains a lot of
complex code from a variety of sources, so being able to easily and
reliably prevent it from doing that ever would eliminate a whole class of
possible issues (there are still side channel issues, of course, and
potential issues with whatever i/o methods it uses).

Networking was never really integrated into the Unix security model that
much, and being able to revoke any network access for a process and its
decendents seems like a basic minimal starting point to even possibly be
able to address #2.

#4 has often not been classified as a security issue, but unintended
sucking of battery life is a DOS attack that can be very significant in
practice and the control methods needed to deal with it are closely
related to other security properties.
Post by James K. Lowden
Post by Matthew Orgass
Fixing that involves defining and limiting what any individual
application and application instances can do (in a way relevent to the
user, such as this app can only modify files in this particular
directory
So you want to associate permissions with programs instead of users.
Which is what setuid(2) gives you without creating a new vector of
things that can have permissions granted to them. That it's not used
very much suggests to me the "application may do" model is of limited
use.
setuid doesn't really do most what I am looking for and is clunky at
what it does do. True capability systems are one extreme form of
application based privileges and I think there are some ways that those
ideas can be useful within an overall more unixy framework.

It seems to me that one core aspect of the traditional unix model is
"the privileges of a process or any of its direct decendents will never
increase" and IMO that seems likely to make a better foundation than a
true capability system if most code runs with almost no privileges and
simpler privileged code organizes higher level tasks (this is much more
easily said than made to work in practice :/). OTOH, there are already
multiple ways that things don't quite work that way in practice and it
seems to me like some such way is needed. IMO, descriptor passing should
be the main such method and there should probably be some increased
ability to pass capabilities via file descriptors (being careful how this
affects current apps). OTOH, passing a descriptor should give the
application the ability to use the descriptor, not have some magical
effect.

Describing file privileges in terms of read/write/execute for
user/group/other also seems fundamental to me; it might be possible to
tweak that a little (I've wondered if subuser users and groups might make
sense or some other method of one login user using multiple uids), but any
major change probably means it is a bad idea to try to run in the same OS
as apps designed for the current model. I think it would be possible to
associate decreased permissions with an application by some other method,
but not increased.

Since networking is currently so permisive there is a lot that could be
done there that would cause current apps to fail in a safe way when run in
the restrictive environment. I also think there could be a unified
privilege delegation model that could assign particular privileges to
particular uids without breaking the fundamental security model (NetBSD
already has some ways of doing that), although there are also many failed
attempts to do this and I don't like any of the more general ones I've
seen (most of what is called capabilities on unixy systems are trying to
do this).

Whatever basic model is used, the main challenge seems to me to be able
to define privileges in a way that is useful but not annoying to the user
and can be enforced reliably. When try 1 at doing this fails it should
also be possible to make a reasonably safe transition to try 2. It should
be possible to have split points in simpler privileged code that can
choose one method or another such that most code will not have access to
both methods, although this can easily get complicated. Some possible
uses of a dbus type system would need to do this to be safely used with
current code.

Things like systrace and SE linux try to apply privilege restriction at
a level where there are complex interactions between things treated as
separate privileges.

I don't think Plan9 directly tried address the issues I mentioned but
has some ideas that might be helpful.

Phone oriented OSes seem to be trying various new security models,
although I don't know much about the details.

It might make sense to create a basic "server of hardware" via the
current code base and run a completely different OS with a completely
different security model in a virtual machine. OSes with significantly
different security models that I know about have so far tried to run
directly on hardware, which makes it very difficult to support enough
hardware to reach enough potential users to possibly have an impact.

OTOH, I tend to think it is possible to adapt the current model in a way
that provides a smooth transition with a configurable level of security
vs. interoperability with current applications that fails in a safe way.
Some things that make sense to me to start in that direction are:

1) a disable_network() system call and no_new_file_desciptors() system
call (and equivalent flags to posix_spawn). Possibly some other way to
pass descriptors that doesn't use AF_UNIX.

2) a highly restricted execution environment such that a process basically
can just use the file descriptors passed to it. Also, a way to associate
this with a particular application such that it cannot be accidently run
outside that restricted environment. This might need changes to how
shared libraries are loaded to work.

3) a service framework that can help organize #2 and new shell
functionality to help organize #2

-Matt
David Young
2013-12-11 05:44:51 UTC
Permalink
Post by Matthew Orgass
In terms of a security properties, I think what I am most looking
1) what gets written to disk, particularly so that stuff I care
about doesn't get overwritten but also to be able to tell what is
using disk space and be able to delete stuff that isn't needed when
the disk fills up (and ideally prevent it from unnecessarily filling
up in the first place)
2) private data should not be removed from my computer without my consent
3) preserve a trusted communication path with various levels of
control software
4) get good battery life on mobile devices and still be able to play
audio and video without skipping
With #3, #1 could be implemented as a reliable rollback mechanism.
#2 is harder, but very important IMO. Solving #2 is likely to make
other options available for #1. Also, in general an application
like mplayer usually has no need to write to any file at all but
contains a lot of complex code from a variety of sources, so being
able to easily and reliably prevent it from doing that ever would
eliminate a whole class of possible issues (there are still side
channel issues, of course, and potential issues with whatever i/o
methods it uses).
Networking was never really integrated into the Unix security
model that much, and being able to revoke any network access for a
process and its decendents seems like a basic minimal starting point
to even possibly be able to address #2.
#4 has often not been classified as a security issue, but
unintended sucking of battery life is a DOS attack that can be very
significant in practice and the control methods needed to deal with
it are closely related to other security properties.
It sounds like you want to give the user (and his agents: programs)
more fine-grained control over program resources. I don't
think that these controls are necessarily the same as security
properties/mechanisms/policies.

We can say, we need more control because there are security issues. Or
we can say, we have security issues because we lack control. It's more
useful and truthful to say the latter, I think: we lack control over
the resources programs use. If a program gets to use any of a user's
resources, it gets to use them all, so we have to be very careful what
programs we run.

I think that with good resource control, we will solve more problems
than we will solve with a security mechanism, but we will surely
solve the security problems, too. For example, I would like for my
NetBSD boxes both to conserve energy and to reduce heat and noise
most of the time, but it does not matter to me if build.sh and its
descendant processes make those boxes noisily draw power and generate
heat. Really, I would like to grant build.sh unlimited watts, but put
all other processes on an energy budget. If I had energy budgets for
processes, surely I could run an untrusted program under a strict power
limit?

If I could limit the universe of network 5-tuples for a process, so that
it could not bind(2) arbitrary local addresses or ports, connect(2)
to or send(2) to arbitrary remote hosts, then that would be enough to
implement a lot of useful policies for diagnostics, privacy, etc.

If I also could limit the number of unique disk blocks a program could
use, and the number of pages of virtual memory, and if I could restrict
the directories where it could link or unlink files, then maybe I could
depend on some untrusted program to apply some useful algorithm to its
standard input and write the result on its standard output, or else
crash trying to exceed its power, storage, network, or memory limits.

Dave
--
David Young
***@pobox.com Urbana, IL (217) 721-9981
Matthew Orgass
2013-12-11 15:42:31 UTC
Permalink
Post by David Young
We can say, we need more control because there are security issues. Or
we can say, we have security issues because we lack control. It's more
useful and truthful to say the latter, I think: we lack control over
the resources programs use. If a program gets to use any of a user's
resources, it gets to use them all, so we have to be very careful what
programs we run.
That makes sense to me too. I think the two perspectives should be the
same, but aren't necessarily the same in practice, so IMO it would make
sense to consider particular issues both ways and try to solve for both
:).
Post by David Young
If I also could limit the number of unique disk blocks a program could
use, and the number of pages of virtual memory, and if I could restrict
the directories where it could link or unlink files, then maybe I could
depend on some untrusted program to apply some useful algorithm to its
standard input and write the result on its standard output, or else
crash trying to exceed its power, storage, network, or memory limits.
Yes, like that. Consider a restricted process that acts like a web
server and which I access through a web browser (via the service
framework, so the process itself doesn't deal with how it gets connected
to the web browser). I think there is both an access control issue (the
process only gets to interact with it's data) and a resource control issue
(how much data can it store and how much processor and memory does it get
to use, etc). The web browser would filter interaction with the rest of
the system and might internally use restricted processes to help achieve
that reliably.

In some cases the server process might internally need additional
services (it might make use of a particular web API, for example). By
providing these services via a separate local process (through the service
framework) the use of these services can be filtered by trusted
applications.

This example also brings up that the networking API and Plan9
per-process file systems are to some extent different ways of solving the
same underlying issue of separating what you are connecting to or
providing from distinct data streams, and that underlying functionality is
useful even without network access.

I would also eventually s/web browser/well designed rendering console/,
such that a web browser would be a filter that converts sloppy XML or HTML
to well formed whatever the rendering console uses. That type of model
would also make good use of an "everything is a resource" perspective.

People have been running untrusted binaries for a while now and web
browsers are currently the main environment that has been attempting to
deal with this (not with binaries exactly, but the lack of a way to deal
with the problem effectively for actual executables means that higher
level attempts keep failing in addition to being slower than they need to
be). This is so much of what people actually want to do with a personal
computer that web browers are literally turning into operating systems
(not wildly popular ones yet, but it is still a fairly new thing). I
don't think they are actually dealing with the fundamental problem that
well, but I predict that if no OS actually deals with this fundamental
problem well then people will end up using OSes that deal with it badly
but at least attempt to deal with it.

-Matt

David Laight
2013-12-06 23:51:54 UTC
Permalink
Post by Robert Elz
There is no reason that management of daemons, or for that mater,
logins on ttys, needs to be done by process 1. tty login management
isn't done that way because it has to be, but because it always has been
(and it gives init some work to do, something less morbid than just
being a graveyard for orphans.)
One thing that has to be done for ttys is to reset the permissions when
the user logs out, and tidy up the utmp file.
IIRC on sysv this is usually done writing a message to init (via a pipe)
giving it the pid and tty name.

We used to end up with /dev entries for ttys that were just minor numbers
on the protocol stack's device (enough STREAMS modules were pushed).
If all the processes closed their terminal, the minor number could get
reused [1], but the tty /dev entry would still be present - and then
programs like wall(1) would write into randon connections!

One solution was to use fattach() (like mount) the cloned tty device
onto a dummy entry in /dev and then change the permissions.
This all 'fell apart' on last close.

David

[1] but not by the same process - it wasn't allowed to read/write
the major/minor pair that used to be its controlling terminal!
--
David Laight: ***@l8s.co.uk
Masao Uebayashi
2013-12-06 05:02:12 UTC
Permalink
I like the idea that init(8) is responsible to manage daemons.

Some time ago I had to play with Pacemaker, a cluster management
system. I didn't fully understand it, but I viewed it as kind of yet
another launched; it periodically polls various services/daemons, and
restart when they failed.

Then I got an idea to add "monitor" method to rc.d scripts and make
init(8) periodically invokes rc "monitor" to check the system's
service tree is healthy or not.
Post by Erik Fair
So, daemons - watched over or not?
There's been a philosophical split (sort of) between the BSD/v7 school of thought and the System III/V school of thought, on how the Unix system should startup and manage userland processes where (background) daemons are concerned, with the structure and function of process 1 (whether you call that /{etc,sbin}/init or something else) at the center of it.
Process 1 has always been responsible for reaping processes whose parents did not do so (e.g. processes not started by shells), but what it does with those events beyond simple status collection has varied.
In BSD-land, derived from Version 7 Unix (or 7th Edition, if you prefer), daemons are expected to fork into independent processes and their parents exit, leaving them with the default parent of process 1. Monitoring independent daemons has been somewhat ad-hoc and messy; we've left process 1 more or less alone in this regard; it merely collects status (with one exception).
In USG/System III/V-land, they opted to add a more general process monitoring facility into process 1, in what's known as "inittab" (or by other names in other variants). I'd argue: good idea, pretty terrible implementation - when System III shipped, it was clear to me what needed to be done: the "daemonization" routines of all daemons run by inittab needed to be removed (no more fork/exit) so that /etc/init could properly manage those processes, and restart them when they died (and terminate them when the system is being shut down). I did that in the early 1980's at Dual Systems, a small mc68k Unix-box manufacturer that was my employer, and it worked well. Had to redo all the work for System V, alas (it always annoyed me that USG/AT&T added new facilities like inittab and then didn't perform the necessary code rototill for the system to properly use them), and I've never liked inittab's "run levels".
It is important to note right here that there's one area in which both schools of thought agreed: user login sessions initiated by getty & login on ttys needed to be explicitly managed by process 1. One can argue that USG/System III people merely extended that model to daemons, too.
One can also argue that BSD simply didn't change what had been inherited from v7 Unix, and then went and did its own thing when TCP/IP (network) sessions over telnet, rlogin, et alia, showed up. You don't have to hang getty off a pty (unlike a tty) to accept a network user session. A good thing, too, but that's where we get inetd(8) from.
Why isn't that inetd stuff in process 1, too? Reasonable fear of code bloat & bugs, I suppose, and a philosophy that process 1 needs to be as simple as possible so that it can be reasonably expected to work properly (after all, if process 1 dies unexpectedly, all kinds of bad bad things happen).
One more important aside that we should consider: "user sessions" now come in more flavors than person pounding on a tty (pty) and a shell (or three): there's FTP logins, IMAP/POP logins, and so on. There have been some attempts at reflecting those in utmp(5) but I don't think anyone has been consistent about it. I think we ought to tie that stuff into the basic authentication libraries, i.e. when a user authenticates for something, if it's going to last more than a second or three (i.e. a user is asking for a "session"), it ought to get an entry in utmp(5) and wtmp(5) so that you can see with who(1) or w(1) the users of the system and what sort of session they're in.
We're NetBSD - it should be easy to see what the Network users are doing in our systems (never mind http or NFS, for now …).
End of "user session" digression - back to daemon management.
NetBSD's rc.d(8) system is great - proper dependency management, and it's easy to manually start, stop, or restart a given daemon or service, but we totally fall down on daemon monitoring - they're expected to "just work" (perfect code!) and if they're important enough, someone will notice and manually restart when they die. Or not.
I've had some problems with that - named(8) likes to die on some of my systems because it's a big, complicated beast, and the Internet now encompasses enough of the world that the totality of all code paths through named are being relatively regularly exercised and bugs discovered quite rapidly in deployment, but not fixed anywhere near fast enough. So, I wrote a little shell script for cron(8) the other day to keep those daemon processes that are polite enough to leave a PID file in /var/run alive, and after testing in my own environment, I posted it to tech-userlevel for those who might also be having the same problems. It's a simple, somewhat hacky patch to a design deficiency in NetBSD.
The right place to deal with all of this is in process 1. It is deemed responsible for startup & shutdown of system, which mode (single user, multi-user) to run in, the secure levels (ugh) and ultimate reaping of all processes, so it "knows" a priori whether a daemon should be running or not and can know whether it is provided the relationship between a daemon (service) and its PID is known. The trick is in expressing in some kind of configuration system what we want in a simple but hopefully sufficiently rich syntax.
However, I don't like either of the two schemes I've seen to date for dealing with the issue. I've already expressed my distaste for inittab(5) as I've seen it (has Linux done something more sensible with it in the last many years?), and I had a look at Apple's OS X "launchd" and I don't like it either - it really wants to be talked to through a control program interface (launchctl, with yet another control language to learn) rather than allowing one to simply edit configuration files.
Worse, neither system has proper dependency management as we have in rc.d(5), and I really, really don't want to lose that.
So, clear statement of the problem: daemons should be started and managed by process 1 because it is in a position to monitor for their death and restart them as necessary, and log those deaths (kern.logsigexit is OK but not really the right thing, and I was the one who ported it from FreeBSD), but we need a configuration system for process 1 that not merely names all the daemons (services) to be started/stopped, but also expresses dependency for both startup and shutdown.
your comments and thoughts are solicited,
Mouse
2013-12-06 05:31:54 UTC
Permalink
Post by Erik Fair
So, daemons - watched over or not?
[...]
your comments and thoughts are solicited,
It would help if you didn't use paragraph-length lines. (If you want
your text to be reflowed by the recipient, see RFC 3676.)
Post by Erik Fair
NetBSD's rc.d(8) system is great - proper dependency management, and it's ea$
Maybe - but only if you're running a stock system and don't want to do
anything but the Officially Approved operations, the ones the system's
designers chose to support.

Step outside that box and it all flips upside down and you're faced
with a great deal of undocumented complexity which various other pieces
of the system assume is being used exactly as designed and which thus
ends up being a twisty little maze of shell scripts all different and
all getting in your way.

There's a reason I tend to turn off the stock daemons and run my own
from /etc/rc.local.
Post by Erik Fair
The right place to deal with all of this is in process 1. [...]
I disagree. I'm in agreement with what kre said: there is no reason,
possibly excepting history, to do any of this in process 1. Indeed, I
would prefer to move /etc/ttys processing out of init, into (say)
ttyspawner; I'm not sure what I'd do about runlevel processing (whether
BSD-style or SV-style or something else). I would actually be tempted
to cut process 1 back to nothing but reaping zombies, perhaps moving it
into the kernel or even eliminating it entirely (by arranging for a
parentless process that dies to be reaped within the kernel).

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
David Holland
2013-12-06 08:42:19 UTC
Permalink
Post by Erik Fair
So, daemons - watched over or not?
There's been a philosophical split (sort of) between the BSD/v7
school of thought and the System III/V school of thought, on how
the Unix system should startup and manage userland processes where
(background) daemons are concerned, with the structure and function
of process 1 (whether you call that /{etc,sbin}/init or something
else) at the center of it.
I wouldn't go so far as to say that. A better way to describe that
split is that the AT&T unixes invented a new and monstrously complex
mechanism to monitor getty and nothing else. They *could* have used
their init arrangements to start and monitor/restart daemons, but as
you note they never did, and I don't think they ever really intended
to; they also at the same time invented a different monstrously
complex way to start up and shut down daemon processes and other
system services.

Anyhow, nowadays even most of the Linux world has realized that that
design is no good, and it's basically of no importance any more except
as a negative example.

What we do have, though, is a pile of hysterical raisins: the
init-getty-login combination works in a particular way that's been the
same all the way back to at least V7 and I think well before; init is
responsible for tidying up sessions started with getty because that
way you don't have an extra useless process hanging around underneath
the user's shell wasting memory. This mattered back then; now it's
just a poorly framed abstraction. It would be better if each getty
were just another anonymous daemon that spawned login (instead of
execing it) and cleaned up afterwards, like telnetd or rlogind or sshd
or basically everything else. (Note that if you're using PAM you get
an ill-conceived partial form of this behind your back for PAM
reasons...)

Even without that cleanup there's no reason init has to be the process
that spawns getty and cleans up after getty sessions; that work could
be farmed out to another daemon.

There are a number of recent attempts to rearrange the way services
and daemons get started (and restarted) -- there's launchd, upstart,
at least one other whose name I'm forgetting, and perhaps others. So
far, none of these has seemed to me like a very good idea; they all
seem hastily conceived (without e.g. an understanding of how
init/getty/login traditionally works) and some of them just don't seem
very ... unixish.

I think if we want to improve the state of the art in this regard the
way to do it is to look at what a "service" is (in the sense of things
like "service nfsd start", not /etc/services) and try to come up with
some abstractions that make sense and aren't oversimplified or
crippled.

Right now, for example, we have "services" that are rc.d scripts
(rpcbind, sshd, syslogd, ...) that start daemons; we also have
"services" (telnetd, fingerd) that are inetd.conf entries, although
most of these are basically dead nowadays; and there are also
"services" like ipf and npf that are rc.d scripts and behave much like
daemons except that they're really kernel state.

Most of these "services" are turned on and off via rc.conf, but not
all of them. (For example, if you want to enable fingerd, you have to
know it's an inetd service.)

Meanwhile, not all rc.d scripts are "services"; e.g. fsck isn't and
cleartmp isn't; these are purely part of the boot sequence.

In an ideal world, all "services" would be configured the same way
(whether that way is rc.conf or something else) and you wouldn't have
to know or care about the implementation to work with them.

Similarly, in an ideal world, all daemons (which might or might not be
part of a "service") would have a failure and recovery/restart path
(just respawning is not necessarily adequate) and would get run in a
framework that handles this, instead of requiring manual monitoring
and hand restart.

An ideal world would also allow users to have "services"; that is, you
log in, the system enables your talkd receiver or biff proxy or
whatever, and if it's a daemon sees to keeping it running... and shuts
it down when you log off. The absolute lack of all infrastructure
support for this in Unix is getting to be a fairly serious drawback.

We are something like 75% to 80% of the way to having a workable
abstraction for system services, but it's still too tightly coupled to
the implementation and still all mixed up with boot-time activities.

As you note, we have bupkis for daemon management, but I don't think
it makes sense to try to tackle that without fitting it into a clear
model of system services, and preferably also of user sessions.

Also, few of the daemons we commonly use have much in the way of
useful failure and recovery behavior; for many of them if you just
respawn them blindly they'll keep crashing, and most of the rest lose
all their state such that restarting them is a long way from
transparent. Some are even worse than this: in the case of syslogd, if
it crashes you (may) silently lose data, and if something respawns it
you may also lose the ability to notice that you may have lost data...
Post by Erik Fair
One more important aside that we should consider: "user sessions"
now come in more flavors than person pounding on a tty (pty) and a
shell (or three): there's FTP logins, IMAP/POP logins, and so
on. There have been some attempts at reflecting those in utmp(5)
but I don't think anyone has been consistent about it. I think we
ought to tie that stuff into the basic authentication libraries,
i.e. when a user authenticates for something, if it's going to last
more than a second or three (i.e. a user is asking for a
"session"), it ought to get an entry in utmp(5) and wtmp(5) so that
you can see with who(1) or w(1) the users of the system and what
sort of session they're in.
Yes, but more than that, there's X logins.
Post by Erik Fair
The right place to deal with all of this is in process 1.
I don't agree; to the extent init is magic, it should not do any
unnecessary work, because that exposes it to risk of failure. To the
extent init isn't magic, it doesn't need to be process 1 any more.

I've built systems where init (that is, the process that sequences
boot and shutdown) is not process 1. I've also built systems where pid
1 is reserved; if your parent exits, getppid() returns 1, but no
actual process 1 exists. Both of these things are perfectly
straightforward; there's no more reason to have a daemon hanging
around just to call wait() than there is to have a daemon hanging
around just to call nfssvc(). Less, in fact - it's fairly easy to
implement wait/exit in a way that doesn't require orphaned processes
to be waited for.
--
David A. Holland
***@netbsd.org
Roy Marples
2013-12-06 09:27:22 UTC
Permalink
Post by David Holland
There are a number of recent attempts to rearrange the way services
and daemons get started (and restarted) -- there's launchd, upstart,
at least one other whose name I'm forgetting, and perhaps others. So
far, none of these has seemed to me like a very good idea; they all
seem hastily conceived (without e.g. an understanding of how
init/getty/login traditionally works) and some of them just don't seem
very ... unixish.
http://en.wikipedia.org/wiki/OpenRC

dh is right on one part, it was written without any knowledge of
init/getty/login.
But then, that wasn't it's job. It's job was to start & stop things in a
dependent order.
The fact it dealt with them was just an aside. I would imagine others
have taken the same approach.
Post by David Holland
I think if we want to improve the state of the art in this regard the
way to do it is to look at what a "service" is (in the sense of things
like "service nfsd start", not /etc/services) and try to come up with
some abstractions that make sense and aren't oversimplified or
crippled.
This is a very important point, and most init systems I've seen get it
wrong.
Oh so wrong.

I'm not going to say too much on this topic other than that I've written
one, patched two others and tinkered with a forth.
OpenRC can run quite nicely on NetBSD ... but I don't use it anymore.
That doesn't stop the existing dev's emailing me with problems every
once in a while ;)

NetBSD's rc.d system works very nicely for me. Could I improve it? Sure!
Do I have the time, probably not.
The only thing that it needs from my pov is a dependency graph, so you
can see the dependency ordering and needs - it took my far to long how
to work out how to get syslogd to start before dhcpcd when I had to
debug a boot time only issue.

Thanks

Roy
Alan Barrett
2013-12-09 15:57:27 UTC
Permalink
Post by Roy Marples
NetBSD's rc.d system works very nicely for me. Could I improve it?
Sure! Do I have the time, probably not.
The only thing that it needs from my pov is a dependency graph, [...]
See src/sbin/rcorder/rcorder-visualize.sh in the NEtBSD source tree.

--apb (Alan Barrett)
Aaron B.
2013-12-07 01:08:39 UTC
Permalink
On Fri, 6 Dec 2013 08:42:19 +0000
Post by David Holland
I think if we want to improve the state of the art in this regard the
way to do it is to look at what a "service" is (in the sense of things
like "service nfsd start", not /etc/services) and try to come up with
some abstractions that make sense and aren't oversimplified or
crippled.
Agreed!

As a sysadmin, I often care less about the internal details, and more about what a system provides. What I want to see:

1) Easy to define/install a new service
2) Easy to manipulate a service (enable/disable/restart/etc)
3) Easy to query a service's state.

IMHO, #3 is the ticky part. People often assume things are 'up' or 'down' and ignore the scope of all the other failures in between.

I think the question of what a service is, is in fact simple: anything in userland you can configure on or off, and anything userland that might break. Daemons, firewall state, inetd-driven processes, possibly even network interfaces and filesystems are fair game.
--
Aaron B. <***@zadzmo.org>
Roy Marples
2013-12-07 01:26:27 UTC
Permalink
Post by Aaron B.
As a sysadmin, I often care less about the internal details, and more
1) Easy to define/install a new service
2) Easy to manipulate a service (enable/disable/restart/etc)
3) Easy to query a service's state.
IMHO, #3 is the ticky part. People often assume things are 'up' or
'down' and ignore the scope of all the other failures in between.
Well, from a service management perspective it's either up or down.
It may have an intermediate state of starting, but that still falls in
the down category.

Anything else would be a configuration error of the service, which is
outside the scope of this discussion.

Thanks

Roy
John Nemeth
2013-12-07 01:33:16 UTC
Permalink
On Dec 7, 1:26am, Roy Marples wrote:
} On 07/12/2013 1:08, Aaron B. wrote:
} > As a sysadmin, I often care less about the internal details, and more
} > about what a system provides. What I want to see:
} >
} > 1) Easy to define/install a new service
} > 2) Easy to manipulate a service (enable/disable/restart/etc)
} > 3) Easy to query a service's state.
} >
} > IMHO, #3 is the ticky part. People often assume things are 'up' or
} > 'down' and ignore the scope of all the other failures in between.
}
} Well, from a service management perspective it's either up or down.
} It may have an intermediate state of starting, but that still falls in
} the down category.

Not quite. If it's down, you may want to do a restart.
However, if it is currently starting, then you want to give it time
to complete the startup process before doing a restart. And, of
course, if it fails to come up after a few restart attempts, you
want to mark it as failed, stop doing restarts, and bring it to
the attention of an administrator.

}-- End of excerpt from Roy Marples
Roy Marples
2013-12-07 01:37:24 UTC
Permalink
Post by John Nemeth
} > As a sysadmin, I often care less about the internal details, and more
} >
} > 1) Easy to define/install a new service
} > 2) Easy to manipulate a service (enable/disable/restart/etc)
} > 3) Easy to query a service's state.
} >
} > IMHO, #3 is the ticky part. People often assume things are 'up' or
} > 'down' and ignore the scope of all the other failures in between.
}
} Well, from a service management perspective it's either up or down.
} It may have an intermediate state of starting, but that still falls in
} the down category.
Not quite. If it's down, you may want to do a restart.
However, if it is currently starting, then you want to give it time
to complete the startup process before doing a restart. And, of
course, if it fails to come up after a few restart attempts, you
want to mark it as failed, stop doing restarts, and bring it to
the attention of an administrator.
What you describe are actions.
I was defining states, sorry if that wasn't clear :)

Thanks

Roy
John Nemeth
2013-12-07 01:42:00 UTC
Permalink
On Dec 7, 1:37am, Roy Marples wrote:
} On 07/12/2013 1:33, John Nemeth wrote:
} > On Dec 7, 1:26am, Roy Marples wrote:
} > } On 07/12/2013 1:08, Aaron B. wrote:
} > } > As a sysadmin, I often care less about the internal details, and
} > more
} > } > about what a system provides. What I want to see:
} > } >
} > } > 1) Easy to define/install a new service
} > } > 2) Easy to manipulate a service (enable/disable/restart/etc)
} > } > 3) Easy to query a service's state.
} > } >
} > } > IMHO, #3 is the ticky part. People often assume things are 'up' or
} > } > 'down' and ignore the scope of all the other failures in between.
} > }
} > } Well, from a service management perspective it's either up or down.
} > } It may have an intermediate state of starting, but that still falls
} > in
} > } the down category.
} >
} > Not quite. If it's down, you may want to do a restart.
} > However, if it is currently starting, then you want to give it time
} > to complete the startup process before doing a restart. And, of
} > course, if it fails to come up after a few restart attempts, you
} > want to mark it as failed, stop doing restarts, and bring it to
} > the attention of an administrator.
}
} What you describe are actions.
} I was defining states, sorry if that wasn't clear :)

Well, in that case, starting should be a state. Actions are
simply forced state transitions. You don't want to force a transition
from down to up (i.e. restart), if it is currently starting (assuming
it hasn't missed a startup timeout).

}-- End of excerpt from Roy Marples
Brett Lymn
2013-12-08 09:37:13 UTC
Permalink
Post by John Nemeth
Not quite. If it's down, you may want to do a restart.
However, if it is currently starting, then you want to give it time
to complete the startup process before doing a restart. And, of
course, if it fails to come up after a few restart attempts, you
want to mark it as failed, stop doing restarts, and bring it to
the attention of an administrator.
Heh - you know, you are close to describing the Solaris SMF (service
management facility)...
--
Brett Lymn
Staple Guns: because duct tape doesn't make that KerCHUNK sound - xkcd.com
John Nemeth
2013-12-08 09:54:21 UTC
Permalink
On Dec 8, 8:07pm, Brett Lymn wrote:
} On Fri, Dec 06, 2013 at 05:33:16PM -0800, John Nemeth wrote:
} >
} > Not quite. If it's down, you may want to do a restart.
} > However, if it is currently starting, then you want to give it time
} > to complete the startup process before doing a restart. And, of
} > course, if it fails to come up after a few restart attempts, you
} > want to mark it as failed, stop doing restarts, and bring it to
} > the attention of an administrator.
}
} Heh - you know, you are close to describing the Solaris SMF (service
} management facility)...

If we're serious about service management, then something like
that, or a similar facility from another OS is most likely what we
need. Using something that already exists, if a suitable one can
be found, would probably be a good thing.

In the above, I didn't even get into the issue of dependencies.
I.e. if you type "service start lockd" it should also start rpcbind.
Should rpcbind fail, then lockd should also be marked as failed.
Right now, typing "/etc/rc.d/lockd onestart" will not automatically
start rpcbind. If rpcbind isn't running, then lockd will simply
fail to start.

Doing service management properly can quickly get quite complex,
which is a good reason to use something that already exists where
the kinks have already been worked out. It is also a very good
reason to not have init doing service management, since init should
be kept simple.

}-- End of excerpt from Brett Lymn
David Holland
2013-12-08 21:03:50 UTC
Permalink
Post by John Nemeth
} Heh - you know, you are close to describing the Solaris SMF (service
} management facility)...
If we're serious about service management, then something like
that, or a similar facility from another OS is most likely what we
need. Using something that already exists, if a suitable one can
be found, would probably be a good thing.
In the above, I didn't even get into the issue of dependencies.
[...]
Doing service management properly can quickly get quite complex,
which is a good reason to use something that already exists [...]
Given the infrastructure we already have (for dependencies and other
things), trying to splice in third party code is not a good idea.
--
David A. Holland
***@netbsd.org
John Nemeth
2013-12-08 21:56:25 UTC
Permalink
On Dec 8, 9:03pm, David Holland wrote:
} On Sun, Dec 08, 2013 at 01:54:21AM -0800, John Nemeth wrote:
} > } Heh - you know, you are close to describing the Solaris SMF (service
} > } management facility)...
} >
} > If we're serious about service management, then something like
} > that, or a similar facility from another OS is most likely what we
} > need. Using something that already exists, if a suitable one can
} > be found, would probably be a good thing.
} >
} > In the above, I didn't even get into the issue of dependencies.
} > [...]
} > Doing service management properly can quickly get quite complex,
} > which is a good reason to use something that already exists [...]
}
} Given the infrastructure we already have (for dependencies and other
} things), trying to splice in third party code is not a good idea.

What infrastructure? We don't do service management. Our
rc.d startup code does not count as service management.

}-- End of excerpt from David Holland
David Holland
2013-12-08 22:17:22 UTC
Permalink
} > } Heh - you know, you are close to describing the Solaris SMF (service
} > } management facility)...
} >
} > If we're serious about service management, then something like
} > that, or a similar facility from another OS is most likely what we
} > need. Using something that already exists, if a suitable one can
} > be found, would probably be a good thing.
} >
} > In the above, I didn't even get into the issue of dependencies.
} > [...]
} > Doing service management properly can quickly get quite complex,
} > which is a good reason to use something that already exists [...]
}
} Given the infrastructure we already have (for dependencies and other
} things), trying to splice in third party code is not a good idea.
What infrastructure? We don't do service management. Our
rc.d startup code does not count as service management.
It is what we have and it handles dependencies, starting and stopping;
regardless of whether it's adequate as it is, bolting on something
else that doesn't interoperate with it would be a serious mistake.
--
David A. Holland
***@netbsd.org
Mouse
2013-12-08 22:32:09 UTC
Permalink
[rc.d] is what we have and it handles dependencies, starting and
stopping; regardless of whether it's adequate as it is, bolting on
something else that doesn't interoperate with it would be a serious
mistake.
Would it?

If what's there is insufficient for a task, as far as I can see the
alternatives are to bolt on something else (which will probably
interoperate poorly to not at all, or there wouldn't be any need for
it) or switch operating systems. This makes it sound as though your
position is that if NetBSD doesn't answer the user's desire out of the
box, the user should run some other OS. Surely that's not what you
mean to be saying....

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
John Nemeth
2013-12-08 22:34:59 UTC
Permalink
On Dec 8, 10:17pm, David Holland wrote:
} On Sun, Dec 08, 2013 at 01:56:25PM -0800, John Nemeth wrote:
} > } > } Heh - you know, you are close to describing the Solaris SMF (service
} > } > } management facility)...
} > } >
} > } > If we're serious about service management, then something like
} > } > that, or a similar facility from another OS is most likely what we
} > } > need. Using something that already exists, if a suitable one can
} > } > be found, would probably be a good thing.
} > } >
} > } > In the above, I didn't even get into the issue of dependencies.
} > } > [...]
} > } > Doing service management properly can quickly get quite complex,
} > } > which is a good reason to use something that already exists [...]
} > }
} > } Given the infrastructure we already have (for dependencies and other
} > } things), trying to splice in third party code is not a good idea.
} >
} > What infrastructure? We don't do service management. Our
} > rc.d startup code does not count as service management.
}
} It is what we have and it handles dependencies, starting and stopping;
} regardless of whether it's adequate as it is, bolting on something
} else that doesn't interoperate with it would be a serious mistake.

A proper service management facility would replace it. The
only real question is would things like rc.conf be kept (rc.d files
might be used for dependency information, but generally aren't
suitable for real service management). There is no reason why rc.d
should be sacrosanct. It has served as well as a replacment for
a monolithic /etc/rc, but it doesn't adequately satisfy modern
service management.

On that score, I disagree with the idea that service management
should be a plug-in where one can drop in any number of different
service monitors. This way lies madness, as it would difficult
for a random admin to figure out how to administrate a random system
(I realise this may not happen a lot in the NetBSD world, but it
is something that we should thing about), and it would be near
impossible for a package that wants to install a service to figure
out what it should do. Service management really needs to be part
of the base system.

}-- End of excerpt from David Holland
David Holland
2013-12-08 23:05:31 UTC
Permalink
Post by John Nemeth
} > What infrastructure? We don't do service management. Our
} > rc.d startup code does not count as service management.
}
} It is what we have and it handles dependencies, starting and stopping;
} regardless of whether it's adequate as it is, bolting on something
} else that doesn't interoperate with it would be a serious mistake.
A proper service management facility would replace it. The
only real question is would things like rc.conf be kept (rc.d files
might be used for dependency information, but generally aren't
suitable for real service management).
Yes, and if the question is: rework it, or throw it out and replace it
with something totally different and incompatible; then I'd rather
rework it. This is not a NIH response so much as a recognition of the
difficulties associated with migrating deployed systems.

Also, while your attention is still on this, please describe what
you'd consider the properties of a "proper service management
facility" to be. I haven't seen this Solaris thing you referenced
(thankfully the last time I had to deal with administering Solaris was
more than ten years ago) but I have seen various other things, most of
which seem like badly conceived bolt-ons to sysvinit.
Post by John Nemeth
On that score, I disagree with the idea that service management
should be a plug-in where one can drop in any number of different
service monitors.
If you're going to be monitoring whether a service is working, rather
than merely running, you need to be able to supply custom monitoring;
local problems and local configuration often cause local problems that
need this. But I don't see this as a problem.

However, what I think you mean is that the general monitoring scheme
and framework has to be designed in and not an afterthought -- that I
agree with and that was my point that Mouse misunderstood about
"bolted on".
--
David A. Holland
***@netbsd.org
Brett Lymn
2013-12-09 09:13:00 UTC
Permalink
Post by David Holland
Yes, and if the question is: rework it, or throw it out and replace it
with something totally different and incompatible; then I'd rather
rework it. This is not a NIH response so much as a recognition of the
difficulties associated with migrating deployed systems.
or perhaps just the ability to import the current rc.d config into
something else - that should be able to be automated to a fair degree.
Post by David Holland
Also, while your attention is still on this, please describe what
you'd consider the properties of a "proper service management
facility" to be. I haven't seen this Solaris thing you referenced
(thankfully the last time I had to deal with administering Solaris was
more than ten years ago) but I have seen various other things, most of
which seem like badly conceived bolt-ons to sysvinit.
SMF is a total rework, not a bolt on to sysvinit. A service is
described by a manifest which is a bit of XML that provides all the
information that SMF needs to run the service - this XML is imported
into the service "database" . The manifest contains information like
the name of the service, a description, dependencies, what action to
start the service, what action to stop the service, what action to
restart the service, timeouts for start and stop, command line flags for
the daemon for the service plus a lot of other things I have forgotten
(I don't write SMF manifests often...). Once the SMF knows about the
service you will usually see the service in one of three states (there
are more but not common), one is disabled - the service is not
configured to start, one is enabled - the service is configured to run
and, lastly, maintenance - the service was attempted to be started but
there was a problem and it has been suspended pending administrator
intervention. Usually a service will hit maintenance if the daemon
restarts too quickly due to, say, a configuration error - once the error
has been corrected then there is a command to tell SMF that the fault
has been cleared and it can now start the service again. Note that if a
daemon just exits it will be restarted it is only if it exits repeatedly
will the service enter maintenance mode.

Here is a more cogent description of SMF for those interested:

http://www.oracle.com/technetwork/articles/servers-storage-admin/intro-smf-basics-s11-1729181.html
--
Brett Lymn
Staple Guns: because duct tape doesn't make that KerCHUNK sound - xkcd.com
Aaron B.
2013-12-07 02:39:23 UTC
Permalink
On Sat, 07 Dec 2013 01:26:27 +0000
Post by Roy Marples
Post by Aaron B.
As a sysadmin, I often care less about the internal details, and more
1) Easy to define/install a new service
2) Easy to manipulate a service (enable/disable/restart/etc)
3) Easy to query a service's state.
IMHO, #3 is the ticky part. People often assume things are 'up' or
'down' and ignore the scope of all the other failures in between.
Well, from a service management perspective it's either up or down.
It may have an intermediate state of starting, but that still falls in
the down category.
Anything else would be a configuration error of the service, which is
outside the scope of this discussion.
What I had in mind are situations like Tomcat running out of PermGen space: it's up, but completely frozen. 'service tomcat status' says 'up, 'svstat /var/service/tomcat' says up, but it's down.

I know, the real fix to this problem is either 'fix Tomcat' or 'use a non-sucky service', but those aren't real world solutions in a lot a cases. What would be great is for some kind of heartbeat or keepalive API where a service could inform the service manager that it is still alive.
--
Aaron B. <***@zadzmo.org>
David Holland
2013-12-08 21:04:55 UTC
Permalink
Post by Roy Marples
Well, from a service management perspective it's either up or down.
It may have an intermediate state of starting, but that still falls
in the down category.
Anything else would be a configuration error of the service, which
is outside the scope of this discussion.
...no, not really.
--
David A. Holland
***@netbsd.org
Loading...