Moving a running process from one computer to another

AlbertGoodwill · February 23, 2007, 6:12am

Hi,

Is there a way to move a running process from one computer to another
and resume the execution from where the original process was
intrrupted?

int main(…)
{
for (int i=0; i<very large number; i++){
cout << i << “,” << endl;
solve rocket science equations
if (i==3) // SomeCondition happened and we decided to move from
Computer-A to Computer-B
MoveTo(“192.168.1.123”); //Move this process to another
computer and continue to run on that computer
}
return 0;

}

Computer-A
1,2,3,

Computer-B
4,5,6,7,8, …

qnxloader · February 23, 2007, 6:36am

There no way to move running process from one computer to another, but you may run same processes on two comp and manage it’s work on condition.
Maybe your problem have another solution…

AlbertGoodwill · February 23, 2007, 7:01am

Apperantly there are some techniques “Process Migration” and “Checkpoint/Restart” which can be used to move/migrate a process from one computer to another while it is running. Following URLs provides more information on these techniques;

hpl.hp.com/personal/Dejan_Milojicic/pm7.pdf
klammeraffe.org/~fritsch/uni … 0000000000
cap.anu.edu.au/cap/projects/esky/
ftg.lbl.gov/CheckpointRestart/Ch … tart.shtml
sei.cmu.edu/publications/art … t-001.html
science.uva.nl/research/scs/Software/
cs.wisc.edu/condor/

I wonder if QNX was blessed with build-in (or third party) functionalities of either Process Migration" or “Checkpoint/Restart” techniques or both.

Regards,

Albert Goodwill

mario · February 23, 2007, 7:08am

As far as I know there is nothing buildin. Unless there is kernel support for this I don’t think that possible. In fact moving a process is not that big of a deal, the problem is that it’s the kernel resources uses by the process that must be reallocated. In your example when the process is moved stdout (cout) need to be open as well. Since file descriptor are based on coid which are themselves based on nid/pid/chid it don’t see how that can be “moved”.

AlbertGoodwill · February 23, 2007, 7:55am

I’m new to the “Process Migration” and “Checkpoint/Restrt” techniques. But my understanding so far is that, these techniques can also be immplemented without "special: kernel support. Forexample, libckpt provides checkoint/restart above the kernel level.

Regards,

Albert

mario · February 23, 2007, 2:52pm

If you customized you program to use a special library to provide cover functions then maybe. In fact QNX’s HAM could take care of some of that but your program has to be build for it.

The example you gave would have to be modified ;-0

AlbertGoodwill · February 23, 2007, 5:15pm

Hi,

As I search/read more about the checkpointing, I’ve found that some chekpointing/restart applications do NOT need user application to be linked with special libraries and they do NOT need kernel level modifications either. Yes, it is amazing!.. I could not get a successfull run yet, but I’m trying some freely available tools on my Linux (and hopefully QNX later on).

See the following links;
checkpointing.org/
geocities.com/asimshankar/ch … index.html
cryopid.berlios.de/
cluster.kiev.ua/tasks/chpx_eng.html

Regards,

Albert

mario · February 23, 2007, 6:25pm

As I said earlier I think it’s possible but you will have to seriously limit what the program can do. For example it’s impossible to move a process with opened sockets, timer, notifications. QNX has many features that linux doesn’t have that makes this extremely challenging. Furthermore you cannot assume that any of function in the C library aren’t making use of any of these resources.

For example let’s assume delay() create a timer on the first call and reuse that same timer on subsequent call. If you move the process and delay() is called after the move, the data in memory has valid timer id, but the timer does not really exists. The codes that performs the move could probably figure out that the program has a timer but when it recreates it would have to write the timer id in the variable that holding it, but there is no way to know where that variable is. Same things for semaphores/mutexes.

Note that I made up that example, under QNX6 delay doesn’t create a timer ;-)

You are opening up a huge can of worm. If this is for a real production system as oppose to experimental/research, I think you are asking for trouble.

AlbertGoodwill · February 23, 2007, 6:55pm

Hi Mario,

Thank you for your replies.

I use QNX both on some embedded HW (PC104) as well as in a VMware as as guest OS hosted in Windows XP.
If you use VMware you know that we can “suspend” the whole guest OS (QNX in my case) and copy the suspended image to anywhere and restart from the point where it was left. This is NOT the same as migrating an arbtrary, individual process. But I thnk it would be fantastic if we could “suspend” a process and move/migrate it to any other computer and restart it there.

May I ask your expert advise on a fault-tolerance / redundancy issue;
Assume that we have a critical control system controlled by a QNX based computer. Because system is critical for us, we would like to have fault-tolerance /redundamcy in such a ay that if one computer and/or proces dies, we want the other computer to take control and continue controlling our critical system from the point where the failed computer/proces left.
How we can achive this sort of fault tolerant system?
Does QNX has some special services/utilities for this sort of redundancy?

Regards,

Albert

mario · February 23, 2007, 7:58pm

Definitely NOT the same. Suspending a VMWare sessions does bring its own set of problems

But I think it would be fantastic if we could “suspend” a process and move/migrate it to any other computer and restart it there.

May I ask your expert advise on a fault-tolerance / redundancy issue;
Assume that we have a critical control system controlled by a QNX based computer. Because system is critical for us, we would like to have fault-tolerance /redundamcy in such a ay that if one computer and/or proces dies, we want the other computer to take control and continue controlling our critical system from the point where the failed computer/proces left.
How we can achive this sort of fault tolerant system?
Does QNX has some special services/utilities for this sort of redundancy?

Regards,

Albert
[/quote]

I have been somewhat involved with the group that designed the RadarSat satellites. Everything is redundant, All inputs are going to two systems, and all outputs are “merged”. So you end up having two systems doing the same things at the same time, there is no moving of software from one place to the other. This is the approach I would take. Depending on your current design this can be simple or very complex, but definitely less risky then moving programs from one machine to another

Think about this, if the computer where the program is to be moved from dies, how do you move the program… That kinds of defeat the purpose doesn’t it Or if the software has gone wild because of a hardware glitch what would be the point of moving it.

Mario

maschoen · February 23, 2007, 8:09pm

It’s interesting hearing the different ideas about this subject, but there seems some confusion about what is wanted and what can be done.
I think it should be clear, that there is no built in way to just freeze any process, transport it to another cpu, and expect to have it pick up where it left off. It also should be clear that if you are writing the process, and it has a well defined result, that it should be possible to provide the type of fault tolerance that Albert was asking about. In the hope of opening up the discussion, I’ll describe a project that I recently worked on that does something like the latter.

The requirement was for a very large fast and redundant database. For speed, the database uses a hash algorithm. There is not need to read records sequentially, so this is adequete. Two copies of the database exist on separate nodes, and each node has a resource manager that looks after it. I/O requests are channeled through an interface library to the Active resource manager. That manager handles all reads directly. Updates, adds, and deletes are also forwarded to the Inactive resource manager. If the Active resource manager dies, requests are re-routed, and the Inactive manager becomes active. There are synchronization issues at startup, and when a dead manager comes back to life which must be handled carefully.

The special service that QNX provides that makes this all work is message passing. Clients send messages to resource managers, and the active resource manager sends updates to the inactive manager. I hope this is some good food for thought.