The scheduling algorithm used by the NQE scheduler is designed to be modified. Also, it is possible to completely rewrite the scheduler. This chapter describes the NQE database and scheduler in greater detail than Chapter 9, “NQE Database”, and it guides you in writing a site-specific scheduler.
You should note the following points:
All data concerning requests submitted to the NQE database is contained within that database. Therefore, any data that the NQE scheduler wishes to remain in existence after it shuts down must be stored in the NQE database. Examples of this data include request information, request script files, job log/history information, and scheduling configuration and request exit statuses.
If the scheduler is killed and then restarted, or the connection to the NQE database is lost and then regained, the scheduler should be capable of recovering its state based entirely on the contents of the NQE database.
This chapter contains information about the following:
Scheduler system task
Writing a simple scheduler
Attribute names and nqeinfo file variable names
Tcl built-in functions
All system tasks, including the scheduler, consist of Tcl scripts that do the following:
Run with root privilege.
Connect to the NQE database and remain continuously connected.
Regularly make queries of the NQE database, analyze data, and modify or act on that data.
Figure 10-1 shows the structure of the scheduler system task (referred to as the NQE scheduler). As shown by the diagram, it consists of three main parts:
nqe_tclsh. This is the NQE version of the Tcl shell. It contains built-in Tcl functions to perform, among other things, the task of connecting to and accessing the NQE database.
nqedb_scheduler.tcl. This is the name of the Tcl script that implements the mandatory parts of the NQE scheduler. It may be invoked as the first argument to nqe_tclsh or directly as an executable script. The nqedbmgr command start scheduler, simply executes this Tcl script as a background UNIX process.
local_sched.tcl. This is the name of the local scheduler (that is, the Tcl script containing the local or site-specific scheduling algorithm). It is sourced by nqedb_scheduler and contains callback functions to implement the algorithm.
By default, the local scheduler script resides in the location defined by NQE_BIN in the nqeinfo file. Its name and location may be changed by either of the following:
Defining another nqeinfo file variable, NQEDB_SCHEDFILE, set to the name of the script file (this can be an absolute path name or relative to NQE_BIN).
Setting a new attribute, s.schedfile, for a particular scheduler system task object. See “Writing a Simple Scheduler” for an example of how to do this.
The scheduler is written in the Tcl language. The most effective way of describing the operation of the NQE scheduler and the way in which it can be modified is by performing the steps necessary to create a new, site-specific NQE scheduler.
This section provides a tutorial in writing an NQE scheduler.
The following is required before starting the tutorial:
You must have knowledge of the Tcl language in order to modify the NQE scheduler. Tcl is a very simple but powerful interpreted language. Many Tcl language books include excellent tutorials on the language.
You must have root access on the NQE database server node running the mSQL daemon, the NQE scheduler system task, and the NQE monitor system task. Only one NQE node is configured to run the mSQL daemon. This is typically determined during installation of NQE. To ensure that a given machine is configured to run the NQE database server, check that the nqeinfo file variable NQE_DEFAULT_COMPLIST setting contains the NQEDB , SCHEDULER, MONITOR, and LWS components, and that the MSQL_SERVER variable is set to the name of the local machine.
Your PATH must include the directory defined by the nqeinfo file variable, NQE_BIN.
The NQE database , NQE scheduler, and NQE monitor must be running.
The nqeinit process should start these system tasks by default; the following message is displayed:
NQE database startup complete |
To check that system tasks are running, execute the nqedbmgr command status. The output should show that mSQL is running and that the monitor, scheduler, and at least the local lws system tasks are running.
Following is an example of the status output for a two node cluster with the NQE database server running on latte and a LWS running on pendulum:
nqedbmgr status ---------------------------------------------------------------- Connection: ----------- Connecting... Successful Client user name: root DB user name: root DB server name: latte DB server port: 603 ---------------------------------------------------------------- Global config object: --------------------- Heartbeat timeout: 90 Heartbeat rate: 30 System event purge rate: 24*60*60 User task purge rate: 24*60*60 Summary status: all ---------------------------------------------------------------- System tasks: ------------- monitor.main (id=s2): Status: Running Pid and host: 2669 on latte Last heartbeat: 26 seconds ago (Update rate is every 30 secs) scheduler.main (id=s3): Status: Running Pid and host: 2678 on latte Last heartbeat: 22 seconds ago (Update rate is every 30 secs) lws.latte (id=s4): Status: Running Pid and host: 2686 on latte Last heartbeat: 20 seconds ago (Update rate is every 30 secs) Default queue: nqebatch lws.pendulum (id=s5): Status: Running Pid and host: 24149 on pendulum Last heartbeat: 22 seconds ago (Update rate is every 30 secs) Default queue: nqebatch |
The goals of the tutorial are as follows:
Create a new NQE scheduler, called simple, that will work alongside the default NQE scheduler (by default called main).
Write a simple scheduler algorithm that schedules tasks to lightweight servers (LWSs) in a round-robin manner. The simple scheduler should also limit the total number of jobs a given LWS can run at any time; in other words, you will implement a global LWS run limit.
Test the scheduler.
Add the enhancement of allowing the run limit to be configurable.
The steps in achieving this are listed below and are described in the following sections:
Create a new scheduler system task object for simple
Create a simple local scheduler file
Write a Tinit callback function to initialize variables
Write a Tentry callback function to handle user tasks newly assigned or newly reassigned to your scheduler
Write a Tpass callback function to schedule user tasks
Write a Tnewowner callback function to update internal counts and lists
Write a Tentry callback function for user tasks returned from an LWS
Run the new scheduler and submit jobs that it will schedule
Add configuration enhancements to the Tinit callback function
It is desirable to develop and test a scheduler without disturbing any production NQE scheduler already running on the system. This can be achieved by creating a new scheduler system task object with the following command:
nqedbmgr create systask scheduler.simple |
This command inserts a new system task object into the NQE database.
![]() | Note: This action does not start the scheduler; it only creates the system task object. |
The output should look similar to the following:
Connected to latte as DB user root (conn=4) Object scheduler.simple inserted (s6) |
Here the local host name is latte. A new system task object has been created with the ID s6. This ID varies depending upon how many system task objects already exist in the NQE database. The system task class is scheduler. Its name is simple (the default scheduler name is main). When test jobs are submitted to the NQE database, you will be able to specify which scheduler should receive, own, and schedule the job.
The system task object can be deleted by using the following commands:
nqedbmgr shutdown scheduler.simple nqedbmgr delete scheduler.simple |
![]() | Caution: The delete command should never be issued while the simple scheduler is still running. |
For additional information about the nqedbmgr shutdown command, see “Stopping the NQE Database” in Chapter 9.
The following local scheduler files are provided with NQE:
File | Description |
template_sched.tcl | (Located in the /nqebase/bin directory). This is a template file for local schedulers. If run without modifications, this scheduler will accept all tasks but never schedule them. |
local_sched.tcl | (Located in the /nqebase/bin directory). This is the default scheduler supplied. (For more information about the default scheduler, see “Scheduler Attributes” in Chapter 9.) |
simple_sched.tcl | (Located in the /nqebase/examples directory). This is a final version of the simple scheduler described in the remainder of this tutorial and may be used for comparison. |
Copy the template file to a file named simple_sched.tcl in the $NQE_BIN directory. Ensure that the file has write permission because it will be modified in later steps.
The newly created scheduler system task object (step 1) must be configured as follows to read the Tcl file you just created:
nqedbmgr % update scheduler.simple {s.schedfile simple_sched.tcl} % exit |
This command updates the scheduler.simple system task object, setting the attribute s.schedfile to simple_sched.tcl.
You can list the attributes in scheduler.simple by using the following command:
nqedbmgr select -f scheduler.simple |
The scheduler will read this object and source the file specified by this attribute.
It is the responsibility of the local scheduler file, simple_sched.tcl, to do the following:
Define at least four special Tcl functions to perform the scheduling. These special Tcl functions are callbacks used by the scheduler program.
Invoke the Tcl function sched_set_callback for each of the four callbacks to register the functions' names. The syntax is as follows:
sched_set_callback callback functionname |
callback may be one of the following:
Function | Description | |
Tinit | Initialize scheduler | |
Tentry | Handle new tasks | |
Tpass | Make a scheduling pass | |
Tnewowner | Assign a new owner |
An example of its use follows:
# Register function fnc_Tinit for callback Tinit sched_set_callback Tinit fnc_Tinit |
The remaining steps describe how to create the functions listed above and explain their purpose.
Write a Tinit callback function.
The Tinit callback is called when the following occurs:
The scheduler starts up.
The scheduler receives the TINIT event. You can send an event to the simple scheduler by using the following nqedbmgr command:
nqedbmgr ask scheduler.simple tinit |
The scheduler reconnects to the NQE database after a previous, unexpected disconnection.
The Tinit callback should be used to perform any local initialization required by the scheduler. It returns with a value of 1 if initialization was successful; otherwise, it returns with a value of 0, and the scheduler shuts down.
The following two global Tcl variables are defined before the Tinit callback is invoked:
Obj_config. This is a global Tcl array containing the global configuration object. The elements of the array correspond to the attributes of the NQE database object of type g and name config.main.
The config.main object is read from the NQE database just before the Tinit callback function is called. The object can contain any configuration information required by user commands and system tasks.
For a list of predefined attributes, see “Global Configuration Object”. You may add others.
Obj_me. This global Tcl array contains the attributes corresponding to the scheduler system task object itself.
This array is the main mechanism for passing configuration data to a particular system task.
For a list of predefined attributes, see “System Task Object”.
The first version of the simple scheduler will not make use of the two global Tcl arrays. It will, however, use its own global Tcl array (called Sched) to store all running information concerning the tasks. This array should be unset and initialized at this point. The code for the Tinit callback follows. The variables initialized will be described later. The following code should be added to simple_sched.tcl:
# # Register the function # sched_set_callback Tinit fnc_Tinit # # Define function fnc_tinit # Arguments: None # proc fnc_Tinit {} { global Obj_config Obj_me global Sched sched_log "TINIT: Entered" if { [info exists Sched] } { # # Clear internal information array if it exists # unset Sched } # Global run limit per LWS set Sched(total_lws_max) 5 sched_trace "TPASS: Global per-LWS run limit:\ $Sched(total_lws_max) " # Scheduler pass rate (when idle) set Sched(passrate) 60 # The round-robin index set Sched(rrindex) 0 # Initialize lists set Sched(pending_list) "" set Sched(schedule_list) "" return 1 } |
Write a Tentry callback function to handle new tasks.
The Tentry callback is called when the following occurs:
A newly submitted user task is detected. New tasks always enter the system with state New.
Upon startup or Tinit, Tentry is called once for each user task already owned by the scheduler.
A user task is completed or fails on an LWS.
The Tentry callback should be used to update local lists and counts, depending upon the state of the user task. The Tentry callback may also change the state and/or owner of the user task.
A local scheduler may define as many states as it wants for the user tasks it owns. Table 10-1 lists the states that must be recognized by the Tentry callback:
Table 10-1. Mandatory Task States Recognized by the Tentry Callback
Task state | Description | |
---|---|---|
New | Indicates the initial state immediately after submission by user (by using NQE GUI or the cqsub command). | |
Scheduled | Indicates the scheduler has assigned a user task to a specific LWS. The LWS system task interprets this state as meaning "This task is to be submitted to the local NQS." After successful submission, the LWS places the task in Submitted state. However, this latter state is never communicated to the scheduler and does not need to be trapped. | |
Completed | Indicates that the job request has completed successfully. | |
Failed | Indicates that the job request cannot be submitted to NQS. | |
Aborted | Indicates that NQS cannot run the job request associated with this user task. This may occur, for example, if the job request is successfully submitted to an NQS pipe queue but then cannot be placed in a batch queue due to queue limits. | |
Terminated | Indicates that a user task has been terminated by execution of a delete request (either through the NQE GUI or the cqdel command). |
In this step, you will write code to deal with the New state and to define a new local state, Pending, which indicates that the user task has been accepted by this scheduler and is queued.
Replace the template version of fnc_Tentry with the following fnc_Tentry function:
# # Register the function # sched_set_callback Tentry fnc_Tentry # # Define function fnc_Tentry # Argument: objid. Object ID of user task # proc fnc_Tentry { objid } { global Obj_config Obj_me Us Lwsid global Sched # Associate global variable Obj_$objid # with variable obj in this fnc. upvar #0 Obj_$objid obj # # Set some local variables for ease of use # set state $obj(t.state) set substate $obj(t.substate) # Log a message to scheduler log file sched_log "TENTRY: objid = $objid, state = $state/$substate" # Look at the state of this task switch -regexp $state { New { # New: Task has just been submitted to NQE by cqsub set upd(t.state) Pending sched_update $objid upd } Pending { # Pending: Scheduler-specific state # Add task to pending_list nqe_listrep Sched(pending_list) $objid # Acknowledge task sched_update $objid # Ask for a scheduling pass as soon as possible sched_post_tpass } } return } |
The following two additional global Tcl variables are defined before the Tentry callback is invoked:
Us. This global Tcl array contains various scheduling values. It is described in more detail in the next step.
Obj_$objid. Each attribute of each user task object owned by the scheduler is represented in Tcl by an array with a name of the form Obj_$objid; $objid is the object ID of the task (such as, t6).
Unfortunately, Tcl arrays with variable names are difficult to access directly in Tcl scripts. Therefore, it is usually more convenient to associate the name of the array with a local name for the duration of the function. This is achieved by using either the Tcl function upvar #0 or the NQE Tcl function globvar. See standard Tcl documentation for a full description of these functions.
The following important functions have been introduced in this step:
Tcl function | |||
Description | |||
sched_log message [severity] | |||
Log a message to the log file. message is the message to log. severity describes the severity level of the message. It may be one of the following: If a severity level is provided, the message is prefixed in the log file by a string showing that severity. If no severity level is provided, the message is logged without a prefix. The following example
logs a message similar to the following:
The function does not return any value. | |||
sched_post_tpass | |||
Request a scheduling pass. This function causes the scheduling algorithm to be called as soon as possible. In other words, as soon as the Tentry callback completes, the Tpass callback is invoked. The function does not return any value. | |||
sched_update objid updatearr or sched_update objid updatelist | |||
This is the main NQE database update function. It must be used to update any attributes in a user task object. objid is the object ID of the object to be updated. updatearr is the name of a Tcl array that contains all the attributes to update. The following example updates the object with ID t6, setting attributes t.state and otherattrib:
The equivalent syntax is as follows:
The function does not return any value. If an error occurs, Tcl traps it and deals with it elsewhere. |
Write the Tpass scheduling algorithm.
The Tpass callback is called as follows:
Once upon receipt of Tinit or upon startup
Regularly (as defined by the Tpass callback's return value)
Soon after any call to sched_post_tpass
![]() | Note: The Tpass callback is not called when the state of any LWS changes. This means that the scheduler may not react to the startup of a new LWS until the next regular scheduling pass. |
The Tpass callback should attempt to find a user task that can be scheduled on an LWS. In other words, the main scheduling algorithm should be coded here.
Only one task should be scheduled in any one pass, and no internal counts or lists should be updated by this function. These jobs should be left to the Tentry and Tnewowner callbacks. This ensures that lists and counts are based solely on the contents of the NQE database.
The simple scheduler algorithm is as follows:
# # Register the function # sched_set_callback Tpass fnc_Tpass # # Define function fnc_Tpass # Argument: None # proc fnc_Tpass {} { global Obj_config Obj_me Us Lwsid global Sched sched_trace "TPASS..." # Skip if there are no lws running if { $Us(lwsruncount) <= 0 } { sched_trace "TPASS: No running LWS" return $Sched(passrate) } # Loop through tasks foreach taskid $Sched(pending_list) { # Find an ordered list of target LWS set lwstouselist [getrrlist] foreach lws $lwstouselist { # Has the lws any room? if { [info exists Sched(total $lws)] && \ $Sched(total_$lws) >= $Sched(total_lws_max)} { #This LWS is already at its limit continue } # Schedule the task to this LWS sched_log "TPASS: Scheduling $taskid\ to $lws ($Lwsid($lws))" set upd(t.state) Scheduled set upd(t.sysclass) $Us(lwsclass) set upd(t.sysname) $lws sched_update $taskid upd # Update round-robin index incr Sched(rrindex) # Ask for another pass as soon as possible return 0 } } return $Sched(passrate) } |
The Us global Tcl array contains useful scheduling information. The following array items are defined:
lwsclass, which is the name of the LWS class. This is needed to assign ownership of a user task to an LWS. It is always set to lws.
lwscount, which is set to the number of LWS system tasks configured. Not all LWSs configured are necessarily running.
lwslist, which is set to the names of the LWS system task objects configured. The number of names in this list equals lwscount.
lwsruncount, which is set to the number of LWS system tasks currently running (that is, in Running state).
lwsrunlist, which is set to the names of the LWS system tasks currently running.
monitorclass, which is the name of the monitor class. This is needed to assign ownership of a user task to the monitor when the task is completed. It is always set to monitor.
monitorname, which is the name of the monitor. This is needed to assign ownership of a user task to the monitor. It is always set to main.
schedulerclass, which is the name of the scheduler class. This is needed to assign ownership of a user task to another scheduler. It is usually set to scheduler.
schedulername, which is the name of this scheduler. For the tutorial example scheduler, the name will be simple.
The function to obtain an ordered list of LWSs by the round-robin method (getrrlist) is as follows; this function is in the scheduler released with NQE, in the file local_sched.tcl:
proc getrrlist {} { global Us Sched # Round-robin tasks to the current active LWS list # if { $Us(lwsruncount) <= $Sched(rrindex) } { # # Round-robin index is greater than the current number # of active LWSs, reset to 0 # set Sched(rrindex) 0 } # Create the LWS to use list in round-robin order # return [concat \ [lrange $Us(lwsrunlist) $Sched(rrindex) end] \ [lrange $Us(lwsrunlist) 0 [expr $Sched(rrindex) - 1]] \ ] } |
Write the Tnewowner callback function.
The Tnewowner callback is called as follows:
Once for every user task owned by any LWS, immediately after Tinit or upon startup.
Whenever the system task ownership of a task changes away from the scheduler. In other words, the Tnewowner callback is invoked after sched_update has been used to assign a user task to an LWS to run or to a monitor to be disposed.
The Tnewowner callback should update any lists based on the new owner of the task. When the function returns, the scheduler will unset the in-memory Tcl global array of the user task because the task will no longer be owned by the scheduler. The following code writes the Tnewowner callback function:
# # Register the function # sched_set_callback Tnewowner fnc_Tnewowner # # Define function fnc_Tnewowner # Arguments: objid - object ID of task # oclass - new owner class # oname - new owner name # proc fnc_Tnewowner { objid oclass oname } { global Obj_config Obj_me Us Lwsid global Sched # Associate global variable Obj_$objid with variable obj upvar #0 Obj_$objid obj lg_log "TNEWOWNER: objid = $objid, owner now $oclass/$oname" # See which type of system task now owns the task switch $oclass { lws { # # A LWS now owns task. This occurs immediately after # changing the owner of the task to a lws (i.e. after # scheduling a task to run in Tpass). # # Remove it from the pending list nqe_listrem Sched(pending_list) $objid # Add task ID to list of scheduled tasks nqe_listrep Sched(schedule_list) $objid #Update counts if { ! [info exists Sched(total $oname)] }{ set Sched(total_$oname) 0 } incr Sched(total $oname) sched_trace "LWS $oname run count increased to\ $Sched(total_$oname)" } monitor { # Monitor now owns task. This occurs when a task # has completed and is never to run again. # Remove it from the lists. nqe_listrem Sched(pending_list) $objid nqe_listrem Sched(schedule_list) $objid } } return } |
The following additional functions were introduced in this section (step 6):
Tcl function | |
Description | |
nqe_listrem listname element | |
Remove an element from a list. This function searches for and deletes an element of a list. No action is performed if the list does not exist or if the element is not found. | |
nqe_listrep listname element | |
Adds an element in a list. If the list does not yet exist, this function creates the list and adds the element to the list. If the list exists, this function adds the element to the list. Otherwise, it performs no action. |
Add to the Tentry callback.
When a user task that is owned by an LWS completes or fails in some way, it is reassigned to the scheduler. This action allows the scheduler to update local counts and possibly requeue the task to run again.
Tasks to be rerun enter the scheduler in the same way as new tasks.
In this simple scheduler, the scheduler will assign any task in Completed, Failed, Aborted, or Terminated state to the monitor unmodified. A more sophisticated scheduler could analyze the reasons for completion and perform some other action.
Merge the following code into the Tentry function that you have already defined:
proc fnc_Tentry { objid } { ... # Note use of -regexp to allow "|" in cases switch -regexp $state { New { ... } Pending { ... } Completed|Failed|Aborted|Terminated { # Task finished # Send to the monitor. sched_log "$objid: Task has completed ($state)" i # Update the LWS lists/counts # Note that task attribute t.lastlws # contains the name of the lws which owned the task if { [lsesarch -exact $Sched(schedule_list) $objid] \ >=0 } { incr Sched(total_$obj(t.lastlws)) -1 sched_trace "LWS $obj(t.lastlws) run count\ reduced to $Sched(total_$obj(t.lastlws))" } set upd(t.sysclass) $Us(monitorclass) set upd(t.sysname) $Us(monitorname) sched_update $objid upd sched_post_tpass } default { # Catch-all for unknown states. sched_log "$objid: Unknown state,\ $state. Task disposed" sched_update $objid [list t.state $state \ t.sysclass $Us(monitorclass) \ t.sysname $Us(monitorname) \ ] } } return } |
The t.lastlws attribute is defined automatically and contains the name of the LWS that last ran this task.
The following two mechanisms exist for running the scheduler:
Invoke the nqedb_scheduler.tcl script interactively as a command in a terminal by using the following command:
nqedb_scheduler.tcl name |
This command is useful while debugging. Logging and tracing output is sent to the screen.
Use the nqedbmgr start scheduler.simple command. This command sets the process in the background and creates a log file in the log directory under NQE's spool directory (see the nqeinfo file variable NQE_SPOOL). This command is useful when the scheduler is in production.
Before invoking the scheduler, set a trace level so that scheduling trace messages are reported; this will aid debugging. Use the nqedbmgr ask or nqedbmgr post event command to set the trace level. For information about trace levels and how to set them, see “Trace Levels”.
This tutorial uses the first mechanism to invoke the scheduler; to do this, simply type the following command:
nqedb_scheduler.tcl simple |
The first argument is the name of the scheduler. If no name is supplied, main is assumed. Output and trace messages will appear on the screen.
To submit a job that will be scheduled by scheduler simple, use the scheduler attribute of the cqsub command, as follows:
cqsub -d nqedb -la scheduler=simple |
It is very possible that the Tcl interpreter will encounter syntax errors while a local scheduler is being developed.
All Tcl errors are trapped. If the error was due to a failure in the connection to the NQE database server, the scheduler enters a retry state. All other Tcl errors cause the scheduler to abort immediately. A description of the error and a Tcl stack trace are placed in the log file (or printed to the screen if the scheduler is being run interactively).
A significant advantage to Tcl is the ability to quickly correct this type of problem.
It is possible to increase the amount of information logged by the scheduler by setting its trace_level attribute. Its value is a mask of the types of tracing required. Values may be separated by commas. The following list provides some useful trace-level names:
Trace-level name | Usage | |
sched | Local scheduler trace messages; this type of trace is reserved for use by local schedulers. | |
pr | Process fork/exec tracing. | |
gen | General system task tracing. | |
db | NQE database function tracing. | |
0 | Disable tracing; this is the default trace-level setting. |
You can set the trace level while the system object is running or when it is not running. To set the trace level while the system object is running, use the nqedbmgr ask or the nqedbmgr post event command. The nqedbmgr ask command is the same as the nqedbmgr post event command, except that it waits for a response. The following example sets the trace level to sched on the simple scheduler as it runs:
nqedbmgr ask scheduler.simple trace_level sched |
The nqedbmgr post event -o command sets the trace level whether or not the system object is in a Running state. The following example sets the trace level to sched on the simple scheduler whether or not it is running:
nqedbmgr post event -o scheduler.simple trace_level sched |
The DUMP_VARS event causes the contents and names of a running scheduler's global Tcl variables to be logged. The event value is a glob pattern to match the global Tcl variable name or value.
For example, to ask the simple scheduler to dump out the contents of the Sched array, use the following command:
nqedbmgr ask scheduler.simple dump_vars Sched |
To dump out all variables, use the following command:
nqedbmgr ask scheduler.simple dump_vars "*" |
The quotation marks are needed if the command is passed on a UNIX command line.
The following simple addition to the Tinit callback allows modification of configuration:
proc fnc_Tinit {} { global Obj_config Obj_me ... # Global run limit per LWS set Sched(total_lws_max) 5 catch {set Sched(total_lws_max) $Obj_me(total_lws_max)} sched_trace "TPASS: Global per-LWS run limit:\n $Sched(total_lws_max) " # Scheduler pass rate (when idle) set Sched(passrate) 60 catch {set Sched(passrate) $Obj_me(passrate)} ... return 1 } |
The total_lws_max and passrate attributes can be overridden by setting attributes with the same names in the system task's object (Obj_me). The simple scheduler loads its own system task object on startup or after receiving the TINIT event.
Update the system task's object as follows:
nqedbmgr % update scheduler.simple {total_lws_max 2} % exit |
Restart the scheduler or post the Tinit event. The new maximum should now be in force.
This section describes some common errors you may experience and provides suggested solutions.
FATAL: Failure occurred. Was not loss of connection FATAL: Don't know what to do here. Exiting | |
This shows a general problem encountered by a system task. A Tcl error description and Tcl stack trace follow the messages. | |
FATAL: TINIT reports fatal error | |
The local scheduler Tinit function returned 0, signifying that it could not initialize correctly. Check the local scheduler's Tinit code. | |
FATAL: System object abc.xyz must be created before starting up this system task | |
An attempt was made to start a system task before its system task object was inserted into the database. Insert its object by using the nqedbmgr create command before starting the system task. | |
FATAL: System task file lock could not be obtained. Currently held by PID xxxx INFO: Another system task with ID sxx may be running | |
This signifies that a system task of the same ID is already running on this host. Check for a process with this PID on the system. | |
FATAL: Configured time syntax for "xyz" is bad: expr | |
A global or system task attribute representing a time value (for example, purge.task_age) is not a valid mathematical expression. Check the attribute (using nqedbmgr select -f obj) and modify it. The system task should be restarted. | |
WARN: Callback Tpass is not defined. No regular scheduling passes will occur | |
Your local scheduler has not registered a Tpass function. This means that no regular scheduling passes can ever occur. Edit the scheduler, and either restart the scheduler or send it a TINIT event. | |
WARN: System object xyz host name attribute (hostA) different from this host (hostB) | |
This warning is issued if a given system task is started on a machine other than the machine used for its previous invocation (this means system file locks cannot be used to ensure that multiple system tasks are not running). It may occur as a result of moving the mSQL database to another machine. | |
ERROR: DB_CONNECT: Failed. Retry in 10 secs... | |
The system task could not connect to the NQE database. This may be due to one of the following:
| |
ERROR: Cannot find the user scheduler file: xyz | |
You have specified a local scheduler file that does not exist. Check the nqeinfo file variable NQEDB_SCHEDFILE and the scheduler's system task attribute, s.schedfile (which takes precedence), for the file name given. If the file name is not an absolute path name, it is assumed to be relative to the NQE bin directory. | |
INFO: Interrupt occurred | |
An interrupt was received by the system task. This generally occurs when the scheduler is being run interactively, and CONTROL-C is pressed. The scheduler will shut down and must be restarted. |
This section lists and describes all predefined attribute names for the nqeinfo file and for NQE database objects.
Table 10-2 describes variables that may be defined in the nqeinfo file. The table also indicates its default value if it has one.
Table 10-2. Variables That May Be Defined in the nqeinfo File
Name of variable | Description | Notes | |
---|---|---|---|
MSQL_DBDEFS | Location and name of mSQL table definitions for NQE database. The file specified by this variable is used as input to mSQL to create tables for the first time. Once the NQE database is created, the file is no longer required. | Default value is $NQE_BIN/nqedb.struct. Needed by the node configured as the NQE database server only. | |
MSQL_DBNAME | Name of mSQL database. The mSQL daemon may serve multiple databases. Each database comprises one or more SQL tables. After connecting, NQE database clients such as cqsub and system tasks will select a database based on this name. | Default value is nqedb. Value must not be greater than 8 characters in length. Needed by the node configured as the NQE database server only. | |
MSQL_HOME | Location of mSQL database files. This directory specifies the location on disk of the data stored by the mSQL daemon. | Default value is $NQE_SPOOL. mSQL automatically uses the msqld directory below this. Needed by the node configured as the NQE database server only. | |
MSQL_SERVER | Name of mSQL daemon server host. This variable contains the name (plus optional colon followed by TCP port) of the host that is running the mSQL daemon. Only master server machines can run the daemon and only one should be defined for the network. | No default is defined. May be overridden by setting the environment variable MSQL_SERVER. Needed by all NQE machines. | |
MSQL_TCP_PORT | Name of mSQL daemon server port. This contains the name or number of the TCP port on which the mSQL daemon listens. | Mandatory if no port specified in MSQL_SERVER. Default is 603. May be overridden by setting port number in MSQL_TCP_PORT . If MSQL_SERVER is set with a port number, the value of MSQL_TCP_PORT is ignored. Needed by all NQE machines. | |
NQEDB_IDDIR | Name of directory to contain the LWS ID files. The ID directory contains files used for communication between an NQS and its local LWS system task. | Default is $NQE_SPOOL/nqedb_iddir. Needed by nodes configured as the NQE database server, NQS server, and LWS. | |
NQEDB_SCHEDFILE | File specification of local scheduler Tcl script. This variable may be set to give an alternative name (relative to NQE_BIN) or the full path name of the Tcl script used to implement local scheduling algorithms. | Default if not specified is local_sched.tcl. Overridden by defining the attribute s.schedfile in the scheduler's system task object. Used by the node configured as the Scheduler. |
The event object type is e. Event objects are used to communicate among tasks (both system and user). Attributes are described in Table 10-3:
Table 10-3. Event Objects Attributes
Name of attribute | Description | Notes | |
---|---|---|---|
type | Object type (e) | ||
e.ack | This field is used by the system to record when a system task has acknowledged the existence of the user task object it now owns. 0 means that this task has changed owner but that the new owner has not acknowledged it. 1 means that the new owner has acknowledged ownership. |
| |
e.name | A name for this event. | See Table 10-4, for list of predefined event names. | |
e.response | Optional response associated with the event. This is set when an event is acknowledged. | If not explicitly set, is set to "". | |
e.targid | Object ID of the target task intended for this event. It may be a user task or system task ID. |
| |
e.targtype | Object type of the target of this event. |
| |
e.value | Optional value for event. | If not explicitly set, is set to "". | |
id | Unique ID for the task. In mSQL, this is in the form en, where n is an integer. | Allocated when the task is first inserted into the NQE database. | |
sec.ins.dbuser | NQE database user name of originator of this task. | ||
timein | Time object was inserted into NQE database. |
| |
timeup | Time object was last updated. |
|
Table 10-4 lists the events that are defined by default:
Table 10-4. Events Defined by Default
Event name | Value | Description | |
---|---|---|---|
DUMP_VARS | Glob pattern for variables to dump | Dump to the log all variable names and values matching the glob pattern supplied. For debugging. | |
SIGDEL | Signal name or number | Send the signal to the process if it is running. Otherwise, delete the process. This event is posted by qdel/cqdel. | |
SHUTDOWN | None | Shut down the system task. | |
TRACE_LEVEL | Trace level | Set the trace level of the target system object to the value passed. | |
TINIT | None | Reinitialize the system task. This event causes the system task to leave Running state and enter connected state, where it clears and reloads all information. The principal use of this event is to restart the scheduler. |
The user task object type is t. This object is used to contain all information concerning a job to run on NQS. Attributes are described in Table 10-5:
Table 10-5. User Task Object Attributes
Name of attribute | Description | Notes | |
---|---|---|---|
att.account | Account name or project name. | cqsub option: -A | |
att.compartment | Compartment names. | cqsub option: -C | |
att.seclevel | Security levels. | cqsub option: -L | |
att.time.sched | Request run time. | cqsub option: -a | |
file.1.name | Standard output file specification name. | cqsub option: -o | |
file.2.name | Standard error file name. | cqsub option: -e | |
file.3.name | Job log file name. | cqsub option: -j | |
file.eo | Merge standard error and standard output files. | cqsub option: -eo | |
file.jo | Job log control (-J m|n|y). | cqsub option: -J | |
file.ke | Keep standard error file on execution machine. The option is set if value is 1 and is not set if value is 0 or attribute is not set. | cqsub option: -ke | |
file.kj | Keep job log file on execution machine. The option is set if value is 1, and is not set if value is 0 or the attribute is not set. | cqsub option: -kj | |
file.ko | Keep standard output file on execution machine. The option is set if value is 1, and is not set if value is 0 or attribute is not set. | cqsub option: -ko | |
id | Unique ID for the task. In mSQL, this will be of the form tn, where n is an integer. | Mandatory attribute. Allocated when the task is first inserted into the NQE database. | |
lim.core | Core file size limit. | cqsub option: -lc | |
lim.data | Data segment size limit. | cqsub option: -ld | |
lim.mpp_m | Per-request MPP memory size limit | cqsub option: -l mpp_m | |
lim.mpp_p | Per-request MPP processor elements. | cqsub option: -l mpp_p | |
lim.p_mpp_m | Per-process MPP memory size limit | cqsub option: -l p_mpp_m | |
lim.mpp_t | Per-request MPP time limit. | cqsub option: -l mpp_t | |
lim.p_mpp_t | Per-process MPP time limit. | cqsub option: -l p_mpp_t | |
lim.ppcpu | Per-process CPU time limit. | cqsub option: -lt | |
lim.ppfile | Per-process permanent file size limit. | cqsub option: -lf | |
lim.ppmem | Per-process memory size limit. | cqsub option: -lm | |
lim.ppnice | Per-process nice value. | cqsub option: -ln | |
lim.pptemp | Per-process temporary file size limit. | cqsub option: -lv | |
lim.prcpu | Per-request CPU time limit. | cqsub option: -lT | |
lim.prfile | Per-request permanent file size limit. | cqsub option: -lF | |
lim.prmem | Per-request memory size limit. | cqsub option: -lM | |
lim.prq | Per-request quick file space limit. | cqsub option: -lQ | |
lim.prtemp | Per-request temporary file size limit. | cqsub option: -lV | |
lim.shm | Per-request shared memory limit. | cqsub option: -l shm_lim | |
lim.shm_seg | Per-request shared memory segments. | cqsub option: -l shm_seg | |
lim.stack | Stack segment size limit. | cqsub option: -ls | |
lim.tape.a | Per-request drive limit for tape type a. | cqsub option: -lUa | |
lim.tape.b | Per-request drive limit for tape type b. | cqsub option: -lUb | |
lim.tape.c | Per-request drive limit for tape type c. | cqsub option: -lUd | |
lim.tape.d | Per-request drive limit for tape type d. | cqsub option: -lUc | |
lim.tape.e | Per-request drive limit for tape type e. | cqsub option: -lUe | |
lim.tape.f | Per-request drive limit for tape type f. | cqsub option: -lUf | |
lim.tape.g | Per-request drive limit for tape type g. | cqsub option: -lUg | |
lim.tape.h | Per-request drive limit for tape type h. | cqsub option: -lUh | |
lim.work | Working set size limit. | cqsub option: -lw | |
mail.begin | Send mail upon beginning execution. The option is set if value is 1 and is not set if value is 0 or the attribute is not set. | cqsub option: -mb | |
mail.end | Send mail upon ending execution. The option is set if value is 1, and is not set if value is 0 or the attribute is not set. | cqsub option: -me | |
mail.rerun | Send mail upon request rerun. The option is set if value is 1, and is not set if value is 0 or the attribute is not set. | cqsub option: -mr | |
mail.transfer | Send mail upon request transfer. The option is set if value is 1, and is not set if value is 0 or the attribute is not set. | cqsub option: -mt | |
mail.user | User name to which mail is sent; default is the user submitting the request. | cqsub option: -mu | |
nqs.export | Export environment variables. The option is set if value is 1, and is not set if value is 0 or the attribute is not set. | cqsub option: -x | |
nqs.nochkpnt | Do not checkpoint; not restartable. The option is set if value is 1, and is not set if value is 0 or the attribute is not set. | cqsub option: -nc | |
nqs.norerun | Do not rerun. The option is set if value is 1, and is not set if value is 0 or the attribute is not set. | cqsub option: -nr | |
nqs.pri | Intraqueue request priority. | cqsub option: -p | |
nqs.queue | Queue name for submission. If not set or set to "default", the LWS will use a default queue (defined in its system task attribute default_queue). | cqsub option: -q | |
nqs.re | Writes to the standard error file while the request is executing. The option is set if value is 1, and is not set if value is 0 or attribute is not set. | cqsub option: -re | |
nqs.ro | Writes to the standard output file while the request is executing. The option is set if value is 1, and not set if value is 0 or attribute is not set. | cqsub option: -ro | |
nqs.shell | Batch script shell name; initial shell when a job is spawned. | cqsub option: -s | |
sec.ins.clienthost | Client host name. | Host name used for file validation; name must be in the .rhosts or .nqshosts file. | |
sec.ins.clientuser | Client user name. | User name used for file validation; name must be in the .rhosts or .nqshosts file. | |
sec.ins.dbuser | NQE database user name of originator of this task. | ||
sec.ins.hash | Hash value for sec.ins.* values. |
| |
sec.ins.tuser | Target user specification. |
| |
sec.ins.tuser.pwe | Target user specification (encrypted). |
| |
sec.target.user | Target user for a particular LWS. | This attribute is created by the LWS after the task is submitted locally. | |
script.0 | Script file contents. | This may be a very long attribute and may not be displayable by nqedbmgr. | |
t.ack | This field is used by the system to record when a system task has acknowledged the existence of the user task object it now owns. 0 means that this task has changed owner but that the new owner has not acknowledged ownership. 1 means that the new owner has acknowledged ownership. |
| |
t.class | The class name for this task. | Always set to USER. | |
t.failedlws | List of LWS names where this job has failed. This may be used to determine whether the task should be rescheduled to a given LWS. |
| |
t.failure | Failure message. |
| |
t.lastlws | Name of LWS that last submitted this task to NQS. This is updated by the LWS when the task is completed or has failed. | This is set by the scheduler when passing ownership to an lws. | |
t.lastsched | System task name of the scheduler that owned and scheduled the task. | This is set by the scheduler when passing ownership to an lws. | |
t.message | Messages concerning success or failure of job. |
| |
t.name | A name for this task. The name is not used except for information. By default, it is set to the name of the NQS job, as specified by -r. | cqsub option: -r | |
t.reqid | The current NQS request ID for this task. This ID may change if the task is submitted to one LWS's NQS, fails, and is then submitted to another. | Not set until task has been submitted to NQS. | |
t.state | State of task. This serves both as information and as a checkpoint for the system task that owns this task. When changing owner from one system task to another, the state must be set to a value known to the new system task. |
| |
t.substate | Substate. | May be "". | |
t.sysclass | Name of class of system object that owns this task. The class may be scheduler, lws, or monitor. |
| |
t.sysname | Name of system object that owns this task. The name is main for the scheduler and monitor and the name of the host for the lws. |
| |
timein | Time at which the task was inserted into the NQE database. |
| |
timeup | Time at which the task was last updated. |
| |
type | Task type (t). | Always set to t for user tasks. | |
user.attr_name | Job attributes. | cqsub option: -la |
The system task object type is s. One system task object is created for each system task that is to run in this system. On a typical system, there is one scheduler system task, one monitor system task, and one LWS system task for each NQS server node. Attributes are described in Table 10-6:
Table 10-6. System Task Object Attributes
Name of attribute | Description | Notes | |
---|---|---|---|
default_queue | Provides the name of the NQS queue to which jobs are submitted by default. | Used by the LWS only. | |
id | Unique ID for the object. In mSQL, this will be of the form sn, where n is an integer. | Mandatory attribute. Allocated when the task is first inserted into the NQE database. | |
qmgrnow | If set and true, requests that the LWS follow any submission to NQS with the command qmgr>scheduler request xyz now, where xyz is the NQS request ID. This forces NQS to run a job, even if local NQS scheduling parameters deny it. | LWS specific. | |
run_mechanism | Defines the validation mechanism used to check the target user against the client user for local submission to NQS. | Used only by the LWS. If it is set to File, LWS checks for local ~/.nqshosts or ~/.rhosts file; if it is set to Passwd, LWS checks for target password. | |
s.class | The class name of this system task. Possible values at present are as follows:
|
| |
s.host | Name of host on which this system task is running. |
| |
s.name | The system task's name. This is main for the scheduler and monitor and is, by default, the host name for the LWS. |
| |
s.pid | UNIX PID of the system task. | Set to -1 when system task exits. | |
s.schedfile | Local scheduler (user scheduler) file. The file may include the absolute path name or the path name relative to the NQE bin directory. This attribute is always recognized, no matter which local scheduler is running. | Used by the scheduler only. Default is local_sched.tcl. See “Step 2: Create the Simple Scheduler File”, for usage. | |
s.state | State of system task:
|
| |
sec.ins.dbuser | NQE database user name of the originator of this task. For a system task, this will usually be root. | ||
trace_level | Current trace level for the system task. | Default setting is 0. |
The global configuration object type is g. This special object type exists for the purposes of containing global configuration data. There is only one instance of this object type in the NQE database. Attributes are described in Table 10-7:
Table 10-7. Global Configuration Object Attributes
Name of attribute | Description | Notes | |
---|---|---|---|
hb.systask_timeout _period | Time before monitor system task marks a given system task as Down. The value may be a mathematical expression. Units are seconds. | Default value is 90. | |
hb.update_period | Time between each update of the heartbeat. Each system task will update the NQE database within this period. The heartbeat is used by the monitor to determine whether a system task is still alive. The value may be a mathematical expression. Units are seconds. | Default value is 30. | |
id | Unique ID for the object. In mSQL, this will be of the form gn, where n is an integer. | Allocated when the object is first inserted into the NQE database. | |
purge.systask_age | The monitor system task will delete (purge from the NQE database) all acknowledged events concerning system tasks that have not been updated within the specified time period. The value may be a mathematical expression. Units are seconds. | Default value is 24*60*60 | |
purge.task_age | The monitor system task will delete (purge from the NQE database) all user tasks that are finished and have not been updated within the specified time period. The value may be a mathematical expression. Units are seconds. | Default value is 24*60*60 | |
s.class | Object class (config). |
| |
s.name | Object name (main). |
| |
sec.ins.dbuser | NQE database user name of the originator of this object (usually root). | ||
status.access | Type of access granted to users when attempting to obtain summary information about jobs in the NQE database. Value can be one of the following:
A user may only see full information for tasks submitted by that user. | Default value is all. | |
sys.ev_rate | Rate at which system tasks check for new events targeted at them or at tasks they own. This attribute may be a mathematical expression. Units are seconds. | Default value is 10. | |
sys.purge_rate | Rate at which the monitor checks the age of a completed job. This attribute may be a mathematical expression. Units are seconds. | Default value is 5*60. | |
sys.st_rate | Rate at which system tasks check for new tasks, changes in task state, and so on. This attribute may be a mathematical expression. Units are seconds. | 10 | |
sys.upd_rate | Rate at which LWS system tasks check for messages from NQS. This attribute may be a mathematical expression. Units are seconds. | 10 | |
timehb | Time of last heartbeat. |
| |
timein | Time at which object was inserted into the NQE database. |
| |
timeup | Time at which object was last updated. |
| |
type | Object type (s); always set to s for system tasks. |
|
Tcl was chosen to become part of NQE scheduling for the purpose of providing a simple, standard interpreted language with which administrators can write schedulers. It is very portable and can be interfaced to C very easily.
The manager command, nqedbmgr, is simply Tcl invoked interactively. All nqedbmgr commands are Tcl functions (defined in nqedbmgr.tcl).
In order to use Tcl effectively in system tasks, various built-in functions (that is, functions written in C and linked with Tcl) were implemented. Table 10-8 provides a brief description of these functions.
Table 10-8. Tcl Built-in Functions
Tcl built-in function | Description | |
---|---|---|
globvar | Because Tcl arrays with variable names are difficult to access directly in Tcl scripts, globvar is used to associate the name of the array with a local name for the duration of the function. | |
lg_init, lg_config, lg_log, lg_trace | Logging and tracing functions. | |
pr_create, pr_set, pr_get, pr_execute, pr_signal, pr_wait, pr_clear | Provide a mechanism for forking and execing processes, and changing context to another user. | |
nqe_after, nqe_pause | Implementation of a simple asynchronous timing loop. These functions are used to give the system tasks event-driven characteristics. | |
nqe_curtime | Returns the elapsed seconds since UNIX Epoch. | |
nqe_daemon_init() | Initializes the caller as a daemon process. It makes a setid() call to create a new session with no controlling tty and also ignores SIGHUP signals. | |
nqe_strtime | Converts UNIX Epoch to a human-readable string. | |
nqe_lock, nqe_locked, nqe_unlock | File locking functions. | |
nqe_pipe | UNIX pipe() implementation. | |
nqe_rmfile | Unlink or remove a file. | |
nqedb_connect, nqedb_insert, nqedb_select, nqedb_getobj, nqedb_update, nqedb_delete, nqedb_disconnect nqedb_insert_event, nqedb_insert_task, nqedb_cancel_event | Tcl implementations of the NQE database C APIs. | |
nqe_getenv, nqe_getvar | Obtain environment variable/ nqeinfo file variables. | |
nqe_getid | Returns effective/real UIDs and GIDs. | |
nqe_gethostname | Returns local host name. |
Table 10-9 lists other useful functions implemented in Tcl:
Table 10-9. Additional Functions Implemented in Tcl
Tcl built-in function | Description | |
---|---|---|
nqe_listrem listname element | Removes an element from a list. This function searches for and deletes an element of a list. No action is performed if the list does not exist or if the element is not found. | |
nqe_listrep listname element | Replaces an element from a list. This function adds an element to a list if the list does not yet exist or if the element is not in the list; otherwise, it performs no action. | |
sched_log message [severity] | Logs a message to the log file. message is the message to the log. severity describes the severity level of the message. It may be one of the following: | |
| If a severity level is provided, the message is prefixed in the log file with a string showing that severity. If no severity level is provided, the message is logged without a prefix. There is no return. | |
sched_post_tpass | Requests a scheduling pass. This function causes the scheduling algorithm (Tpass) to be called as soon as possible. There is no return value. | |
sched_set_callbac callback functionanme | Registers the Tcl function, functionname, with callback callback. callback may be one of the following: | |
| There is no return value. | |
sched_update objid updatearr or sched_update objid updatelist | This is the main NQE database update function. It must be used to update any attributes in a user task object. objid is the object ID of the object to be updated. updatearr is the name of a Tcl array that contains all the attributes to update. Alternatively, you may pass a list of attribute names and values in updatelist. The function does not return any value. If an error occurs, Tcl traps it and deals with it elsewhere. |