<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom"><title>jacked.in</title><link href="http://jacked.in" /><updated>2010-07-11T01:05:37Z</updated><id>http://jacked.in</id><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/jackedin" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="jackedin" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry><title>Moab / Torque Primer Part 2: Resources</title><author><name>Oliver Baltzer</name></author><link href="/2009/11/moab-torque-primer-part-2-resources/" /><updated>2009-11-08T23:53:00Z</updated><published>2009-11-08T23:53:00Z</published><id>/2009/11/moab-torque-primer-part-2-resources/</id><content type="html">
       
&lt;p&gt;
The main purpose of Torque and Moab is to manage, monitor and schedule
resources. At a very high level the most obvious types of resources are
compute nodes. Each of these compute nodes, however, contributes resources
at a much finer level to the resource pool and resources monitors such as
Torque often monitor processors, main memory, storage, etc. In addition to
these rather common resource types, nodes may also be associated with other
more specific resource types, such as network bandwidth, software licenses
or power consumption. However, for a general understanding considering
compute nodes and processors as resources is&amp;nbsp;sufficient.
&lt;/p&gt;

&lt;p&gt;
Torque is responsible for monitoring the resources provided by compute
nodes. Each compute node runs a small daemon program called
&lt;code&gt;pbs_mom&lt;/code&gt; which collects all relevant resource information on
the compute node it is running on and sends this information to Torque&amp;#8217;s
&lt;code&gt;pbs_server&lt;/code&gt; which typically runs on the headnode of the
cluster. The &lt;code&gt;pbs_server&lt;/code&gt; then stores this information in an
internal database which can be used by schedulers such as Moab to get
real-time information about the availability of&amp;nbsp;resources. 
&lt;/p&gt;

&lt;p&gt;
Torque provides access to its list of available resources through the
&lt;a href="http://linux.die.net/man/8/pbsnodes"&gt;&lt;code&gt;pbsnodes&lt;/code&gt;&lt;/a&gt;
command. Without any command-line arguments the command lists all nodes
that Torque knows about and the resources it monitors for each&amp;nbsp;node:
&lt;/p&gt;

&lt;pre&gt;
$ pbsnodes
compute-0-0.local
     state = job-exclusive
     np = 8
     ntype = cluster
     jobs = 0/45.cluster.local, 1/46.cluster.local, 2/47.cluster.local, 3/48.cluster.local, 4/49.cluster.local, 5/50.cluster.local, 6/51.cluster.local, 7/52.cluster.local
     status = opsys=linux,uname=Linux compute-0-0.local 2.6.9-42.ELsmp #1 SMP Tue Aug 15 10:35:26 BST 2006 x86_64,sessions=9878 9881 9882 9883 9889 9918 9924 9927,nsessions=8,nusers=1,idletime=16682405,totmem=12258160kb,availmem=10771408kb,physmem=8161596kb,ncpus=8,loadave=9.43,netload=1074777221096,state=free,jobs=31.cluster.local 46.cluster.local 45.cluster.local 47.cluster.local 49.cluster.local 48.cluster.local 50.cluster.local 51.cluster.local 52.cluster.local,varattr=,rectime=1250282323

...

compute-0-3.local
     state = free
     np = 8
     ntype = cluster
     status = opsys=linux,uname=Linux compute-0-3.local 2.6.9-42.ELsmp #1 SMP Tue Aug 15 10:35:26 BST 2006 x86_64,sessions=? 0,nsessions=? 0,nusers=0,idletime=27314517,totmem=12258140kb,availmem=11926516kb,physmem=8161576kb,ncpus=8,loadave=0.00,netload=4289316359815,state=free,jobs=972217.cluster.local,varattr=,rectime=1250282329
&lt;/pre&gt;

&lt;p&gt;
For each node it lists the node&amp;#8217;s state, the number of processors provided
by the node, as well as other attributes such as available memory or jobs
currently executed on the node. In most cases the default
&lt;code&gt;pbsnodes&lt;/code&gt; output is too verbose and a number of command-line
switches can help filtering the&amp;nbsp;information:
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;&lt;code&gt;pbsnodes -l -N&lt;/code&gt; lists all nodes marked with states
   &lt;em&gt;offline&lt;/em&gt;, &lt;em&gt;down&lt;/em&gt;, or &lt;em&gt;unknown&lt;/em&gt; and respective
   user-defined comments associates with these nodes:
&lt;pre&gt;
$ pbsnodes -l -n
compute-0-0.local   offline                    reimaging
compute-0-5.local   offline                    blinking HDD
compute-0-7.local   down,offline               replacing CPU
&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
  &lt;code&gt;pbsnodes -l free&lt;/code&gt; lists all nodes that are free, i.e. which can
   potentially run jobs:
&lt;pre&gt;
$ pbsnodes -l free
compute-0-1.local   free
compute-0-2.local   free
compute-0-3.local   free
compute-0-4.local   free
compute-0-6.local   free
&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
In addition to reporting node states, &lt;code&gt;pbsnodes&lt;/code&gt; may be used to
set a node state manually. This is useful for maintenance or debugging of
hard-/software problems. &lt;code&gt;pbsnodes -o [nodename]&lt;/code&gt; is used to
mark a node as &lt;em&gt;offline&lt;/em&gt;. This allows jobs that may currently be
running on this node to complete, but no new jobs will be allocated to the
node. In Moab terminology the node is considered to be &lt;em&gt;drained&lt;/em&gt; if
it is marked as &lt;em&gt;offline&lt;/em&gt;. Additionally, when putting a node into an
&lt;em&gt;offline&lt;/em&gt; state, a comment or note should always be provided, such
that the reason for taking the node &lt;em&gt;offline&lt;/em&gt; can be easily
determined. The user-defined comment can be set with the &lt;code&gt;-N&lt;/code&gt;
command line&amp;nbsp;option:
&lt;/p&gt;

&lt;pre&gt;
$ pbsnodes -o -N "installing new hard-drive" compute-0-5.local
$ pbsnodes -l -n
compute-0-5.local   offline                    installing new hard-drive
&lt;/pre&gt;

&lt;p&gt;Once a node can be placed back into the pool of available resources its
&lt;em&gt;offline&lt;/em&gt; state needs to be cleared. This is done by using the
&lt;code&gt;-c&lt;/code&gt; option to &lt;code&gt;pbsnodes&lt;/code&gt;. At the same time the
comment can be removed from the node using &lt;code&gt;-N ""&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;
$ pbsnodes -c -N "" compute-0-5.local
$ pbsnodes -l -n free
...
compute-0-5.local    free
...
$ pbsnodes compute-0-5.local
compute-0-5.local
     state = free
     np = 8
     ntype = cluster
     status = opsys=linux,uname=Linux compute-0-5.local 2.6.9-42.ELsmp #1 SMP Tue Aug 15 10:35:26 BST 2006 x86_64,sessions=18901,nsessions=1,nusers=1,idletime=36154599,totmem=12258140kb,availmem=11888652kb,physmem=8161576kb,ncpus=8,loadave=0.01,netload=3655128653370,state=free,varattr=,rectime=1250283674
&lt;/pre&gt;

&lt;p&gt;
In addition to Torque&amp;#8217;s tools to manage nodes, Moab also provides tools to
gain an insight into Moab&amp;#8217;s view at the nodes and the resources they
provide. This is done using the &lt;a
href="http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml"&gt;&lt;code&gt;checknode&lt;/code&gt;&lt;/a&gt;&amp;nbsp;command.
&lt;/p&gt;
&lt;p&gt;
The &lt;code&gt;checknode&lt;/code&gt; command is only used to report information about
a node and cannot be used to modify any of the node&amp;#8217;s parameters. A typical
command output for a busy node looks&amp;nbsp;like:
&lt;/p&gt;

&lt;pre&gt;
$ checknode compute-0-3.local
node compute-0-3.local
State:      Busy  (in current state for 00:32:26)
&lt;strong&gt;Configured Resources: &lt;span class="caps"&gt;PROCS&lt;/span&gt;: 8  &lt;span class="caps"&gt;MEM&lt;/span&gt;: 7970M  &lt;span class="caps"&gt;SWAP&lt;/span&gt;: 11G  &lt;span class="caps"&gt;DISK&lt;/span&gt;: 1M&lt;/strong&gt;
&lt;strong&gt;Utilized   Resources: &lt;span class="caps"&gt;PROCS&lt;/span&gt;: 8  &lt;span class="caps"&gt;SWAP&lt;/span&gt;: 1189M&lt;/strong&gt;
&lt;strong&gt;Dedicated  Resources: &lt;span class="caps"&gt;PROCS&lt;/span&gt;: 8&lt;/strong&gt;
  &lt;span class="caps"&gt;MTBF&lt;/span&gt;(longterm):   &lt;span class="caps"&gt;INFINITY&lt;/span&gt;  &lt;span class="caps"&gt;MTBF&lt;/span&gt;(24h):   &lt;span class="caps"&gt;INFINITY&lt;/span&gt;
Vars:       &lt;span class="caps"&gt;RACK&lt;/span&gt;,&lt;span class="caps"&gt;SLOT&lt;/span&gt;,Rack
Opsys:      linux     Arch:      ---   
Speed:      1.00      CPULoad:   8.080
Network Load: 10.35 kB/s
Flags:      rmdetected
Network:    &lt;span class="caps"&gt;DEFAULT&lt;/span&gt;
Classes:    [queue1 8:8][queue2 0:8][queue3 8:8]
&lt;span class="caps"&gt;RM&lt;/span&gt;[cluster] &lt;span class="caps"&gt;TYPE&lt;/span&gt;=&lt;span class="caps"&gt;PBS&lt;/span&gt;  &lt;span class="caps"&gt;STATE&lt;/span&gt;=Busy
EffNodeAccessPolicy: &lt;span class="caps"&gt;SHARED&lt;/span&gt;

Total Time: 78:12:29:09  Up: 78:12:04:28 (99.98%)  Active: 13:20:15:23 (17.63%)

Reservations:
  res1.609x8  User  9:51:12 -&gt; 11:51:12 (2:00:00)
    Blocked Resources@9:51:12     Procs: 8/8 (100.00%)  Mem: 0/7970 (0.00%)  Swap: 0/11970 (0.00%)  Disk: 0/1 (0.00%)
  &lt;strong&gt;66x8  Job:Running  -00:32:26 -&gt; 1:27:34 (2:00:00)&lt;/strong&gt;
  res1.614x8  User  1:09:51:12 -&gt; 1:11:51:12 (2:00:00)
    Blocked Resources@1:09:51:12  Procs: 8/8 (100.00%)  Mem: 0/7970 (0.00%)  Swap: 0/11970 (0.00%)  Disk: 0/1 (0.00%)
Jobs:        66
&lt;/pre&gt;

&lt;p&gt;
The output provides a number of details about the compute node including
information on what resources are available on the node (Configured), which
of those are currently in use (Utilized), and which resources have been
dedicated to a job (Dedicated). Note that a node may show a utilization of
resources that are not explicitly dedicated. In the case above, the
resources that are dedicated to a job are 8 CPUs, however, the node also
reports a utilization of 1189 &lt;span class="caps"&gt;MB&lt;/span&gt; of swap space. In most cases this is ok
until the utilized resources reach the limit of physically available
resources or the load of the system dramatically exceeds the number of
available processors. In the ideal case a job should always only utilize at
most the resources that are dedicated to it and if this is not the case the
job&amp;#8217;s owner should be informed to adjust the resources requested by any
subsequent similar&amp;nbsp;jobs.
&lt;/p&gt;

&lt;p&gt;
The &lt;code&gt;checknode&lt;/code&gt; output additionally provides an overview of any current and
future resource reservations. The example above shows reservations for 2
future standing reservations (&lt;code&gt;res1&lt;/code&gt;) and one current reservation
for a job (66). One can identify current reservations by their start
time being&amp;nbsp;negative. 
&lt;/p&gt;

&lt;p&gt;
Aside of typical computation resources such as &lt;span class="caps"&gt;CPU&lt;/span&gt;, main memory and disk,
Torque and Moab allow user-defined resources to be associated with nodes.
Such resources can for example be special hardware devices, a limited
number of software licenses or application specific resource measures (e.g.
&lt;a
href="http://blog.dreamhosters.com/kbase/index.cgi?area=2583"&gt;Dreamhost&amp;#8217;s
Conueries&lt;/a&gt;). These kinds of resources are typically defined through the
&lt;a
href="http://www.clusterresources.com/products/mwm/docs/12.2nodeattributes.shtml"&gt;&lt;code&gt;NODECFG&lt;/code&gt;&lt;/a&gt;
directive in Moab&amp;#8217;s configuration&amp;nbsp;file.
&lt;/p&gt;


   </content></entry><entry><title>Moab / Torque Primer Part 1: Submitting Jobs</title><author><name>Oliver Baltzer</name></author><link href="/2009/08/moab-torque-primer-part-1-submitting-jobs/" /><updated>2009-08-26T04:38:00Z</updated><published>2009-08-26T03:38:00Z</published><id>/2009/08/moab-torque-primer-part-1-submitting-jobs/</id><content type="html">
       
&lt;p&gt;Traditionally, the most common way of submitting jobs to a cluster is from
the command-line or a shell script by using the &lt;a
href="http://www.clusterresources.com/torquedocs21/commands/qsub.shtml"&gt;&lt;code&gt;qsub&lt;/code&gt;&lt;/a&gt;
command. &lt;code&gt;qsub&lt;/code&gt; is part of Torque and submits jobs into Torque&amp;#8217;s
job queue. When Torque and Moab are used in combination, then Moab receives
the information about the job through Torque automatically which is
typically completely oblivious to the&amp;nbsp;user.&lt;/p&gt;
&lt;p&gt;For the user to control various properties of the job and have control over
what the job does, &lt;code&gt;qsub&lt;/code&gt; takes a so-called submission script as
input. The submission script is typically a normal shell script written by
the user that describes which commands should be executed once the job gets
to run. The submission script can be passed to &lt;code&gt;qsub&lt;/code&gt; either
through standard input or as a file specified as a command-line&amp;nbsp;argument:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ cat &amp;lt;&amp;lt; EOF &amp;gt; test_job.sh
sleep 60
hostname
EOF
$ qsub test_job.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;is equivalent&amp;nbsp;to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ echo "sleep 60 ; hostname" | qsub
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In either case, Torque makes an internal copy of the script and uses this
copy to actually run the job. In most cases the user would write the
submission script as a file as such scripts are typically rather&amp;nbsp;complex.&lt;/p&gt;
&lt;p&gt;In addition to the submission script &lt;code&gt;qsub&lt;/code&gt; also accepts
arguments that specify certain properties about the job, such as the job&amp;#8217;s
name, its expected runtime, the priority or into which queue it should be
submitted. Those properties can either be specified as command-line
arguments to the &lt;code&gt;qsub&lt;/code&gt; command or specified within the job&amp;#8217;s
submission script using a special syntax. For example, to submit a job with
name &amp;#8220;testjob&amp;#8221;, the name of the job can either be specified as a
command-line&amp;nbsp;argument:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ cat &amp;lt;&amp;lt; EOF | qsub -N testjob
sleep 60
hostname
EOF
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;or as part of the submission&amp;nbsp;script:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ cat &amp;lt;&amp;lt; EOF | qsub
#PBS -N testjob
sleep 60
hostname
EOF
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note, the &lt;code&gt;#PBS&lt;/code&gt; directive in the submission script. It
instructs Torque to interpret the following parameters as if they where
specified on the&amp;nbsp;command-line.&lt;/p&gt;
&lt;p&gt;Please see the
&lt;a
href="http://www.clusterresources.com/torquedocs21/commands/qsub.shtml"&gt;&lt;code&gt;qsub&lt;/code&gt;&lt;/a&gt;
manpage for information on other command-line arguments and additional&amp;nbsp;information.&lt;/p&gt;

   </content></entry><entry><title>Moab / Torque Primer: A Gentle Introduction to Job Management</title><author><name>Oliver Baltzer</name></author><link href="/2009/08/moab-torque-primer-a-gentle-introduction/" /><updated>2009-08-26T03:34:00Z</updated><published>2009-08-26T03:34:00Z</published><id>/2009/08/moab-torque-primer-a-gentle-introduction/</id><content type="html">
       
&lt;p&gt;Moab and Torque are two software packages that work closely together and
are used in combination at many &lt;span class="caps"&gt;HPC&lt;/span&gt;&amp;nbsp;sites:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.clusterresources.com/products/torque-resource-manager.php"&gt;Torque&lt;/a&gt;
   is an Open Source resource manager which is responsible for collecting
   status and health information from compute nodes and keeps track of jobs
   running in the system. It is also responsible for spawning the actual
   executables that are associated with a jobs, e.g. running the executable
   on the corresponding compute&amp;nbsp;node.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.clusterresources.com/products/mwm/docs/moabusers.shtml"&gt;Moab&lt;/a&gt;
   is a commercial scheduler product developed by &lt;del&gt;&lt;a
   href="http://www.clusterresources.com/"&gt;Cluster Resources
   Inc.&lt;/a&gt;&lt;/del&gt; &lt;a href="http://www.adaptivecomputing.com/"&gt;Adaptive
   Computing&lt;/a&gt; which is responsible for allocating resources to jobs that
   are requesting resources. It does so by collecting all the information
   that Torque (or other resource managers) can provide about currently
   running jobs, available nodes and other resources. Once Moab has
   scheduled resources for a job, it instructs Torque to execute the job on
   the allocated&amp;nbsp;resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This collection of articles is intended as a very basic introduction to
resource and job management with Moab and Torque. It provides a high-level
practical starting point for new &lt;span class="caps"&gt;HPC&lt;/span&gt; system administrators who want to
become more familiar with these systems. It is not intended to provide
comprehensive and detailed descriptions of all of the systems&amp;#8217; features or
their&amp;nbsp;configuration.&lt;/p&gt;
&lt;p&gt;The articles are organized in parts which will be posted as they become&amp;nbsp;available:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="http://www.hpc-admin.com/moab-torque-primer-part-1-submitting-jobs"&gt;Submitting&amp;nbsp;Jobs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.hpc-admin.com/moab-torque-primer-part-2-resources"&gt;Resources&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Queues&lt;/li&gt;
&lt;li&gt;Reservations&lt;/li&gt;
&lt;li&gt;Job&amp;nbsp;Status&lt;/li&gt;
&lt;li&gt;Modifying&amp;nbsp;Jobs&lt;/li&gt;
&lt;li&gt;Canceling&amp;nbsp;Jobs&lt;/li&gt;
&lt;/ol&gt;

   </content></entry><entry><title>Decipher OSM log messages</title><author><name>Oliver Baltzer</name></author><link href="/2009/08/decipher-osm-log-messages/" /><updated>2009-08-05T23:55:00Z</updated><published>2009-08-05T23:55:00Z</published><id>/2009/08/decipher-osm-log-messages/</id><content type="html">
       
&lt;p&gt;The &lt;a href="http://www.hpc-admin.com/magically-filling-up-the-root-file-system"&gt;last post&lt;/a&gt; described a situation where the OpenSM Infiniband (&lt;span class="caps"&gt;IB&lt;/span&gt;) subnet manager was logging hundreds of messages per second to its log file &lt;code&gt;/tmp/osm.log&lt;/code&gt; and proposed an intermediate solution for preventing it to fill up the file system. This post is about tracking down the root of the&amp;nbsp;issue.&lt;/p&gt;
&lt;p&gt;Supposedly the OpenSM log file contains messages&amp;nbsp;like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Aug 04 23:27:02 910870 [40A04960] -&amp;gt; __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x008B Port 6 TID:0x000000002c20f6fa
Aug 04 23:27:02 910896 [40A04960] -&amp;gt; __osm_trap_rcv_process_request: ERR 3804: Received trap 1091363 times consecutively
Aug 04 23:27:02 912507 [40401960] -&amp;gt; __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x008B Port 6 TID:0x000000002c20f6fb
Aug 04 23:27:02 912533 [40401960] -&amp;gt; __osm_trap_rcv_process_request: ERR 3804: Received trap 1091364 times consecutively
Aug 04 23:27:02 914191 [40602960] -&amp;gt; __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x008B Port 6 TID:0x000000002c20f6fc
Aug 04 23:27:02 914212 [40602960] -&amp;gt; __osm_trap_rcv_process_request: ERR 3804: Received trap 1091365 times consecutively
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These messages are rather cryptic and by themselves not particularly helpful. However, they contain information about the origin of the error: &lt;code&gt;Producer:2 from LID:0x008B Port 6&lt;/code&gt;. This obscure pair of &lt;span class="caps"&gt;LID&lt;/span&gt; and port actually refers to a port on the &lt;span class="caps"&gt;IB&lt;/span&gt; switch, which is reporting the error to the subnet manager. Now, one can log in to the &lt;span class="caps"&gt;IB&lt;/span&gt; switch and try to figure which physical port and cable on the &lt;span class="caps"&gt;IB&lt;/span&gt; switch are associated with the given (&lt;span class="caps"&gt;LID&lt;/span&gt;, port) pair. Then a trip to the server room and digging among numerous &lt;span class="caps"&gt;IB&lt;/span&gt; cables may reveal the machine that is the cause of all this&amp;nbsp;trouble.&lt;/p&gt;
&lt;p&gt;Alternatively, one can open the software toolbox and pull out &lt;code&gt;ibdiagnet&lt;/code&gt; &amp;#8212; a tool that is part of the OpenIB/&lt;span class="caps"&gt;OFED&lt;/span&gt; distribution. &lt;code&gt;ibdiagnet&lt;/code&gt; provides a number of useful functions to debug Infiniband networks and, in addition to general &lt;span class="caps"&gt;IB&lt;/span&gt; network path information, it conveniently provides a mapping from &lt;span class="caps"&gt;IB&lt;/span&gt; switch ports to machine hostnames for all &lt;span class="caps"&gt;IB&lt;/span&gt; host interfaces that are reachable. Even though it can only report that mapping for interfaces that are reachable it can still be used to identify interfaces that are offline assuming the &lt;span class="caps"&gt;IB&lt;/span&gt; cabling was following some predictable&amp;nbsp;pattern.&lt;/p&gt;
&lt;p&gt;When running &lt;code&gt;ibdiagnet&lt;/code&gt; without any command-line arguments it will run a number of diagnostics and leave a couple of files in &lt;code&gt;/tmp&lt;/code&gt;. A detailed list of the files can be found in the &lt;code&gt;ibdiagnet&lt;/code&gt; &lt;a href="http://linux.die.net/man/1/ibdiagnet"&gt;manpage&lt;/a&gt;. The file &lt;code&gt;/tmp/ibdiagnet.lst&lt;/code&gt; provides a list of all active ports in the &lt;span class="caps"&gt;IB&lt;/span&gt; fabric, including ports that are internal to the switch. Additionally, for any host ports that are active it will show the hostname configured for the corresponding host. This information is used to eventually identify the &lt;span class="caps"&gt;IB&lt;/span&gt; host that causes the&amp;nbsp;troubles:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;...
{ SW Ports:18 ... Chip A} LID:008B PN:02 } { CA ... {compute-0-3 HCA-1} LID:008F PN:01 } PHY=4x LOG=ACT SPD=5
{ SW Ports:18 ... Chip A} LID:008B PN:04 } { CA ... {compute-0-5 HCA-1} LID:0002 PN:01 } PHY=4x LOG=ACT SPD=5
{ SW Ports:18 ... Chip A} LID:008B PN:05 } { CA ... {compute-0-6 HCA-1} LID:0006 PN:01 } PHY=4x LOG=ACT SPD=5
{ SW Ports:18 ... Chip A} LID:008B PN:07 } { CA ... {compute-0-8 HCA-1} LID:0004 PN:01 } PHY=4x LOG=ACT SPD=5
{ SW Ports:18 ... Chip A} LID:008B PN:08 } { CA ... {compute-0-9 HCA-1} LID:0005 PN:01 } PHY=4x LOG=ACT SPD=5
...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since &lt;code&gt;/tmp/osm.log&lt;/code&gt; explicitly refers to switch &lt;span class="caps"&gt;LID&lt;/span&gt; 0x8B Port 6, it can be easily determined that the entry for &lt;span class="caps"&gt;LID&lt;/span&gt; 0x8B Port 6 is missing. The nodes are connected to the switch in a particular order and following that wiring pattern &lt;span class="caps"&gt;LID&lt;/span&gt; 0x8B Port 6 would be connected with &lt;code&gt;compute-0-7&lt;/code&gt;, the presumed trouble maker. A quick check on the node indeed revealed a kernel panic which prevented the &lt;span class="caps"&gt;IB&lt;/span&gt; driver to initialize the host interface&amp;nbsp;correctly.&lt;/p&gt;
&lt;p&gt;Obviously, doing such investigation every single time there is a problem with the &lt;span class="caps"&gt;IB&lt;/span&gt; network may become tedious. So using the information collected by &lt;code&gt;ibdiagnet&lt;/code&gt; to create an up-to-date chart for (&lt;span class="caps"&gt;LID&lt;/span&gt;, port) to hostname mappings may be a good&amp;nbsp;idea.&lt;/p&gt;

   </content></entry></feed>
