LondonGrid

XrootD and ARGUS authentication

2014-10-08T10:20:00.000+01:00

A couple of months ago, I set up a test machine running XrootD version 4 at QMUL. This was to test three things:

IPv6 (see blog post),
Central authorisation via ARGUS (the subject of this blog post).
XrootD 4

We run StoRM/Lustre on our grid storage, and have run an XrootD server for some time as part of the ATLAS federated storage system, FAX. This allows local (and non local) ATLAS users interactive access, via the xrootd protocol, to files on our grid storage.

For the new machine, I started by following ATLAS's Fax for Posix storage sites instructions. These instructions document how to use VOMS authentication, but not central banning via ARGUS. CMS do however have some instructions on using xrootd-lcmaps to do the authorisation - though with RPMs from different (and therefore potentially incompatible) repositories. It is, however, possible to get them to work.

The following packages are needed (or at least what I have installed):

yum install xrootd4-server-atlas-n2n-plugin
yum install argus-pep-api-c yum install lcmaps-plugins-c-pep
yum install lcmaps-plugins-verify-proxy
yum install lcmaps-plugins-tracking-groupid
yum install yum install xerces-c
yum install lcmaps-plugins-basic

Now the packages are installed, xrootd needs to be configured to use them - the appropriate lines in /etc/xrootd/xrootd-clustered.cfg are:

xrootd.seclib /usr/lib64/libXrdSec.so
xrootd.fslib /usr/lib64/libXrdOfs.so
sec.protocol /usr/lib64 gsi -certdir:/etc/grid-security/certificates -cert:/etc/grid-security/xrd/xrdcert.pem -key:/etc/grid-security/xrd/xrdkey.pem -crl:3 -authzfun:libXrdLcmaps.so -authzfunparms:--osg,--lcmapscfg,/etc/xrootd/lcmaps.cfg,--loglevel,5|useglobals -gmapopt:10 -gmapto:0
#
acc.authdb /etc/xrootd/auth_file
acc.authrefresh 60
ofs.authorize 1

And in /etc/xrootd/lcmaps.cfg it is necessary to change path and argus server (my argus server is obscured in the example below). My config file looks looks like:

################################

# where to look for modules
#path = /usr/lib64/modules
path = /usr/lib64/lcmaps

good = "lcmaps_dummy_good.mod"
bad = "lcmaps_dummy_bad.mod"
# Note put your own argus host instead of for argushost.mydomain
pepc        = "lcmaps_c_pep.mod"
             "--pep-daemon-endpoint-url https://argushost.mydomain:8154/authz"
             " --resourceid http://esc.qmul.ac.uk/xrootd"
             " --actionid http://glite.org/xacml/action/execute"
             " --capath /etc/grid-security/certificates/"
             " --no-check-certificates"
             " --certificate /etc/grid-security/xrd/xrdcert.pem"
             " --key /etc/grid-security/xrd/xrdkey.pem"

xrootd_policy:
pepc -> good | bad
################################################

Then after restarting xrootd, you just need to test that it works.

It seems to work, I was successfully able to ban myself. Unbanning didn't work instantly, and I resorted to restarting xrootd - though perhaps if I'd had patience, it would have worked eventually.

Overall, whilst it wasn't trivial to do, it's not actually that hard, and is one more step along the road to having central banning working on all our grid services.

Serial Consoles over ipmi

2013-06-04T15:47:00.002+01:00

To get Serial Consoles over ipmi working properly with Scientific Linux 6.4 (aka RHEL 6.4 / centos 6.4) I had to modify several setting both in the BIOS and in the OS.

Hardware Configuration

For Dell C6100 I set these setting in the BIOS

Remote Access = Enabled
Serial Port Number = COM2
Serial Port Mode = 115200 8,n,1
Flow Control = None
Redirection After BIOS POST = Always
Terminal Type = VT100
VT-UTF8 Combo Key Support = Enabled

Note: "Redirection After Boot = Disabled" is required otherwise I get a 5 minute timeout before booting the kernel. Unfortunately with this set up you get a gap in output while the server attempts to pxeboot. However, you can interact with the BIOS and once Grub starts you will see and be able to interact with the grub and Linux boot processes.

For Dell R510/710 I set these setting in the BIOS

Serial Communication = On with Console Redirection via COM2
Serial Port Address = Serial Device1=COM1,Serial Device2=COM2
External Serial Connector = Serial Device1
Failsafe Baud Rate = 115200
Remote Terminal Type = VT100/VT220
Redirection After Boot = Disabled

Note: With these settings you will be unable to see the progress of the kickstart install on the non default console.

Grub configuration

In grub.conf you should have these two lines (they were there by default in my installs).

serial --unit=1 --speed=115200
terminal --timeout=5 serial console

This allows you access grub via the consoles. The "serial" (ipmi) terminal will be default unless you press a key when asked during the boot process. This is only for grub and not for the rest of the linux boot process

SL6 Configuration

The last console specified in the linux kernel boot options is taken to be the default console. However, if the same console is specified twice this can cause issues (e.g. when entering a password the characters are shown on the screen!)

For the initial kickstart pxe boot I append "console=tty1 console=ttyS1,115200" to the linux kernel arguments. Here the serial console over ipmi will be the default during the install process, while the other console should echo the output of the ipmi console.

After install the kernel argument "console=ttyS1,115200" was already added to the kernel boot arguments. I have additionally added "console=tty1" before this, this may be required to enable interaction with the server via a directly connected terminal if needed.

With the ipmi port set as default (last console specified in the kernel arguments) SL6 will automatically start a getty for ttyS1. If it was not the default console we would have to add a upstart config file in /etc/init/. Note SL6 uses upstart, previous SL5 console configurations in /etc/inittab are ignored!

e.g. ttyS1.conf

start on stopping rc runlevel [345]
stop on starting runlevel [S016]

respawn
exec /sbin/agetty /dev/ttyS1 115200 vt100

The art of cabling

2013-04-21T22:43:00.000+01:00

The challenge of organising your cables behind your TV is nothing compared to that of a large computing cluster.

One of our standard racks contains 12 Dell R510s servers (for storage) and 6 Dell C6100 chases (providing 24 compute nodes) all 36 nodes are connected with a 10 Gb (SFP+), 1 Gb (backup) and 100 Mb (for IPMI) network cable. Connecting to 3 different network switches at the top of the rack. In addition the 18 "boxes" need a total of 36 power connections. A total of 144 cables per rack!

How to cope? Separate the network cables from the power cables, a possible source of noise. Use different colour cables for the different traffic and add unique id number for each cable. Use lose, removable cable ties. When a cable brakes don't remove it, just add a new cable.

The 10 Gb switches, in our case Dell S4810s, connect using 4 40Gb QSFP+ cables to two Dell Z9000 core switches. Having two core switches allows us to take one unit out of service without downtime (we use the VLT protocol and it works!). However this does add cable complexity. The backup 1 gig switches connect to each other in a daisy chain using 10 Gb cx4 cables, left over from before our 10 Gb upgrade. Finally the ipmi switches connect to a front-end switch using 1GBaseT cables.

The picture shows the inter-switch links. Visible are the orange 40Gb connections and blue 10Gb cx4 cables. In addition each 40 Gb cable has an ID indicating which rack it came from and which core switch its going too.

We have one rack full of critical, world facing servers. These servers need to be available all the times making it very difficult to reorganise the cabling. As a result over time, as we add and remove servers, the cabling becomes a mess. This is starting to become a risk! We are just going to have to accept some down time to sort it out in the near future.

virtualization performance hit

2013-04-15T10:30:00.000+01:00

Like the rest of the world, there is a lot of discussion going about the use of clouds and virtualization in gridpp.

http://www.gridpp.ac.uk/gridpp30/mcnab-lhcb-vmclouds-march-2013.pdf

http://www.admin-magazine.com/HPC/articles/the_cloud_s_role_in_hpc

Using virtualization will have a performance impact, so using it for our type of computing (hpc/htc) may not be the best solution. However just what impact does it have? A quick search of the web suggests anywhere between 3 to 30%. Most of the overhead appears to be in the kernel and in i/o.

http://serverfault.com/questions/261974/how-much-overhead-does-x86-x64-virtualization-have

http://www.altechnative.net/2012/08/04/virtual-performance-part-1-vmware/

http://www.anandtech.com/show/3827/virtualization-ask-the-experts-1

I decided that I wanted to do some of my own tests with the focus on the type of work we do in gridpp.

Testbed: 24 thread westmere processor running at 2.66 GHz + 48 Gig of memory using Scientific Linux 6.3 (basically RHEL6). I'm using the default install of KVM with the virtual image as a local file setup to use all 24 threads.

Benchmarks: 1) I unpack and make the ROOT analysis package using 24 threads; 2) as 1 but using only one thread. 3) I generate 500,000 Montecarlo events using the HERWIG++ Generator; 4) as 3 but I also include the time taken to unpack and install HERWIG++; 5) I run the HEP-SPEC06 benchmark. For tests 1 to 4 i use the TIME command to obtain the real time taken (smaller is better), for 5 I report the hep-spec score (larger is better). I will run the benchmarks on the bare metal install and on the VM on the same hardware and compare the results.

Results:

Out of the box performance of KVM results in ~3% (CPU intensive) to 20% (sys call intensive) reduction in performance. There is some indication of correlation with ratio of sys time / user time (particular effect with make/tar/gzip?). This is not seen in HEP-SPEC result. SYS time is the CPU time spent within the kernel and from previous studies we expect this to incur a high performance hit in virtualization.

If I get the time I intend to repeat analysis using optimisations (e.g. guest image on LVM). Repeat analysis using fedora 18 ( ~RHEL 7). Repeat using sandybridge cpu. Look at network performance (eg iozone with lustre).

The Queen Mary Grid Cluster

2013-04-10T14:24:00.000+01:00

The qmul grid/htc cluster is a high throughput (htc) research computing cluster based at Queen Mary, University of London. We primarily serve the scientific grid community and are funded by the griddpp
collaboration (i.e. uk stfc research council). By high throughput we mean the ability to do lots of individual separate jobs. Our main workload is data analysis for the ATLAS experiment at cern. We are the top site in the UK for this type of work, and one of the leading sites for the ATLAS LHC experiment in the world. We are part of the LondonGrid (hence the post to this blog!)

Our cluster comprises of:

For running the actual jobs
30 Dell C6100 using X5650s processors, contributing a total of 2880 job slots, and
60 older streamline nodes using E5420 processors, contributing a total of 480 job slots.

For Storage we run the Lustre parallel file system using
72 Dell R510s with 1800 TBytes of disk and
12 older Dell 1950s with MD100 disk arrays with 360TB of disk
Our actual provision is about 1600TB due to the use of raid 6 and "real" disk sizes.

We have a lot of development work to do over the next year which I hope to describe over the coming month in this blog including...

A new monitoring system probably based on opennms.
A new deployment system, to replace our hand made perl/mason/kickstart system probably using razor and puppet.
A cloud stack, we've been doing scientific computing using the grid software, but this model of computing is likely to be replaced with a cloud type model, we will need to look at the various options (OpenStack, CloudStack or OpenNebula).

The 11 racks of the QMUL cluster

RHUL cluster expands

2011-03-11T12:41:00.003+00:00

Yesterday, RHUL took delivery of new storage and compute nodes to beef up its Tier2 cluster.
The GridPP and CIF funded kit was supplied by Dell and is being installed and configured by Alces.
The extra 6.3 kHS06 and 420 TB will more than double the capacity of cluster.
Once the installation is complete and accepted, work to integrate it with the existing cluster and bring up the gLite services will begin.

RHUL 'Newton' cluster comes home

2010-02-19T17:29:00.002+00:00

After two years hosted by Imperial College, our 'Newton' Grid computing cluster has finally been relocated to Royal Holloway's new state-of-the-art computer centre. The move was carried out by Clustervision and everything went smoothly. Before the cluster goes back into production, analysing LHC data, a software upgrade to SL5 is planned.

A small part of Newton remains at IC: the racks were donated to become part of the particle physics cluster.

Comparing ATLAS analysis at RHUL using the file-staging and RFIO approaches

2009-07-31T12:14:00.028+01:00

I have been looking at the performance of the Royal Holloway cluster during Hammercloud tests in which data was accessed directly from the DPM pool nodes using the RFIO protocol and comparing it to the recent UK-wide file-staging test (540).

For the RFIO approach two identical tests (537 and 538) were requested in order to ensure enough jobs arrived on site. The RFIO IOBUFSIZE was set to 4KB. Job CPU efficiencies and cluster throughput (the product of number of running jobs and average job efficiency) were extracted using Sam and Dug's script. The job throughput climbed steadily up to a peak at around 320 running jobs. At this point the throughput started to decline probably compounded by the fact that one of the disk servers lost a disk and became over-loaded.

The CPU efficiency declined relatively consistently as the number of running jobs increased:

Each job was reading data at about 1 MB/s so that at the peak the total bandwidth was around 350 MB/s - roughly 30 MB/s per disk server. The disk servers were working hard, however, the iostat %util values were around 100% with high cpu iowait values.

So how do these results compare to those obtained when staging files to the worker node prior to analysis? This graph shows the same RFIO throughput data together with results from the recently run file-staging test:

The throughput during file-staging leveled off earlier - at around 175 running jobs. Similarly the average job efficiency drops more steeply:

The job failure rate for the RFIO tests was 4% compared to 17% for the file-staging test.

RHUL getting good rates into MCDISK from RAL

2009-05-13T19:55:00.004+01:00

RHUL has regularly got good rates and by that I mean 80-100 MB/s from Fermilab when downloading CMS data. It nice now to see similarly high rates downloading ATLAS data into the MCDISK space token from RAL.

Exercised space token creation at UCL-HEP

2008-04-16T13:44:00.003+01:00

Thought it was neat to give it a try and created as a test a small reservation for dteam, following the instructions on the LCG Twiki. All went well and all the tweaks for SL3 / gLite 3.0 worked well. Only oddity was that:

[root@pc55 root]# dpm-reservespace --gspace 10M --lifetime Inf --group lcgdteam --token_desc dteam_10M
send2nsd: NS009 - fatal configuration error: Host unknown: UNUSED
invalid group: lcgdteam

but:

[root@pc55 root]# dpm-reservespace --gspace 10M --lifetime Inf --gid 2688 --token_desc dteam_10M

worked well. Perhaps due to the fact that the group id is not the same as the VO name?? (tried also with 'dteam' in place of 'lcgdteam', but had the same error.

RHUL aircon problems

2008-03-26T14:41:00.003+00:00

Our machine room aircon system broke down last week and the temperatures have been all over the place.

After a few days of summer clothing and a few nights of temperature alarms, it was diagnosed to be a refrigerant gas leak from the chiller on the roof. The bad news is that this takes 2 weeks to fix. Luckily the estates engineer was very efficient and organised the delivery and connection of a backup chiller on the last day of term, then personally looked in over Easter to keep an eye on it.

It has been stable the last few days so I've just brought the cluster back up. The site will come out of downtime this evening.

UCL-HEP APEL accounting fixed

2007-08-07T16:15:00.000+01:00

After upgrading to gLite r27 on the 4th of July, APEL stopped publishing to the central RGMA registry. The apel-publisher failed with a not handled

RGMABufferFullException

To fix this, we had to update to the latest version of the APEL rpm's (2.0.5-1) on the MON and CE and re-run YAIM on both

Imperial SE - dCache removed ~30TB of CMS data

2007-07-20T14:00:00.000+01:00

As requested by CMS users, this week we have cleaned up around ~30TB (orphaned files) of CMS data from IC dCache. We need to understand why so many orphaned files are generated in dCache.

Brunel SE running DPM 1.6.5

2007-07-20T13:19:00.001+01:00

We were having problems with the storage element at Brunel so I upgraded it to DPM version 1.6.5 (via 1.6.3) this week. The upgrade didn't go totally smoothly but now things seem a lot better. Thanks to Greig for his usual excellent support.

Brunel running SL4 cluster

2007-07-20T12:46:00.000+01:00

The worker nodes of dgc-grid-40 are now running the glite worker node release on SL4. It is passing the ops SAM tests and the VO tests that have run recently. There was a problem with LHCb production jobs trying to use edg-brokerinfo rather than glite-brokerinfo which I reported and they have now fixed. CMS production jobs have also completed successfully. Steve Lloyd's ATLAS tests pass apart from the 'New Package' part. Steve's comment was "My tests are still running release 12.0.6 for which the requirement is SL3 so they shouldn't really go into SL4 machines...this problem will go away when I switch to release 13.0.X as that's supposed to work on SL4". ATLAS production jobs seem to run OK but there seems to be a problem copying the output files back.

RHUL accounting problem

2007-07-20T12:32:00.000+01:00

There was a problem with the apel accounting at RHUL this week:

ZoneInfo: /usr/java/j2sdk1.4.2_12/jre/lib/zi/ZoneInfoMappings (Too
many open files)
Thu Jul 19 00:35:06 GMT 2007: apel-pbs-log-parser - WARNING -
Exception opening file /var/spool/PBS/server_priv/accounting/20070713
java.io.FileNotFoundException:
/var/spool/PBS/server_priv/accounting/20070713 (Too many open files)

we solved it by moving some of the files out of /var/spool/PBS/server_priv/accounting.

bdii counts

2007-06-26T05:53:00.000+01:00

Promised to monitor the bdii. This is the plot of the bdii count a while ago. I'll have to redo it for a longer period. It seems clear that it is not the entire site bdii that disappear but only individual entries. Which is very probably correlated with load. We have seen it with the ce mds.

RB very slow

2007-06-19T10:28:00.000+01:00

Yesterday I have been wrestling with our RB. I takes several hours for a job to go from waiting to scheduled which means that the matchmaking process is overloaded. I think the reason was that the database was very big (4GB). Exacly 2^32. As suggested here I cleaned the database and it seems better now. The problem is that I never got to the root of what was going wrongly...

dCache failures (dcache-server-1.7.0-36)

2007-06-19T10:22:00.000+01:00

Again this morning we have pools going down with a memory allocation problem:
--
06/19 00:45:58 Cell(sedsk01_5@sedsk01Domain) : Thread : ping got : java.lang.OutOfMemoryError: Java heap space
--
I think the only way we will solve this will be to get hold on a dCache developer that can have a look. Clearly we did not have this problem when we where running the previous version (release 35).

dCache pools went down

2007-06-18T13:25:00.000+01:00

From friday afternoon several dCache pools went down. It ran out of memory, and here is the content of the sedsk01Domain.log file.

06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at java.lang.Thread.run(Thread.java:595)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : Storing incomplete file : 0003000000000000006E0B80 with 2756018417
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : Stacked Exception (Original) for : 0003000000000000006E0B80 <-P---------(0)[0]> 2756018417 si={cms:cms} : CacheException(rc=10006;msg=Pnfs request timed out)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : Stacked Throwable (Resulting) for : 0003000000000000006E0B80 <-P---------(0)[0]> 2756018417 si={cms:cms} : CacheException(rc=33;msg=Illegal State Transition -P-------- -> -P--------)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : CacheException(rc=33;msg=Illegal State Transition -P-------- -> -P--------)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at diskCacheV111.repository.CacheRepository2$CacheEntry.setPrimaryState(CacheRepository2.java:107)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at diskCacheV111.repository.CacheRepository2$CacheEntry.setPrecious(CacheRepository2.java:219)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at diskCacheV111.repository.CacheRepository2$CacheEntry.setPrecious(CacheRepository2.java:215)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at diskCacheV111.pools.MultiProtocolPool2$RepositoryIoHandler.run(MultiProtocolPool2.java:1538)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:64)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at java.lang.Thread.run(Thread.java:595)
06/15 16:35:02 Cell(c-100@sedsk01Domain) : runIO : java.lang.OutOfMemoryError: Java heap space
06/15 16:35:02 Cell(c-100@sedsk01Domain) : java.lang.OutOfMemoryError: Java heap space
06/15 16:35:02 Cell(c-100@sedsk01Domain) : java.lang.OutOfMemoryError: Java heap space
06/15 16:38:25 Cell(c-100@sedsk01Domain) : runIO : java.lang.OutOfMemoryError: Java heap space

dCache is started with those parameters:
-server -Xmx512m -XX:MaxDirectMemorySize=512m

We don't know what happened.

Dataset access problem at IC-HEP

2007-06-15T03:47:00.000+01:00

Some users are experimenting datasets access problems at IC-HEP. The ticket in question is GGUS 22106. The problem is that our cms users don't have the problem for the same dataset.
This raises the question on how to debug those problems when you don't have users on hand. In this case the only solutions will be to do it interactively with the user.

SAM Failures in London

2007-06-15T03:28:00.000+01:00

Summary of SAM failures and solutions

mars-ce2: CA certificates updated but permissions where wrong for the lt2-lcg group and hence the certs where not readable. Fixed now
hep-ce:

Update of the images. Missing ssl and uuid libraries caused the lcg-cp tools to fail. Matt solved this
updated the CA but unfortunatly the crl cronjob did not run since it is being run by mona. Now fixed

gw-2 (UCL-CENTRAL): Investigated intermittent failures and discovered that the sam jobs are sometimes killed by sge which has a vmem limit of 2GB. The problem is that python when creating a new thread tries to use the max stack size of the parent process. Since sge set this with a very high value any new thread will thread will try to create a big stack and the vmem limit will be reached. The solution is to change the max stack size in the sge configuration. We tried a ulimit -s 10 in the jobmanager but since then gw-2 is failing the ops test consistently. William has been contacted the revert back this change and make the modification in the sge queue configuration.

Note: this problem was seen on the ic-hep cluster (ce00) and fixed using the stack size limit.

ce1.pp (RHUL): gatekeeper problem, it seems I cannot access with the ssh keys I am using at home. Have to check from IC.

It's a black week for the availability in London...

London Tier2 Workshop

2007-05-02T12:03:00.000+01:00

The London Tier2 Workshop took place on the 16 of April.
It was a good opportunity to see what are the non hep application running on the grid.
The slides of the workshop can be found here

New Grid Security Policy Document

2007-05-02T10:37:00.000+01:00

The new Grid Security Policy Document can be found at here . It is still a draft, and comments are welcome. See version 5.6

RB Wrestling the comeback

2007-02-20T10:55:00.000+00:00

This morning looking at the monitoring our RB does not look happy. You can judge yourself on the plot below. It clearly seems that when the submission rate is too high the workload manager can just not eat the jobs fast enough to reduce the queue length. I have asked help from Maarten, we'll see what he come up with. I think I will have a look in the rb code to find out what is going on...