WLCG RAL Tier 1

RAL Tier1 – Plans for Christmas & New Year Holiday 2016/17

Gareth Smith — Thu, 15 Dec 2016 11:33:04 +0000

RAL closes at the end of the working day on Friday 23rd December and will re-open on Tuesday 3rd January. During this time we plan for services at the RAL Tier1 to remain up. The usual on-call cover will be in place (as per nights and weekends). This cover will be enhanced by daily checks of key systems.

Furthermore we do not have support around the 25/26 December & 1st January for some site services we rely on. The impact of any failures around these particular dates may therefore be more extended. Also, over the holiday we have relaxed our expectation that the on-call person will respond within two hours, particularly on the specific dates just mentioned.

During the holiday we will check for tickets in the usual manner. However, only service critical issues will be dealt with.

The status of the RAL Tier1 can be seen on the dashboard at:

http://www.gridpp.rl.ac.uk/status/

Gareth Smith

Analysis of Call-outs for 2015 and First Part of 2016

Gareth Smith — Mon, 11 Jul 2016 12:19:57 +0000

During the first week of July our Work Experience student, Ellen carried out an analysis of the data regarding the calls to the Tier 1 team from 2015 to 2016 so far. She has provided this report.

The above plot shows the distribution of call-outs for each day of the week, with each line showing a categorization of the call resolution. The “n/a” is normally applied to alarms that occur during working hours which are therefore not handled by the out of hours team. In 2015, there was a total of 250 calls. Call-outs can arise from genuine failures or from cases where work being undertaken triggers the call-out system. From this graph we can conclude that work that triggers call-outs is carried out at the beginning of the week, primarily on Tuesdays, then begins to dip through to the weekend. This largely reflects the scheduling of work in the early part of the week, particularly on Tuesdays.

These pie charts show the distribution of calls from different systems. Most sectors in on the 2016 pie are relatively similar to the one of 2015, however calls for “CE” and “SRM” have decreased whilst “Disk Server” and “Database” have had a significant increase, but this may change as the year progresses.

R89 Water Pump Outage

James Adams — Tue, 10 May 2016 13:57:29 +0000

Yesterday an unexplained site-wide BMS glitch caused the R89 BMS to unexpectedly stop the four pumps which circulate water around from the CRACs to the Chillers and back again. The Chillers shut down due to lack of water flow but the CRACs continued to circulate air in the rooms.

Machines across R89 began to heat up at roughly one °C per minute — our servers will try to shut down cleanly at a threshold based on the model of the hardware (most commonly 60°C). Shortly after being notified of the pump shutdown we paused all running batch jobs and prevented any new jobs from starting which appeared to stabilise temperatures, during this time other groups using the data-centre were also shutting down their services. Just before 5pm the pumps and chillers were restarted and temperatures started to fall. After some discussion, the paused jobs were allowed to continue (but new jobs were still prevented).

The graph below shows the mean internal (not CPU) temperature of all 168 hosts identified as “worker nodes” throughout yesterday’s event with key events labelled (note time is in UTC).

For a wider view of the whole room, we can look at the period from ARTEMIS’s point of view.

http://www.gridpp.rl.ac.uk/blog/wp-content/uploads/2016/05/heatmap-2016-05-10.ogv

Or with (incomplete) rack layouts overlaid over the data:

http://www.gridpp.rl.ac.uk/blog/wp-content/uploads/2016/05/heatmap-2016-05-11.ogv

Long delayed hat day

johnkelly — Wed, 27 Apr 2016 09:40:22 +0000

It has been a while since there was a Tier1 Hat Day. It took numerous meetings, including an unprecedented full committee meeting to arrange the latest display of millinery finesse.

All in all it was a good turnout. In addition to the more ‘normal’ members attending, less normal members attending included:-

The Ceph member of staff being transformed to a man-octopus hybrid.
The pipe wielding mad scientist who denies all knowledge of man-octopus transformations.
The captain of the tier1 finally dons his official hat, hopefully to navigate R89 through the rocks of uncertainty.
Fidel sent his personal look-alike representative, or maybe even Fidel himself in disguise
The wizard wore the magical conical hat covered in magical invisible symbols.

The dark figure in the background who may be a fencer or maybe just a ‘dark figure’ mimic

RAL Tier1 – Plans for Christmas & New Year Holiday

Gareth Smith — Wed, 16 Dec 2015 09:52:19 +0000

RAL Tier1 – Plans for Christmas & New Year Holiday 2015/16

RAL closes at the end of the working day on Thursday 24th December and will re-open on Monday 4th January. During this time we plan for services at the RAL Tier1 to remain up. The usual on-call cover will be in place (as per nights and weekends). This cover will be enhanced by daily checks of key systems.

During the holiday we will check for tickets in the usual manner. However, only service critical issues will be dealt with.

The status of the RAL Tier1 can be seen on the dashboard at:

http://www.gridpp.rl.ac.uk/status/

Gareth Smith

Analysis of Callout Data

Dan O'Riordan — Fri, 03 Jul 2015 13:47:32 +0000

Analysis of Callout Data

As a work experience student at RAL, I have collected and analysed the data detailing the callouts made to the Tier 1 on-call team. The team provide 24×7 cover for the Tier 1 service.

Total number of Callouts per year

Over the past few years, a trend has emerged highlighted by the above graph of Total Callouts Yearly. The graph shows a decrease from 467 callouts in 2011 to 91 half way through 2015. This significant decrease of 285 callouts (when estimating total callouts for 2015 being double 91) could reflect the weekly review of the callouts being done by Tier 1. Another explanation being improvements in technology to reduce the risk of faults and callouts. The only anomaly is 2014, showing a higher amount of callouts with no known specific cause as the team has not analysed all of the data. However, even with this anomaly, the overall data shows a trend portraying a lower amount of failures each year. Hopefully, we will hit zero soon!

Types of Alarms by Server

During 2014, there were a total of 294 callouts, the graph above divides this total among the different service and types of alarms. We can conclude from this data that Castor, Database, DISK Server and SRM cause the most callouts. This could be because we treat storage services as more critical and these are more often configured to callout. We do note that we have a large number of storage servers and this could lead to more callouts. We also note that the (Condor) batch system doesn’t produce many callouts, and there are relatively few for other grid services.

Types of Alarms by who handled them

The on-call team consists of a ‘Primary on-call’ (PoC) person who receives the message from the automated call-out system. The PoC makes an initial assessment of the problem and will attempt to resolve it. Should further assistance be needed the PoC passes the problem onto the on-call ‘expert’ from each of the support teams (Fabric, Castor, Database, Grid Services).

The graph above shows the difference between the problems handled by the PoC and the PoC + expert. We can see from this data that in 2014, 2/3 of the problems that arose were largely too complex or too big for the PoC alone and so referred to the assistance of an expert as the graph suggests.

My, how we have grown!

johnkelly — Wed, 13 May 2015 12:13:32 +0000

There was a recent request for batch farm capacity data going back some years. This data had been removed from the live ganglia server a while ago but I did some tweaking to get it back online.

While dealing with the day to day running of the batch farm, we don’t really see the ‘big picture’. I was surprised to see the growth of the RAL batch farm capacity.

This plot also brought questions about the amount of idle HEPspec we appear to have. I occasionally investigate idle capacity on the farm. There are a few common reasons why machines have spare capacity.

The most common reason is memory limitations. Older machines have less memory, so a machine can have empty CPU slots but no spare memory and so can’t run jobs. We have also discovered that some jobs use much more memory than the job requirements state. A single CPU job using 16GB of memory will prevent many other jobs from starting on a worker node. This problem should slowly fade away as we upgrade machines and deploy new technologies like cgroups and containers.

Other common causes are:

Empty pilot jobs occupying job slots but not doing anything. This should be less common now.
Slow I/O where jobs sit idle while waiting on data.
Jobs that are simply inefficient. All experiments seem to go through phases of submitting such jobs.
There are always some machines running in some restricted manner so as to test something. For example we now have some new-build machines being tested. They are in the batch farm, but not running jobs while we investigate and resolve issues.
Machines are drained for updates and reboots. At the time of writing, there is one cluster being drained for kernel and errata updates.
And occasionally the experiments simply all ‘go away’.

Stress test of Ceph Cloud cluster

Alastair Dewhurst — Thu, 22 Jan 2015 16:51:39 +0000

RAL has a Ceph storage cluster (refered to as the Cloud Cluster) that provides a Rados Block Device interface for our Cloud infastructure. We recently ran a stress test of the Ceph instance.

We had 222 VMs running, of which 50 were randomly writing large volumes of data. We realised we had maxed out when we noticed a slowdown in the responsiveness of our VMs. Increasing the number of VMs writing data did not increase the amount of data being written, so we believe we hit the limit on the cluster.

The write rate we hit into the cluster was 1044 MB/s (8.2 Gb/s), as reported by ‘Ceph status’. It is worth saying that this was the raw data in, as we store three copies, there was actually 24.6Gb/s being written (not including journaling). Investigation showed that the limiting factor was the storage node disks, which were all writing as fast as they could.

We have undertaken no optimisation with our mount commands in the Ceph configuration and this should probably be something we explore further in the future for performance gain.

The cluster currently consists of 15 storage nodes, each with 7 OSDS and 10Gb/s client and rebalancing networks.

The following graphs show the network, CPU and Memory utilisation on one of the storage nodes. They are typical of the rest of the cluster. The step change represents the point where we fired up the VMs doing random writes. You will notice the network in was about 220MB/s, fifteen times this is 3300MB/s ~ 26Gb/s which is approximately the same as the 24.6Gb/s figure I quote above, providing an independent check on the figure Ceph status quotes.

RAL Tier1 – Plans for Christmas & New Year Holiday

Gareth Smith — Wed, 17 Dec 2014 14:31:00 +0000

RAL Tier1 – Plans for Christmas & New Year Holiday 2014/15

RAL closes at 3pm on Wednesday 24th December and will re-open on Monday 5th January. During this time we plan for services at the RAL Tier1 to remain up. The usual on-call cover will be in place (as per nights and weekends). This cover will be enhanced by daily checks of key systems.

During the holiday we will check for tickets in the usual manner. However, only service critical issues will be dealt with.

The status of the RAL Tier1 can be seen on the dashboard at:

http://www.gridpp.rl.ac.uk/status/

Gareth Smith

Deploying 2013 worker nodes at RAL

johnkelly — Thu, 03 Apr 2014 13:39:35 +0000

We have just had a very busy week deploying the 2013 tranches of worker nodes here at RAL. WE had hoped to deploy these sooner but many staff were unavailable due to conferences or annual leave. Consequently there was a rush to ensure that we continued to meet our pledged capacity on the 1st April.

The new machines are in two tranches, 64 OCF machines and 64 Viglen machines. They all have dual Xeon E5-2650 @ 2.60GHz processors. They are running hyperthreading and each machine is configured to have 32 job slots. (Total additional job slots is 4096.)

The new machines have been put into production with the latest kernels and errata. So the next week or so will see staff at RAL doing a rolling upgrade on the batch farm to ensure that it is homogeneous, with all worker nodes running the same kernel, errata and EMI version. We are also taking the opportunity to do a slight update of the condor version, from condor-8.0.4-189770.x86_64 to condor-8.0.6-225363.x86_64.

As the new machines come in, it is also a reminder that we will be retiring the old 2008 machines. At the moment there is no hurry and we will continue to exploit whatever resources we have available.

The graph shows the increase in HSPEC06 capacity of the past week. The HSPEC idle has increased because many machines are now being drained for kernel and errata updates.