GridPP PMB Meeting 657

GridPP PMB Meeting 657 (22.01.18))
=================================
Present: Dave Britton (Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Steve Lloyd, Andrew McNab, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Roger Jones, Dave Kelsey, Andrew Sansum.

1. Risk Register
================
PG will liaise separately with AS on items on the Risk Register he is responsible for. Overview of points discussed:
1 – Castor storage system problems – inadequate performance. Currently on red following discussions last time and it should remain red.
2 – Tier1 CEPH project fails – previous comment were the likelihood was reduced and no issues seen, but some have been seen and it was suggested this risk should be increased, particularly because staffing issues in the team are a concern and one member will soon leave. It was suggested the impact should be raised, likelihood should be raised by 1.

3 – Outage of UK T1 – the likelihood continues to be low as this has happened only twice in 15 years. The impact should perhaps rise as a result of the lengthy impact at CNAF. Remains at the current level.

4 – Failure at T1 to meet SLA or MoU – relates to failure to meet commitments. This remains at current level

5 – Significant loss of custodial data at the T1 – relates mainly to the Echo project and associated risks. There has been some data loss on occasion, recent experience is that data lost is recoverable. Remains at current level

6 – Substantial loss of damage to hardware. Remains at current level.

7 – Disaster at T1 leads to prolonged outage. Remains at current level.

8 – Recruitment retention problems at RAL – this should increase due to recent loss of staff. Likelihood should perhaps be raised to red, but should be discussed with AS before decision is made.

9 – Failure to deploy or operate hardware – this will be clearer when the procurement is complete, currently it has been better than expected. Remains at current level for the moment.

10 – Insufficient network bandwidth – perhaps the firewall issue, discussed below, should be factored in. PG will ensure this is included here. Should be raised with an explanation after more discussion with GS and AS.

11 – Over contention for resources – this was not changed last time and is currently at amber level. Remains at current level.

12 – Capital vs Resources at the T1 – this was reduced slightly last time. Remains at current level.

13 – Technology mismatch – cloud technologies and others may impact the risks. This could be slightly increased by 1 and note that new technologies may have an impact.

14 – Loss of experience or insufficient personnel at T2s – this remains at amber.

15 – Insufficient funding at T2s for h/w – remains at current level.

16 – T2 not fit for purpose – remains at current level.

17 – Experiment software runs poorly on the grid – previously increased – remains at current level

18 – Security problem affecting reputation – this was increased recently due to experiences in NHS in particular and there has been recent issues with Meltdown and Spectre. Risk is about reputation, rather than whether or not there will be problems – it was suggested we are not more vulnerable now than previously and this should remain at current level.

19 – Loss of GridPP service due to security – still green and we are in the same position. We have been affected and required to patch systems or take offline etc, but should remain at current level as the impact is low.

20 – Insufficient VO/User support effort – this remained the same last time but noted increased number of VOs. Remains at current level.

21 – Mismatch between budget and Hardware costs – last time marked high due to inflation and currency exchange. We will know more when the CPU tender is clear but should be slightly reduced to amber.

22 – Core service funding insufficient – risk 22 and 23 should be considered together (23 is about EGI and 22 is about non-EGI). This should not be high but should be discussed with AS as there is funding for e.g. GOCDB and APEL. Remains at current level

23 – Breakdown of NGI/EGI infrastructure – see risk 22.

24 – insufficient travel funds for effective management – Discussion should be held with DK in advance of OC meeting. Remains at current level.

25 – GriddPP resources prove insufficient for actual requirements – previously quite high, but we now aim to meet 90% due to increased funding. Should be reduced to amber.

26 – Critical middleware no longer supported – relates to Globus, etc, LCG has a plan to deal with this. There are several aspects this could cover, but it is a shared risk which helps mitigate it. Remains at current level.

27 – Unplanned infrastructure costs – e.g. electricity costs at Tier2 increased. Remains at current level.

28 – Loss of EGI.eu – relates to Brexit. Remains at current level.

29 – GridPP funding uncertainty – remains at current level.

31 – Conflicting opinion amongst GridPP stakeholders – there are numerous meetings to ensure this is covered. Remains at current level.

32 – Failure of achieving further integration with PPAN community – Remains at current level.

PG will make relevant changes.

2. Procurement Update
=====================
AS circulated an update but is not present to discuss.

3. AOCB
=======
a) PG circulated summary report. Highlighted the CMS complaints as this has been on several reports and CMS have raised a concern that the performance on our Tier1 is worse than other Tier1s. DC suggested there will be more information available from the CMS taskforce, the suspicion is the situation results from looking at data remotely and this will be better understood in the future when testing and investigation is complete. PG noted site availability has been particularly raised as an issue and should be considered. SL’s network tests highlighted issues at RAL. Worker nodes may have an impact here but it is not yet clear as worker nodes reading remotely would be affected by the firewall, but reading locally would not be affected. Lazy download and Castor may improve as Echo comes on-stream, but this is separate from the worker nodes issue.
DB asked when the taskforce is expected to conclude, it was expected to conclude last year but the complexity has led to an extension and should conclude in the next month or two.
SL circulated network test results for RAL. The access rate to PPD vs rest of the world appears to be the main issue: LCG to PPD is good and vice versa; on the rows this is not so good. This is a snapshot over 72 hours but the situation has been similar for several weeks. Duncan has been contacting people to try to resolve this – focus has been on failing sites, e.g. VAC. It was noted Glasgow appears rather asymmetric and should be investigated further.
b) Q3 reports have been received and Q4 are awaited.

c) PC noted that when he reports on the standing items on external issues one item regularly discussed to raise the case to BEIS for money the STFC must disclose their management structure. GridPP (and DIRAC) have been delivering billions of core hours to the LHC successfully and it was suggested a link or set of documents should be available to outline this. This is described in the GridPP proposals which contain useful diagrams and over the years less detail is included on the management structure, but GridPP3 or GridPP4 had more extensive text associated with the diagrams that could be worked in to a comprehensive document. There is also text on the website (‘about’ page). GridPP4+ had documents which could be useful, but members have little time due to other deadlines and commitments. PG will bring together the various sections of information that is easy to point to as evidence of our management structure etc.
d) Mark Thomson has just been announced as the Head of STFC.
ACTION 657.1: GS and SL will assess network tests for RAL and report to the PMB.
ACTION 657.2: DC to report on the CMS taskforce.
ACTION: 657.3 PG will provide PC with documents and diagrams relating to the management structure.

4. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
There was a meeting on Friday and DC suggested the PMB read the minutes. First part related to Spectre/Meltdown with input from David Crooks and others. Then suggestions for the next GridPP meeting were discussed which DC has summarised in the minutes (DC is circulating a link to the minutes).

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ was not present and no report was submitted.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
Nothing of significant to report.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
Nothing of significance to report.

SI-4 Production Manager’s report (JC)
————————————-
Only three items to report this week, other items being simply pointers for information.

1. There was a WLCG ops coordination meeting last week https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes180118 with a theme of SAM tests and their use in WLCG reporting. The short summary is that there is overwhelming support for use of SAM tests for operational benefits but there is often a gap between the SAM reports and experiment perceptions of site performance. One major concern was the effort required to recompute results given the number of requests where sites are trying to show even small improvements in their performance to their management/funders. A more useful measure would include a reference to job accounting. A proposal will be made to the MB.

2. Within the EGI A/R reports the UK remained green at over 90% for the month. In fact the NGI has been green for the whole of 2017.

3. The SoLid experiment recently started data taking. It seems they underestimated their CPU requirements and so we are seeking other GridPP sites to enable the VO. The data is held at Imperial.

For information:

4. There will be a February pre-GDB 2nd Febraury on HPC utilisation: https://indico.cern.ch/event/651338/ and another on 13th February on GPU Utilisation. The GDB is on 14th: https://indico.cern.ch/event/651350/.

5. Registration is open for the next LHCOPN and LHCONE meeting: RAL, Abingdon UK, 6-7 March 2018: http://www.cvent.com/d/xtq4sy.

6. There is a proposal for a new HEPiX WG on Network Functions Virtualisation (NFV) and Software Defined Networks (SDN). It will be more research oriented and aims to bring together people interested in networking R&D for HEP. https://indico.cern.ch/event/637013/contributions/2739266/.

I note that CHEP 2018 Abstract submission is closed. Do we have a list of GridPP submissions?

A statement was made that funding agencies do not look at the monthly A/R reports and therefore why do we put any effort into correcting them. I said that we refer to them in GridPP along with additional metrics like job accounting, but was not aware of our funding agency looking at them directly for Tier-2s. It would be useful if we could state our interest (or lack of it) in the reports at the MB in case there is a misunderstanding being propagated to the Ops Coordination group. DB confirmed our funding agencies do not generally look at this information but they are routinely included in the project map and the data is therefore important to us.

SI-5 Tier-1 Manager’s Report (GS)
———————————
A very brief report covering the last week.

• Patching for Spectre and Meltdown was taking place last week. (E.g. an outage of Castor for a few hours last Wednesday, 17th Jan).
• Problems since last week with the Atlas Castor instance. There has been a high rate of deletion requests, and the back end stager database has been struggling to keep up with the total load (reads, writes and deletes). Atlas have backed off at the moment. No more specific cause for the problems found but investigations ongoing.
• There will be an intervention on one of the BMS (Building Management Systems) in the R89 machine room on Wednesday. It has been intermittently faulty causing some system restarts. In one case the chillers failed to restart. This system manages the pumps. The intervention should take less than 30 minutes. On Tuesday one of the pumps will be reconnected ‘stand-alone’ and checked it works OK – and this will be used to maintain flow during the BMS controller swap.

Alison should join the PMB in the next few weeks for an update.

ACTION 657.4: DB will ask AS to invite Alison to join the PMB in the next few weeks for an update.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
There was a meeting but nothing of significance to report. There was an update on KNAF – there is a talk on the MB agenda which DB circulated. There was some recovery of the power and an inspection to understand how water managed to enter the area and how to prevent this in the future. A process of recovery of the IT equipment is ongoing and new kit will be installed, to be completed by mid-February. This could have been much worse and a positive outcome is that much of the equipment was saved.
Meltdown and Spectre was also covered – information is still being gained but no performance issues are known.

SI-7 External Contexts (PC)
———————————
PC will update next week.

REVIEW OF ACTIONS
=================
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced in January). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
OS documents MUST be done and submitted to PG this week.
649.1: DB will write Introduction of OS documents. Ongoing.
649.2: PC will write Wider Context of OS documents. Ongoing.
649.3: PG will schedule a discussion of the Risk Register at a PMB meeting in December then update this in the OS documents. Done.
649.4: GS and AS will write the Tier1 Status section of OS documents. Ongoing.
649.5: JC will write Deployment Status section of OS documents with input from PG. Ongoing.
649.6: RJ, DC and AM will write LHC section of User Reports in OS documents. Ongoing.
649.7: JC will write Other Experiments section of User Reports in OS documents with input from DC and PG. Ongoing.
654.1: PG will telephone Mark Sutton to arrange for him to have access to GridPP. Done.
655.1: DC will discuss migration from WMS with T2K. Done.
655.2: AS to prepare a report on failure of the generator to come up after a recent issue. Ongoing.
655.3: PG to consider the agenda and date for Tier1 review and include disaster recovery plans. (UPDATE: appropriate dates are being considered with AS). Ongoing.
656.1: DK will report before the end of February on any actions GridPP should take to comply with GDPR. Ongoing.
656.2: DC will report on CPU efficiencies. Ongoing.
656.3: GS will discuss Tier-1 procurement with Laura and Martin and report to the PMB. Done.
656.4: DB will contact external contacts to invite them to attend and/or contribute to GridPP40. Ongoing.

ACTIONS AS OF 22.01.18
======================
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced in January). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
OS documents MUST be done and submitted to PG this week.
649.1: DB will write Introduction of OS documents. Ongoing.
649.2: PC will write Wider Context of OS documents. Ongoing.
649.4: GS and AS will write the Tier1 Status section of OS documents. Ongoing.
649.5: JC will write Deployment Status section of OS documents with input from PG. Ongoing.
649.6: RJ, DC and AM will write LHC section of User Reports in OS documents. Ongoing.
649.7: JC will write Other Experiments section of User Reports in OS documents with input from DC and PG. Ongoing.
655.2: AS to prepare a report on failure of the generator to come up after a recent issue. Ongoing.
655.3: PG to consider the agenda and date for Tier1 review and include disaster recovery plans. (UPDATE: appropriate dates are being considered with AS). Ongoing.
656.1: DK will report before the end of February on any actions GridPP should take to comply with GDPR. Ongoing.
656.2: DC will report on CPU efficiencies. Ongoing.
656.4: DB will contact external contacts to invite them to attend and/or contribute to GridPP40. Ongoing.
657.1: GS and SL will assess network tests for RAL and report to the PMB.
657.2: DC to report on the CMS taskforce.
657.3 PG will provide PC with documents and diagrams relating to the management structure.
657.4: DB will ask AS to invite Alison to join the PMB in the next few weeks for an update.