GridPP PMB Meeting 665

GridPP PMB Meeting 665 (16.04.18)
Present: Dave Britton (Chair), Tony Cass, Pete Clarke, David Colling, Alastair Dewhurst, Tony Doyle, Pete Gronbech, Steve Lloyd, Andrew McNab, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: Jeremy Coles, Dave Kelsey, Roger Jones,

1. GridPP40 Feedback

The meeting was very positively received and feedback has been very good. It was recognised that Pitlochry is challenging to travel to, but as this was not a regular venue it is manageable. There was excellent interaction with the Dell representatives who were very well engaged. They are keen to sponsor GridPP41 at Ambleside.

2. Tier-1 Procurement
There is no new information on funds. (VENDOR) are expected to deliver h/w on 31st April and acceptance testing is being undertaken to ensure requirements are met. Potential sanctions are written in the contract, but this may not be appropriate and it is to be hoped delivery will proceed as planned. If there are no sanctions imposed, there will be an impact if we lose access to funds and we should ensure (VENDOR) are made aware of potential impacts and sanctions while being careful to maintain relationships. Reviews of procurement at RAL, STFC and SBF are currently ongoing, including Tier-1 procurement.
AD will raise the issues around (VENDOR) with Lindsay and Martin this week and will produce a Procurement schedule for the coming FY and build in an additional month to buffer a similar situation in the future.
ACTION 665.1: AD will raise issues relating to (VENDOR) delivery of h/w with Lindsay and Martin.
ACTION 665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future.


4. Standing Items

SI-0 Bi-Weekly Report from Technical Group (DC)
DC was not present and no report submitted.

SI-1 ATLAS Weekly Review and Plans (RJ)
Atlas Liaison role has progressed – Stephen Heywood has suggested filling the role internally with Tim Adye, who has the requisite skillset, until April 2019 then PPD would split with Stuart and Martin and/or Tim. Tim may be unlikely to want to do this for the longer term – some concerns have been expressed by RJ, but Andrew Taylor suggests this plan seems reasonable and should be attempted. DB summarised RJ’s concerns over if the role progresses to GridPP6 it must be indispensable by next year and it is not at this time attractive to job-share as it should be embedded into Tier-1, Tier-2 and GridPP. This is unlikely to be compatible for 2 people to share – it is a PPD post and should preferably be filled there to ensure it is retained in GridPP6. It is essential that the role holder is fully engaged and committed to the role in the longer term. There has been a role JD drawn up with PPD that could be reviewed.
ACTION 665.3: DB will follow up with RJ on the Atlas post.

SI-2 CMS Weekly Review and Plans (DC)
Nothing of significance to report.

SI-3 LHCb Weekly Review and Plans (PC)
Nothing of significance to report

SI-4 Production Manager’s report (JC)
Meetings at GDB last week has some interesting summaries
Ops coordination meeting made changes to reliability report. Security has been quiet. OEOS-hub week is next week – JC will circulate a report.

SI-5 Tier-1 Manager’s Report (AD)
Here is a brief report covering the period since the last ‘normal’ PMB on 5th March. (I.e. the last time I produced a report). I normally send this round the PMB list.
I have put the availability figures for March at the bottom.


– Delayed intervention to patch the Oracle databases took place successfully on 27th March.
– At the start of April four newer disk servers were added to the Castor GenTape instance. The older ones were set readonly ahead of withdrawal. A throughput problem last week (during the GridPP meeting) led to the older ones being re-instated (temporarily).
– The dial stacking of Echo took place on 27th Feb. There were some problems after this. These were traced to a race condition in our system configuration utility (Quattor) that was causing the IPv6 configuration not to be applied correctly in some cases. This has now been fixed and we believe the problem affecting the FTS service is now resolved. While the problem was going on some VOs stopped using our (RAL) FTS but managed their file transfers using other FTS servers. Some other dual-stacked services still need to be checked out.
– On 4th April a minor CEPH update was applied to Echo to fix the ‘backfill’ bug. This will make adding more hardware into Echo easier.

– There was a problem mid-March when a disproportionately large number of GGUS tickets were raised, all appearing to have xrootd as the common denominator. After extensive investigations it was been found that a firewall rule had been lost and consequently packets being dropped. Once identified the rule was quickly reinstated.
– The upgrade (replacement) of the RAL firewall is scheduled to take place on the morning of the 25th April. Hopefully this will fix problems we have seen with data flows to/from our worker nodes.

RAL Tier1 Availabilities for March:
Alice: 100%
Atlas: 99% (See effect of planned Castor outage on 27th March) Atlas Echo: 100%
CMS: 98% (See effect of planned Castor outage on 27th March)
LHCb: 100
OPS: 99.2

SI-6 LCG Management Board Report of Issues (DB)
Management board meet tomorrow.

SI-7 External Contexts (PC)
UKT0 have successfully captured £16M which is excellent news and now in the STFC allocation letter. PC, Dave Corney and AS will now progress appropriate oversight and approval process for STFC to award the funds. The PMB congratulated UKT0 for capturing these funds.

644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: AS confirmed £40K will roll over into FY18 – AD has agreed with Tim to begin the process). Done.
657.2: DC to report on the CMS taskforce. Done.
655.3: PG to consider the agenda and date for Tier1 review and include disaster recovery plans. (UPDATE: 13 September at RAL has been agreed). Done.
656.1: DK will report before the end of February on any actions GridPP should take to comply with GDPR. (UPDATE: DK circulated slides) Ongoing.
656.2: DC will report on CPU efficiencies and CMS taskforce. Done.

662.1: Summaries to be provided of any likely contribution to the broad aims of GridPP from the CDT: AM/RJ for Manchester, JC for Cambridge, DB for Glasgow. Done.

663.1: RJ will discuss with Stephan about reducing the number of sites and confirm if they are happy to acknowledge CPU deriving from these sites is part of the pledge. Done.
663.2: PG will canvas sites to ascertain when they want to spend money and determine how disk will be phased out. Ongoing.
663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). Ongoing.
663.4: PC will publish our input to Balance of Programmes Review on GridPP website. Ongoing.
663.5: GS will respond on availability for proposed date of 13 September for Tier1 review. Ongoing.
663.6: PC will confirm to DK the number of staff from Edinburgh that should attend CHEP. Done.
663.7: RJ will send information to DK on CHEP talks and posters accepted. Done.
663.8: JC will examine GridPP staff roles/service/areas of expertise. Ongoing.
663.9: AM will share baseline of interfaces he will draw up for UKT0 participating sites before a F2F in June. Ongoing
663.10: AM will share list of interfaces which experiments need to be able to participate in the UKT0 service. Ongoing.

ACTIONS AS OF 16.04.18

656.1: DK will report before the end of February on any actions GridPP should take to comply with GDPR. (UPDATE: DK circulated slides) Ongoing.
663.2: PG will canvas sites to ascertain when they want to spend money and determine how disk will be phased out. Ongoing.
663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). Ongoing.
663.4: PC will publish our input to Balance of Programmes Review on GridPP website. Ongoing.
663.5: GS will respond on availability for proposed date of 13 September for Tier1 review. Ongoing.
663.8: JC will examine GridPP staff roles/service/areas of expertise. Ongoing.
663.9: AM will share baseline of interfaces he will draw up for UKT0 participating sites before a F2F in June. Ongoing
663.10: AM will share list of interfaces which experiments need to be able to participate in the UKT0 service. Ongoing.
665.1: AD will raise issues relating to (VENDOR) delivery of h/w with Lindsay and Martin
665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future.
665.3: DB will follow up with RJ on the Atlas post.