Present: Dave Britton(Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).
Apologies: Andrew McNab
1. Tier-1 tape capacity planning
Following up on an email from AM re Atlas’s use of tape. Atlas requested tape but RJ noted the actual use falls below what is being requested and there is no real pressure for additional tape. There will be an announcement from Oracle on the plans for T10KE use very soon. Good progress has been made on approvals last week, we remain tight on procurement and can bring forward tape procurement; however, tape purchase planned for FY17 spanned the entire project GridPP5. If we don’t re-profile a decision will still be required in 2017 to meet the whole GridPP5 plan and the funds may no longer be available. Delaying 6 months would allow us to determine the actual status with tape. The PMB decided to monitor and make a decision at a later date, perhaps at RRBs in either spring or September, as flexibility is required while we await the current procurement.
2. LHCOPN upgrade
As discussed at Ambleside – potential costs and timescales for receipt of the link were discussed as it is likely to be required around the time when LHC restarts this year. The PMB agreed to progress the upgrade to the network per discussions from Ambleside asap.
a) Procurement – there is no news re tenders.
b) Finance team – AS received an internal email requesting information on carry-forwards into the next Financial Year commitments to gauge the scale. AS has mentioned the possible underspend due to the timetable to the finance team.
c) Researchfish – changes have been implemented and PC has dipped into the new system. It has all the functionality to allow input of data and all subgroup grants will automatically have been linked to that by STFC. The experiments will insert all their publications and individuals can link in to them.
4. Standing Items
SI-0 Bi-Weekly Report from Technical Group (DC)
No report this week as the meeting was delayed for one week due to the Atlas Jamboree.
SI-1 Dissemination Report (SL)
Nothing of significance to report.
SI-2 ATLAS Weekly Review and Plans (RJ)
RJ mentioned a blip which appeared to be site errors but was in the production – it has been traced and halted. Due to the Atlas Jamboree the GridPP technical meeting was cancelled so that more guidance could be provided to GridPP sites, DB will this week progress the document he is writing on Tier-2 site evolution taking account of the Atlas Jamboree and computing management.
SI-3 CMS Weekly Review and Plans (DC)
Nothing of significance to report.
SI-4 LHCb Weekly Review and Plans (PC)
Nothing of significance to report.
SI-5 Production Manager’s report (JC)
1. The ops meeting this week will look at APEL vs ATLAS accounting figures for the last months (this is to address the issue noticed when we last reviewed the metrics for Tier-2 allocations).
2. Sites are being warned to take additional care when changing accounting client settings so that a site does not inadvertently re-publish old data which is leading to APEL loading. No UK sites have been flagged specifically.
3. Looking at sites issues.
– Durham are having intermittent, hard to explain nagios test failures on their arc CE
– Glasgow are seeing some high cvmfs usage on some nodes running lhcb jobs
– RALPP is following up on an SE issue raised by biomed
– Sheffield have an ongoing SNO+ problem related to CVMFS
And on the regional dashboard:
– Birmingham had a SURL problem that required a BDII restart
– QMUL have been getting glue2 errors on their new test CEs.
4. From the site roundtable last week two items of note: ZFS being investigated by several sites for their storage and CentOS7 setup and testing features quite widely.
5. Looking to get Tier-2 reports back on 25th January.
SI-6 Tier-1 Manager’s Report (GS)
– The main item of note was the 2.1.15 update to the LHCb Castor instance on Wednesday. However, this was problematic. After the upgrade the SAM tests etc. passed but we were unable to handle any real level of load. During the following day the Castor Team worked with CERN (Guiseppe Lo Presti of the Castor Team there). Two problems were identified. One was a version of a particular library used by Castor internal messaging, the other was a couple of parameters that needed tuning. The instance was returned to normal operation at the end of Thursday afternoon.
– We have applied the patches for CVE 2016-7117 on LHCb Castor nodes this morning. This is being done now so that we can check this patch would not break the other Castor instances if applied at the same time as their 2.1.15 updates
– The Atlas stager update is planned for tomorrow, with ‘GEN’ on Thursday (26th).
– We continue to see load on the CMS Castor instance that has led intermittent failures of the SAM tests of the SRMs – which in turn has led to poor availability for CMS – as seen in the December availability figures.
DC will keep a track on workloads, several production jobs failed at RAL and DC will also track this and report to the PMB if any issues.
– No significant changes to report.
Here are the availability figures for December 2016 for the RAL Tier1.
Alice: 100%
Atlas: 100%
CMS: 92%
LHCb: 100%
OPS: 100%
Tier-1 review agenda will be discussed by GS and AS this afternoon and circulated to the PMB by tomorrow.
SI-7 LCG Management Board Report of Issues (DB)
No MB meeting and therefore nothing to report.
SI-8 External Contexts (PC)
Nothing significant to report. It is possible we can get between £1-2M into STFC for us and partners. PC may attend the LSST in the USA tomorrow, they have GridPP on the agenda, and he will report if anything arises. George Beckett and Andy Washbrook will attend at 6pm tomorrow.
Tier-1 review takes place on Wednesday and no PMB will be necessary next Monday unless something pressing arises (DC will circulate a summary of the technical meeting).
610.1: AS/GS to produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
616.3: DB and SL will discuss how best to progress replacement of TW’s role. Ongoing
620.1 DB to contact DK re the procedure to deal with a security incident and the media. (Update: DK had devised an interim statement which involved TW as dissemination officer and he is no longer in post – there is no prescriptive full response as this would be dependent on circumstances and probably involve an emergency PMB and communication with relevant PR representatives). DK will send the statement to PMB in case required in future – spokesman SL as head of board or DB as project leader.
ACTIONS AS OF 23.01.17
610.1: AS/GS to produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
616.3: DB and SL will discuss how best to progress replacement of TW’s role. Ongoing
620.1 DB to contact DK re the procedure to deal with a security incident and the media. (Update: DK had devised an interim statement which involved TW as dissemination officer and he is no longer in post – there is no prescriptive full response as this would be dependent on circumstances and probably involve an emergency PMB and communication with relevant PR representatives). DK will send the statement to PMB in case required in future – spokesman SL as head of board or DB as project leader.