Present: Dave Britton (Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Alastair Dewhurst, Tony Doyle, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Gareth Roy (minutes), Andrew Sansum.
Apologies: Pete Gronbech,
1. OSC Document Updates
Documents are urgently required for the OSC deadline of 14th November. Currently awaiting inputs from Experiments, Operations, Other VO’s and the Tier-1. RJ states he is working on ATLAS input. JC aims to complete OPS sections this week and has collected inputs for the Other VOs section and has material available for completing that report. An initial version of Tier-1 section is available, AD working on improving this.
2. Tier-1 CPU Usage
AD presents on resource usage at the Tier-1 during October. Resources provided by the Tier-1 are now comfortablyabove pledge with ATLAS, LHCB and ALICE all currently meeting fairshare usage, CMS was below fairshare due to a lack of submitted jobs this was confirmed by DC as a problem in the 2018 MC outputs causing a stop to take place on production simulation.
3. Crick Meeting
James Fleming (Director of IT and Services – Crick) and Steve Hindmarsh (Head of Scientific Computing – Crick) have contacted DB due to concerns over how Scientific computing is provided at the Crick and it’s satellite institutes. They foresee an increase in the distributed nature of compute and wish to discuss how HEP overcame the associated issues. A meeting has been arranged for the 21st November to discuss this topic.
4. Status of GridPP6 Background Documents
a) Storage – PC, the current document has useful content but is in the form of discussions and commentary. An executive summary is prepared which could be used for input to the proposal.
b) Security – DK, is concentrating on the executive summary. More information is available and a more full document can be completed in time for the GridPP6 proposal.
c) Tier-1 – AD, sent an initial draft of Tier-1 report, more detail needs to be added to some sections to make a complete document.
d) Tier-2 – SL, no updates in the last month but was in a reasonable shape and ready for review.
e) Experimental Support – RJ, sent an amended draft, no updates since then.
A discussion took place about ensuring staff posts at Tier-2 would be described as RSE within the GridPP6 proposal as this better reflected the role and its requirements.
5. Letter of Support
A request from Becky Parker for a letter of support from GridPP to support the Nucleus Application for Research in Schools was discussed, the PMB agreed that this was appropriate and some mention of providing access to opportunistic resource usage should be added.
7. Standing Items
SI-0 Bi-Weekly Report from Technical Group (DC)
Proposal for the Technical Group to monitor the Rucio developments. AD had suggested additional topics and DC felt it was an appropriate time to restart the Technical Meetings. A suggestion that a bi-weekly schedule for Rucio meetings would occur with off weeks being filled with other topics. Additionally the status of the currently IRIS funded Rucio projects was discussed.
SI-1 ATLAS Weekly Review and Plans (RJ)
RJ, ATLAS moved to a new a new Tape Stager at the Tier-1. The issue with ATLAS not using it’s pledged resources appears to be solved with overusage in October. Ongoing discussion about the Storage systems at Birmingham, and ATLAS position on using EOS there. RJ to talk with central ADC teams to come back with a position.
SI-2 CMS Weekly Review and Plans (DC)
Nothing to report.
SI-3 LHCb Weekly Review and Plans (PC)
Nothing to report.
SI-4 Production Manager’s report (JC)
– Some sites ticketed due to CVE updates.
– Some issues with the OPS security dashboard and pakiti updates not being pulled in.
– Increasing look from sites at migration to Centos 7.
– Sites starting procurements to purchase IRIS equipment.
– Daniela Bauer help with the migration of T2K LFC catalogue to DFC, issues with the large number of files causing issues.
SI-5 Tier-1 Manager’s Report (AD)
– ALICE have had problems with authentication problems with CASTOR. An update was performed on Monday 29th October, which promptly broke it. This was reverted but issues remained until Thursday 1st November.
– LHCb have started syncing more of their data from Castor to Echo. They have written over a PB to Echo since the 2nd November (just under 3 days). The write rate into Echo is about 10 times higher than normal ATLAS and CMS production work (5GB/s instead of 500MB/s).
– Some (1 – 5%) of CMS gridFTP SAM test jobs are failng against Echo due to “System error in bind: Address already in us”. This is when GridFTP can’t find a contiguous block of ports to use for a transfer. This potential problem has been known about for a long time, but we believed we had sufficient mitigation in place to prevent it causing any real issues. This may be related to the bulk transfers LHCb have been doing in the last week. We are asking Ian Johnson who developed the plugin if there is anything else he can do to fix this.
– CMS AAA, problems remain. The manager continues to randomly crash and we will setup a second one (to hide the problem). This week we will be pushing out the newest version of XRootD (4.8.5) which should claims to fix the problem.
– ATLAS migrated to the new tape instance (wlcgTape) on Wednesday. ATLAS are now completely off their old instance which will be decommissioned. CMS and non-LHC VOs will follow before Christmas.
– After NA62 lost data at CERN, the Castor team recalled what we had backed up at RAL to the new wlcgTape instance as this buffer was larger and more performant than the gen instance one. This speed up recover by a day or so.
– We received multiple GGUS tickets regarding the FTS problems which quickly pointed to an IPv6 issue. Inbound IPv6 traffic was getting blocked to machine that were not on the OPN subnet (i.e. a firewall problem). We believed we fixed the issue on Friday and did not get any further complaints over the weekend. IPv6 problems both at RAL and other sites are impacting the FTS service very frequently. To mitigate this, we are reverting the FTS test instance to IPv4 only, this will allow VO to continue to function in the event of a problem. We are also planning to move the FTS service on to the OPN subnet this week.
– It was discovered that CERN has been incorrectly routing IPv6 packets. At the LHCONE meeting last week it was noted that KIT was receiving packets from RAL via LHCONE. It turned out that KIT was only advertising its IPv6 address to those on the OPN via the LHCONE. This should have meant no IPv6 transfers between RAL and KIT were possible. The fact that the Tier-1 is not on the LHCONE does cause confusion for other sites especially Tier-1s who assume we are part of it.
SI-6 LCG Management Board Report of Issues (DB)
DB, highlights a report from a PRACE workshop held a CERN. This meeting looked at the usage of HPC resources for HL-LHC. It was suggested that some joint effort between, WLCG, SKA & PRACE to carry out software development.
Start and end dates for Run2 are fixed, but a large amount of data may not be generated in the first year.
GDPR was discussed and how long a user would be associated with a VO after the user has left.
SI-7 External Contexts (PC)
PC, eInfrastructre bid is completed and moving forward.
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. (Update: PG will check the OSC minutes for details and cover with GR). Ongoing.
678.2: DK to finalise the Security, Trust and Identity background document by mid October. (Update: DK and David Crooks have been working on this and it is nearly complete) Ongoing.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. (Update: Almost complete and will circulate current iteration for comment). Ongoing.
678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland – DB and PC will attend. This is almost complete and awaiting a few minor elements to be worked in – GR will upload into Googledocs for info). Ongoing.
680.2: JC will follow up GDPR implications relating to VOMS with DK. Completed.
684.1: PG will contact the owners of risks who were absent from today’s PMB to confirm they are satisfied with the decisions taken on the risk register.
684.2: DB will write to AD to suggest the Technical Group take on the coordinating of Rucio. Completed.
684.3: DB will contact the authors of the OSC documents asking them to complete their sections this week. Completed.
ACTIONS AS OF 05.11.18
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. (Update: PG will check the OSC minutes for details and cover with GR). Ongoing.
678.2: DK to finalise the Security, Trust and Identity background document by mid October. (Update: DK and David Crooks have been working on this and it is nearly complete) Ongoing.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. (Update: Almost complete and will circulate current iteration for comment). Ongoing.
678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland – DB and PC will attend. This is almost complete and awaiting a few minor elements to be worked in – GR will upload into Googledocs for info). Ongoing.