Page 2 of 3

Re: Intel libraries

Posted: Thu May 16, 2019 7:37 pm
by pattim
OK, I'm still using 19.4 - I haven't tried 19.5.
What was the output you received when you ran the recommended test (ldd wrfm_arw.exe.intel)?
Are you only running on a single machine? (specs?)
Can you post the contents of your /etc/hosts file?

Re: Intel libraries

Posted: Sat May 18, 2019 10:52 am
by alfe
@meteoadriatic

I didn't try yet...

Re: Intel libraries

Posted: Thu May 23, 2019 9:06 pm
by alfe
Thanks god it's working !!!!!!!!!! :D :mrgreen:

Oh dear , I am shocked !!

I mean it is working with uems intel compiled binaries ! It is approximately faster than PGI compiled binaries by approximately 20-30% on my system.

I have found the solution by digging deeper and deeper into the internet.
See here : https://software.intel.com/en-us/forums ... pic/270043
And here : http://wiki.seas.harvard.edu/geos-chem/ ... n_Compiler

So the point is the stack size. In my system it was set by default at 64 MB. Obviously this is not enough for Intel binaries. In the terminal window I have typed :
ulimit -s unlimited
export OMP_STACKSIZE=500M

And then run !

Thanks to all of you guys for your support,

Alain

Re: Intel libraries

Posted: Fri May 24, 2019 10:47 am
by meteoadriatic
Hi,

I didn't realize that your post on official wrf forum is from you until now :) Great that you solved the problem, yes stack size is usually first thing that has to be adjusted for WRF generally, it is strange that UEMS does not set it high enough by default.

Re: Intel libraries

Posted: Fri May 24, 2019 5:13 pm
by j0nes2k
Hello,

I also see a really good speedup of approx. 30% at least when using the Intel binaries - a really huge improvement! I am only running on one machine and I only ran into one error. The solution was printed in the error message fortunately, leaving this here for future reference.

I had to set:

Code: Select all

echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
...and to make this change permanently on my system (Ubuntu 18.04 on AWS) I had to edit /etc/sysctl.d/10-ptrace.conf, setting "0" instead of "1" there.

Best regards,

Jonas

Re: Intel libraries

Posted: Thu Oct 10, 2019 9:57 pm
by meteo60
I'm trying to use intel executables on a new cluster:
Old configuration: Just one master server 28 cores, 1 domain non nested (resolution 4km), intel executables: no problem
New config: 1 master + 1 node (same servers) 48 cores, 1 domain (same config but 3km), intel executables: sometimes I have an error:

Code: Select all

Simulation Failed (101)! I hate when this #%^!#!!% happens.
                 System Signal Code (SN) : 101 (Unknown Signal)
                 Here is some information from rsl.error.0000:
                 ----------------------------------------------------------------------------------------
                   Error Log: forrtl: severe (174): SIGSEGV, segmentation fault occurred
                 ----------------------------------------------------------------------------------------
                 System Signal Code (SN) : 101 (Unknown Signal)
                 Here is some information from rsl.error.0024:
                 ----------------------------------------------------------------------------------------
                   Error Log: MPIDU_Complete_posted_with_error(1710): Process failed
In rsl.error.0000:

Code: Select all

Timing for main (dt= 24.00): time 2019-10-12_01:12:24 on domain   1:    0.55257 elapsed seconds
forrtl: severe (174): SIGSEGV, segmentation fault occurred
In rsl.error.0024:

Code: Select all

WRF NUMBER OF TILES =   4
Fatal error in PMPI_Wait: Unknown error class, error stack:
PMPI_Wait(219)........................: MPI_Wait(request=0x5a33b7c, status=0x7ffe94cb6d80) failed
MPIR_Wait_impl(100)...................: fail failed
MPIDU_Complete_posted_with_error(1710): Process failed
If I retry the run, exactly the same bug, at the same time.
Of course I have typed:
ulimit -s unlimited
export OMP_STACKSIZE=500M

And at the next run: it's OK.... or not and again the same bug. Sometimes it's at the beginning, at the middle, at the end of the run. And if I retry the run, the bug occurs again at the same time.
It's very random

I will try a couple of runs of old domain (4km) on this cluster

Re: Intel libraries

Posted: Fri Oct 11, 2019 7:56 am
by alfe
Hello meteo60,
In the newest UEMS version I have a similar problem.
The intel libraries worked fine by setting unlimited stack size and OMP_STACKSIZE at an arbitrary high value, for the previous UEMS version. But with this newest UEMS version, it doesn't work. I have the same 174 SIGSERV error, and always after the first wrfout time step.
For the moment I gave up and use the PGI compiled libraties.
Robert might have changed something in the compilation options, or there is a new bug introduced somewhere in the code.

Alain

Re: Intel libraries

Posted: Tue Oct 22, 2019 10:23 am
by meteo60
Which OS do you use?
I have always that issue (174) sometimes, despite OMP_STACKSIZE etc.
It's very random again: most runs are ok, but sometimes it crashes.
When a run crashes, I retry it with other datas. I usually use GFSP25PT, if crash I retry with GFSP25 and it's ok... strange...

Re: Intel libraries

Posted: Tue Oct 22, 2019 7:48 pm
by alfe
Hello meteo60,
I use Centos 7.
I have tried the standard kernel 3.10 and also the newest kernel 5.2. No change.
I can set OMP_STACKSIZE at any arbitrary high value but no change.
Other strange behaviour : I have a very large domain but with DX = 25km. This one runs but all the wrfout files seem corrupted.
With a 3 domains run, it always crashes. :cry:

Re: Intel libraries

Posted: Tue Oct 22, 2019 8:07 pm
by meteo60
I use centos 7 too
I think we have to wait for Robert to resolve this issue....