Jump to content
watkinrt

Error 158: Error encountered when accessing the scratch file

Recommended Posts

I am receiving the following error while trying to run a job on a Linux cluster:

(Scratch disk space usage for starting iteration = 2704 MB)
  
 
 *** ERROR # 158 ***
 Error encountered when accessing the scratch file.
    error from subroutine xdslrs 
    Solver error no. =   -702 
 This may be caused by insufficient disk space or some other
 system resource related limitations.
 Try one or more of the following workarounds if you think otherwise 
 and if this issue happens on Windows. 
 - Resubmit a job. 
 - Avoid writing scratch files in the drive where the Operating System is 
   installed (start the job on other drive or use TMPDIR/-tmpdir options).  
 - Disable real time protection at minimum for files with extension rs~ .
 - Use of environment variable OS_SCRATCH_EXT=txt may help. 
 This error was detected in subroutine adjslvtm.
 
 *** ERROR # 5019 ***
 Specified temprature vectors (1 - 1) out of allowed range (1 - 0).
 
 This error occurs in module "snsdrv".

************************************************************************

RESOURCE USAGE INFORMATION
--------------------------

 MAXIMUM MEMORY USED                                      4985 MB
   IN ADDITION    177 MB MEMORY WAS ALLOCATED FOR TEMPORARY USE
   INCLUDING MEMORY FOR MUMPS                             2929 MB
 MAXIMUM DISK SPACE USED                                  5939 MB
   INCLUDING DISK SPACE FOR MUMPS                         3934 MB

************************************************************************

I've tried some the troubleshooting suggestions in the error log without any luck:

  • I've specified the scratch drive to a large file system (petabytes of storage)
  • I've set OS_SCRATCH_EXT=txt
  • Since this is a high performance computing environment, I don't have the ability to access "real time protection" options; however, I'm not aware of any virus protection running on the cluster and, furthermore, the OS_SCRATCH_EXT=txt should fix virus scan issues as I expect the system would let text files pass.

 

I should note that I am trying to run this problem as a parallel job with the following command:

 

$ALTAIR_HOME/scripts/invoke/init_solver.sh -mpi pl -ddm -np 2 -nt 8 -scr $SCRATCH -outfile $WORK <filename>.fem

Below are some other relevant notes:

 

  • If I try to run this job in serial (i.e., without the -mpi pl -ddm -np 2), I don't experience the above error; so, this appears to be something that arises when trying to run mpi jobs.
  • I've tried running this job with -mpi i and my system doesn't seems to be setup for intel based mpi (unable to find required .so files).
  • Cluster node information:
    • Dual Socket
    • Xeon E5-2690 v3 (Haswell) : 12 cores per socket (24 cores/node), 2.6 GHz
    • 64 GB DDR4-2133 (8 x 8GB dual rank x8 DIMMS)
    • Hyperthreading Enabled - 48 threads (logical CPUs) per node
  • The $SCRATCH drive I'm pointing to is a network drive. I tried running -scr slow=1,$SCRATCH, but still get the same error.
  • When I make the above ddm call, I have requested 2x nodes and a total of 48 mpi processes (although I'm not using them all)

 

Thoughts?

Share this post


Link to post
Share on other sites
  • Scratch drive is often a local fast (SSD) disk for performance. Don't use Network drive!
  • Never use "Hyperthreading enabled" for FEA on HPC Cluster

(Note: I work daily with HPC Linux cluster for FEA)

 

 

 

Share this post


Link to post
Share on other sites

Thanks for the response. 

 

On 3/4/2020 at 10:50 AM, Q.Nguyen-Dai said:
  • Scratch drive is often a local fast (SSD) disk for performance. Don't use Network drive!

Unfortunately, I have very limited space allocated on my HPC local drive (1GB). So, for big jobs, I have to offload the scratch work to a network scratch drive. Ideally, I can run everything in core, but sometimes that isn't an option either.

 

On 3/4/2020 at 10:50 AM, Q.Nguyen-Dai said:
  • Never use "Hyperthreading enabled" for FEA on HPC Cluster

 Is that due to HPC limitations or just that it isn't a good idea for optistruct in general? I ask because my cluster supposedly have Hyperthreading available on the CPUs (not that I've been using it though).

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
You are posting as a guest. If you have an account, please sign in.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...