Looking for guidance with Chapel+Slurm+OmniPath

Discussion:

Barry Moore

2017-03-28 03:06:44 UTC

Hello All,

Looking to get my feet off the ground with Chapel on our HPC resources here
at Pitt. I wrote the following approximation to Pi code:

```
// Need BlockDist module to use Block distribution mapping
use BlockDist;

// Some config constants for running the problem
config const tasksPerLocale : int = 1;
config const numberGridPoints : int = 10;

// The Grid and Domain specification
const Grid = {1..numberGridPoints};
const Domain: domain(1) dmapped Block(boundingBox = Grid) = Grid;

// Our approximation procedure
proc approx(a: real) return 4.0 / (1.0 + a ** 2);

// The separation of grid points and the sum which will catch a reduce
var dGrid : real = 1.0 / numberGridPoints;
var sum : real = 0.0;

// forall values in the domain calculation the valueOnGrid and approx of
that value
forall d in Domain with (+ reduce sum) {
var valueOnGrid : real = dGrid * (d + 0.5);
sum += approx(valueOnGrid);
}

// Print out Pi
writeln("Pi is approximately: ", sum * dGrid);
```

using the pull request: https://github.com/chapel-lang/chapel/pull/5843 to
compile the code.

I am convinced the code is running over OmniPath, but I am quite sure it is
not using multiple cores per locale. Am I missing an additional level in my
forall loop (something with tasksPerLocale)?

Any help is appreciated.

Thanks,

Barry

--
Barry E Moore II, PhD
E-mail: ***@pitt.edu

Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260

Elliot Ronaghan

2017-03-28 06:08:09 UTC

Permalink

Hi Barry,

Off the top of my head I'd recommend turning off any external CPU
affinity binding. Try doing:
`export HFI_NO_CPUAFFINITY=1`

We had another user run into a similar issue where PSM2 was limiting
the chapel executable to only have affinity to 1 processor:
https://github.com/chapel-lang/chapel/issues/5536#issuecomment-285826038

If that doesn't help, can you send us the slurm command you're using as
well as the output of `$CHPL_HOME/util/printchplenv --anonymize` and
`printenv | grep -i affinity`?

It'd also help to see how much parallelism and how many physical cores a
chapel program thinks it has access too by running this program:

writeln(here.maxTaskPar); // per-locale parallelism
writeln(here.numPUs(accessible=true)); // # cores we have access to
writeln(here.numPUs(accessible=false)); // # cores on the locale

Elliot

-----Original Message-----
From: Barry Moore <***@gmail.com>
Date: Monday, March 27, 2017 at 8:06 PM
To: "chapel-***@lists.sourceforge.net" <chapel-***@lists.sourceforge.net>
Subject: Looking for guidance with Chapel+Slurm+OmniPath

Hello All,

Looking to get my feet off the ground with Chapel on our HPC resources here at Pitt. I wrote the following approximation to Pi code:

```
// Need BlockDist module to use Block distribution mapping
use BlockDist;

// Some config constants for running the problem
config const tasksPerLocale : int = 1;
config const numberGridPoints : int = 10;

// The Grid and Domain specification
const Grid = {1..numberGridPoints};
const Domain: domain(1) dmapped Block(boundingBox = Grid) = Grid;

// Our approximation procedure
proc approx(a: real) return 4.0 / (1.0 + a ** 2);

// The separation of grid points and the sum which will catch a reduce
var dGrid : real = 1.0 / numberGridPoints;
var sum : real = 0.0;

// forall values in the domain calculation the valueOnGrid and approx of that value
forall d in Domain with (+ reduce sum) {
var valueOnGrid : real = dGrid * (d + 0.5);
sum += approx(valueOnGrid);
}

// Print out Pi
writeln("Pi is approximately: ", sum * dGrid);
```

using the pull request: https://github.com/chapel-lang/chapel/pull/5843 to compile the code.

I am convinced the code is running over OmniPath, but I am quite sure it is not using multiple cores per locale. Am I missing an additional level in my forall loop (something with tasksPerLocale)?

Any help is appreciated.

Thanks,

Barry

--
Barry E Moore II, PhD
E-mail: ***@pitt.edu

Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260

Barry Moore

2017-03-28 16:48:00 UTC

Permalink

Elliot,

The top of your head must be a nice place because that worked. Just to
prove to myself, I added the `here` stuff to the bottom of my code, results
before and after setting HFI_NO_CPUAFFINITY=1.

Pi is approximately: 3.14159
1
1
28
export HFI_NO_CPUAFFINITY=1
Pi is approximately: 3.14159
28
28
28

I will put some documentation together with my pull request for running on
OmniPath w/ Slurm.

- Barry

Post by Elliot Ronaghan
Hi Barry,
Off the top of my head I'd recommend turning off any external CPU
`export HFI_NO_CPUAFFINITY=1`
We had another user run into a similar issue where PSM2 was limiting
https://github.com/chapel-lang/chapel/issues/5536#issuecomment-285826038
If that doesn't help, can you send us the slurm command you're using as
well as the output of `$CHPL_HOME/util/printchplenv --anonymize` and
`printenv | grep -i affinity`?
It'd also help to see how much parallelism and how many physical cores a
writeln(here.maxTaskPar); // per-locale parallelism
writeln(here.numPUs(accessible=true)); // # cores we have access to
writeln(here.numPUs(accessible=false)); // # cores on the locale
Elliot
-----Original Message-----
Date: Monday, March 27, 2017 at 8:06 PM
sourceforge.net>
Subject: Looking for guidance with Chapel+Slurm+OmniPath
Hello All,
Looking to get my feet off the ground with Chapel on our HPC resources
```
// Need BlockDist module to use Block distribution mapping
use BlockDist;
// Some config constants for running the problem
config const tasksPerLocale : int = 1;
config const numberGridPoints : int = 10;
// The Grid and Domain specification
const Grid = {1..numberGridPoints};
const Domain: domain(1) dmapped Block(boundingBox = Grid) = Grid;
// Our approximation procedure
proc approx(a: real) return 4.0 / (1.0 + a ** 2);
// The separation of grid points and the sum which will catch a reduce
var dGrid : real = 1.0 / numberGridPoints;
var sum : real = 0.0;
// forall values in the domain calculation the valueOnGrid and approx of that value
forall d in Domain with (+ reduce sum) {
var valueOnGrid : real = dGrid * (d + 0.5);
sum += approx(valueOnGrid);
}
// Print out Pi
writeln("Pi is approximately: ", sum * dGrid);
```
using the pull request: https://github.com/chapel-lang/chapel/pull/5843
to compile the code.
I am convinced the code is running over OmniPath, but I am quite sure it
is not using multiple cores per locale. Am I missing an additional level in
my forall loop (something with tasksPerLocale)?
Any help is appreciated.
Thanks,
Barry
--
Barry E Moore II, PhD
Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260

--
Barry E Moore II, PhD
E-mail: ***@pitt.edu

Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260

Elliot Ronaghan

2017-03-28 17:08:01 UTC

Permalink

Great, glad that worked out.

We don't have much experience running on omnipath ourselves, so any documentation you'd be willing to put together would be greatly appreciated.

Elliot

-----Original Message-----
From: Barry Moore <***@gmail.com>
Date: Tuesday, March 28, 2017 at 9:48 AM
To: Elliot Ronaghan <***@cray.com>
Cc: "chapel-***@lists.sourceforge.net" <chapel-***@lists.sourceforge.net>
Subject: Re: Looking for guidance with Chapel+Slurm+OmniPath

Elliot,

The top of your head must be a nice place because that worked. Just to prove to myself, I added the `here` stuff to the bottom of my code, results before and after setting HFI_NO_CPUAFFINITY=1.

Pi is approximately: 3.14159
1
1
28
export HFI_NO_CPUAFFINITY=1
Pi is approximately: 3.14159
28
28
28

I will put some documentation together with my pull request for running on OmniPath w/ Slurm.

- Barry

On Tue, Mar 28, 2017 at 2:08 AM, Elliot Ronaghan
<***@cray.com> wrote:

Hi Barry,

Off the top of my head I'd recommend turning off any external CPU
affinity binding. Try doing:
`export HFI_NO_CPUAFFINITY=1`

We had another user run into a similar issue where PSM2 was limiting
the chapel executable to only have affinity to 1 processor:
https://github.com/chapel-lang/chapel/issues/5536#issuecomment-285826038

If that doesn't help, can you send us the slurm command you're using as
well as the output of `$CHPL_HOME/util/printchplenv --anonymize` and
`printenv | grep -i affinity`?

It'd also help to see how much parallelism and how many physical cores a
chapel program thinks it has access too by running this program:

writeln(here.maxTaskPar); // per-locale parallelism
writeln(here.numPUs(accessible=true)); // # cores we have access to
writeln(here.numPUs(accessible=false)); // # cores on the locale

Elliot

-----Original Message-----
From: Barry Moore <***@gmail.com>
Date: Monday, March 27, 2017 at 8:06 PM
To: "chapel-***@lists.sourceforge.net" <chapel-***@lists.sourceforge.net>
Subject: Looking for guidance with Chapel+Slurm+OmniPath

Hello All,

Looking to get my feet off the ground with Chapel on our HPC resources here at Pitt. I wrote the following approximation to Pi code:

```
// Need BlockDist module to use Block distribution mapping
use BlockDist;

// Some config constants for running the problem
config const tasksPerLocale : int = 1;
config const numberGridPoints : int = 10;

// The Grid and Domain specification
const Grid = {1..numberGridPoints};
const Domain: domain(1) dmapped Block(boundingBox = Grid) = Grid;

// Our approximation procedure
proc approx(a: real) return 4.0 / (1.0 + a ** 2);

// The separation of grid points and the sum which will catch a reduce
var dGrid : real = 1.0 / numberGridPoints;
var sum : real = 0.0;

// forall values in the domain calculation the valueOnGrid and approx of that value
forall d in Domain with (+ reduce sum) {
var valueOnGrid : real = dGrid * (d + 0.5);
sum += approx(valueOnGrid);
}

// Print out Pi
writeln("Pi is approximately: ", sum * dGrid);
```

using the pull request:
https://github.com/chapel-lang/chapel/pull/5843 <https://github.com/chapel-lang/chapel/pull/5843> to compile the code.

I am convinced the code is running over OmniPath, but I am quite sure it is not using multiple cores per locale. Am I missing an additional level in my forall loop (something with tasksPerLocale)?

Any help is appreciated.

Thanks,

Barry

--
Barry E Moore II, PhD
E-mail: ***@pitt.edu

Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260

--
Barry E Moore II, PhD
E-mail: ***@pitt.edu

Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260