Hello world: not-so-optimal prime numbers

Previous post of “Programming XeonPHI using .NET and F#” series

 

In this post we start playing with the XeonPHI KNL system using F# and Visual Studio Code. The goal is to become familiar with the platform and see how simple, even inefficiently written, code performs on the platform.

In this post series we will return on XeonPhi internal architecture, but being this the first true post about it (previous post was about installing the environment) let’s start with a brief discussion about the internal structure of the platform. Our main source about KNL is the Knights Landing edition of the “Intel Xeon Phi Processor High Performance Programming” book.

XeonPhi architecture

The XeonPhi processor is basically a mesh of up to 36 tiles interconnected by a 2D mesh as represented in the following schema (source):

Each tile contains two cores whose design is derived from Silvermont Atom architecture. However, the core design has been significantly altered from the original design, supporting 4 hyper-threads. Each core has 2 vector processing units (VPU) 512 bit wide each. Each core has two L1 caches for instructions and data, 32KB each. The two cores in a tile share 1MB of L2 cache through wich they access the memory.

To ensure enough memory bandwidth to feed all the cores, Intel developed a special RAM called MCDRAM (Multi-channel DRAM). XeonPhi processor is also capable to address DDR4 memory to store larger quantity of data.

Our ninja system features 128GB of DDR4, 16GB MCDRAM, and 64 cores (32 tiles).

XeonPhi provides four configurations defining how memory is close to a core, and three memory models (cache, hybrid, and flat). You may think that things are already getting more complicated than you are ready to accept, but the good news is that the default settings are often optimal (quadrant mode for cores and cache model for memory) so you can look into different configuration only when needed.

 

The prime numbers test

Computing prime numbers is  a typical exercise to get a CPU busy, also because the primality test does not depend from any state so the test can be safely executed in parallel without any worries about race conditions. The isPrime function we will use is written in F# and, given a number n, recursively looks wether a number between sqrt(n) and 1 divides evenly n.

let isPrime n =
  let top = (sqrt(n |> float) |> int) + 1
  let rec tp k d =
    if d = 1 then true
    elif k % d = 0 then false
    else tp k (d - 1)
  tp n top
val isPrime : n:int -> bool

Full name: Prime.isPrime

val n : int
val top : int
val sqrt : value:’T -> ‘U (requires member Sqrt)

Full name: Microsoft.FSharp.Core.Operators.sqrt

Multiple items
val float : value:’T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–
type float = System.Double

Full name: Microsoft.FSharp.Core.float

——————–
type float<‘Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

Multiple items
val int : value:’T -> int (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.int

——————–
type int = int32

Full name: Microsoft.FSharp.Core.int

——————–
type int<‘Measure> = int

Full name: Microsoft.FSharp.Core.int<_>

val tp : (int -> int -> bool)
val k : int
val d : int

We used tail recursion for defining isPrime, the compiler will transform the recursive definition into a loop.

For a number n whose binary representation is k bits the function will perform at most 2^(k/2) tests .

The function has been written to test division by d from sqrt(n) down to 1 instead than starting from 2 and going up to increase the CPU usage. In fact it is more likely to find a number that can be divided by 2 or 3, and the test in this case would exit after few cycles.

Sequential execution

We are interested in testing how the XeonPHI KNL behaves when running ordinary .NET code and how far we can go by using high level parallel programming primitives. Since the KNL we are using has 64 cores it is natural to distribute the computation across the available cores. Moreover, since it is also important to understand how the memory access performs we will abuse of this resource in our example: we will allocate an array containing numbers from 1 to 100M and we will use the Array.map function in order to fill another array containing pairs (n, isPrime(n)). It is important to notice that F# pairs are heap allocated so in this way we allocate two arrays of 400MB each plus 1GB of heap allocated pairs.

#time
let n = [| 1 .. 100000000 |]
let isPrime n =
  let top = (sqrt(n |> float) |> int) + 1
  let rec tp k d =
    if d = 1 then true
    elif k % d = 0 then false
    else tp k (d - 1)
  tp n top

let pn = 
  n
  |> Array.map (fun n -> (n, isPrime n)) 
  |> Seq.filter (fun (_, p) -> p) 
  |> Seq.length

printfn "Found %d primes" pn
val n : int []

Full name: Primes.n

val isPrime : n:int -> bool

Full name: Primes.isPrime

val n : int
val top : int
val sqrt : value:’T -> ‘U (requires member Sqrt)

Full name: Microsoft.FSharp.Core.Operators.sqrt

Multiple items
val float : value:’T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–
type float = System.Double

Full name: Microsoft.FSharp.Core.float

——————–
type float<‘Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

Multiple items
val int : value:’T -> int (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.int

——————–
type int = int32

Full name: Microsoft.FSharp.Core.int

——————–
type int<‘Measure> = int

Full name: Microsoft.FSharp.Core.int<_>

val tp : (int -> int -> bool)
val k : int
val d : int
val pn : int

Full name: Primes.pn

module Array

from Microsoft.FSharp.Collections

val map : mapping:(‘T -> ‘U) -> array:’T [] -> ‘U []

Full name: Microsoft.FSharp.Collections.Array.map

module Seq

from Microsoft.FSharp.Collections

val filter : predicate:(‘T -> bool) -> source:seq<‘T> -> seq<‘T>

Full name: Microsoft.FSharp.Collections.Seq.filter

val p : bool
val length : source:seq<‘T> -> int

Full name: Microsoft.FSharp.Collections.Seq.length

val printfn : format:Printf.TextWriterFormat<‘T> -> ‘T

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.printfn

To understand how well a parallel code it is useful to know how long it takes to execute a function on a single core. When executed from within Visual Studio Code the script above interactively we obtained the following output:

Real: 03:09:21.840, CPU: 03:09:27.673, GC gen0: 572, gen1: 12

Parallel execution

One of the advantages of using high level languages such F# on .NET rather than C/C++ for writing parallel programs is that you can parallelize your program more easily. Functional programming is well known to encourage writing code that is more parallel-friendly due to a reduced use of global and shared state. Moreover, F# is statically typed and compiled into the intermediate language executed by common language runtime after translated into machine language by the Just in time compiler.

The F# Array.Parallel module offers utilities to automatically process arrays in parallel, taking into account the number of cores available on the system. We use the map function that behaves like Array.map function but starting multiple threads performing the map of a function over slices of the Array in parallel.

#time
let n = [| 1 .. 100000000 |]
let isPrime n =
  let top = (sqrt(n |> float) |> int) + 1
  let rec tp k d =
    if d = 1 then true
    elif k % d = 0 then false
    else tp k (d - 1)
  tp n top

let pn = 
  n 
  |> Array.Parallel.map (fun n -> (n, isPrime n)) 
  |> Seq.filter (fun (_, p) -> p) 
  |> Seq.length

printfn "Found %d primes" pn
val n : int []

Full name: Multi.n

val isPrime : n:int -> bool

Full name: Multi.isPrime

val n : int
val top : int
val sqrt : value:’T -> ‘U (requires member Sqrt)

Full name: Microsoft.FSharp.Core.Operators.sqrt

Multiple items
val float : value:’T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–
type float = System.Double

Full name: Microsoft.FSharp.Core.float

——————–
type float<‘Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

Multiple items
val int : value:’T -> int (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.int

——————–
type int = int32

Full name: Microsoft.FSharp.Core.int

——————–
type int<‘Measure> = int

Full name: Microsoft.FSharp.Core.int<_>

val tp : (int -> int -> bool)
val k : int
val d : int
val pn : int

Full name: Multi.pn

module Array

from Microsoft.FSharp.Collections

module Parallel

from Microsoft.FSharp.Collections.ArrayModule

val map : mapping:(‘T -> ‘U) -> array:’T [] -> ‘U []

Full name: Microsoft.FSharp.Collections.ArrayModule.Parallel.map

module Seq

from Microsoft.FSharp.Collections

val filter : predicate:(‘T -> bool) -> source:seq<‘T> -> seq<‘T>

Full name: Microsoft.FSharp.Collections.Seq.filter

val p : bool
val length : source:seq<‘T> -> int

Full name: Microsoft.FSharp.Collections.Seq.length

val printfn : format:Printf.TextWriterFormat<‘T> -> ‘T

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.printfn

When executed interactively we obtained the following output:

Real: 00:03:39.085, CPU: 08:50:45.152, GC gen0: 614, gen1: 12

As it can be easily observed the single thread execution required 11361,84 seconds versus just 219,085 seconds needed to complete the parallel version. This is about 52x speedup of the parallel version from the sequential one. Considering that our use of Array.map and Array.Parallel.map together combined with the generation of heap allocated pairs generates significant access to memory the obtained speedup is not bad at all considered that the maximum would have been 64x.

Conclusions

In this first experiment of programming XeonPhi with F# and .NET we found that is rather easy to leverage on parallel oriented constructs to obtain significant speedups in our code, even if not optimal. The XeonPhi memory handling has proven to perform well under garbage collector-like memory handling, even though we didn’t generate too much collection cycles. The code can be executed as it is on any platform, and this implies that XeonPhi is easy to be used to speedup any program. In the next posts we will continue this exploration of examining the performance of .NET code on KNL, we will compare it with native code to understand how much speedup is reasonable to expect from well written programs.

 

Update (8 Oct. 2016): The post has been edited to polish the code and remove the useless test function originally used to test different functions in a single file.

 

Previous post of “Programming XeonPHI using .NET and F#” series

Leave a Reply